1 Reply Latest reply on May 6, 2009 1:02 PM by setatum

JBoss cache instances fail to join cluster after bounce

setatum May 6, 2009 12:02 PM

We are experiencing a problem with a 3-node JBoss Cache setup. All three nodes startup fine and changes propogate as expected. However, if we later on restart one of our app servers (or an instance dies), it may fail to rejoin the cluster. If it does, I've not found anything else I can do than to change the multicast address to something different, then bounce all three servers. I can restart the app server over and over again, and I get the same error when trying to start up JBoss Cache.

We currently only use the cache for a small amount of information - one node with 153 children, one with 249 children, and one with 266 children. Each child may have one, two, or three name/value pairs added to it. When everything is working, both reads and updates are blazing fast. The only problem is the sometimes complete and utter failure to rejoin the cluster.

The environment details:

O/S: SunOS 5.10 Generic_138888-06 sun4us sparc FJSV,GPUZC-M
Java: Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_10-b03)
App Server: WebLogic Server 9.2 MP2 Mon Jun 25 01:32:01 EDT 2007 952826
JBoss Cache: jbosscache-core-3.0.3GA
Jars installed in Weblogic's domain lib: jboss-common-core.jar jboss-logging-spi.jar jbosscache-core.jar jcip-annotations.jar jgroups.jar (all from 3.0.3GA download)

Some additional IP/routing details on the three instances (just being thorough):

server1 (ip 10.16.106.221 netmask 255.255.255.0) netstat -nr output:

Routing Table: IPv4
 Destination Gateway Flags Ref Use Interface
-------------------- -------------------- ----- ----- ---------- ---------
default 10.16.106.190 UG 1 238875
10.16.106.0 10.16.106.220 U 1 270565 fjgi0
224.0.0.0 10.16.106.220 U 1 0 fjgi0
127.0.0.1 127.0.0.1 UH 1347 8972376 lo0

server2 (ip 10.16.106.221 netmask 255.255.255.0) netstat -nr output:

Routing Table: IPv4
 Destination Gateway Flags Ref Use Interface
-------------------- -------------------- ----- ----- ---------- ---------
default 10.16.106.190 UG 1 339798
10.16.106.0 10.16.106.221 U 1 1248223 fjgi0
224.0.0.0 10.16.106.221 U 1 0 fjgi0
127.0.0.1 127.0.0.1 UH 1362 10112605 lo0


server3 (ip 10.16.106.222 netmask 255.255.255.0) netstat -nr output:

Routing Table: IPv4
 Destination Gateway Flags Ref Use Interface
-------------------- -------------------- ----- ----- ---------- ---------
default 10.16.106.190 UG 1 346621
10.16.106.0 10.16.106.222 U 1 437006 fjgi0
224.0.0.0 10.16.106.222 U 1 0 fjgi0
127.0.0.1 127.0.0.1 UH 1186 10364437 lo0

Now the jboss-cache.xml config that is used by each of the three instances:

<?xml version="1.0" encoding="UTF-8" ?>

<server>
 <mbean code="org.jboss.cache.pojo.jmx.PojoCacheJmxWrapper"
 name="jboss.cache:service=PojoCache">

 <depends>jboss:service=TransactionManager</depends>

 <!-- Configure the TransactionManager -->
 <attribute name="TransactionManagerLookupClass">
 org.jboss.cache.transaction.DummyTransactionManagerLookup
 </attribute>

 <!-- Isolation level : SERIALIZABLE
 REPEATABLE_READ (default)
 READ_COMMITTED
 READ_UNCOMMITTED
 NONE
 -->
 <attribute name="IsolationLevel">REPEATABLE_READ</attribute>

 <!-- Valid modes are LOCAL, REPL_ASYNC and REPL_SYNC -->
 <attribute name="CacheMode">REPL_ASYNC</attribute>

 <!-- Name of cluster. Needs to be the same for all caches,
 in order for them to find each other
 -->
 <attribute name="ClusterName">prodMwCluster</attribute>

 <!-- JGroups protocol stack properties. -->
 <attribute name="ClusterConfig">
 <config>
 <!-- UDP: if you have a multihomed machine, set the bind_addr
 attribute to the appropriate NIC IP address
-->
 <!-- UDP: On Windows machines, because of the media sense feature
 being broken with multicast (even after disabling media sense)
 set the loopback attribute to true
-->
 <UDP mcast_addr="228.16.106.2" mcast_port="48863"
 ip_ttl="64" ip_mcast="true"
 mcast_send_buf_size="150000" mcast_recv_buf_size="80000"
 ucast_send_buf_size="150000" ucast_recv_buf_size="80000"
 loopback="false"/>
 <PING timeout="2000" num_initial_members="3"/>
 <MERGE2 min_interval="10000" max_interval="20000"/>
 <FD shun="true"/>
 <FD_SOCK/>
 <VERIFY_SUSPECT timeout="1500"/>
 <pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800"
 max_xmit_size="8192"/>
 <UNICAST timeout="600,1200,2400,4800"/>
 <pbcast.STABLE desired_avg_gossip="400000"/>
 <FC max_credits="2000000" min_threshold="0.10"/>
 <FRAG2 frag_size="8192"/>
 <pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
 shun="true" print_local_addr="true"/>
 <pbcast.STATE_TRANSFER/>
 </config>
 </attribute>

 <!-- Whether or not to fetch state on joining a cluster -->
 <attribute name="FetchInMemoryState">true</attribute>

 <!-- The max amount of time (in milliseconds) we wait until the
 initial state (ie. the contents of the cache) are retrieved from
 existing members in a clustered environment
 -->
 <attribute name="InitialStateRetrievalTimeout">15000</attribute>

 <!-- Number of milliseconds to wait until all responses for a
 synchronous call have been received.
 -->
 <attribute name="SyncReplTimeout">15000</attribute>

 <!-- Max number of milliseconds to wait for a lock acquisition -->
 <attribute name="LockAcquisitionTimeout">10000</attribute>

 </mbean>
</server>

I created startup/shutdown classes for WebLogic that create the Cache instance and place it in JNDI. I won't post the entire code here, but the cache creation code in the startup class looks like this:

 System.out.println("JBossCache - starting up...");
 CacheFactory<String, String> factory = new DefaultCacheFactory<String, String>();
 // configFile is jboss-cache.xml
 Cache<String, String> cache = factory.createCache(configFile, true);
 System.out.println("JbossCache - started cache");
 // put cache into JNDI...

The corresponding shutdown class code snippet looks like this:

 // grabbed cache out of JNDI and unbound it from there...
 cache.stop();
 cache.destroy();
 System.out.println("JbossCache - stopped cache.");

Below is an example of the error that occurred this past weekend on the 2nd of the three servers. The server needed to be bounced for an unrelated configuration change, and upon startup an error was generated when the JBossCacheLoader class fired on startup (this is from the WebLogic system out logs):

JBossCache - starting up...

-------------------------------------------------------
GMS: address is 10.16.106.221:34622
-------------------------------------------------------


(approximately 10 seconds elapse then)


<May 1, 2009 9:39:23 PM CDT> <Critical> <WebLogicServer> <BEA-000362> <Server failed. Reason:

There are 1 nested errors:

org.jboss.cache.CacheException: java.lang.reflect.InvocationTargetException
 at org.jboss.cache.util.reflect.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:148)
 at org.jboss.cache.factories.ComponentRegistry$PrioritizedMethod.invoke(ComponentRegistry.java:883)
 at org.jboss.cache.factories.ComponentRegistry.internalStart(ComponentRegistry.java:680)
 at org.jboss.cache.factories.ComponentRegistry.start(ComponentRegistry.java:561)
 at org.jboss.cache.invocation.CacheInvocationDelegate.start(CacheInvocationDelegate.java:301)
 at org.jboss.cache.DefaultCacheFactory.createCache(DefaultCacheFactory.java:119)
 at org.jboss.cache.DefaultCacheFactory.createCache(DefaultCacheFactory.java:94)
 at com.company.cache.weblogic.JBossCacheStartup.main(JBossCacheStartup.java:41)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:585)
 at weblogic.management.deploy.classdeployment.ClassDeploymentManager.invokeMain(ClassDeploymentManager.java:353)
 at weblogic.management.deploy.classdeployment.ClassDeploymentManager.invokeClass(ClassDeploymentManager.java:263)
 at weblogic.management.deploy.classdeployment.ClassDeploymentManager.access$000(ClassDeploymentManager.java:54)
 at weblogic.management.deploy.classdeployment.ClassDeploymentManager$1.run(ClassDeploymentManager.java:205)
 at weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:321)
 at weblogic.security.service.SecurityManager.runAs(SecurityManager.java:121)
 at weblogic.management.deploy.classdeployment.ClassDeploymentManager.invokeClassDeployment(ClassDeploymentManager.java:198)
 at weblogic.management.deploy.classdeployment.ClassDeploymentManager.invokeClassDeployments(ClassDeploymentManager.java:177)
 at weblogic.management.deploy.classdeployment.ClassDeploymentManager.runStartupsBeforeAppActivation(ClassDeploymentManager.java:151)
 at weblogic.management.deploy.internal.DeploymentAdapter$4.activate(DeploymentAdapter.java:166)
 at weblogic.management.deploy.internal.AppTransition$2.transitionApp(AppTransition.java:30)
 at weblogic.management.deploy.internal.ConfiguredDeployments.transitionApps(ConfiguredDeployments.java:233)
 at weblogic.management.deploy.internal.ConfiguredDeployments.activate(ConfiguredDeployments.java:169)
 at weblogic.management.deploy.internal.ConfiguredDeployments.deploy(ConfiguredDeployments.java:123)
 at weblogic.management.deploy.internal.DeploymentServerService.resume(DeploymentServerService.java:173)
 at weblogic.management.deploy.internal.DeploymentServerService.start(DeploymentServerService.java:89)
 at weblogic.t3.srvr.SubsystemRequest.run(SubsystemRequest.java:64)
 at weblogic.work.ExecuteThread.execute(ExecuteThread.java:209)
 at weblogic.work.ExecuteThread.run(ExecuteThread.java:181)
Caused by: java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:585)
 at org.jboss.cache.util.reflect.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:144)
 ... 30 more
Caused by: org.jboss.cache.CacheException: Unable to connect to JGroups channel
 at org.jboss.cache.RPCManagerImpl.start(RPCManagerImpl.java:252)
 ... 35 more
Caused by: org.jgroups.StateTransferException: 10.16.106.221:34622 could not fetch state null from null
 at org.jgroups.JChannel.connect(JChannel.java:466)
 at org.jboss.cache.RPCManagerImpl.start(RPCManagerImpl.java:242)
 ... 35 more
Caused by: org.jgroups.StateTransferException: 10.16.106.221:34622 could not fetch state null from null
 at org.jgroups.JChannel.connect(JChannel.java:459)
 ... 36 more

Any idea what could be causing this problem, given my configuration? I may try the JGroups probe script to see if it can tell me any more information. Otherwise I am completely at a loss. Sometimes restarts work ok, but it seems that once one fails, they will continue to fail until they are all restarted with a new multicast IP.

Also, say server 1 bounces and then fails, 2 will do the same thing if we bounce it. They all have to have config changed and bounced. Then they all talk to each other again and are happy.

Thanks for any insight.

-Scott

1. Re: JBoss cache instances fail to join cluster after bounce

setatum May 6, 2009 1:02 PM (in response to setatum)

Well I just realized something that may be important. Originally I had tried using Pojo Cache for our implementation but ran into serious memory leak issues. I switched the implementation to use core cache, and this got rid of our memory leak problems.

However, the config file we use now is based on the config file from section 8.2 of the Pojo Cache User's Guide. Looking it this and comparing it to the configuration files from section 12.1 of the Core Cache User's Guide, it is totally different! However, core cache seems to load this configuration file ok (if I intentionally mess up the file format, the loader complains).

I'll now attempt to setup a new configuration file based on the core cache examples.

Is there a tuning guide separate from the User's Guide which recommends settings based on your setup? For example, my setup is 3 servers all close together on a high speed network, where speed is more important than throughput (our cached datasets are currently very small). The instances themselves should be reliable, but if an instance properly shuts itself down with stop/destroy for ~5 minutes and then attempts to rejoin the group, it should be allowed to do so (this seems to be my problem now).

-Scott
Actions