10 Replies Latest reply on Mar 12, 2004 9:39 AM by michael.daleiden

Initial state replication is not working! Please help!

michael.daleiden Mar 8, 2004 2:59 PM

I am using JBoss 3.2.2 with the TreeCache (non-AOP) and I am having problems with replication of the initial state of the cache when a new node joins the cluster. I have the TreeCache set up as a deployed MBean (in the /deploy directory).

Here's the scenario:

1) Server A is started, the TreeCache MBean is started successfully, and the cache is populated with several nodes as other elements of the server startup.
2) Server B is started and the TreeCache MBean is started but it never "acquires" the existing state from the instance on Server A (the log displays the message: "start(): state could not be retrieved (must be first member in group)")
3) Server B completes its startup, adding more nodes to the cache as its components start. These nodes are replicated successfully to the cache on Server A.

The end result is that the cache on Server A correctly contains ALL of the nodes created by both Server A and Server B, while Server B only contains the nodes created by Server B -- the caches are not in sync!

Here is the MBean deployment descriptor for Server A:

<?xml version="1.0" encoding="UTF-8"?>

<!-- ===================================================================== -->
<!-- -->
<!-- Sample TreeCache Service Configuration -->
<!-- -->
<!-- ===================================================================== -->

<server>

 <classpath codebase="./lib" archives="jboss-cache.jar, jgroups.jar"/>


 <!-- ==================================================================== -->
 <!-- Defines TreeCache configuration -->
 <!-- ==================================================================== -->

 <mbean code="org.jboss.cache.TreeCache"
 name="jboss.cache:service=TreeCache">

 <depends>jboss:service=Naming</depends>
 <depends>jboss:service=TransactionManager</depends>

 <!--
 Configure the TransactionManager
 -->
 <attribute name="TransactionManagerLookupClass">org.jboss.cache.JBossTransactionManagerLookup</attribute>

 <!--
 Isolation level : SERIALIZABLE
 REPEATABLE_READ (default)
 READ_COMMITTED
 READ_UNCOMMITTED
 NONE
 -->
 <attribute name="IsolationLevel">REPEATABLE_READ</attribute>

 <!--
 Valid modes are LOCAL, REPL_ASYNC and REPL_SYNC
 -->
 <attribute name="CacheMode">REPL_ASYNC</attribute>

 <!--
 Just used for async repl: use a replication queue
 -->
 <attribute name="UseReplQueue">false</attribute>

 <!--
 Replication interval for replication queue (in ms)
 -->
 <attribute name="ReplQueueInterval">0</attribute>

 <!--
 Max number of elements which trigger replication
 -->
 <attribute name="ReplQueueMaxElements">0</attribute>

 <!-- Name of cluster. Needs to be the same for all clusters, in order
 to find each other
 -->
 <attribute name="ClusterName">TelematicsCache</attribute>

 <!-- JGroups protocol stack properties. Can also be a URL,
 e.g. file:/home/bela/default.xml
 <attribute name="ClusterProperties"></attribute>
 -->

 <attribute name="ClusterConfig">
 <config>
 <TCP start_port="7900"/>
 <TCPPING initial_hosts="ServerB[7900]" port_range="3" timeout="3000"
 num_initial_members="3" up_thread="true" down_thread="true"/>
 <VERIFY_SUSPECT timeout="1500" up_thread="false" down_thread="false" />
 <pbcast.NAKACK gc_lag="100" retransmit_timeout="3000" up_thread="true" down_thread="true" />
 <pbcast.STABLE desired_avg_gossip="20000" up_thread="false" down_thread="false" />
 <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="false" print_local_addr="false" down_thread="true" up_thread="true" />
 </config>
 </attribute>

 <!--
 Max number of entries in the cache. If this is exceeded, the
 eviction policy will kick some entries out in order to make
 more room
 -->
 <attribute name="MaxCapacity">20000</attribute>

 <!--
 Whether or not to fetch state on joining a cluster
 -->
 <attribute name="FetchStateOnStartup">true</attribute>

 <!--
 The max amount of time (in milliseconds) we wait until the
 initial state (ie. the contents of the cache) are retrieved from
 existing members in a clustered environment
 -->
 <attribute name="InitialStateRetrievalTimeout">15000</attribute>

 <!--
 Number of milliseconds to wait until all responses for a
 synchronous call have been received.
 -->
 <attribute name="SyncReplTimeout">10000</attribute>

 <!-- Max number of milliseconds to wait for a lock acquisition -->
 <attribute name="LockAcquisitionTimeout">15000</attribute>

 <!-- Max number of milliseconds we hold a lock (not currently
 implemented) -->
 <attribute name="LockLeaseTimeout">60000</attribute>

 <!-- Name of the eviction policy class. Not supported now. -->
 <attribute name="EvictionPolicyClass"></attribute>
 </mbean>

</server>

The MBean deployment descriptor on Server B is identical, except the "initial_hosts" attribute is set to "ServerA[7900]".

I need to use TCP for JGroups, as the servers are in different subnets and the network admins do not allow cross-subnet UDP multicasting.

Any ideas as to why this is occurring? I really need to get this up and running quickly, as we are gearing up for production deployment.

1. Re: Initial state replication is not working! Please help!

michael.daleiden Mar 8, 2004 4:11 PM (in response to michael.daleiden)
Found the problem -- need to have the following as the last element in the cluster config for the cache:

<pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>

Question: does the paid doco discuss all of the options available for configuring clusters? In particular, does it address the various JGroups configuration parameters and how they interact with the cluster management functions of JBoss? Or is this something that is somehow covered in the JGroups docs? This is an area where a detailed "roadmap" needs to be provided that assists developers/deployers in finding the configuration options that are available and how they should be used to configure JBoss clusters.
Actions
2. Re: Initial state replication is not working! Please help!

belaban Mar 8, 2004 9:21 PM (in response to michael.daleiden)

Hi Michael,

glad you found the STABLE protocol. You're also missing a failure detection protocol (FD for example), so try killing a member and see what happens, this might lead to problems.
Wrt docs, I'm working on it, but because I'm also working on several other projects, this thread gets a bit starved... :-)
Bela
Actions

3. Re: Initial state replication is not working! Please help!

michael.daleiden Mar 9, 2004 1:53 PM (in response to michael.daleiden)

"bela" wrote:
glad you found the STABLE protocol. You're also missing a failure detection protocol (FD for example), so try killing a member and see what happens, this might lead to problems.

Well, I am further along now, but things are still quirky. Here is the latest configuration that I am using for TreeCache (IP addresses changed for security reasons):

<?xml version="1.0" encoding="UTF-8"?>

<!-- ===================================================================== -->
<!-- -->
<!-- Sample TreeCache Service Configuration -->
<!-- -->
<!-- ===================================================================== -->

<server>

 <classpath codebase="./lib" archives="jboss-cache.jar, jgroups.jar"/>


 <!-- ==================================================================== -->
 <!-- Defines TreeCache configuration -->
 <!-- ==================================================================== -->

 <mbean code="org.jboss.cache.TreeCache"
 name="jboss.cache:service=TreeCache">

 <depends>jboss:service=Naming</depends>
 <depends>jboss:service=TransactionManager</depends>

 <!--
 Configure the TransactionManager
 -->
 <attribute name="TransactionManagerLookupClass">org.jboss.cache.JBossTransactionManagerLookup</attribute>

 <!--
 Isolation level : SERIALIZABLE
 REPEATABLE_READ (default)
 READ_COMMITTED
 READ_UNCOMMITTED
 NONE
 -->
 <attribute name="IsolationLevel">REPEATABLE_READ</attribute>

 <!--
 Valid modes are LOCAL, REPL_ASYNC and REPL_SYNC
 -->
 <attribute name="CacheMode">REPL_ASYNC</attribute>

 <!--
 Just used for async repl: use a replication queue
 -->
 <attribute name="UseReplQueue">false</attribute>

 <!--
 Replication interval for replication queue (in ms)
 -->
 <attribute name="ReplQueueInterval">0</attribute>

 <!--
 Max number of elements which trigger replication
 -->
 <attribute name="ReplQueueMaxElements">0</attribute>

 <!-- Name of cluster. Needs to be the same for all clusters, in order
 to find each other
 -->
 <attribute name="ClusterName">TelematicsCache</attribute>

 <!-- JGroups protocol stack properties. Can also be a URL,
 e.g. file:/home/bela/default.xml
 <attribute name="ClusterProperties"></attribute>
 -->

 <attribute name="ClusterConfig">
 <config>
 <TCP start_port="7900" bind_addr="143.61.XXX.XXX"/>
 <TCPPING initial_hosts="143.61.YYY.ZZZ[7900]" port_range="1" timeout="3000"
 num_initial_members="1" up_thread="true" down_thread="true"/>
 <FD shun="true" up_thread="true" down_thread="true" timeout="2500" max_tries="5" />
 <VERIFY_SUSPECT timeout="1500" up_thread="false" down_thread="false" />
 <pbcast.NAKACK gc_lag="100" retransmit_timeout="3000" up_thread="true" down_thread="true" />
 <pbcast.STABLE desired_avg_gossip="20000" up_thread="false" down_thread="false" />
 <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="false" print_local_addr="false" down_thread="true" up_thread="true" />
 <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
 </config>
 </attribute>

 <!--
 Max number of entries in the cache. If this is exceeded, the
 eviction policy will kick some entries out in order to make
 more room
 -->
 <attribute name="MaxCapacity">20000</attribute>

 <!--
 Whether or not to fetch state on joining a cluster
 -->
 <attribute name="FetchStateOnStartup">true</attribute>

 <!--
 The max amount of time (in milliseconds) we wait until the
 initial state (ie. the contents of the cache) are retrieved from
 existing members in a clustered environment
 -->
 <attribute name="InitialStateRetrievalTimeout">15000</attribute>

 <!--
 Number of milliseconds to wait until all responses for a
 synchronous call have been received.
 -->
 <attribute name="SyncReplTimeout">10000</attribute>

 <!-- Max number of milliseconds to wait for a lock acquisition -->
 <attribute name="LockAcquisitionTimeout">15000</attribute>

 <!-- Max number of milliseconds we hold a lock (not currently
 implemented) -->
 <attribute name="LockLeaseTimeout">60000</attribute>

 <!-- Name of the eviction policy class. Not supported now. -->
 <attribute name="EvictionPolicyClass"></attribute>
 </mbean>


</server>

I have two servers that are clustered: one Win2K server with Java 1.4.1 and one HP/UX server with Java 1.3.1. Both are running JBoss 3.2.2.

Here's the sequence of events that works as expected:

1) Start Win2K JBoss instance. TreeCache is initialized with data from this instance. Wait until it is completely started.
2) Start HP/UX JBoss instance. TreeCache state is transferred successfully to this node and additional data is added from this node. Both nodes have synchronized cache information.
3) Stop HP/UX JBoss instance. TreeCache is updated to reflect the loss of this node.
4) Start HP/UX JBoss instance again. TreeCache state is successfully transferred (again) and the cache is updated to reflect the reappearance of this node.

Now, time for quirkiness to occur -- new sequence of events:

1) Start Win2K JBoss instance. TreeCache is initialized with data from this instance. Wait until it is completely started.
2) Start HP/UX JBoss instance. TreeCache state is transferred successfully to this node and additional data is added from this node. Both nodes have synchronized cache information.
3) Stop Win2K JBoss instance. TreeCache is updated to reflect loss of this node.
4) Start Win2K JBoss instance again. TreeCache state appears to be transferred successfully, BUT the updates applied to the cache by the Win2K node are not distributed to the cache instance on the HP/UX node. Looking in the log, I see the following message repeated for each update that should have been replicated:

[ERROR] RpcDispatcher.handle(): exception=java.lang.ClassNotFoundException: No ClassLoaders found for: boolean

Any ideas as to what's going on here? Why would it work in one direction and not the other?

"bela" wrote:
Wrt docs, I'm working on it, but because I'm also working on several other projects, this thread gets a bit starved... :-)
Bela

I know the feeling... :-) I've got about 5 projects that are going concurrently, plus a baby on the way at home (which adds a whole lot of home projects!) If I can get this cache to work as I expect, it will really help take the pressure off of this project, so I can focus on the others.

4. Re: Initial state replication is not working! Please help!

belaban Mar 9, 2004 6:19 PM (in response to michael.daleiden)

First of all, add *all* hosts to initial_hosts. Second, use the same JDK between members. I don't vouch for members running 1.4 and others running 1.3 to run correctly, e.g. serialVersionUIDs may have changed, leading to the serialization problem.
Bela
Actions
5. Re: Initial state replication is not working! Please help!

michael.daleiden Mar 9, 2004 7:54 PM (in response to michael.daleiden)

Thanks, Bela.

I had my suspicions about different versions of Java. I am in the process of upgrading the HP/UX server to Java 1.4. I will let you know if this solves the problem.

As for the initial_hosts, there are only the two servers at present. Is there some reason for having *all* servers in the initial_hosts list? I can't see having all hosts specified, since I can see adding additional hosts in the future but I would not want to have to update the config every time I add a new server...
Actions
6. Re: Initial state replication is not working! Please help!

belaban Mar 9, 2004 9:00 PM (in response to michael.daleiden)

You need all hosts (A and B):

If A comes up 2nd it needs to 'see' B.
If B comes up 2nd it needs to see A.

Bela
Actions
7. Re: Initial state replication is not working! Please help!

belaban Mar 9, 2004 9:01 PM (in response to michael.daleiden)

For more hosts, you can get away with fewer initial_hosts, as long as a member that comes up as non-first member gets an initial response from at least 1 member.

Bela
Actions
8. Re: Initial state replication is not working! Please help!

michael.daleiden Mar 10, 2004 9:58 AM (in response to michael.daleiden)

Well, I switched the Win2K server back to JDK 1.3.1 (since this was easier than upgrading the HP/UX server to 1.4). The java.lang.ClassNotFoundException: No ClassLoaders found for: boolean error still occurs on the HP/UX server when the Win2K server is shutdown and then restarted.

With respect to the initial_hosts, I have the HP/UX cache config set up to point to the Win2K server (e.g., initial_hosts="<Win2K server IP[7900]") and the Win2K cache config set up to point at the HP/UX server (e.g., initial_hosts="<HP/UX server IP[7900]"). This allows either server to come up first and have the second server "see" it when it comes up.

Any more ideas or suggestions to help resolve this?
Actions
9. Re: Initial state replication is not working! Please help!

michael.daleiden Mar 10, 2004 2:26 PM (in response to michael.daleiden)
After yet another journey down into the depths of the cache, it now appears that the problem lies with the combination of Java 1.3, TreeCache, and boolean attributes. Basically, once I downgraded the Win2K server to 1.3, it began exhibiting the same behaviour as the HP/UX server (java.lang.ClassNotFoundException: No ClassLoaders found for: boolean reported by JavaGroups).

More specifically, this error (so far) only seems to occur when setting a boolean attribute on a tree node. I tried using both styles of the put() method:
cache.put("/test/mynode", "testAttribute", new Boolean(true));

and
HashMap testMap = new HashMap(); testMap.put("testAttribute", new Boolean(true)); cache.put("/test/mynode", testMap);

In both cases, the classloader error was reported.

In order to check my sanity, I moved the Win2K server back to Java 1.4 and ran the tests again. Sure enough, the Win2K server no longer reported the classloader error.

So...to sum things up: it appears that when a JBoss server instance running Java 1.3 and TreeCache receives a replication transaction from another server that involves replication of a boolean attribute, it fails. If the recipient server is running Java 1.4, the replication is successful.

I have a request into our HP/UX admin to upgrade the HP/UX server to 1.4 (which will probably not occur until Monday or Tuesday of next week). Once this upgrade is complete, I'll run the tests again to verify my suspicions about Java 1.3 and the replication of booleans in TreeCache.
Actions
10. Re: Initial state replication is not working! Please help!

michael.daleiden Mar 12, 2004 9:39 AM (in response to michael.daleiden)

"michael.daleiden" wrote:
I have a request into our HP/UX admin to upgrade the HP/UX server to 1.4 (which will probably not occur until Monday or Tuesday of next week). Once this upgrade is complete, I'll run the tests again to verify my suspicions about Java 1.3 and the replication of booleans in TreeCache.

I was able to get the HP/UX server upgraded to Java 1.4 and I reran my tests. All tests were successful, which verifies my suspicions about replication failure for boolean cache values under Java 1.3. I don't know whether anyone at JBoss should spend time trying to resolve this issue, given that fact that Java 1.3 support has already been dropped by Sun and even 1.4.1 is about to go into "end of lifecycle" state at Sun. Maybe it should just be something noted in the JBoss Cache documentation and/or release notes.
Actions

Go to original post