Initial state replication is not working! Please help!
michael.daleiden Mar 8, 2004 2:59 PMI am using JBoss 3.2.2 with the TreeCache (non-AOP) and I am having problems with replication of the initial state of the cache when a new node joins the cluster. I have the TreeCache set up as a deployed MBean (in the /deploy directory).
Here's the scenario:
1) Server A is started, the TreeCache MBean is started successfully, and the cache is populated with several nodes as other elements of the server startup.
2) Server B is started and the TreeCache MBean is started but it never "acquires" the existing state from the instance on Server A (the log displays the message: "start(): state could not be retrieved (must be first member in group)")
3) Server B completes its startup, adding more nodes to the cache as its components start. These nodes are replicated successfully to the cache on Server A.
The end result is that the cache on Server A correctly contains ALL of the nodes created by both Server A and Server B, while Server B only contains the nodes created by Server B -- the caches are not in sync!
Here is the MBean deployment descriptor for Server A:
<?xml version="1.0" encoding="UTF-8"?> <!-- ===================================================================== --> <!-- --> <!-- Sample TreeCache Service Configuration --> <!-- --> <!-- ===================================================================== --> <server> <classpath codebase="./lib" archives="jboss-cache.jar, jgroups.jar"/> <!-- ==================================================================== --> <!-- Defines TreeCache configuration --> <!-- ==================================================================== --> <mbean code="org.jboss.cache.TreeCache" name="jboss.cache:service=TreeCache"> <depends>jboss:service=Naming</depends> <depends>jboss:service=TransactionManager</depends> <!-- Configure the TransactionManager --> <attribute name="TransactionManagerLookupClass">org.jboss.cache.JBossTransactionManagerLookup</attribute> <!-- Isolation level : SERIALIZABLE REPEATABLE_READ (default) READ_COMMITTED READ_UNCOMMITTED NONE --> <attribute name="IsolationLevel">REPEATABLE_READ</attribute> <!-- Valid modes are LOCAL, REPL_ASYNC and REPL_SYNC --> <attribute name="CacheMode">REPL_ASYNC</attribute> <!-- Just used for async repl: use a replication queue --> <attribute name="UseReplQueue">false</attribute> <!-- Replication interval for replication queue (in ms) --> <attribute name="ReplQueueInterval">0</attribute> <!-- Max number of elements which trigger replication --> <attribute name="ReplQueueMaxElements">0</attribute> <!-- Name of cluster. Needs to be the same for all clusters, in order to find each other --> <attribute name="ClusterName">TelematicsCache</attribute> <!-- JGroups protocol stack properties. Can also be a URL, e.g. file:/home/bela/default.xml <attribute name="ClusterProperties"></attribute> --> <attribute name="ClusterConfig"> <config> <TCP start_port="7900"/> <TCPPING initial_hosts="ServerB[7900]" port_range="3" timeout="3000" num_initial_members="3" up_thread="true" down_thread="true"/> <VERIFY_SUSPECT timeout="1500" up_thread="false" down_thread="false" /> <pbcast.NAKACK gc_lag="100" retransmit_timeout="3000" up_thread="true" down_thread="true" /> <pbcast.STABLE desired_avg_gossip="20000" up_thread="false" down_thread="false" /> <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="false" print_local_addr="false" down_thread="true" up_thread="true" /> </config> </attribute> <!-- Max number of entries in the cache. If this is exceeded, the eviction policy will kick some entries out in order to make more room --> <attribute name="MaxCapacity">20000</attribute> <!-- Whether or not to fetch state on joining a cluster --> <attribute name="FetchStateOnStartup">true</attribute> <!-- The max amount of time (in milliseconds) we wait until the initial state (ie. the contents of the cache) are retrieved from existing members in a clustered environment --> <attribute name="InitialStateRetrievalTimeout">15000</attribute> <!-- Number of milliseconds to wait until all responses for a synchronous call have been received. --> <attribute name="SyncReplTimeout">10000</attribute> <!-- Max number of milliseconds to wait for a lock acquisition --> <attribute name="LockAcquisitionTimeout">15000</attribute> <!-- Max number of milliseconds we hold a lock (not currently implemented) --> <attribute name="LockLeaseTimeout">60000</attribute> <!-- Name of the eviction policy class. Not supported now. --> <attribute name="EvictionPolicyClass"></attribute> </mbean> </server>
The MBean deployment descriptor on Server B is identical, except the "initial_hosts" attribute is set to "ServerA[7900]".
I need to use TCP for JGroups, as the servers are in different subnets and the network admins do not allow cross-subnet UDP multicasting.
Any ideas as to why this is occurring? I really need to get this up and running quickly, as we are gearing up for production deployment.