3 Replies Latest reply on Jan 6, 2006 4:24 PM by belaban

    Getting TreeCacheAOP Clustering, A Few Questions


      Hi all!

      It feels like I'm so close, yet so far. After a couple of days of configuring and building I've been able to get annotated POJOs working with TreeCacheAOP and Java 1.5. (Most of the time spent on my own bonehead mistakes.)

      The cache works well locally but I can't seem to get it clustering. I've got a listener on the mcast_addr so I see instances come up and announce that they're running with the same cache name. I don't believe it's a jgroups or networking issue (although I've been wrong about 100 times in the last couple of days).

      Here's what I'm doing:

      1) Deploy app to Tomcat 5.5 on machine 1. Access the app and see TreeCacheAOP working (at least locally). Check the logs and verify that jgroups has bound to an address. I see this, which tells me the cache is running:

      0 [http-] INFO org.jboss.cache.PropertyConfigurator - Found existing property editor for org.w3c.dom.Element: org.jboss.util.propertyeditor.ElementEditor@15099a1
      31 [http-] INFO org.jboss.cache.PropertyConfigurator - configure(): attribute size: 20
      47 [http-] INFO org.jboss.cache.TreeCache - setting cluster properties from xml to: UDP(ip_mcast=true;ip_ttl=64;loopback=true;mcast_addr=;mcast_port=41332;mcast_recv_buf_size=80000;mcast_send_buf_size=150000;ucast_recv_buf_size=80000;ucast_send_buf_size=150000):PING(down_thread=false;num_initial_members=3;timeout=2000;up_thread=false):MERGE2(max_interval=20000;min_interval=10000):FD_SOCK:VERIFY_SUSPECT(down_thread=false;timeout=1500;up_thread=false):pbcast.NAKACK(down_thread=false;gc_lag=50;max_xmit_size=8192;retransmit_timeout=600,1200,2400,4800;up_thread=false):UNICAST(down_thread=false;min_threshold=10;timeout=600,1200,2400;window_size=100):pbcast.STABLE(desired_avg_gossip=20000;down_thread=false;up_thread=false):FRAG(down_thread=false;frag_size=8192;up_thread=false):pbcast.GMS(join_retry_timeout=2000;join_timeout=5000;print_local_addr=true;shun=true):pbcast.STATE_TRANSFER(down_thread=true;up_thread=true)
      78 [http-] INFO org.jboss.cache.TreeCache - setEvictionPolicyConfig(): [config: null]
      110 [http-] INFO org.jboss.cache.TreeCache - interceptor chain is:
      class org.jboss.cache.interceptors.CallInterceptor
      class org.jboss.cache.interceptors.PessimisticLockInterceptor
      class org.jboss.cache.interceptors.UnlockInterceptor
      class org.jboss.cache.interceptors.ReplicationInterceptor
      141 [http-] INFO org.jboss.cache.TreeCache - cache mode is REPL_SYNC
      1110 [http-] INFO org.jboss.cache.TreeCache - USE_MARSHALLING is true. We will marshall/unmarshall the value.
      GMS: address is
      3203 [Thread-153] INFO org.jboss.cache.TreeCache - viewAccepted(): [|0] []
      3219 [http-] INFO org.jboss.cache.TreeCache - my local address is
      3219 [http-] INFO org.jboss.cache.TreeCache - state could not be retrieved (must be first member in group)
      3219 [http-] INFO org.jboss.cache.eviction.LRUPolicy - Starting eviction policy using the provider: org.jboss.cache.eviction.AopLRUPolicy
      3219 [http-] INFO org.jboss.cache.eviction.LRUPolicy - Starting a eviction timer with wake up interval of (secs) 5
      3219 [Thread-153] INFO org.jboss.cache.TreeCache - new cache is null (maybe first member in cluster)
      3219 [http-] INFO org.jboss.cache.TreeCache - Cache is started!!

      2) Check the mcast listener. It sees machine 1's messages and the correct cache name.

      3) Deploy app to Tomcat 5.5 on machine 2. Access the app and see TreeCacheAOP in action. Check the logs and verify that jgroups has bound to an address. I see log messages similar to those from machine 1. Unfortunately, machine 2 doesn't connect with machine 1 and it creates its own cluster. I see:

      3219 [Thread-153] INFO org.jboss.cache.TreeCache - new cache is null (maybe first member in cluster)

      4) Check the mcast listener. I now see heartbeats from both machine 1 and machine 2. They both report the same cache name "RegerCom-TreeCache-Cluster". To me this makes me think that they'll start replicating.

      But in the logs of both I see things like:

      79687 [UpHandler (FD_SOCK)] WARN org.jgroups.protocols.pbcast.NAKACK -] discarded message from non-member

      So I shut down all instances and bring them up, one at a time, about two minutes apart. I've got three machines that I do this with. They all share the same replSync-service.xml, which I've tweaked throughout the week:

      <?xml version="1.0" encoding="UTF-8"?>
       <classpath codebase="./lib" archives="jboss-cache.jar, jgroups.jar"/>
       <mbean code="org.jboss.cache.aop.TreeCacheAop"
       <attribute name="TransactionManagerLookupClass">org.jboss.cache.JBossTransactionManagerLookup</attribute>
       <attribute name="IsolationLevel">REPEATABLE_READ</attribute>
       <attribute name="CacheMode">REPL_SYNC</attribute>
       <attribute name="UseReplQueue">false</attribute>
       <attribute name="ReplQueueInterval">0</attribute>
       <attribute name="ReplQueueMaxElements">0</attribute>
       <attribute name="ClusterName">RegerCom-TreeCache-Cluster</attribute>
       <attribute name="ClusterConfig">
       <UDP mcast_addr="" mcast_port="41332"
       ip_ttl="64" ip_mcast="true"
       mcast_send_buf_size="150000" mcast_recv_buf_size="80000"
       ucast_send_buf_size="150000" ucast_recv_buf_size="80000"
       <PING timeout="2000" num_initial_members="3" up_thread="false" down_thread="false"/>
       <MERGE2 min_interval="10000" max_interval="20000"/>
       <VERIFY_SUSPECT timeout="1500" up_thread="false" down_thread="false"/>
       <pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800" max_xmit_size="8192" up_thread="false" down_thread="false"/>
       <UNICAST timeout="600,1200,2400" window_size="100" min_threshold="10" down_thread="false"/>
       <pbcast.STABLE desired_avg_gossip="20000" up_thread="false" down_thread="false"/>
       <FRAG frag_size="8192" down_thread="false" up_thread="false"/>
       <pbcast.GMS join_timeout="10000" join_retry_timeout="2000" shun="true" print_local_addr="true"/>
       <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
       <attribute name="FetchStateOnStartup">true</attribute>
       <attribute name="InitialStateRetrievalTimeout">15000</attribute>
       <attribute name="SyncReplTimeout">15000</attribute>
       <attribute name="LockAcquisitionTimeout">10000</attribute>
       <attribute name="EvictionPolicyClass">org.jboss.cache.eviction.AopLRUPolicy</attribute>
       <attribute name="EvictionPolicyConfig">
       <attribute name="wakeUpIntervalSeconds">5</attribute>
       <region name="/_default_">
       <attribute name="maxNodes">50000</attribute>
       <attribute name="timeToLiveSeconds">0</attribute>
       <region name="/usersession">
       <attribute name="maxNodes">50000</attribute>
       <attribute name="timeToLiveSeconds">0</attribute>
       <attribute name="UseMarshalling">true</attribute>

      All machines share the same code + jars (deployed as a war file), same jvm version, same Tomcat 5.5.12 version, same network segment.

      About one time in five, and seemingly randomly, I'll get some clustering to happen between two of the machines, but never between all three. But it's not predictable and not solid. I'm sure there's something misconfigured on my side, but I'm at a loss as to what it is.

      Does anything look out of whack? Can you point me to common clustering issues? I've checked out the jgroups doc for ideas. Is it possible that I have two instances able to see each other via multicast, using the same cluster name but not clustering? I thought that if I had the same replSync-service.xml on all three that they'd always cluster up.

      Thanks for any help you can offer. I'm looking forward to running TreeCacheAOP! It's a great piece of work and exactly what many webapps need.