3 Replies Latest reply on Jan 6, 2006 4:24 PM by belaban

Getting TreeCacheAOP Clustering, A Few Questions

joereger Jan 6, 2006 10:50 AM

Hi all!

It feels like I'm so close, yet so far. After a couple of days of configuring and building I've been able to get annotated POJOs working with TreeCacheAOP and Java 1.5. (Most of the time spent on my own bonehead mistakes.)

The cache works well locally but I can't seem to get it clustering. I've got a listener on the mcast_addr so I see instances come up and announce that they're running with the same cache name. I don't believe it's a jgroups or networking issue (although I've been wrong about 100 times in the last couple of days).

Here's what I'm doing:

1) Deploy app to Tomcat 5.5 on machine 1. Access the app and see TreeCacheAOP working (at least locally). Check the logs and verify that jgroups has bound to an address. I see this, which tells me the cache is running:

0 [http-127.0.0.1-80-1] INFO org.jboss.cache.PropertyConfigurator - Found existing property editor for org.w3c.dom.Element: org.jboss.util.propertyeditor.ElementEditor@15099a1
31 [http-127.0.0.1-80-1] INFO org.jboss.cache.PropertyConfigurator - configure(): attribute size: 20
47 [http-127.0.0.1-80-1] INFO org.jboss.cache.TreeCache - setting cluster properties from xml to: UDP(ip_mcast=true;ip_ttl=64;loopback=true;mcast_addr=228.1.2.3;mcast_port=41332;mcast_recv_buf_size=80000;mcast_send_buf_size=150000;ucast_recv_buf_size=80000;ucast_send_buf_size=150000):PING(down_thread=false;num_initial_members=3;timeout=2000;up_thread=false):MERGE2(max_interval=20000;min_interval=10000):FD_SOCK:VERIFY_SUSPECT(down_thread=false;timeout=1500;up_thread=false):pbcast.NAKACK(down_thread=false;gc_lag=50;max_xmit_size=8192;retransmit_timeout=600,1200,2400,4800;up_thread=false):UNICAST(down_thread=false;min_threshold=10;timeout=600,1200,2400;window_size=100):pbcast.STABLE(desired_avg_gossip=20000;down_thread=false;up_thread=false):FRAG(down_thread=false;frag_size=8192;up_thread=false):pbcast.GMS(join_retry_timeout=2000;join_timeout=5000;print_local_addr=true;shun=true):pbcast.STATE_TRANSFER(down_thread=true;up_thread=true)
78 [http-127.0.0.1-80-1] INFO org.jboss.cache.TreeCache - setEvictionPolicyConfig(): [config: null]
110 [http-127.0.0.1-80-1] INFO org.jboss.cache.TreeCache - interceptor chain is:
class org.jboss.cache.interceptors.CallInterceptor
class org.jboss.cache.interceptors.PessimisticLockInterceptor
class org.jboss.cache.interceptors.UnlockInterceptor
class org.jboss.cache.interceptors.ReplicationInterceptor
141 [http-127.0.0.1-80-1] INFO org.jboss.cache.TreeCache - cache mode is REPL_SYNC
1110 [http-127.0.0.1-80-1] INFO org.jboss.cache.TreeCache - USE_MARSHALLING is true. We will marshall/unmarshall the value.

-------------------------------------------------------
GMS: address is 192.168.1.101:2181
-------------------------------------------------------
3203 [Thread-153] INFO org.jboss.cache.TreeCache - viewAccepted(): [192.168.1.101:2181|0] [192.168.1.101:2181]
3219 [http-127.0.0.1-80-1] INFO org.jboss.cache.TreeCache - my local address is 192.168.1.101:2181
3219 [http-127.0.0.1-80-1] INFO org.jboss.cache.TreeCache - state could not be retrieved (must be first member in group)
3219 [http-127.0.0.1-80-1] INFO org.jboss.cache.eviction.LRUPolicy - Starting eviction policy using the provider: org.jboss.cache.eviction.AopLRUPolicy
3219 [http-127.0.0.1-80-1] INFO org.jboss.cache.eviction.LRUPolicy - Starting a eviction timer with wake up interval of (secs) 5
3219 [Thread-153] INFO org.jboss.cache.TreeCache - new cache is null (maybe first member in cluster)
3219 [http-127.0.0.1-80-1] INFO org.jboss.cache.TreeCache - Cache is started!!

2) Check the mcast listener. It sees machine 1's messages and the correct cache name.

3) Deploy app to Tomcat 5.5 on machine 2. Access the app and see TreeCacheAOP in action. Check the logs and verify that jgroups has bound to an address. I see log messages similar to those from machine 1. Unfortunately, machine 2 doesn't connect with machine 1 and it creates its own cluster. I see:

3219 [Thread-153] INFO org.jboss.cache.TreeCache - new cache is null (maybe first member in cluster)

4) Check the mcast listener. I now see heartbeats from both machine 1 and machine 2. They both report the same cache name "RegerCom-TreeCache-Cluster". To me this makes me think that they'll start replicating.

But in the logs of both I see things like:

79687 [UpHandler (FD_SOCK)] WARN org.jgroups.protocols.pbcast.NAKACK - 192.168.1.101:2356] discarded message from non-member 192.168.1.103:1195

So I shut down all instances and bring them up, one at a time, about two minutes apart. I've got three machines that I do this with. They all share the same replSync-service.xml, which I've tweaked throughout the week:

<?xml version="1.0" encoding="UTF-8"?>
<server>
 <classpath codebase="./lib" archives="jboss-cache.jar, jgroups.jar"/>
 <mbean code="org.jboss.cache.aop.TreeCacheAop"
 name="jboss.cache:service=TreeCacheAop">
 <depends>jboss:service=Naming</depends>
 <depends>jboss:service=TransactionManager</depends>
 <attribute name="TransactionManagerLookupClass">org.jboss.cache.JBossTransactionManagerLookup</attribute>
 <attribute name="IsolationLevel">REPEATABLE_READ</attribute>
 <attribute name="CacheMode">REPL_SYNC</attribute>
 <attribute name="UseReplQueue">false</attribute>
 <attribute name="ReplQueueInterval">0</attribute>
 <attribute name="ReplQueueMaxElements">0</attribute>
 <attribute name="ClusterName">RegerCom-TreeCache-Cluster</attribute>
 <attribute name="ClusterConfig">
 <config>
 <UDP mcast_addr="228.1.2.3" mcast_port="41332"
 ip_ttl="64" ip_mcast="true"
 mcast_send_buf_size="150000" mcast_recv_buf_size="80000"
 ucast_send_buf_size="150000" ucast_recv_buf_size="80000"
 loopback="true"/>
 <PING timeout="2000" num_initial_members="3" up_thread="false" down_thread="false"/>
 <MERGE2 min_interval="10000" max_interval="20000"/>
 <FD_SOCK/>
 <VERIFY_SUSPECT timeout="1500" up_thread="false" down_thread="false"/>
 <pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800" max_xmit_size="8192" up_thread="false" down_thread="false"/>
 <UNICAST timeout="600,1200,2400" window_size="100" min_threshold="10" down_thread="false"/>
 <pbcast.STABLE desired_avg_gossip="20000" up_thread="false" down_thread="false"/>
 <FRAG frag_size="8192" down_thread="false" up_thread="false"/>
 <pbcast.GMS join_timeout="10000" join_retry_timeout="2000" shun="true" print_local_addr="true"/>
 <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
 </config>
 </attribute>
 <attribute name="FetchStateOnStartup">true</attribute>
 <attribute name="InitialStateRetrievalTimeout">15000</attribute>
 <attribute name="SyncReplTimeout">15000</attribute>
 <attribute name="LockAcquisitionTimeout">10000</attribute>
 <attribute name="EvictionPolicyClass">org.jboss.cache.eviction.AopLRUPolicy</attribute>
 <attribute name="EvictionPolicyConfig">
 <config>
 <attribute name="wakeUpIntervalSeconds">5</attribute>
 <region name="/_default_">
 <attribute name="maxNodes">50000</attribute>
 <attribute name="timeToLiveSeconds">0</attribute>
 </region>
 <region name="/usersession">
 <attribute name="maxNodes">50000</attribute>
 <attribute name="timeToLiveSeconds">0</attribute>
 </region>
 </config>
 </attribute>
 <attribute name="UseMarshalling">true</attribute>
 </mbean>
</server>

All machines share the same code + jars (deployed as a war file), same jvm version, same Tomcat 5.5.12 version, same network segment.

About one time in five, and seemingly randomly, I'll get some clustering to happen between two of the machines, but never between all three. But it's not predictable and not solid. I'm sure there's something misconfigured on my side, but I'm at a loss as to what it is.

Does anything look out of whack? Can you point me to common clustering issues? I've checked out the jgroups doc for ideas. Is it possible that I have two instances able to see each other via multicast, using the same cluster name but not clustering? I thought that if I had the same replSync-service.xml on all three that they'd always cluster up.

Thanks for any help you can offer. I'm looking forward to running TreeCacheAOP! It's a great piece of work and exactly what many webapps need.

Best,

Joe

1. Re: Getting TreeCacheAOP Clustering, A Few Questions

joereger Jan 6, 2006 1:31 PM (in response to joereger)

I just ran the test outlined on this page: http://wiki.jboss.org/wiki/Wiki.jsp?page=TestingJBoss

With the jgroups config I'm using in replSync-service.xml my machines have no trouble finding one another and clustering. Why do my TreeCacheAOP caches always think that there's no cache and that they need to create a new cluster even if one already exists?

Thanks,

-- Joe
Actions
2. Re: Getting TreeCacheAOP Clustering, A Few Questions

brian.stansberry Jan 6, 2006 2:10 PM (in response to joereger)

Does the ViewDemo test work consistently (i.e. it wasn't just luck that it worked once)? If so, this is quite strange, as TreeCacheAop doesn't do anything special to create its JGroups channel -- it applies the protocol stack config from the xml (which looks fine) and starts the channel using the name in the "ClusterName" attribute. If the ViewDemo test works consistently using your protocol stack config and the different cache instances have the same ClusterName attribute, they should see each other.

All I can recommend is turning up the logging on org.jgroups to DEBUG and see if anything looks odd. Maybe start with org.jgroups.protocols.pbcast.GMS in order to reduce the noise level in the logs.
Actions
3. Re: Getting TreeCacheAOP Clustering, A Few Questions

belaban Jan 6, 2006 4:24 PM (in response to joereger)

Try the tips in here too: http://wiki.jboss.org/wiki/Wiki.jsp?page=HandleJoinProblem
Actions

Go to original post