0 Replies Latest reply on Apr 20, 2006 6:28 AM by wesleyhall

My single machine, super cluster

wesleyhall Apr 20, 2006 6:28 AM

Hello folks,

I seem to be having a problem with the TreeCache replication service.

I have been receving lots of traffic to my console from the TreeCache logger category, which appears a little odd. For example...

11:00:59,531 INFO [STDOUT]
-------------------------------------------------------
GMS: address is cgl0677:3230
-------------------------------------------------------
11:01:01,531 INFO [TreeCache] viewAccepted(): new members: [CGL0657:1237, cgl06
77:2325, cgl0677:2335, cgl0677:2362, cgl0677:2388, cgl0677:2400, cgl0677:2409, c
gl0677:2420, cgl0677:2433, cgl0677:2439, cgl0677:2449, cgl0677:2455, cgl0677:246
7, cgl0677:2475, cgl0677:2486, cgl0677:2496, cgl0677:2502, cgl0677:2511, cgl0677
:2523, cgl0677:2531, cgl0677:2542, cgl0677:2553, cgl0677:2559, cgl0677:2564, cgl
0677:2582, cgl0677:2591, cgl0677:2602, cgl0677:2612, cgl0677:2624, cgl0677:2629,
 cgl0677:2684, cgl0677:2696, cgl0677:2704, cgl0677:2715, cgl0677:2721, cgl0677:2
727, cgl0677:2739, cgl0677:2749, cgl0677:2752, cgl0677:2762, cgl0677:2769, cgl06
77:2780, cgl0677:2791, cgl0677:2794, cgl0677:2829, cgl0677:2839, cgl0677:2851, c
gl0677:2890, cgl0677:2902, cgl0677:2910, cgl0677:2921, cgl0677:2930, cgl0677:293
5, cgl0677:2941, cgl0677:2947, cgl0677:2953, cgl0677:2959, cgl0677:2967, cgl0677
:2973, cgl0677:2979, cgl0677:2985, cgl0677:2991, cgl0677:2997, cgl0677:3003, cgl
0677:3009, cgl0677:3015, cgl0677:3021, cgl0677:3028, cgl0677:3037, cgl0677:3041,
 cgl0677:3056, cgl0677:3080, cgl0677:3086, cgl0677:3092, cgl0677:3098, cgl0677:3
104, cgl0677:3117, cgl0677:3126, cgl0677:3172, cgl0677:3182, cgl0677:3189, cgl06
77:3195, cgl0677:3201, cgl0677:3209, cgl0677:3215, cgl0677:3223, cgl0677:3230]
11:01:01,546 INFO [TreeCache] received the state (size=192 bytes)
11:01:01,546 INFO [TreeCache] transient state: 140 bytes
11:01:01,546 INFO [TreeCache] setting transient state
11:01:01,546 INFO [TreeCache] locking the old tree
11:01:01,546 INFO [TreeCache] locking the old tree was successful
11:01:01,562 INFO [TreeCache] setting the transient state was successful
11:01:01,562 INFO [TreeCache] forcing release of all locks in old tree
11:01:03,765 INFO [STDOUT]
-------------------------------------------------------
GMS: address is cgl0677:3233
-------------------------------------------------------
11:01:05,765 INFO [TreeCache] viewAccepted(): new members: [CGL0657:1263, cgl06
77:2405, cgl0677:2412, cgl0677:2423, cgl0677:2431, cgl0677:2441, cgl0677:2452, c
gl0677:2461, cgl0677:2471, cgl0677:2478, cgl0677:2489, cgl0677:2499, cgl0677:250
5, cgl0677:2517, cgl0677:2527, cgl0677:2534, cgl0677:2545, cgl0677:2554, cgl0677
:2561, cgl0677:2567, cgl0677:2588, cgl0677:2594, cgl0677:2603, cgl0677:2611, cgl
0677:2625, cgl0677:2632, cgl0677:2690, cgl0677:2700, cgl0677:2707, cgl0677:2724,
 cgl0677:2730, cgl0677:2755, cgl0677:2765, cgl0677:2772, cgl0677:2783, cgl0677:2
788, cgl0677:2797, cgl0677:2832, cgl0677:2842, cgl0677:2854, cgl0677:2896, cgl06
77:2906, cgl0677:2913, cgl0677:2924, cgl0677:2931, cgl0677:2938, cgl0677:2944, c
gl0677:2950, cgl0677:2956, cgl0677:2962, cgl0677:2970, cgl0677:2976, cgl0677:298
2, cgl0677:2988, cgl0677:2994, cgl0677:3000, cgl0677:3006, cgl0677:3012, cgl0677
:3018, cgl0677:3024, cgl0677:3030, cgl0677:3036, cgl0677:3044, cgl0677:3063, cgl
0677:3066, cgl0677:3083, cgl0677:3089, cgl0677:3095, cgl0677:3101, cgl0677:3107,
 cgl0677:3114, cgl0677:3122, cgl0677:3175, cgl0677:3185, cgl0677:3192, cgl0677:3
198, cgl0677:3204, cgl0677:3212, cgl0677:3219, cgl0677:3226, cgl0677:3233]
11:01:05,781 INFO [TreeCache] received the state (size=192 bytes)
11:01:05,781 INFO [TreeCache] transient state: 140 bytes
11:01:05,781 INFO [TreeCache] setting transient state
11:01:05,781 INFO [TreeCache] locking the old tree
11:01:05,781 INFO [TreeCache] locking the old tree was successful
11:01:05,781 INFO [TreeCache] setting the transient state was successful
11:01:05,781 INFO [TreeCache] forcing release of all locks in old tree
11:01:11,343 INFO [STDOUT]

I am not sure if this is related but we are also seeing the following message alot (sometimes it even repeats like it is looping infinately and wont stop until the server is shutdown)...

11:00:49,640 ERROR [ClientGmsImpl] suspect() should not be invoked on an instance of org.jgroups.protocols.pbcast.ClientGmsImpl

I am not very experienced with the TreeCache/JGroups replication mechanism but this looks to me like my host (CGL0677) is joining the notification group multiple times. The first host in the list (CGL0657) is a collegue who is running the same configuration as me but with a different partition name. Strangely, restarting my JBoss doesn't seem to resolve the problem as the server restarts with the exact same problem (which suggests to me that the replication nodes are persisted somewhere, is this true?).

After some investigation, I found the problem disappears when I remove the ejb3.deployer from my deploy directory, suggesting the problem is either the SFSB cache or the entity cache.

Has anybody seen/resolved this problem before? If anybody has a time to post an explanation it would be most gratefully received. Below I am posting the configuration from my ejb3.deployer.

Thanks.

<mbean code="org.jboss.ejb3.cache.tree.PassivationTreeCache" name="jboss.cache:service=EJB3SFSBClusteredCache">
 <!--
 Node locking level : SERIALIZABLE
 REPEATABLE_READ (default)
 READ_COMMITTED
 READ_UNCOMMITTED
 NONE
 -->
 <attribute name="IsolationLevel">READ_UNCOMMITTED</attribute>

 <!-- Valid modes are LOCAL
 REPL_ASYNC
 REPL_SYNC
 -->
 <attribute name="CacheMode">REPL_SYNC</attribute>

 <attribute name="ClusterName">SFSB-Cache</attribute>

 <attribute name="ClusterConfig">
 <config>
 <!-- UDP: if you have a multihomed machine,
 set the bind_addr attribute to the appropriate NIC IP address
 -->
 <!-- UDP: On Windows machines, because of the media sense feature
 being broken with multicast (even after disabling media sense)
 set the loopback attribute to true
 -->
 <UDP mcast_addr="${jboss.partition.udpGroup:228.1.2.3}" mcast_port="45551" ip_ttl="64" ip_mcast="true"
 mcast_send_buf_size="150000" mcast_recv_buf_size="80000" ucast_send_buf_size="150000"
 ucast_recv_buf_size="80000" loopback="false"/>
 <PING timeout="2000" num_initial_members="3" up_thread="false" down_thread="false"/>
 <MERGE2 min_interval="10000" max_interval="20000"/>
 <FD shun="true" up_thread="true" down_thread="true"/>
 <VERIFY_SUSPECT timeout="1500" up_thread="false" down_thread="false"/>
 <pbcast.NAKACK gc_lag="50" max_xmit_size="8192" retransmit_timeout="600,1200,2400,4800" up_thread="false"
 down_thread="false"/>
 <UNICAST timeout="600,1200,2400" window_size="100" min_threshold="10" down_thread="false"/>
 <pbcast.STABLE desired_avg_gossip="20000" up_thread="false" down_thread="false"/>
 <FRAG frag_size="8192" down_thread="false" up_thread="false"/>
 <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="true" print_local_addr="true"/>
 <pbcast.STATE_TRANSFER up_thread="false" down_thread="false"/>
 </config>
 </attribute>

 <!-- Number of milliseconds to wait until all responses for a
 synchronous call have been received.
 -->
 <attribute name="SyncReplTimeout">10000</attribute>

 <!-- Max number of milliseconds to wait for a lock acquisition -->
 <attribute name="LockAcquisitionTimeout">15000</attribute>

 <!-- Name of the eviction policy class. -->
 <attribute name="EvictionPolicyClass">org.jboss.ejb3.cache.tree.StatefulEvictionPolicy</attribute>

 <!-- Specific eviction policy configurations. This is LRU -->
 <attribute name="EvictionPolicyConfig">
 <config>
 <attribute name="wakeUpIntervalSeconds">1</attribute>
 <name>statefulClustered</name>
 <region name="/_default_">
 <attribute name="maxNodes">1000000</attribute>
 <attribute name="timeToIdleSeconds">300</attribute>
 </region>

 </config>
 </attribute>

 <attribute name="CacheLoaderFetchPersistentState">false</attribute>
 <attribute name="CacheLoaderFetchTransientState">true</attribute>
 <attribute name="FetchStateOnStartup">true</attribute>
 <attribute name="CacheLoaderClass">org.jboss.ejb3.cache.tree.StatefulCacheLoader</attribute>
 <attribute name="CacheLoaderConfig">location=statefulClustered</attribute>
 </mbean>

 <mbean code="org.jboss.cache.TreeCache" name="jboss.cache:service=EJB3EntityTreeCache">
 <depends>jboss:service=Naming</depends>
 <depends>jboss:service=TransactionManager</depends>

 <!-- Configure the TransactionManager -->
 <attribute name="TransactionManagerLookupClass">org.jboss.cache.JBossTransactionManagerLookup</attribute>

 <!--
 Node locking level : SERIALIZABLE
 REPEATABLE_READ (default)
 READ_COMMITTED
 READ_UNCOMMITTED
 NONE
 -->
 <attribute name="IsolationLevel">REPEATABLE_READ</attribute>

 <!-- Valid modes are LOCAL
 REPL_ASYNC
 REPL_SYNC
 -->
 <attribute name="CacheMode">REPL_SYNC</attribute>

 <!-- Name of cluster. Needs to be the same for all clusters, in order
 to find each other -->
 <attribute name="ClusterName">EJB3-entity-cache</attribute>

 <attribute name="ClusterConfig">
 <config>
 <!-- UDP: if you have a multihomed machine,
 set the bind_addr attribute to the appropriate NIC IP address
 -->
 <!-- UDP: On Windows machines, because of the media sense feature
 being broken with multicast (even after disabling media sense)
 set the loopback attribute to true
 -->
 <UDP mcast_addr="${jboss.partition.udpGroup:228.1.2.3}" mcast_port="43333" ip_ttl="2" ip_mcast="true"
 mcast_send_buf_size="150000" mcast_recv_buf_size="80000" ucast_send_buf_size="150000"
 ucast_recv_buf_size="80000" loopback="false" />
 <PING timeout="2000" num_initial_members="3" up_thread="false" down_thread="false" />
 <MERGE2 min_interval="10000" max_interval="20000" />
 <FD shun="true" up_thread="true" down_thread="true" />
 <VERIFY_SUSPECT timeout="1500" up_thread="false" down_thread="false" />
 <pbcast.NAKACK gc_lag="50" max_xmit_size="8192" retransmit_timeout="600,1200,2400,4800" up_thread="false"
 down_thread="false" />
 <UNICAST timeout="600,1200,2400" window_size="100" min_threshold="10" down_thread="false" />
 <pbcast.STABLE desired_avg_gossip="20000" up_thread="false" down_thread="false" />
 <FRAG frag_size="8192" down_thread="false" up_thread="false" />
 <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="true" print_local_addr="true" />
 <pbcast.STATE_TRANSFER up_thread="false" down_thread="false" />
 </config>
 </attribute>

 <!-- The max amount of time (in milliseconds) we wait until the
 initial state (ie. the contents of the cache) are retrieved from
 existing members in a clustered environment
 -->
 <attribute name="InitialStateRetrievalTimeout">5000</attribute>

 <!-- Number of milliseconds to wait until all responses for a
 synchronous call have been received.
 -->
 <attribute name="SyncReplTimeout">10000</attribute>

 <!-- Max number of milliseconds to wait for a lock acquisition -->
 <attribute name="LockAcquisitionTimeout">15000</attribute>

 <!-- Name of the eviction policy class. -->
 <attribute name="EvictionPolicyClass">org.jboss.cache.eviction.LRUPolicy</attribute>

 <!-- Specific eviction policy configurations. This is LRU -->
 <attribute name="EvictionPolicyConfig">
 <config>
 <attribute name="wakeUpIntervalSeconds">5</attribute>
 <!-- Cache wide default -->
 <region name="/_default_">
 <attribute name="maxNodes">5000</attribute>
 <attribute name="timeToLiveSeconds">1000</attribute>
 </region>
 </config>
 </attribute>

 </mbean>