4 Replies Latest reply on Oct 3, 2007 3:06 PM by Nolan Johnson

    FetchInMemoryState + jgroups config

    Nolan Johnson Newbie

      I'm having trouble getting things configured properly to get the state transferred to a new server joining a cluster (jbosscache 2.0.0GA). Relevant config:

       <attribute name="ClusterConfig">
       <config>
       <UDP mcast_addr="228.1.2.3" mcast_port="48866"
       ip_ttl="64" ip_mcast="true"
       mcast_send_buf_size="150000" mcast_recv_buf_size="80000"
       ucast_send_buf_size="150000" ucast_recv_buf_size="80000"
       loopback="false"/>
       <PING timeout="10000" num_initial_members="3"/>
       <MERGE2 min_interval="1000" max_interval="2000"/>
       <FD shun="true"/>
       <FD_SOCK/>
       <VERIFY_SUSPECT timeout="1500"
       up_thread="false" down_thread="false"/>
       <pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800"
       max_xmit_size="8192"/>
       <UNICAST timeout="600,1200,2400"/>
       <pbcast.STABLE desired_avg_gossip="400000"/>
       <pbcast.GMS join_timeout="5000" join_retry_timeout="20000"
       shun="true" print_local_addr="true" leave_timeout="3000"/>
       <FC max_credits="2000000"
       min_threshold="0.20"/>
       <FRAG2 frag_size="8192"/>
       <pbcast.STREAMING_STATE_TRANSFER/>
       </config>
       </attribute>
       <attribute name="FetchInMemoryState">true</attribute>
       <attribute name="StateRetrievalTimeout">60000</attribute>
      


      What I get in the log indicates that JGroups is timing out before it can join the group (related log messages shown. IP of machine joining is 1.1.1.1. IP of machine already up is 9.9.9.9):

      2007-09-12 14:26:51,443 DEBUG [org.jboss.cache.CacheImpl.MyCluster] (main) cache mode is REPL_ASYNC
      2007-09-12 14:26:51,453 INFO [org.jgroups.JChannel] (main) JGroups version: 2.5.0
      2007-09-12 14:27:01,691 INFO [org.jboss.cache.CacheImpl.MyCluster] (main) viewAccepted(): [1.1.1.1:32774|0] [1.1.1.1:32774]
      2007-09-12 14:27:01,709 INFO [org.jboss.cache.CacheImpl.MyCluster] (main) CacheImpl local address is 1.1.1.1:32774
      2007-09-12 14:27:01,709 DEBUG [org.jboss.cache.CacheImpl.MyCluster] (main) State could not be retrieved (we are the first member in group)
      2007-09-12 14:27:01,710 INFO [org.jboss.cache.CacheImpl.MyCluster] (main) JBoss Cache version: JBossCache 'Habanero' 2.0.0.GA[ $Id: Version.java,v 1.35 2007/08/01 16:52:13 msurtani Exp $]
      2007-09-12 14:27:03,046 INFO [org.jboss.cache.CacheImpl.MyCluster] (Incoming Thread,MyCluster,1.1.1.1:32774) viewAccepted(): MergeView::[9.9.9.9:32781|19] [9.9.9.9:32781, 1.1.1.1:32774], subgroups=[[9.9.9.9:32781|18] [9.9.9.9:32781], [1.1.1.1:32774|0] [1.1.1.1:32774]]
      


      So I should increease the ping timeout, right? Probably only need a few seconds, since the other server's group was merged with this one only 2 seconds after this JGroups declared that it couldn't find the other one.

      So I changed the timeout for PING to 200000 (200 seconds, much longer). However, this is the result:
      2007-09-12 14:18:41,098 DEBUG [org.jboss.cache.CacheImpl.MyCluster] (main) cache mode is REPL_ASYNC
      2007-09-12 14:18:41,108 INFO [org.jgroups.JChannel] (main) JGroups version: 2.5.0
      2007-09-12 14:22:01,344 INFO [org.jboss.cache.CacheImpl.MyCluster] (main) viewAccepted(): [1.1.1.1:32773|0] [1.1.1.1:32773]
      2007-09-12 14:22:01,362 INFO [org.jboss.cache.CacheImpl.MyCluster] (main) CacheImpl local address is 1.1.1.1:32773
      2007-09-12 14:22:01,362 DEBUG [org.jboss.cache.CacheImpl.MyCluster] (main) State could not be retrieved (we are the first member in group)
      2007-09-12 14:22:01,363 INFO [org.jboss.cache.CacheImpl.MyCluster] (main) JBoss Cache version: JBossCache 'Habanero' 2.0.0.GA[ $Id: Version.java,v 1.35 2007/08/01 16:52:13 msurtani Exp $]
      2007-09-12 14:22:04,245 WARN [org.jgroups.protocols.pbcast.NAKACK] (Incoming Thread,MyCluster,1.1.1.1:32773) 1.1.1.1:32773] discarded message from non-member 9.9.9.9:32781, my view is [10.195.70.107:32773|0] [1.1.1.1:32773]
      2007-09-12 14:22:04,246 WARN [org.jgroups.protocols.pbcast.NAKACK] (Incoming Thread,MyCluster,10.195.70.107:32773) 1.1.1.1:32773] discarded message from non-member 9.9.9.9:32781, my view is [1.1.1.1:32773|0] [1.1.1.1:32773]
      2007-09-12 14:22:04,248 INFO [org.jboss.cache.CacheImpl.MyCluster] (Incoming Thread,MyCluster,1.1.1.1:32773) viewAccepted(): MergeView::[9.9.9.9:32781|17] [9.9.9.9:32781, 1.1.1.1:32773], subgroups=[[9.9.9.9:32781|16] [9.9.9.9:32781], [1.1.1.1:32773|0] [1.1.1.1:32773]]
      
      

      What happens is that after the 200 seconds expire, JGroups decides that it's alone, no state transfer happens, and then 4 seconds later, JGroups decides that the other server exists. Too late for state transfer.

      Am I misconfiguring something? Misunderstanding something?