1 Reply Latest reply on Nov 16, 2006 10:18 AM by Jian Fang

    Problem when a node tries to join the cluster

    Jian Fang Newbie

      Hello,

      I have some problems when try to bring up a new node to an existing cluster. It seems the new node
      waits for pretty long time before it receives the join response. I set the PING timeout to be 2 seconds,
      the new node always timed out and thought itself to be the first member in the group, and thus, cannot
      fetch the in-memory state from other nodes during startup. Until after sometime, it joined the group, then
      everything works fine. But before that, system does not work properly because the cache state is not
      synchronized in the new node. (I use JBOSS TreeCache)

      I changed the PING timeout to be 20 seconds, in most cases, the
      new node can join the existing group, but there are still some cases the new node timed out and
      thought itself to be the first member in the group. I use Fedora core 6 linux box, use JBoss TreeCache
      and tomcat. The cluster includes only two nodes. I just wonder why it takes so long for the new node
      to join the existing group?

      The trace information and JBoss Cache configuration are attached. Any hints are welcome.

      Thanks in advance,

      Jian Fang

      --------------------------------
      Trace information:
      --------------------------------
      11: class org.jgroups.protocols.PerfHeader
      12: class org.jgroups.protocols.PIGGYBACK$PiggybackHeader
      13: class org.jgroups.protocols.PingHeader
      --


      2006-11-15 13:03:26,292 (main) org.jgroups.protocols.AUTOCONF.senseMaxFragSize
      DEBUG: frag_size=64000

      2006-11-15 13:03:26,361 (main) org.jgroups.protocols.pbcast.GMS.setImpl
      DEBUG: changed role to org.jgroups.protocols.pbcast.ClientGmsImpl

      2006-11-15 13:03:26,415 (DownHandler (STABLE)) org.jgroups.protocols.pbcast.STABLE.startStableTask
      DEBUG: stable task started

      2006-11-15 13:03:26,420 (DownHandler (UDP)) org.jgroups.protocols.UDP.start
      DEBUG: creating sockets and starting threads

      2006-11-15 13:03:26,421 (DownHandler (UDP)) org.jgroups.protocols.UDP.createSockets
      INFO : sockets will use interface 10.20.2.37

      2006-11-15 13:03:26,426 (DownHandler (UDP)) org.jgroups.protocols.UDP.createSockets
      --
      mcast_send_sock: bound to 10.20.2.37:33043, send buffer size=320000, receive buffer size=320000

      2006-11-15 13:03:26,433 (DownHandler (UDP)) org.jgroups.protocols.UDP.bindToInterfaces
      DEBUG: joined /224.0.0.75:7500 on eth0

      2006-11-15 13:03:26,435 (DownHandler (UDP)) org.jgroups.protocols.UDP.bindToInterfaces
      DEBUG: joined /224.0.0.75:7500 on lo

      2006-11-15 13:03:26,438 (DownHandler (UDP)) org.jgroups.protocols.UDP.startThreads
      DEBUG: created unicast receiver thread

      2006-11-15 13:03:26,440 (UpHandler (STATE_TRANSFER)) org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.passUp
      DEBUG: setting local_addr (null) to 10.20.2.37:33042

      2006-11-15 13:03:26,442 (PingSender) org.jgroups.protocols.PingSender.run
      DEBUG: sending GET_MBRS_REQ

      2006-11-15 13:03:26,442 (PingWaiter) org.jgroups.protocols.PingWaiter.findInitialMembers
      DEBUG: waiting for initial members: time_to_wait=2000, got 0 rsps

      2006-11-15 13:03:26,447 (DownHandler (UDP)) org.jgroups.protocols.UDP.down
      DEBUG: sending msg to null (src=10.20.2.37:33042), headers are {PING=[PING: type=GET_MBRS_REQ, arg=null], UDP=[channel_name=JTVCORE-Cluster]}

      2006-11-15 13:03:26,450 (UDP mcast receiver) org.jgroups.protocols.UDP.receive
      DEBUG: received (mcast)57 bytes from 10.20.2.37:33043

      2006-11-15 13:03:26,454 (UDP mcast receiver) org.jgroups.protocols.UDP.handleIncomingMessage
      DEBUG: message is [dst: 228.8.8.8:45566, src: 10.20.2.37:33042 (2 headers), size = 0 bytes], headers are {UDP=[channel_name=JTVCORE-Cluster], PING=[PING: type=GET_MBRS_REQ, arg=null]}

      2006-11-15 13:03:26,455 (UpHandler (PING)) org.jgroups.protocols.PING.up
      DEBUG: discarded my own discovery request

      2006-11-15 13:03:27,445 (PingSender) org.jgroups.protocols.PingSender.run
      DEBUG: sending GET_MBRS_REQ

      2006-11-15 13:03:27,447 (DownHandler (UDP)) org.jgroups.protocols.UDP.down
      DEBUG: sending msg to null (src=10.20.2.37:33042), headers are {PING=[PING: type=GET_MBRS_REQ, arg=null], UDP=[channel_name=JTVCORE-Cluster]}

      2006-11-15 13:03:27,448 (UDP mcast receiver) org.jgroups.protocols.UDP.receive
      DEBUG: received (mcast)57 bytes from 10.20.2.37:33043

      2006-11-15 13:03:27,449 (UDP mcast receiver) org.jgroups.protocols.UDP.handleIncomingMessage
      DEBUG: message is [dst: 228.8.8.8:45566, src: 10.20.2.37:33042 (2 headers), size = 0 bytes], headers are {UDP=[channel_name=JTVCORE-Cluster], PING=[PING: type=GET_MBRS_REQ, arg=null]}

      2006-11-15 13:03:27,450 (UpHandler (PING)) org.jgroups.protocols.PING.up
      DEBUG: discarded my own discovery request

      2006-11-15 13:03:27,557 (UDP mcast receiver) org.jgroups.protocols.UDP.receive
      DEBUG: received (mcast)57 bytes from 10.20.2.38:32789

      2006-11-15 13:03:27,558 (UDP mcast receiver) org.jgroups.protocols.UDP.handleIncomingMessage
      DEBUG: message is [dst: 228.8.8.8:45566, src: 10.20.2.38:32788 (2 headers), size = 0 bytes], headers are {UDP=[channel_name=JTVCORE-Cluster], PING=[PING: type=GET_MBRS_REQ, arg=null]}

      2006-11-15 13:03:27,559 (UpHandler (PING)) org.jgroups.protocols.PING.up
      DEBUG: received GET_MBRS_REQ from 10.20.2.38:32788, sending response [PING: type=GET_MBRS_RSP, arg=[own_addr=10.20.2.37:33042, coord_addr=10.20.2.37:33042, is_server=false]]

      2006-11-15 13:03:27,560 (DownHandler (UDP)) org.jgroups.protocols.UDP.down
      DEBUG: sending msg to 10.20.2.38:32788 (src=10.20.2.37:33042), headers are {PING=[PING: type=GET_MBRS_RSP, arg=[own_addr=10.20.2.37:33042, coord_addr=10.20.2.37:33042, is_server=false]], UDP=[channel_name=JTVCORE-Cluster]}

      2006-11-15 13:03:28,445 (PingWaiter) org.jgroups.protocols.PingWaiter.findInitialMembers
      DEBUG: initial mbrs are []

      2006-11-15 13:03:28,446 (DownHandler (GMS)) org.jgroups.protocols.pbcast.GMS.join
      DEBUG: initial_mbrs are []

      2006-11-15 13:03:28,447 (DownHandler (GMS)) org.jgroups.protocols.pbcast.GMS.join
      DEBUG: no initial members discovered: creating group as first member

      2006-11-15 13:03:28,448 (DownHandler (GMS)) org.jgroups.protocols.pbcast.GMS.installView
      DEBUG: [local_addr=10.20.2.37:33042] view is [10.20.2.37:33042|0] [10.20.2.37:33042]

      2006-11-15 13:03:28,456 (main) org.jboss.cache.TreeCache.startService
      INFO : TreeCache local address is 10.20.2.37:33042

      2006-11-15 13:03:28,459 (UpHandler (STATE_TRANSFER)) org.jboss.cache.TreeCache.viewAccepted
      INFO : viewAccepted(): [10.20.2.37:33042|0] [10.20.2.37:33042]

      2006-11-15 13:03:28,463 (DownHandler (STATE_TRANSFER)) org.jgroups.protocols.pbcast.STATE_TRANSFER.down
      DEBUG: GET_STATE: first member (no state)

      2006-11-15 13:03:28,466 (DownHandler (STABLE)) org.jgroups.protocols.pbcast.STABLE.resetDigest
      DEBUG: resetting digest from NAKACK: [10.20.2.37:33042#-1]

      2006-11-15 13:03:28,469 (DownHandler (GMS)) org.jgroups.protocols.pbcast.GMS.setImpl
      DEBUG: 10.20.2.37:33042 changed role to org.jgroups.protocols.pbcast.CoordGmsImpl

      2006-11-15 13:03:28,469 (DownHandler (GMS)) org.jgroups.protocols.pbcast.GMS.becomeSingletonMember
      DEBUG: created group (first member). My view is [10.20.2.37:33042|0], impl is org.jgroups.protocols.pbcast.CoordGmsImpl

      2006-11-15 13:03:28,483 (main) org.jboss.cache.TreeCache.fetchStateOnStartup
      INFO : State could not be retrieved (we are the first member in group)

      ----------------------
      configuration
      ----------------------

      jboss:service=Naming

      jboss:service=TransactionManager

      <!-- Configure the TransactionManager -->
      <!-- org.jboss.cache.DummyTransactionManagerLookup-->
      com.jtv.core.resource.impl.spring.cache.JBossCacheTransactionManagerLookup

      <!--
      Node locking scheme:
      OPTIMISTIC
      PESSIMISTIC (default)
      -->
      <!-- PESSIMISTIC -->


      <!--
      Node locking level : SERIALIZABLE
      REPEATABLE_READ (default)
      READ_COMMITTED
      READ_UNCOMMITTED
      NONE
      -->
      REPEATABLE_READ

      <!--
      Valid modes are LOCAL
      REPL_ASYNC
      REPL_SYNC
      INVALIDATION_ASYNC
      INVALIDATION_SYNC
      -->
      REPL_SYNC

      <!-- Name of cluster. Needs to be the same for all clusters, in order
      to find each other
      -->
      JTVCORE-Cluster

      <!-- JGroups protocol stack properties. Can also be a URL,
      e.g. file:/home/bela/default.xml

      -->


      <UDP mcast_send_buf_size="320000"
      mcast_port="45566"
      ucast_recv_buf_size="320000"
      mcast_addr="228.8.8.8"
      bind_addr="10.20.2.37"
      loopback="false"
      mcast_recv_buf_size="320000"
      max_bundle_size="60000"
      max_bundle_timeout="30"
      use_incoming_packet_handler="false"
      use_outgoing_packet_handler="false"
      ucast_send_buf_size="320000"
      ip_ttl="32"
      enable_bundling="false"/>
      <PING timeout="20000"
      num_initial_members="2"
      up_thread="true"
      down_thread="true"/>
      <MERGE2 max_interval="10000"
      min_interval="5000"/>
      <FD timeout="2000"
      max_tries="3"
      shun="true"
      up_thread="true"
      down_thread="true"/>
      <VERIFY_SUSPECT timeout="1500"/>
      <pbcast.NAKACK max_xmit_size="8192"
      use_mcast_xmit="false"
      gc_lag="50"
      retransmit_timeout="600,1200,2400,4800"/>


      <!--
      - desired_avg_gossip: periodically sends STABLE messages around. 0 disables this
      - max_bytes: max number of bytes received from anyone until a STABLE message is sent. Use either this or
      desired_avg_gossip, but not both ! 0 disables it.
      - stability_delay: range (number of milliseconds) that we wait until sending a STABILITY message.
      This prevents STABILITY multicast storms. If max_bytes is used, this should be set to a low value (> 0 though !).
      -->
      <pbcast.STABLE stability_delay="1000"
      desired_avg_gossip="20000"
      max_bytes="0"/>
      <FRAG frag_size="8192"
      down_thread="true"
      up_thread="true"/>
      <VIEW_SYNC avg_send_interval="20000" down_thread="true" up_thread="true" />
      <pbcast.GMS print_local_addr="true"
      join_timeout="3000"
      join_retry_timeout="2000"
      shun="true"/>
      <pbcast.STATE_TRANSFER
      down_thread="true"
      up_thread="true"/>



      <!--
      Whether or not to fetch state on joining a cluster
      NOTE this used to be called FetchStateOnStartup and has been renamed to be more descriptive.
      -->
      <!-- -->
      true
      <!-- -->

      <!--
      The max amount of time (in milliseconds) we wait until the
      initial state (ie. the contents of the cache) are retrieved from
      existing members in a clustered environment
      -->
      20000

      <!--
      Number of milliseconds to wait until all responses for a
      synchronous call have been received.
      -->
      20000

      <!-- Max number of milliseconds to wait for a lock acquisition -->
      15000

      <!-- Name of the eviction policy class. -->
      <!--attribute name="EvictionPolicyClass"></attribute-->

      <!--
      org.jboss.cache.eviction.LRUPolicy
      -->

      <!-- Specific eviction policy configurations. This is LRU -->


      5
      <!-- Cache wide default /com/jtv/core-->

      5000
      0


      2000
      0


      10000
      0




      <!--
      Indicate whether to use region based marshalling or not. Set this to true if you are running under a scoped
      class loader, e.g., inside an application server. Default is "false".
      -->
      false