3 Replies Latest reply on Nov 11, 2010 2:39 PM by cbo_

    Cluster Setup Issue in a Replicated Cache And StateTransfer

    cbo_

      I have what appears to be a cluster setup issue in jgroups.  Seems to be similar in ways to issue http://community.jboss.org/message/551332

       

      In the current scenario the cluster setup appears to fail and the 2 participants in the cluster are operating separately for a short while (enough time for the StateTransfer to have not occurred). 

       

      I am using MPING for discovery in my jgroups file.

       

      I did not have detailed logging turned on for stateTransfer at the time so you can't see that come into the logging picture.  Our use case involves an highly available application with an active side and a passive side.  We became aware of a problem because we noticed our passive app did not have some of the entries in a cache when we brought up the passive side application (as though stateTransfer was not done).  It seems from some of the logging (copied below) that the cluster discover ran into some issues and each of the applications decided it was coordinator for some reason.  Later, this seems to get resolved (i.e. merged).  The problem is that the stateTransfer is already considered completed so the caches are forever out of sync.  I will put the logging from the passive side first, followed by the active side.  Also attaching jgroups files from both sides.

       

       

      PASSIVE SIDE:

      2010-11-05 08:27:23,231 20629 TRACE [org.jgroups.protocols.MPING] (ReceiverThread:) received GET_MBRS_REQ from Prod_perfbc2b-9973, sending response [PING: type=GET_MBRS_RSP, arg=own_addr=Prod_perfbc2a
      -58563, view id=[Prod_perfbc2a-58563|0], is_server=true, is_coord=true, logical_name=Prod_perfbc2a-58563]
      2010-11-05 08:27:23,294 20692 TRACE [org.jgroups.protocols.MPING] (ReceiverThread:) received GET_MBRS_REQ from Prod_perfbc2b-9973, sending response [PING: type=GET_MBRS_RSP, arg=own_addr=Prod_perfbc2a
      -58563, view id=[Prod_perfbc2a-58563|0], is_server=true, is_coord=true, logical_name=Prod_perfbc2a-58563, physical_addrs=170.137.253.44:7020]
      2010-11-05 08:27:23,298 20696 WARN  [org.jgroups.protocols.pbcast.NAKACK] (OOB-1,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) Prod_perfbc2a-58563: dropped message from Prod_perfbc2b-99
      73 (not in xmit_table), keys are [Prod_perfbc2a-58563], view=[Prod_perfbc2a-58563|0] [Prod_perfbc2a-58563]
      2010-11-05 08:27:24,300 21698 TRACE [org.jgroups.protocols.MPING] (ReceiverThread:) received GET_MBRS_REQ from Prod_perfbc2a-58563, sending response [PING: type=GET_MBRS_RSP, arg=own_addr=Prod_perfbc2
      a-58563, view id=[Prod_perfbc2a-58563|0], is_server=true, is_coord=true, logical_name=Prod_perfbc2a-58563]
      2010-11-05 08:27:24,300 21698 TRACE [org.jgroups.protocols.MPING] (OOB-1,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) received GET_MBRS_RSP from Prod_perfbc2a-58563: own_addr=Prod_perf
      bc2a-58563, view id=[Prod_perfbc2a-58563|0], is_server=true, is_coord=true, logical_name=Prod_perfbc2a-58563
      2010-11-05 08:27:24,301 21699 TRACE [org.jgroups.protocols.MPING] (OOB-2,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) received GET_MBRS_RSP from Prod_perfbc2b-9973: own_addr=Prod_perfb
      c2b-9973, view id=[Prod_perfbc2b-9973|4], is_server=true, is_coord=true, logical_name=Prod_perfbc2b-9973
      2010-11-05 08:27:24,301 21699 TRACE [org.jgroups.protocols.MPING] (Timer-1,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) discovery took 2 ms: responses: 2 total (2 servers (2 coord), 0
      clients)
      2010-11-05 08:27:24,302 21700 TRACE [org.jgroups.protocols.MERGE2] (Timer-1,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) Discovery results:
      [Prod_perfbc2a-58563]: [Prod_perfbc2a-58563|0] [Prod_perfbc2a-58563]
      [Prod_perfbc2b-9973]: [Prod_perfbc2b-9973|4] [Prod_perfbc2b-9973]
      2010-11-05 08:27:24,302 21700 DEBUG [org.jgroups.protocols.MERGE2] (Timer-1,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) Prod_perfbc2a-58563 found different views : [Prod_perfbc2b-9973
      |4], [Prod_perfbc2a-58563|0]; sending up MERGE event with merge participants [Prod_perfbc2b-9973, Prod_perfbc2a-58563].
      Discovery results:
      [Prod_perfbc2a-58563]: [Prod_perfbc2a-58563|0] [Prod_perfbc2a-58563]
      [Prod_perfbc2b-9973]: [Prod_perfbc2b-9973|4] [Prod_perfbc2b-9973]
      2010-11-05 08:27:24,355 21753 DEBUG [org.jgroups.protocols.pbcast.GMS] (ViewHandler,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) determining merge leader from [Prod_perfbc2b-9973, Prod
      _perfbc2a-58563]
      2010-11-05 08:27:24,355 21753 DEBUG [org.jgroups.protocols.pbcast.GMS] (ViewHandler,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) I (Prod_perfbc2a-58563) am not the merge leader, waitin
      g for merge leader (Prod_perfbc2b-9973) to initiate merge
      2010-11-05 08:27:28,296 25694 DEBUG [org.jgroups.protocols.pbcast.GMS] (Incoming-1,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) Prod_perfbc2a-58563: sending RESUME event
      2010-11-05 08:27:28,297 25695 WARN  [org.jgroups.protocols.pbcast.NAKACK] (Incoming-3,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) Prod_perfbc2a-58563: dropped message from Prod_perfbc
      2b-9973 (not in xmit_table), keys are [Prod_perfbc2a-58563], view=[Prod_perfbc2a-58563|0] [Prod_perfbc2a-58563]
      2010-11-05 08:27:28,297 25695 WARN  [org.jgroups.protocols.pbcast.NAKACK] (Incoming-2,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) Prod_perfbc2a-58563: dropped message from Prod_perfbc
      2b-9973 (not in xmit_table), keys are [Prod_perfbc2a-58563], view=[Prod_perfbc2a-58563|0] [Prod_perfbc2a-58563]
      2010-11-05 08:27:29,299 26697 TRACE [org.jgroups.protocols.MPING] (ReceiverThread:) received GET_MBRS_REQ from Prod_perfbc2b-9973, sending response [PING: type=GET_MBRS_RSP, arg=own_addr=Prod_perfbc2a
      -58563, view id=[Prod_perfbc2a-58563|0], is_server=true, is_coord=true, logical_name=Prod_perfbc2a-58563]
      2010-11-05 08:27:33,482 30880 TRACE [org.jgroups.protocols.MPING] (ReceiverThread:) received GET_MBRS_REQ from Prod_perfbc2a-58563, sending response [PING: type=GET_MBRS_RSP, arg=own_addr=Prod_perfbc2
      a-58563, view id=[Prod_perfbc2a-58563|0], is_server=true, is_coord=true, logical_name=Prod_perfbc2a-58563]
      2010-11-05 08:27:33,482 30880 TRACE [org.jgroups.protocols.MPING] (OOB-2,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) received GET_MBRS_RSP from Prod_perfbc2a-58563: own_addr=Prod_perf
      bc2a-58563, view id=[Prod_perfbc2a-58563|0], is_server=true, is_coord=true, logical_name=Prod_perfbc2a-58563
      2010-11-05 08:27:33,483 30881 TRACE [org.jgroups.protocols.MPING] (OOB-1,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) received GET_MBRS_RSP from Prod_perfbc2b-9973: own_addr=Prod_perfb
      c2b-9973, view id=[Prod_perfbc2b-9973|4], is_server=true, is_coord=true, logical_name=Prod_perfbc2b-9973
      2010-11-05 08:27:33,483 30881 TRACE [org.jgroups.protocols.MPING] (Timer-2,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) discovery took 2 ms: responses: 2 total (2 servers (2 coord), 0
      clients)
      2010-11-05 08:27:33,483 30881 TRACE [org.jgroups.protocols.MERGE2] (Timer-2,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) Discovery results:
      [Prod_perfbc2a-58563]: [Prod_perfbc2a-58563|0] [Prod_perfbc2a-58563]
      [Prod_perfbc2b-9973]: [Prod_perfbc2b-9973|4] [Prod_perfbc2b-9973]
      2010-11-05 08:27:33,484 30882 DEBUG [org.jgroups.protocols.MERGE2] (Timer-2,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) Prod_perfbc2a-58563 found different views : [Prod_perfbc2b-9973
      |4], [Prod_perfbc2a-58563|0]; sending up MERGE event with merge participants [Prod_perfbc2b-9973, Prod_perfbc2a-58563].
      Discovery results:
      [Prod_perfbc2a-58563]: [Prod_perfbc2a-58563|0] [Prod_perfbc2a-58563]
      [Prod_perfbc2b-9973]: [Prod_perfbc2b-9973|4] [Prod_perfbc2b-9973]

      2010-11-05 08:27:23,231 20629 TRACE [org.jgroups.protocols.MPING] (ReceiverThread:) received GET_MBRS_REQ from Prod_perfbc2b-9973, sending response [PING: type=GET_MBRS_RSP, arg=own_addr=Prod_perfbc2a

      -58563, view id=[Prod_perfbc2a-58563|0], is_server=true, is_coord=true, logical_name=Prod_perfbc2a-58563]

      2010-11-05 08:27:23,294 20692 TRACE [org.jgroups.protocols.MPING] (ReceiverThread:) received GET_MBRS_REQ from Prod_perfbc2b-9973, sending response [PING: type=GET_MBRS_RSP, arg=own_addr=Prod_perfbc2a

      -58563, view id=[Prod_perfbc2a-58563|0], is_server=true, is_coord=true, logical_name=Prod_perfbc2a-58563, physical_addrs=170.137.253.44:7020]

      2010-11-05 08:27:23,298 20696 WARN  [org.jgroups.protocols.pbcast.NAKACK] (OOB-1,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) Prod_perfbc2a-58563: dropped message from Prod_perfbc2b-99

      73 (not in xmit_table), keys are [Prod_perfbc2a-58563], view=[Prod_perfbc2a-58563|0] [Prod_perfbc2a-58563]

      2010-11-05 08:27:24,300 21698 TRACE [org.jgroups.protocols.MPING] (ReceiverThread:) received GET_MBRS_REQ from Prod_perfbc2a-58563, sending response [PING: type=GET_MBRS_RSP, arg=own_addr=Prod_perfbc2

      a-58563, view id=[Prod_perfbc2a-58563|0], is_server=true, is_coord=true, logical_name=Prod_perfbc2a-58563]

      2010-11-05 08:27:24,300 21698 TRACE [org.jgroups.protocols.MPING] (OOB-1,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) received GET_MBRS_RSP from Prod_perfbc2a-58563: own_addr=Prod_perf

      bc2a-58563, view id=[Prod_perfbc2a-58563|0], is_server=true, is_coord=true, logical_name=Prod_perfbc2a-58563

      2010-11-05 08:27:24,301 21699 TRACE [org.jgroups.protocols.MPING] (OOB-2,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) received GET_MBRS_RSP from Prod_perfbc2b-9973: own_addr=Prod_perfb

      c2b-9973, view id=[Prod_perfbc2b-9973|4], is_server=true, is_coord=true, logical_name=Prod_perfbc2b-9973

      2010-11-05 08:27:24,301 21699 TRACE [org.jgroups.protocols.MPING] (Timer-1,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) discovery took 2 ms: responses: 2 total (2 servers (2 coord), 0

      clients)

      2010-11-05 08:27:24,302 21700 TRACE [org.jgroups.protocols.MERGE2] (Timer-1,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) Discovery results:

      [Prod_perfbc2a-58563]: [Prod_perfbc2a-58563|0] [Prod_perfbc2a-58563]

      [Prod_perfbc2b-9973]: [Prod_perfbc2b-9973|4] [Prod_perfbc2b-9973]

       

      2010-11-05 08:27:24,302 21700 DEBUG [org.jgroups.protocols.MERGE2] (Timer-1,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) Prod_perfbc2a-58563 found different views : [Prod_perfbc2b-9973

      |4], [Prod_perfbc2a-58563|0]; sending up MERGE event with merge participants [Prod_perfbc2b-9973, Prod_perfbc2a-58563].

      Discovery results:

      [Prod_perfbc2a-58563]: [Prod_perfbc2a-58563|0] [Prod_perfbc2a-58563]

      [Prod_perfbc2b-9973]: [Prod_perfbc2b-9973|4] [Prod_perfbc2b-9973]

       

      2010-11-05 08:27:24,355 21753 DEBUG [org.jgroups.protocols.pbcast.GMS] (ViewHandler,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) determining merge leader from [Prod_perfbc2b-9973, Prod

      _perfbc2a-58563]

      2010-11-05 08:27:24,355 21753 DEBUG [org.jgroups.protocols.pbcast.GMS] (ViewHandler,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) I (Prod_perfbc2a-58563) am not the merge leader, waitin

      g for merge leader (Prod_perfbc2b-9973) to initiate merge

      2010-11-05 08:27:28,296 25694 DEBUG [org.jgroups.protocols.pbcast.GMS] (Incoming-1,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) Prod_perfbc2a-58563: sending RESUME event

      2010-11-05 08:27:28,297 25695 WARN  [org.jgroups.protocols.pbcast.NAKACK] (Incoming-3,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) Prod_perfbc2a-58563: dropped message from Prod_perfbc

      2b-9973 (not in xmit_table), keys are [Prod_perfbc2a-58563], view=[Prod_perfbc2a-58563|0] [Prod_perfbc2a-58563]

      2010-11-05 08:27:28,297 25695 WARN  [org.jgroups.protocols.pbcast.NAKACK] (Incoming-2,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) Prod_perfbc2a-58563: dropped message from Prod_perfbc

      2b-9973 (not in xmit_table), keys are [Prod_perfbc2a-58563], view=[Prod_perfbc2a-58563|0] [Prod_perfbc2a-58563]

      2010-11-05 08:27:29,299 26697 TRACE [org.jgroups.protocols.MPING] (ReceiverThread:) received GET_MBRS_REQ from Prod_perfbc2b-9973, sending response [PING: type=GET_MBRS_RSP, arg=own_addr=Prod_perfbc2a

      -58563, view id=[Prod_perfbc2a-58563|0], is_server=true, is_coord=true, logical_name=Prod_perfbc2a-58563]

      2010-11-05 08:27:33,482 30880 TRACE [org.jgroups.protocols.MPING] (ReceiverThread:) received GET_MBRS_REQ from Prod_perfbc2a-58563, sending response [PING: type=GET_MBRS_RSP, arg=own_addr=Prod_perfbc2

      a-58563, view id=[Prod_perfbc2a-58563|0], is_server=true, is_coord=true, logical_name=Prod_perfbc2a-58563]

      2010-11-05 08:27:33,482 30880 TRACE [org.jgroups.protocols.MPING] (OOB-2,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) received GET_MBRS_RSP from Prod_perfbc2a-58563: own_addr=Prod_perf

      bc2a-58563, view id=[Prod_perfbc2a-58563|0], is_server=true, is_coord=true, logical_name=Prod_perfbc2a-58563

      2010-11-05 08:27:33,483 30881 TRACE [org.jgroups.protocols.MPING] (OOB-1,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) received GET_MBRS_RSP from Prod_perfbc2b-9973: own_addr=Prod_perfb

      c2b-9973, view id=[Prod_perfbc2b-9973|4], is_server=true, is_coord=true, logical_name=Prod_perfbc2b-9973

      2010-11-05 08:27:33,483 30881 TRACE [org.jgroups.protocols.MPING] (Timer-2,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) discovery took 2 ms: responses: 2 total (2 servers (2 coord), 0

      clients)

      2010-11-05 08:27:33,483 30881 TRACE [org.jgroups.protocols.MERGE2] (Timer-2,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) Discovery results:

      [Prod_perfbc2a-58563]: [Prod_perfbc2a-58563|0] [Prod_perfbc2a-58563]

      [Prod_perfbc2b-9973]: [Prod_perfbc2b-9973|4] [Prod_perfbc2b-9973]

       

      2010-11-05 08:27:33,484 30882 DEBUG [org.jgroups.protocols.MERGE2] (Timer-2,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) Prod_perfbc2a-58563 found different views : [Prod_perfbc2b-9973

      |4], [Prod_perfbc2a-58563|0]; sending up MERGE event with merge participants [Prod_perfbc2b-9973, Prod_perfbc2a-58563].

      Discovery results:

      [Prod_perfbc2a-58563]: [Prod_perfbc2a-58563|0] [Prod_perfbc2a-58563]

      [Prod_perfbc2b-9973]: [Prod_perfbc2b-9973|4] [Prod_perfbc2b-9973]

      2010-11-05 08:27:33,534 30932 DEBUG [org.jgroups.protocols.pbcast.GMS] (ViewHandler,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) determining merge leader from [Prod_perfbc2b-9973, Prod
      _perfbc2a-58563]
      2010-11-05 08:27:33,534 30932 DEBUG [org.jgroups.protocols.pbcast.GMS] (ViewHandler,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) I (Prod_perfbc2a-58563) am not the merge leader, waitin
      g for merge leader (Prod_perfbc2b-9973) to initiate merge
      2010-11-05 08:27:41,593 38991 INFO  [org.infinispan.factories.ComponentRegistry] (main:) Infinispan version: Infinispan 'Ursus' 4.2.0.ALPHA2
      2010-11-05 08:27:43,172 40570 TRACE [org.jgroups.protocols.MPING] (ReceiverThread:) received GET_MBRS_REQ from Prod_perfbc2b-9973, sending response [PING: type=GET_MBRS_RSP, arg=own_addr=Prod_perfbc2a
      -58563, view id=[Prod_perfbc2a-58563|0], is_server=true, is_coord=true, logical_name=Prod_perfbc2a-58563]
      2010-11-05 08:27:43,226 40624 WARN  [org.jgroups.protocols.pbcast.NAKACK] (Incoming-6,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) Prod_perfbc2a-58563: dropped message from Prod_perfbc
      2b-9973 (not in xmit_table), keys are [Prod_perfbc2a-58563], view=[Prod_perfbc2a-58563|0] [Prod_perfbc2a-58563]
      2010-11-05 08:27:43,227 40625 WARN  [org.jgroups.protocols.pbcast.NAKACK] (OOB-2,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) Prod_perfbc2a-58563: dropped message from Prod_perfbc2b-99
      73 (not in xmit_table), keys are [Prod_perfbc2a-58563], view=[Prod_perfbc2a-58563|0] [Prod_perfbc2a-58563]
      2010-11-05 08:27:43,229 40627 TRACE [org.jgroups.protocols.pbcast.GMS] (OOB-1,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) Prod_perfbc2a-58563: view handler for merge_id Prod_perfbc2b-
      9973::2 was suspended
      2010-11-05 08:27:43,230 40628 DEBUG [org.jgroups.protocols.pbcast.GMS] (OOB-1,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) Prod_perfbc2a-58563: got merge request from Prod_perfbc2b-997
      3, merge_id=Prod_perfbc2b-9973::2, mbrs=[Prod_perfbc2a-58563]
      2010-11-05 08:27:43,231 40629 TRACE [org.jgroups.protocols.pbcast.GMS] (OOB-1,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) Prod_perfbc2a-58563: successful GMS flush by coordinator
      2010-11-05 08:27:43,232 40630 DEBUG [org.jgroups.protocols.pbcast.GMS] (OOB-1,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) Prod_perfbc2a-58563: fetched all digests for [Prod_perfbc2a-5
      8563]
      2010-11-05 08:27:43,232 40630 DEBUG [org.jgroups.protocols.pbcast.GMS] (OOB-1,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) Prod_perfbc2a-58563: sending merge response=GmsHeader[MERGE_R
      SP]: view=[Prod_perfbc2a-58563|0] [Prod_perfbc2a-58563], digest=Prod_perfbc2a-58563: [0 : 4 (4)], merge_id=Prod_perfbc2b-9973::2
      2010-11-05 08:27:43,235 40633 WARN  [org.jgroups.protocols.pbcast.NAKACK] (Incoming-7,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) Prod_perfbc2a-58563: dropped message from Prod_perfbc
      2b-9973 (not in xmit_table), keys are [Prod_perfbc2a-58563], view=[Prod_perfbc2a-58563|0] [Prod_perfbc2a-58563]
      2010-11-05 08:27:43,237 40635 TRACE [org.jgroups.protocols.pbcast.GMS] (Incoming-6,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2a-58563:) Prod_perfbc2a-58563: mcasting view {MergeView::[Prod_perfbc2b-9973|5] [Prod_perfbc2b-9973, Prod_perfbc2a-58563], subgroups=[[Prod_perfbc2b-9973|4] [Prod_perfbc2b-9973], [Prod_perfbc2a-58563|0] [Prod_perfbc2a-58563]]} (2 mbrs)
      ACTIVE SIDE:
      2010-11-05 08:27:23,231 58496756 TRACE [org.jgroups.protocols.MPING] (ReceiverThread:) received GET_MBRS_REQ from Prod_perfbc2b-9973, sending response [PING: type=GET_MBRS_RSP, arg=own_addr=Prod_perf
      bc2b-9973, view id=[Prod_perfbc2b-9973|4], is_server=true, is_coord=true, logical_name=Prod_perfbc2b-9973]
      2010-11-05 08:27:23,232 58496757 TRACE [org.jgroups.protocols.MPING] (OOB-1,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2b-9973:) received GET_MBRS_RSP from Prod_perfbc2b-9973: own_addr=Prod_pe
      rfbc2b-9973, view id=[Prod_perfbc2b-9973|4], is_server=true, is_coord=true, logical_name=Prod_perfbc2b-9973
      2010-11-05 08:27:23,239 58496764 TRACE [org.jgroups.protocols.MPING] (OOB-1,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2b-9973:) received GET_MBRS_RSP from Prod_perfbc2a-58563: own_addr=Prod_p
      erfbc2a-58563, view id=[Prod_perfbc2a-58563|0], is_server=true, is_coord=true, logical_name=Prod_perfbc2a-58563
      2010-11-05 08:27:23,239 58496764 TRACE [org.jgroups.protocols.MPING] (Timer-3,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2b-9973:) discovery took 8 ms: responses: 2 total (2 servers (2 coord),
      0 clients)
      2010-11-05 08:27:23,239 58496764 TRACE [org.jgroups.protocols.MERGE2] (Timer-3,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2b-9973:) Discovery results:
      [Prod_perfbc2b-9973]: [Prod_perfbc2b-9973|4] [Prod_perfbc2b-9973]
      [Prod_perfbc2a-58563]: [Prod_perfbc2a-58563|0] [Prod_perfbc2a-58563]
      2010-11-05 08:27:23,239 58496764 DEBUG [org.jgroups.protocols.MERGE2] (Timer-3,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2b-9973:) Prod_perfbc2b-9973 found different views : [Prod_perfbc2b-99
      73|4], [Prod_perfbc2a-58563|0]; sending up MERGE event with merge participants [Prod_perfbc2b-9973, Prod_perfbc2a-58563].
      Discovery results:
      [Prod_perfbc2b-9973]: [Prod_perfbc2b-9973|4] [Prod_perfbc2b-9973]
      [Prod_perfbc2a-58563]: [Prod_perfbc2a-58563|0] [Prod_perfbc2a-58563]
      2010-11-05 08:27:23,294 58496819 WARN  [org.jgroups.protocols.TCP] (MergeTask,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2b-9973:) Prod_perfbc2b-9973: no physical address for Prod_perfbc2a-585
      63, dropping message
      2010-11-05 08:27:23,295 58496820 TRACE [org.jgroups.protocols.MPING] (ReceiverThread:) received GET_MBRS_REQ from Prod_perfbc2b-9973, sending response [PING: type=GET_MBRS_RSP, arg=own_addr=Prod_perf
      bc2b-9973, view id=[Prod_perfbc2b-9973|4], is_server=true, is_coord=true, logical_name=Prod_perfbc2b-9973, physical_addrs=170.137.233.15:7020]
      2010-11-05 08:27:23,295 58496820 TRACE [org.jgroups.protocols.MPING] (OOB-2,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2b-9973:) received GET_MBRS_RSP from Prod_perfbc2b-9973: own_addr=Prod_pe
      rfbc2b-9973, view id=[Prod_perfbc2b-9973|4], is_server=true, is_coord=true, logical_name=Prod_perfbc2b-9973, physical_addrs=170.137.233.15:7020
      2010-11-05 08:27:23,296 58496821 TRACE [org.jgroups.protocols.MPING] (OOB-2,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2b-9973:) received GET_MBRS_RSP from Prod_perfbc2a-58563: own_addr=Prod_p
      erfbc2a-58563, view id=[Prod_perfbc2a-58563|0], is_server=true, is_coord=true, logical_name=Prod_perfbc2a-58563, physical_addrs=170.137.253.44:7020
      2010-11-05 08:27:24,301 58497826 TRACE [org.jgroups.protocols.MPING] (ReceiverThread:) received GET_MBRS_REQ from Prod_perfbc2a-58563, sending response [PING: type=GET_MBRS_RSP, arg=own_addr=Prod_per
      fbc2b-9973, view id=[Prod_perfbc2b-9973|4], is_server=true, is_coord=true, logical_name=Prod_perfbc2b-9973]
      2010-11-05 08:27:28,295 58501820 WARN  [org.jgroups.protocols.pbcast.GMS] (MergeTask,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2b-9973:) Prod_perfbc2b-9973: merge leader did not get data from
      all partition coordinators [Prod_perfbc2b-9973, Prod_perfbc2a-58563], merge is cancelled
      2010-11-05 08:27:28,298 58501823 WARN  [org.jgroups.protocols.pbcast.NAKACK] (Incoming-4,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2b-9973:) Prod_perfbc2b-9973: dropped message from Prod_perf
      bc2a-58563 (not in xmit_table), keys are [Prod_perfbc2b-9973], view=[Prod_perfbc2b-9973|4] [Prod_perfbc2b-9973]
      2010-11-05 08:27:29,299 58502824 TRACE [org.jgroups.protocols.MPING] (ReceiverThread:) received GET_MBRS_REQ from Prod_perfbc2b-9973, sending response [PING: type=GET_MBRS_RSP, arg=own_addr=Prod_perf
      bc2b-9973, view id=[Prod_perfbc2b-9973|4], is_server=true, is_coord=true, logical_name=Prod_perfbc2b-9973]
      2010-11-05 08:27:29,300 58502825 TRACE [org.jgroups.protocols.MPING] (OOB-5,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2b-9973:) received GET_MBRS_RSP from Prod_perfbc2b-9973: own_addr=Prod_pe
      rfbc2b-9973, view id=[Prod_perfbc2b-9973|4], is_server=true, is_coord=true, logical_name=Prod_perfbc2b-9973
      2010-11-05 08:27:29,301 58502826 TRACE [org.jgroups.protocols.MPING] (OOB-6,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2b-9973:) received GET_MBRS_RSP from Prod_perfbc2a-58563: own_addr=Prod_p
      erfbc2a-58563, view id=[Prod_perfbc2a-58563|0], is_server=true, is_coord=true, logical_name=Prod_perfbc2a-58563
      2010-11-05 08:27:29,301 58502826 TRACE [org.jgroups.protocols.MPING] (Timer-2,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2b-9973:) discovery took 2 ms: responses: 2 total (2 servers (2 coord),
      0 clients)
      2010-11-05 08:27:29,301 58502826 TRACE [org.jgroups.protocols.MERGE2] (Timer-2,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2b-9973:) Discovery results:
      [Prod_perfbc2b-9973]: [Prod_perfbc2b-9973|4] [Prod_perfbc2b-9973]
      [Prod_perfbc2a-58563]: [Prod_perfbc2a-58563|0] [Prod_perfbc2a-58563]
      2010-11-05 08:27:29,301 58502826 DEBUG [org.jgroups.protocols.MERGE2] (Timer-2,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2b-9973:) Prod_perfbc2b-9973 found different views : [Prod_perfbc2b-99
      73|4], [Prod_perfbc2a-58563|0]; sending up MERGE event with merge participants [Prod_perfbc2b-9973, Prod_perfbc2a-58563].
      Discovery results:
      [Prod_perfbc2b-9973]: [Prod_perfbc2b-9973|4] [Prod_perfbc2b-9973]
      [Prod_perfbc2a-58563]: [Prod_perfbc2a-58563|0] [Prod_perfbc2a-58563]
      2010-11-05 08:27:30,297 58503822 WARN  [org.jgroups.protocols.pbcast.FLUSH] (Incoming-1,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2b-9973:) Prod_perfbc2b-9973: waiting for UNBLOCK timed out a
      fter 2000 ms
      2010-11-05 08:27:33,483 58507008 TRACE [org.jgroups.protocols.MPING] (ReceiverThread:) received GET_MBRS_REQ from Prod_perfbc2a-58563, sending response [PING: type=GET_MBRS_RSP, arg=own_addr=Prod_per
      fbc2b-9973, view id=[Prod_perfbc2b-9973|4], is_server=true, is_coord=true, logical_name=Prod_perfbc2b-9973]
      2010-11-05 08:27:43,173 58516698 TRACE [org.jgroups.protocols.MPING] (ReceiverThread:) received GET_MBRS_REQ from Prod_perfbc2b-9973, sending response [PING: type=GET_MBRS_RSP, arg=own_addr=Prod_perf
      bc2b-9973, view id=[Prod_perfbc2b-9973|4], is_server=true, is_coord=true, logical_name=Prod_perfbc2b-9973]
      2010-11-05 08:27:43,173 58516698 TRACE [org.jgroups.protocols.MPING] (OOB-6,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2b-9973:) received GET_MBRS_RSP from Prod_perfbc2b-9973: own_addr=Prod_pe
      rfbc2b-9973, view id=[Prod_perfbc2b-9973|4], is_server=true, is_coord=true, logical_name=Prod_perfbc2b-9973
      2010-11-05 08:27:43,174 58516699 TRACE [org.jgroups.protocols.MPING] (OOB-6,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2b-9973:) received GET_MBRS_RSP from Prod_perfbc2a-58563: own_addr=Prod_p
      erfbc2a-58563, view id=[Prod_perfbc2a-58563|0], is_server=true, is_coord=true, logical_name=Prod_perfbc2a-58563
      2010-11-05 08:27:43,174 58516699 TRACE [org.jgroups.protocols.MPING] (Timer-4,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2b-9973:) discovery took 2 ms: responses: 2 total (2 servers (2 coord),
      0 clients)
      2010-11-05 08:27:43,174 58516699 TRACE [org.jgroups.protocols.MERGE2] (Timer-4,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2b-9973:) Discovery results:
      [Prod_perfbc2b-9973]: [Prod_perfbc2b-9973|4] [Prod_perfbc2b-9973]
      [Prod_perfbc2a-58563]: [Prod_perfbc2a-58563|0] [Prod_perfbc2a-58563]
      2010-11-05 08:27:43,175 58516700 DEBUG [org.jgroups.protocols.MERGE2] (Timer-4,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2b-9973:) Prod_perfbc2b-9973 found different views : [Prod_perfbc2b-99
      73|4], [Prod_perfbc2a-58563|0]; sending up MERGE event with merge participants [Prod_perfbc2b-9973, Prod_perfbc2a-58563].
      Discovery results:
      [Prod_perfbc2b-9973]: [Prod_perfbc2b-9973|4] [Prod_perfbc2b-9973]
      [Prod_perfbc2a-58563]: [Prod_perfbc2a-58563|0] [Prod_perfbc2a-58563]
      2010-11-05 08:27:43,232 58516757 WARN  [org.jgroups.protocols.pbcast.NAKACK] (Incoming-6,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2b-9973:) Prod_perfbc2b-9973: dropped message from Prod_perf
      bc2a-58563 (not in xmit_table), keys are [Prod_perfbc2b-9973], view=[Prod_perfbc2b-9973|4] [Prod_perfbc2b-9973]
      2010-11-05 08:27:43,233 58516758 WARN  [org.jgroups.protocols.pbcast.NAKACK] (OOB-5,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2b-9973:) Prod_perfbc2b-9973: dropped message from Prod_perfbc2a-
      58563 (not in xmit_table), keys are [Prod_perfbc2b-9973], view=[Prod_perfbc2b-9973|4] [Prod_perfbc2b-9973]
      2010-11-05 08:27:43,236 58516761 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-5,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2b-9973:) Received new, MERGED cluste
      r view: MergeView::[Prod_perfbc2b-9973|5] [Prod_perfbc2b-9973, Prod_perfbc2a-58563], subgroups=[[Prod_perfbc2b-9973|4] [Prod_perfbc2b-9973], [Prod_perfbc2a-58563|0] [Prod_perfbc2a-58563]]
      2010-11-05 08:27:58,236 58531761 DEBUG [org.jgroups.protocols.FD] (Timer-3,infinispan-cluster-OHServerHybrid-Prod,Prod_perfbc2b-9973:) sending are-you-alive msg to Prod_perfbc2a-58563 (own address=Pr
      od_perfbc2b-9973

        • 1. Re: Cluster Setup Issue in a Replicated Cache And StateTransfer
          cbo_

          Should have mentioned there were no significant pauses to the JVM during the time of discovery (i.e. no lengthy GCs nor jmap'ing going on). 

           

          Both applications have the is_coord=true which seems to be problematic.  As such, stateTransfer does not seem to occur.

          • 2. Re: Cluster Setup Issue in a Replicated Cache And StateTransfer
            mircea.markus

            Can you please attach the full logs including infinispan state transfer requests? Infinispan to be in TRACE as well.

            • 3. Re: Cluster Setup Issue in a Replicated Cache And StateTransfer
              cbo_

              I was looking at the logs some more and figured out the necessary adjustments to tolerate issues at cluster setup/discovery.  I still believe the issue here is related to a dropped multicast packet.  This only happens occasionally.  I figured out that I could increase the tolerance for such occurrences by adjusting 2 parameters in the jgroups MPING section.  I increased the timeout= and also the number for num_ping_requests.  Increasing the num_ping_requests will allow more retries to occur.  Increasing the timeout= setting will allow more time for the overall discovery to take place before moving on to the next steps (which include stateTransfer).  The problem is when the discovery fails and causes multiple VMs to be coordinators for their own caches (not sharing).  Then, it moves onto the stateTransfer and will effectively transfer nothing.  Then, later on when the MERGE occurs it only resolves the cluster but not state.

               

              After increasing my timeout to 10 seconds (10000 millis) and my num_ping_requests to 20 (20 attemps in the 10 seconds) I am not seeing the problem.  Of course if an issue persists beyond 10 seconds then I am in trouble.  However, keep in mind if I simply set the timeout very high, then the first VM into the cluster will always be delayed by that amount of time since no other coordinators are in the cluster.