0 Replies Latest reply on Nov 21, 2013 12:44 AM by erikpearson

    Server fails to join cluster after being restarted

    erikpearson

      Does anyone have suggestions about why a server can't drop and rejoin a cluster?  I'm running Infinispan 6.0.0.Final in a two node cluster.  With servers A and B running and clustered, if B is brought down, then restarted, it fails to rejoin the cluster.  We see the following errors in the logs:

       

      Server A:

       

      [2013-11-20 23:26:53,087][ViewHandler,mn4s34196-5743][DEBUG][org.jgroups.protocols.pbcast.STABLE] suspending message garbage collection

      [2013-11-20 23:26:53,087][ViewHandler,mn4s34196-5743][DEBUG][org.jgroups.protocols.pbcast.STABLE] mn4s34196-5743: resume task started, max_suspend_time=33000

      [2013-11-20 23:26:53,087][Incoming-1,mn4s34196-5743][DEBUG][org.jgroups.protocols.pbcast.GMS] mn4s34196-5743: installing view [mn4s34196-5743|9] (2) [mn4s34196-5743, mn4s34198-20624]

      [2013-11-20 23:26:53,088][Incoming-1,mn4s34196-5743][DEBUG][org.jgroups.protocols.FD_SOCK] VIEW_CHANGE received: [mn4s34196-5743, mn4s34198-20624]

      [2013-11-20 23:26:53,088][FD_SOCK pinger,mn4s34196-5743][DEBUG][org.jgroups.protocols.FD_SOCK] ping_dest is mn4s34198-20624, pingable_mbrs=[mn4s34196-5743, mn4s34198-20624]

      [2013-11-20 23:26:53,088][Incoming-1,mn4s34196-5743][DEBUG][org.infinispan.remoting.transport.jgroups.JGroupsTransport] New view accepted: [mn4s34196-5743|9] (2) [mn4s34196-5743, mn4s34198-20624]

      [2013-11-20 23:26:53,088][Incoming-1,mn4s34196-5743][INFO][org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000094: Received new cluster view: [mn4s34196-5743|9] (2) [mn4s34196-5743,,mn4s34198-20624]

      [2013-11-20 23:26:53,135][ViewHandler,mn4s34196-5743][DEBUG][org.jgroups.protocols.pbcast.STABLE] resuming message garbage collection

      [2013-11-20 23:27:52,497][Timer-4,mn4s34196-5743][DEBUG][org.jgroups.protocols.FD_ALL] haven't received a heartbeat from mn4s34198-20624 for 50351 ms, adding it to suspect list

      [2013-11-20 23:27:52,497][Timer-4,mn4s34196-5743][DEBUG][org.jgroups.protocols.FD_ALL] suspecting [mn4s34198-20624]

      [2013-11-20 23:27:52,498][Incoming-2,mn4s34196-5743][WARN][org.jgroups.protocols.pbcast.GMS] mn4s34196-5743: not member of view [mn4s34198-20624|10]; discarding it

      [2013-11-20 23:27:55,499][Timer-2,mn4s34196-5743][DEBUG][org.jgroups.protocols.MERGE2] mn4s34196-5743 found different views : [mn4s34198-20624|10], [mn4s34196-5743|9]; sending up MERGE event with merge participants [mn4s34198-20624, mn4s34196-5743].

      Discovery results:

      [mn4s34198-20624]: coord=mn4s34198-20624

      [mn4s34196-5743]: coord=mn4s34196-5743

       

      [2013-11-20 23:27:56,748][INT-1,mn4s34196-5743][DEBUG][org.jgroups.protocols.pbcast.STABLE] suspending message garbage collection

      [2013-11-20 23:27:56,748][INT-1,mn4s34196-5743][DEBUG][org.jgroups.protocols.pbcast.STABLE] mn4s34196-5743: resume task started, max_suspend_time=220000

      [2013-11-20 23:27:56,803][Incoming-1,mn4s34196-5743][DEBUG][org.jgroups.protocols.pbcast.NAKACK2]

      [mn4s34196-5743 mergeDigest()]

      existing digest:  mn4s34198-20624: [3 (3)], mn4s34196-5743: [14 (14)]

      new digest:       mn4s34198-20624: [1 (1)], mn4s34196-5743: [14 (14)]

      resulting digest: mn4s34198-20624: [3 (3)], mn4s34196-5743: [14 (14)]

      [2013-11-20 23:27:56,803][Incoming-1,mn4s34196-5743][DEBUG][org.jgroups.protocols.pbcast.GMS] mn4s34196-5743: installing view MergeView::[mn4s34198-20624|11] (2) [mn4s34198-20624, mn4s34196-5743], 2 subgroups: [mn4s34198-20624|10] (1) [mn4s34198-20624], [mn4s34196-5743|9] (1) [mn4s34196-5743]

      [2013-11-20 23:27:56,804][Incoming-1,mn4s34196-5743][DEBUG][org.jgroups.protocols.FD_SOCK] VIEW_CHANGE received: [mn4s34198-20624, mn4s34196-5743]

      [2013-11-20 23:27:56,804][Incoming-1,mn4s34196-5743][DEBUG][org.infinispan.remoting.transport.jgroups.JGroupsTransport] New view accepted: MergeView::[mn4s34198-20624|11] (2) [mn4s34198-20624, mn4s34196-5743], 2 subgroups: [mn4s34198-20624|10] (1) [mn4s34198-20624], [mn4s34196-5743|9] (1) [mn4s34196-5743]

      [2013-11-20 23:27:56,804][Incoming-1,mn4s34196-5743][INFO][org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000093: Received new, MERGED cluster view: MergeView::[mn4s34198-20624|11] (2) [mn4s34198-20624, mn4s34196-5743], 2 subgroups: [mn4s34198-20624|10] (1) [mn4s34198-20624], [mn4s34196-5743|9] (1) [mn4s34196-5743]

      [2013-11-20 23:27:56,804][Incoming-1,mn4s34196-5743][DEBUG][org.jgroups.protocols.pbcast.STABLE] resuming message garbage collection

      [2013-11-20 23:27:58,806][Incoming-2,mn4s34196-5743][WARN][org.jgroups.protocols.pbcast.GMS] mn4s34196-5743: failed to collect all ACKs (expected=2) for view [mn4s34198-20624|11] after 2000ms, missing 2 ACKs from mn4s34198-20624, mn4s34196-5743

       

       

      Server B:

       

       

      [2013-11-20 23:26:52,976][main][DEBUG][org.jgroups.protocols.UDP] socket information:

      mcast_addr=230.33.1.1:46655, bind_addr=/10.151.115.22, ttl=2

      sock: bound to 10.151.115.22:39198, receive buffer size=20000000, send buffer size=640000

      mcast_sock: bound to 10.151.115.22:46655, send buffer size=640000, receive buffer size=25000000

      [2013-11-20 23:26:52,978][main][DEBUG][org.jgroups.protocols.pbcast.GMS] address=mn4s34198-20624, cluster=ISPN, physical address=10.151.115.22:39198

      [2013-11-20 23:26:53,007][main][DEBUG][org.jgroups.protocols.pbcast.GMS] mn4s34198-20624: sending JOIN(mn4s34198-20624) to mn4s34196-5743

      [2013-11-20 23:26:53,116][main][DEBUG][org.jgroups.protocols.pbcast.NAKACK2]

      [mn4s34198-20624 setDigest()]

      existing digest:  []

      new digest:       mn4s34196-5743: [13 (13)], mn4s34198-20624: [0 (0)]

      resulting digest: mn4s34196-5743: [13 (13)], mn4s34198-20624: [0 (0)]

      [2013-11-20 23:26:53,117][main][DEBUG][org.jgroups.protocols.pbcast.GMS] mn4s34198-20624: installing view [mn4s34196-5743|9] (2) [mn4s34196-5743, mn4s34198-20624]

      [2013-11-20 23:26:53,117][main][DEBUG][org.jgroups.protocols.FD_SOCK] VIEW_CHANGE received: [mn4s34196-5743, mn4s34198-20624]

      [2013-11-20 23:26:53,120][main][DEBUG][org.infinispan.remoting.transport.jgroups.JGroupsTransport] New view accepted: [mn4s34196-5743|9] (2) [mn4s34196-5743, mn4s34198-20624]

      [2013-11-20 23:26:53,120][main][INFO][org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000094: Received new cluster view: [mn4s34196-5743|9] (2) [mn4s34196-5743, mn4s34198-20624]

      [2013-11-20 23:26:53,121][main][INFO][org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000079: Cache local address is mn4s34198-20624, physical addresses are [10.151.115.22:39198]

      [2013-11-20 23:26:53,122][main][DEBUG][org.infinispan.remoting.transport.jgroups.JGroupsTransport] Waiting on view being accepted

      [2013-11-20 23:26:53,125][main][INFO][org.infinispan.factories.GlobalComponentRegistry] ISPN000128: Infinispan version: Infinispan 'Infinium' 6.0.0.Final

      [2013-11-20 23:26:53,156][FD_SOCK pinger,mn4s34198-20624][DEBUG][org.jgroups.protocols.FD_SOCK] ping_dest is mn4s34196-5743, pingable_mbrs=[mn4s34196-5743, mn4s34198-20624]

      ...

      [2013-11-20 23:27:17,138][Timer-3,mn4s34198-20624][DEBUG][org.jgroups.protocols.FD_ALL] haven't received a heartbeat from mn4s34196-5743 for 15026 ms, adding it to suspect list

      [2013-11-20 23:27:17,138][Timer-3,mn4s34198-20624][DEBUG][org.jgroups.protocols.FD_ALL] suspecting [mn4s34196-5743]

      [2013-11-20 23:27:18,641][VERIFY_SUSPECT.TimerThread,mn4s34198-20624][DEBUG][org.jgroups.blocks.RequestCorrelator] suspect=mn4s34196-5743

      [2013-11-20 23:27:18,695][ViewHandler,mn4s34198-20624][DEBUG][org.jgroups.protocols.pbcast.GMS] mn4s34198-20624: members are [mn4s34196-5743, mn4s34198-20624], coord=mn4s34198-20624: I'm the new coord !

      [2013-11-20 23:27:18,696][ViewHandler,mn4s34198-20624][DEBUG][org.jgroups.protocols.pbcast.STABLE] resuming message garbage collection

      [2013-11-20 23:27:18,749][Incoming-1,mn4s34198-20624][DEBUG][org.jgroups.protocols.pbcast.GMS] mn4s34198-20624: installing view [mn4s34198-20624|10] (1) [mn4s34198-20624]

      [2013-11-20 23:27:18,749][Incoming-1,mn4s34198-20624][DEBUG][org.jgroups.protocols.pbcast.NAKACK2] mn4s34198-20624: removed mn4s34196-5743 from xmit_table (not member anymore)

      [2013-11-20 23:27:18,749][Incoming-1,mn4s34198-20624][DEBUG][org.jgroups.protocols.FD_SOCK] VIEW_CHANGE received: [mn4s34198-20624]

      [2013-11-20 23:27:19,091][FD_SOCK pinger,mn4s34198-20624][DEBUG][org.jgroups.protocols.FD_SOCK] socket to null was closed gracefully

      [2013-11-20 23:27:19,093][Incoming-1,mn4s34198-20624][DEBUG][org.infinispan.remoting.transport.jgroups.JGroupsTransport] New view accepted: [mn4s34198-20624|10] (1) [mn4s34198-20624]

      [2013-11-20 23:27:19,093][Incoming-1,mn4s34198-20624][INFO][org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000094: Received new cluster view: [mn4s34198-20624|10] (1) [mn4s34198-20624] !