1 Reply Latest reply on Aug 12, 2011 11:30 AM by galder.zamarreno

    Failed node removal during performance test

    an1310

      Hi all:

       

      I was running a stress test (around 100 tx/sec) across a 4-node distributed cluster.  There were three caches involved.  I then stopped one of the nodes, and the cluster was not able to rehash -- producing non-stop failures until I finally stopped the test to see if the cluster would catch up.

       

      It didn't.  Here are the relevant log snippets from when I stopped the cache.  This was with ISPN 5.0.0.CR8 and the included JGroups dependency.

       

      Node 1:

       

      2011-08-04 00:09:47,659 ERROR [org.jgroups.protocols.TCP] failed sending message to Satriani_2-12188(PHL) (254 bytes): java.lang.IllegalStateException: Queue full, cause: null

      2011-08-04 00:09:47,660 ERROR [org.jgroups.protocols.TCP] failed sending message to Satriani_2-12188(PHL) (254 bytes): java.lang.IllegalStateException: Queue full, cause: null

      2011-08-04 00:09:47,727 ERROR [org.jgroups.protocols.TCP] failed sending message to Satriani_2-12188(PHL) (179 bytes): java.lang.IllegalStateException: Queue full, cause: null

      2011-08-04 00:09:47,730 ERROR [org.jgroups.protocols.TCP] failed sending message to Satriani_2-12188(PHL) (179 bytes): java.lang.IllegalStateException: Queue full, cause: null

      2011-08-04 00:09:49,389 ERROR [org.jgroups.protocols.TCP] failed sending message to Satriani_2-12188(PHL) (254 bytes): java.lang.IllegalStateException: Queue full, cause: null

      2011-08-04 00:09:49,424 ERROR [org.jgroups.protocols.TCP] failed sending message to Satriani_2-12188(PHL) (179 bytes): java.lang.IllegalStateException: Queue full, cause: null

      2011-08-04 00:09:56,497 WARN  [org.jgroups.protocols.pbcast.GMS] Satriani_1-20971(PHL): merge leader did not get data from all partition coordinators [Satriani_1-20971(PHL), Satriani_2-12188(PHL)], merge is cancelled

      2011-08-04 00:10:16,099 WARN  [org.jgroups.protocols.pbcast.GMS] Satriani_1-20971(PHL): merge leader did not get data from all partition coordinators [Satriani_1-20971(PHL), Satriani_2-12188(PHL)], merge is cancelled

      2011-08-04 00:10:36,592 WARN  [org.jgroups.protocols.pbcast.GMS] Satriani_1-20971(PHL): merge leader did not get data from all partition coordinators [Satriani_1-20971(PHL), Satriani_2-12188(PHL)], merge is cancelled

      2011-08-04 00:10:55,600 WARN  [org.jgroups.protocols.pbcast.GMS] Satriani_1-20971(PHL): merge leader did not get data from all partition coordinators [Satriani_1-20971(PHL), Satriani_2-12188(PHL)], merge is cancelled

       

      Node 2:

      [3121 : 3124 (3124) (size=3, missing=3122, highest stability=3121)]

      2011-08-04 00:11:29,498 WARN  [org.jgroups.protocols.pbcast.NAKACK] (requester=Satriani_3-59165(PHL), local_addr=Satriani_2-12188(PHL)) message Satriani_2-12188(PHL)::197 not found in retransmission table of Satriani_2-12188(PHL):

      [3121 : 3124 (3124) (size=3, missing=3122, highest stability=3121)]

      2011-08-04 00:11:29,954 WARN  [org.jgroups.protocols.pbcast.NAKACK] Satriani_2-12188(PHL): dropped message from Satriani_1-20971(PHL) (not in table [Satriani_2-12188(PHL)]), view=[Satriani_2-12188(PHL)|8] [Satriani_2-12188(PHL)]

      2011-08-04 00:11:31,998 WARN  [org.jgroups.protocols.pbcast.NAKACK] (requester=Satriani_1-20971(PHL), local_addr=Satriani_2-12188(PHL)) message Satriani_2-12188(PHL)::1782 not found in retransmission table of Satriani_2-12188(PHL):

      [3121 : 3124 (3124) (size=3, missing=3122, highest stability=3121)]

      2011-08-04 00:11:34,294 WARN  [org.jgroups.protocols.pbcast.NAKACK] (requester=Satriani_3-59165(PHL), local_addr=Satriani_2-12188(PHL)) message Satriani_2-12188(PHL)::197 not found in retransmission table of Satriani_2-12188(PHL):

      [3121 : 3124 (3124) (size=3, missing=3122, highest stability=3121)]

      2011-08-04 00:11:36,724 WARN  [org.jgroups.protocols.pbcast.NAKACK] (requester=Satriani_1-20971(PHL), local_addr=Satriani_2-12188(PHL)) message Satriani_2-12188(PHL)::1782 not found in retransmission table of Satriani_2-12188(PHL):

      [3121 : 3124 (3124) (size=3, missing=3122, highest stability=3121)]

      2011-08-04 00:11:37,838 WARN  [org.jgroups.protocols.pbcast.NAKACK] Satriani_2-12188(PHL): dropped message from Satriani_1-20971(PHL) (not in table [Satriani_2-12188(PHL)]), view=[Satriani_2-12188(PHL)|8] [Satriani_2-12188(PHL)]

      2011-08-04 00:11:39,093 WARN  [org.jgroups.protocols.pbcast.NAKACK] (requester=Satriani_3-59165(PHL), local_addr=Satriani_2-12188(PHL)) message Satriani_2-12188(PHL)::197 not found in retransmission table of Satriani_2-12188(PHL):

      [3121 : 3124 (3124) (size=3, missing=3122, highest stability=3121)]

      2011-08-04 00:11:40,349 WARN  [org.jgroups.protocols.pbcast.NAKACK] Satriani_2-12188(PHL): dropped message from Satriani_3-59165(PHL) (not in table [Satriani_2-12188(PHL)]), view=[Satriani_2-12188(PHL)|8] [Satriani_2-12188(PHL)]

      2011-08-04 00:11:41,554 WARN  [org.jgroups.protocols.pbcast.NAKACK] (requester=Satriani_1-20971(PHL), local_addr=Satriani_2-12188(PHL)) message Satriani_2-12188(PHL)::1782 not found in retransmission table of Satriani_2-12188(PHL):

      [3121 : 3124 (3124) (size=3, missing=3122, highest stability=3121)]

       

       

      Node 3:

       

      2011-08-03 14:54:57,928 ERROR [org.infinispan.distribution.RebalanceTask] ISPN000145: Error during rehash

      java.util.concurrent.TimeoutException: Timeout waiting for cluster-wide rehash to finish

              at org.infinispan.distribution.DistributionManagerImpl.waitForRehashToComplete(DistributionManagerImpl.java:265)

              at org.infinispan.distribution.RebalanceTask.performRehash(RebalanceTask.java:165)

              at org.infinispan.distribution.RehashTask.call(RehashTask.java:67)

              at org.infinispan.distribution.RehashTask.call(RehashTask.java:44)

              at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

              at java.util.concurrent.FutureTask.run(FutureTask.java:138)

              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

              at java.lang.Thread.run(Thread.java:619)

      2011-08-03 14:54:57,934 ERROR [org.infinispan.distribution.RebalanceTask] ISPN000145: Error during rehash

      java.util.concurrent.TimeoutException: Timeout waiting for cluster-wide rehash to finish

              at org.infinispan.distribution.DistributionManagerImpl.waitForRehashToComplete(DistributionManagerImpl.java:265)

              at org.infinispan.distribution.RebalanceTask.performRehash(RebalanceTask.java:165)

              at org.infinispan.distribution.RehashTask.call(RehashTask.java:67)

              at org.infinispan.distribution.RehashTask.call(RehashTask.java:44)

              at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

              at java.util.concurrent.FutureTask.run(FutureTask.java:138)

              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

              at java.lang.Thread.run(Thread.java:619)

       

      Node 3 hadn't logged a message in a while, so I took a thread dump.  I've also included the JGroups configuration.