Failed node removal during performance test
an1310 Aug 3, 2011 8:28 PMHi all:
I was running a stress test (around 100 tx/sec) across a 4-node distributed cluster. There were three caches involved. I then stopped one of the nodes, and the cluster was not able to rehash -- producing non-stop failures until I finally stopped the test to see if the cluster would catch up.
It didn't. Here are the relevant log snippets from when I stopped the cache. This was with ISPN 5.0.0.CR8 and the included JGroups dependency.
Node 1:
2011-08-04 00:09:47,659 ERROR [org.jgroups.protocols.TCP] failed sending message to Satriani_2-12188(PHL) (254 bytes): java.lang.IllegalStateException: Queue full, cause: null
2011-08-04 00:09:47,660 ERROR [org.jgroups.protocols.TCP] failed sending message to Satriani_2-12188(PHL) (254 bytes): java.lang.IllegalStateException: Queue full, cause: null
2011-08-04 00:09:47,727 ERROR [org.jgroups.protocols.TCP] failed sending message to Satriani_2-12188(PHL) (179 bytes): java.lang.IllegalStateException: Queue full, cause: null
2011-08-04 00:09:47,730 ERROR [org.jgroups.protocols.TCP] failed sending message to Satriani_2-12188(PHL) (179 bytes): java.lang.IllegalStateException: Queue full, cause: null
2011-08-04 00:09:49,389 ERROR [org.jgroups.protocols.TCP] failed sending message to Satriani_2-12188(PHL) (254 bytes): java.lang.IllegalStateException: Queue full, cause: null
2011-08-04 00:09:49,424 ERROR [org.jgroups.protocols.TCP] failed sending message to Satriani_2-12188(PHL) (179 bytes): java.lang.IllegalStateException: Queue full, cause: null
2011-08-04 00:09:56,497 WARN [org.jgroups.protocols.pbcast.GMS] Satriani_1-20971(PHL): merge leader did not get data from all partition coordinators [Satriani_1-20971(PHL), Satriani_2-12188(PHL)], merge is cancelled
2011-08-04 00:10:16,099 WARN [org.jgroups.protocols.pbcast.GMS] Satriani_1-20971(PHL): merge leader did not get data from all partition coordinators [Satriani_1-20971(PHL), Satriani_2-12188(PHL)], merge is cancelled
2011-08-04 00:10:36,592 WARN [org.jgroups.protocols.pbcast.GMS] Satriani_1-20971(PHL): merge leader did not get data from all partition coordinators [Satriani_1-20971(PHL), Satriani_2-12188(PHL)], merge is cancelled
2011-08-04 00:10:55,600 WARN [org.jgroups.protocols.pbcast.GMS] Satriani_1-20971(PHL): merge leader did not get data from all partition coordinators [Satriani_1-20971(PHL), Satriani_2-12188(PHL)], merge is cancelled
Node 2:
[3121 : 3124 (3124) (size=3, missing=3122, highest stability=3121)]
2011-08-04 00:11:29,498 WARN [org.jgroups.protocols.pbcast.NAKACK] (requester=Satriani_3-59165(PHL), local_addr=Satriani_2-12188(PHL)) message Satriani_2-12188(PHL)::197 not found in retransmission table of Satriani_2-12188(PHL):
[3121 : 3124 (3124) (size=3, missing=3122, highest stability=3121)]
2011-08-04 00:11:29,954 WARN [org.jgroups.protocols.pbcast.NAKACK] Satriani_2-12188(PHL): dropped message from Satriani_1-20971(PHL) (not in table [Satriani_2-12188(PHL)]), view=[Satriani_2-12188(PHL)|8] [Satriani_2-12188(PHL)]
2011-08-04 00:11:31,998 WARN [org.jgroups.protocols.pbcast.NAKACK] (requester=Satriani_1-20971(PHL), local_addr=Satriani_2-12188(PHL)) message Satriani_2-12188(PHL)::1782 not found in retransmission table of Satriani_2-12188(PHL):
[3121 : 3124 (3124) (size=3, missing=3122, highest stability=3121)]
2011-08-04 00:11:34,294 WARN [org.jgroups.protocols.pbcast.NAKACK] (requester=Satriani_3-59165(PHL), local_addr=Satriani_2-12188(PHL)) message Satriani_2-12188(PHL)::197 not found in retransmission table of Satriani_2-12188(PHL):
[3121 : 3124 (3124) (size=3, missing=3122, highest stability=3121)]
2011-08-04 00:11:36,724 WARN [org.jgroups.protocols.pbcast.NAKACK] (requester=Satriani_1-20971(PHL), local_addr=Satriani_2-12188(PHL)) message Satriani_2-12188(PHL)::1782 not found in retransmission table of Satriani_2-12188(PHL):
[3121 : 3124 (3124) (size=3, missing=3122, highest stability=3121)]
2011-08-04 00:11:37,838 WARN [org.jgroups.protocols.pbcast.NAKACK] Satriani_2-12188(PHL): dropped message from Satriani_1-20971(PHL) (not in table [Satriani_2-12188(PHL)]), view=[Satriani_2-12188(PHL)|8] [Satriani_2-12188(PHL)]
2011-08-04 00:11:39,093 WARN [org.jgroups.protocols.pbcast.NAKACK] (requester=Satriani_3-59165(PHL), local_addr=Satriani_2-12188(PHL)) message Satriani_2-12188(PHL)::197 not found in retransmission table of Satriani_2-12188(PHL):
[3121 : 3124 (3124) (size=3, missing=3122, highest stability=3121)]
2011-08-04 00:11:40,349 WARN [org.jgroups.protocols.pbcast.NAKACK] Satriani_2-12188(PHL): dropped message from Satriani_3-59165(PHL) (not in table [Satriani_2-12188(PHL)]), view=[Satriani_2-12188(PHL)|8] [Satriani_2-12188(PHL)]
2011-08-04 00:11:41,554 WARN [org.jgroups.protocols.pbcast.NAKACK] (requester=Satriani_1-20971(PHL), local_addr=Satriani_2-12188(PHL)) message Satriani_2-12188(PHL)::1782 not found in retransmission table of Satriani_2-12188(PHL):
[3121 : 3124 (3124) (size=3, missing=3122, highest stability=3121)]
Node 3:
2011-08-03 14:54:57,928 ERROR [org.infinispan.distribution.RebalanceTask] ISPN000145: Error during rehash
java.util.concurrent.TimeoutException: Timeout waiting for cluster-wide rehash to finish
at org.infinispan.distribution.DistributionManagerImpl.waitForRehashToComplete(DistributionManagerImpl.java:265)
at org.infinispan.distribution.RebalanceTask.performRehash(RebalanceTask.java:165)
at org.infinispan.distribution.RehashTask.call(RehashTask.java:67)
at org.infinispan.distribution.RehashTask.call(RehashTask.java:44)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
2011-08-03 14:54:57,934 ERROR [org.infinispan.distribution.RebalanceTask] ISPN000145: Error during rehash
java.util.concurrent.TimeoutException: Timeout waiting for cluster-wide rehash to finish
at org.infinispan.distribution.DistributionManagerImpl.waitForRehashToComplete(DistributionManagerImpl.java:265)
at org.infinispan.distribution.RebalanceTask.performRehash(RebalanceTask.java:165)
at org.infinispan.distribution.RehashTask.call(RehashTask.java:67)
at org.infinispan.distribution.RehashTask.call(RehashTask.java:44)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Node 3 hadn't logged a message in a while, so I took a thread dump. I've also included the JGroups configuration.
-
tcp.xml 2.4 KB
-
node_3_dmp.txt.zip 18.2 KB