1 Reply Latest reply on Aug 12, 2011 11:30 AM by galder.zamarreno

Failed node removal during performance test

an1310 Aug 3, 2011 8:28 PM

Hi all:

I was running a stress test (around 100 tx/sec) across a 4-node distributed cluster. There were three caches involved. I then stopped one of the nodes, and the cluster was not able to rehash -- producing non-stop failures until I finally stopped the test to see if the cluster would catch up.

It didn't. Here are the relevant log snippets from when I stopped the cache. This was with ISPN 5.0.0.CR8 and the included JGroups dependency.

Node 1:

2011-08-04 00:09:47,659 ERROR [org.jgroups.protocols.TCP] failed sending message to Satriani_2-12188(PHL) (254 bytes): java.lang.IllegalStateException: Queue full, cause: null
2011-08-04 00:09:47,660 ERROR [org.jgroups.protocols.TCP] failed sending message to Satriani_2-12188(PHL) (254 bytes): java.lang.IllegalStateException: Queue full, cause: null
2011-08-04 00:09:47,727 ERROR [org.jgroups.protocols.TCP] failed sending message to Satriani_2-12188(PHL) (179 bytes): java.lang.IllegalStateException: Queue full, cause: null
2011-08-04 00:09:47,730 ERROR [org.jgroups.protocols.TCP] failed sending message to Satriani_2-12188(PHL) (179 bytes): java.lang.IllegalStateException: Queue full, cause: null
2011-08-04 00:09:49,389 ERROR [org.jgroups.protocols.TCP] failed sending message to Satriani_2-12188(PHL) (254 bytes): java.lang.IllegalStateException: Queue full, cause: null
2011-08-04 00:09:49,424 ERROR [org.jgroups.protocols.TCP] failed sending message to Satriani_2-12188(PHL) (179 bytes): java.lang.IllegalStateException: Queue full, cause: null
2011-08-04 00:09:56,497 WARN  [org.jgroups.protocols.pbcast.GMS] Satriani_1-20971(PHL): merge leader did not get data from all partition coordinators [Satriani_1-20971(PHL), Satriani_2-12188(PHL)], merge is cancelled
2011-08-04 00:10:16,099 WARN  [org.jgroups.protocols.pbcast.GMS] Satriani_1-20971(PHL): merge leader did not get data from all partition coordinators [Satriani_1-20971(PHL), Satriani_2-12188(PHL)], merge is cancelled
2011-08-04 00:10:36,592 WARN  [org.jgroups.protocols.pbcast.GMS] Satriani_1-20971(PHL): merge leader did not get data from all partition coordinators [Satriani_1-20971(PHL), Satriani_2-12188(PHL)], merge is cancelled
2011-08-04 00:10:55,600 WARN  [org.jgroups.protocols.pbcast.GMS] Satriani_1-20971(PHL): merge leader did not get data from all partition coordinators [Satriani_1-20971(PHL), Satriani_2-12188(PHL)], merge is cancelled

Node 2:

[3121 : 3124 (3124) (size=3, missing=3122, highest stability=3121)]
2011-08-04 00:11:29,498 WARN  [org.jgroups.protocols.pbcast.NAKACK] (requester=Satriani_3-59165(PHL), local_addr=Satriani_2-12188(PHL)) message Satriani_2-12188(PHL)::197 not found in retransmission table of Satriani_2-12188(PHL):
[3121 : 3124 (3124) (size=3, missing=3122, highest stability=3121)]
2011-08-04 00:11:29,954 WARN  [org.jgroups.protocols.pbcast.NAKACK] Satriani_2-12188(PHL): dropped message from Satriani_1-20971(PHL) (not in table [Satriani_2-12188(PHL)]), view=[Satriani_2-12188(PHL)|8] [Satriani_2-12188(PHL)]
2011-08-04 00:11:31,998 WARN  [org.jgroups.protocols.pbcast.NAKACK] (requester=Satriani_1-20971(PHL), local_addr=Satriani_2-12188(PHL)) message Satriani_2-12188(PHL)::1782 not found in retransmission table of Satriani_2-12188(PHL):
[3121 : 3124 (3124) (size=3, missing=3122, highest stability=3121)]
2011-08-04 00:11:34,294 WARN  [org.jgroups.protocols.pbcast.NAKACK] (requester=Satriani_3-59165(PHL), local_addr=Satriani_2-12188(PHL)) message Satriani_2-12188(PHL)::197 not found in retransmission table of Satriani_2-12188(PHL):
[3121 : 3124 (3124) (size=3, missing=3122, highest stability=3121)]
2011-08-04 00:11:36,724 WARN  [org.jgroups.protocols.pbcast.NAKACK] (requester=Satriani_1-20971(PHL), local_addr=Satriani_2-12188(PHL)) message Satriani_2-12188(PHL)::1782 not found in retransmission table of Satriani_2-12188(PHL):
[3121 : 3124 (3124) (size=3, missing=3122, highest stability=3121)]
2011-08-04 00:11:37,838 WARN  [org.jgroups.protocols.pbcast.NAKACK] Satriani_2-12188(PHL): dropped message from Satriani_1-20971(PHL) (not in table [Satriani_2-12188(PHL)]), view=[Satriani_2-12188(PHL)|8] [Satriani_2-12188(PHL)]
2011-08-04 00:11:39,093 WARN  [org.jgroups.protocols.pbcast.NAKACK] (requester=Satriani_3-59165(PHL), local_addr=Satriani_2-12188(PHL)) message Satriani_2-12188(PHL)::197 not found in retransmission table of Satriani_2-12188(PHL):
[3121 : 3124 (3124) (size=3, missing=3122, highest stability=3121)]
2011-08-04 00:11:40,349 WARN  [org.jgroups.protocols.pbcast.NAKACK] Satriani_2-12188(PHL): dropped message from Satriani_3-59165(PHL) (not in table [Satriani_2-12188(PHL)]), view=[Satriani_2-12188(PHL)|8] [Satriani_2-12188(PHL)]
2011-08-04 00:11:41,554 WARN  [org.jgroups.protocols.pbcast.NAKACK] (requester=Satriani_1-20971(PHL), local_addr=Satriani_2-12188(PHL)) message Satriani_2-12188(PHL)::1782 not found in retransmission table of Satriani_2-12188(PHL):
[3121 : 3124 (3124) (size=3, missing=3122, highest stability=3121)]

Node 3:

2011-08-03 14:54:57,928 ERROR [org.infinispan.distribution.RebalanceTask] ISPN000145: Error during rehash

java.util.concurrent.TimeoutException: Timeout waiting for cluster-wide rehash to finish

at org.infinispan.distribution.DistributionManagerImpl.waitForRehashToComplete(DistributionManagerImpl.java:265)

at org.infinispan.distribution.RebalanceTask.performRehash(RebalanceTask.java:165)

at org.infinispan.distribution.RehashTask.call(RehashTask.java:67)

at org.infinispan.distribution.RehashTask.call(RehashTask.java:44)

at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

at java.util.concurrent.FutureTask.run(FutureTask.java:138)

at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:619)

2011-08-03 14:54:57,934 ERROR [org.infinispan.distribution.RebalanceTask] ISPN000145: Error during rehash