3 Replies Latest reply on Feb 12, 2011 10:41 PM by girishadat

JGroups issues at cluster nodes restart.

girishadat Feb 9, 2011 4:45 PM

I am trying to test some abnormal termination behaviours of my application that makes use of Infinispan extensively.

But when I try to restart one of the members in the cluster, JGroups is showing many different errors like:

infinitely showing [org.jgroups.protocols.pbcast.NAKACK] mynode1-61896: dropped message from mynode2-62861 (not in table [mynode1-61896]), view=[mynode1-61896|2] [mynode1-61896]
infinitely showing [org.jgroups.protocols.pbcast.ClientGmsImpl] join(mynode2-62861) sent to mynode1-61896 timed out (after 7000 ms), retrying

Why these happen? Also, my transactions gets aborted when trying to do some operations in a synchronous replication cache in a transaction context.

I am using Infinispan 4.2.0 with JGroups 2.11 and JBossTS 4.6.

1. JGroups issues at cluster nodes restart.

girishadat Feb 9, 2011 4:47 PM (in response to girishadat)

As per my knowledge, the join-retry-timedout problem with JGroups was solved in 2.8.0 ( by introducing logical addresses and removing shunning?)
Actions
2. Re: JGroups issues at cluster nodes restart.

girishadat Feb 11, 2011 8:42 AM (in response to girishadat)
It looks like the cause is @ the coordinator to which the new member is trying to send the join request. The join retries for rejoins happens after below lines in coordinator which does not happen always. Following lines were printed after a view change.

2011-02-11 17:23:37,974 WARN [org.jgroups.protocols.pbcast.GMS] 192.168.12.3:7800:1331-25091: failed to collect all ACKs (expected=1) for view [192.168.12.3:7800:1331-25091|4] [192.168.12.3:7800:1331-25091] after 2000ms, missing ACKs from [192.168.12.3:7800:1331-25091] 2011-02-11 17:23:39,977 WARN [org.jgroups.protocols.pbcast.FLUSH] 192.168.12.3:7800:1331-25091: waiting for UNBLOCK timed out after 2000 ms

Why a node fails to get ack from itself? Anyway, let me try setting -Djgroups.bind.address.
Actions
3. Re: JGroups issues at cluster nodes restart.

girishadat Feb 12, 2011 10:41 PM (in response to girishadat)

Sorry again; it was due to an bug in the host application code. A view change listener added in the coordinator node was blocked at waiting for completion of a thread which was started to perform some operations in the application for the view change.
Actions

Go to original post