-
1. JBoss 5.1.0.GA - org.jgroups.protocols.pbcast.GMS timed out.
belaban Jan 7, 2011 8:29 AM (in response to dafydd2277)Is this with JBoss 5.x ? JBoss 5.x uses JGroups 2.6, which can run into reincarnation issues. Switch to JBos 6 if you can (probably you can't !).
If not, then make sure you wait for all cluster nodes to exclude a killed node before you restart that node. Or, if you can, shut down the node gracefully before you restart it. Kill -9 in conjunction with a high timeout in FD leads to reincarnation issues...
Googling for "jgroups reincarnation" should shed some light on this, too..
-
2. JBoss 5.1.0.GA - org.jgroups.protocols.pbcast.GMS timed out.
dafydd2277 Jan 7, 2011 9:19 AM (in response to belaban)Hi, Bela,
Thanks for the suggestions. No, sadly, I can't upgrade JBoss.
Your suggestion to start searching for "jgroups reincarnation" led me to http://community.jboss.org/wiki/HandleJoinProblem, which led me to adding a new <category> in jboss-log4j.xml:
<category name="org.jgroups.protocols.pbcast.GMS">
<priority value="DEBUG" />
</category>
That gave me this output:
2011-01-07 08:56:05,468 DEBUG [org.jgroups.protocols.pbcast.GMS] initial_mbrs are [[own_addr=XXX.XXX.236.64:43996, coord_addr=XXX.XXX.236.64:43996, is_server=true]]
2011-01-07 08:56:05,468 DEBUG [org.jgroups.protocols.pbcast.GMS] election results: {XXX.XXX.236.64:43996=1}
2011-01-07 08:56:05,468 DEBUG [org.jgroups.protocols.pbcast.GMS] sending handleJoin(XXX.XXX.236.65:17224) to XXX.XXX.236.64:43996
2011-01-07 08:56:05,564 WARN [org.jgroups.protocols.pbcast.GMS] join(XXX.XXX.236.65:17224) sent to XXX.XXX.236.64:43996 timed out (after 3000 ms), retrying
(Over and over...)
First, the own_addr value in the first line has me puzzled. Shouldn't that be 236.65?
More importantly, should I be seeing more than just the one initial_mbr, if six (well, five right now) of the 10 hosts in the cluster are up and processing? (The other four handle those services that haven't been activated, yet.)
I got one of the developers on line, and he suggested emptying .../server/<config>/tmp. That didn't change any behaviors. Now, we've stopped this JBoss process and are letting it sit for 10 minutes before restarting it. Hopefully, that will be long enough for "all cluster nodes to exclude a killed node." (We do have graceful stop and start scripts for these processes. We've not had to kill -9 to stop them.)
Thanks!
David
-
3. JBoss 5.1.0.GA - org.jgroups.protocols.pbcast.GMS timed out.
dafydd2277 Jan 7, 2011 9:34 AM (in response to belaban)And, letting this instance stay shut down for 10 minutes didn't change the behavior. Still digging...
-
4. JBoss 5.1.0.GA - org.jgroups.protocols.pbcast.GMS timed out.
belaban Jan 7, 2011 10:22 AM (in response to dafydd2277)No, own_addr is the address of the member who sent the discovery response, in this case the coordinator to whom we will send the JOIN request.
Can you reproduce this scenario ?
-
5. JBoss 5.1.0.GA - org.jgroups.protocols.pbcast.GMS timed out.
dafydd2277 Jan 7, 2011 10:29 AM (in response to belaban)Really?! How weird. If 236.64 is shut down, how is it sending a discovery response? That's something for me to go investigate!
No, we've never reproduced this in house. Unfortunately.
-
6. JBoss 5.1.0.GA - org.jgroups.protocols.pbcast.GMS timed out.
dafydd2277 Jan 7, 2011 10:36 AM (in response to belaban)And, never mind... Turns out, somone had started 236.64 at some point. We're going to shut that down, and see what 236.65 does with a new discovery response.
(D'oh!)
-
7. JBoss 5.1.0.GA - org.jgroups.protocols.pbcast.GMS timed out.
dafydd2277 Jan 7, 2011 10:51 AM (in response to dafydd2277)Yah, that turned out to be it. I suppose the summary or lesson learned is: Verify that the JBoss instance sending the discovery reply is itself in a good state. Shut it down if necessary.
On to the next problem...
Thanks, Bela!