7 Replies Latest reply on Jan 7, 2011 10:51 AM by dafydd2277

JBoss 5.1.0.GA - org.jgroups.protocols.pbcast.GMS timed out.

dafydd2277 Jan 7, 2011 8:10 AM

Good Morning!

I'll start by saying that I wouldn't even qualify as a "trained monkey" in JBoss. Someone else designed this cluster; I just get to support it.

For context: We have two parallel clusters of 10 servers each at this customer site. However, not all 10 servers of a given cluster are running because some of the associated applications have not yet been turned on from the customer side. (That is, there's no point in our monitoring those queues if they're not yet running.)

I'm sure many of you have seen similar WARNings in your server.log:

---------------------------------------------------------

GMS: address is XXX.XXX.236.65:20629 (cluster=<CLUSTER_NAME>-HAPartitionCache)

---------------------------------------------------------

2011-01-07 06:57:07,769 WARN [org.jgroups.protocols.pbcast.GMS] join(XXX.XXX.236.65:20629) sent to XXX.XXX.236.64:43996 timed out (after 3000 ms), retrying

2011-01-07 06:57:07,855 WARN [org.jgroups.protocols.pbcast.GMS] join(XXX.XXX.236.65:20629) sent to XXX.XXX.236.64:43996 timed out (after 3000 ms), retrying

2011-01-07 06:57:10,774 WARN [org.jgroups.protocols.pbcast.GMS] join(XXX.XXX.236.65:20629) sent to XXX.XXX.236.64:43996 timed out (after 3000 ms), retrying

"And the messages go on forever..."

Obviously, 236.65 is the local host, trying to connect to the next host up (236.64). However, 236.64 is one of those hosts that is currently turned off.

How do I tell this member of the cluster that timeouts to another member mean it's time to give up and try the next host on the list? I don't mind getting the timeouts. I do mind that it doesn't give up trying!

Thanks!

David

1. JBoss 5.1.0.GA - org.jgroups.protocols.pbcast.GMS timed out.

belaban Jan 7, 2011 8:29 AM (in response to dafydd2277)

Is this with JBoss 5.x ? JBoss 5.x uses JGroups 2.6, which can run into reincarnation issues. Switch to JBos 6 if you can (probably you can't !).

If not, then make sure you wait for all cluster nodes to exclude a killed node before you restart that node. Or, if you can, shut down the node gracefully before you restart it. Kill -9 in conjunction with a high timeout in FD leads to reincarnation issues...

Googling for "jgroups reincarnation" should shed some light on this, too..
Actions
2. JBoss 5.1.0.GA - org.jgroups.protocols.pbcast.GMS timed out.

dafydd2277 Jan 7, 2011 9:19 AM (in response to belaban)

Hi, Bela,

Thanks for the suggestions. No, sadly, I can't upgrade JBoss.

Your suggestion to start searching for "jgroups reincarnation" led me to http://community.jboss.org/wiki/HandleJoinProblem, which led me to adding a new <category> in jboss-log4j.xml:

<category name="org.jgroups.protocols.pbcast.GMS">
<priority value="DEBUG" />
</category>

That gave me this output:

2011-01-07 08:56:05,468 DEBUG [org.jgroups.protocols.pbcast.GMS] initial_mbrs are [[own_addr=XXX.XXX.236.64:43996, coord_addr=XXX.XXX.236.64:43996, is_server=true]]
2011-01-07 08:56:05,468 DEBUG [org.jgroups.protocols.pbcast.GMS] election results: {XXX.XXX.236.64:43996=1}
2011-01-07 08:56:05,468 DEBUG [org.jgroups.protocols.pbcast.GMS] sending handleJoin(XXX.XXX.236.65:17224) to XXX.XXX.236.64:43996
2011-01-07 08:56:05,564 WARN [org.jgroups.protocols.pbcast.GMS] join(XXX.XXX.236.65:17224) sent to XXX.XXX.236.64:43996 timed out (after 3000 ms), retrying

(Over and over...)

First, the own_addr value in the first line has me puzzled. Shouldn't that be 236.65?

More importantly, should I be seeing more than just the one initial_mbr, if six (well, five right now) of the 10 hosts in the cluster are up and processing? (The other four handle those services that haven't been activated, yet.)

I got one of the developers on line, and he suggested emptying .../server/<config>/tmp. That didn't change any behaviors. Now, we've stopped this JBoss process and are letting it sit for 10 minutes before restarting it. Hopefully, that will be long enough for "all cluster nodes to exclude a killed node." (We do have graceful stop and start scripts for these processes. We've not had to kill -9 to stop them.)

Thanks!
David
Actions
3. JBoss 5.1.0.GA - org.jgroups.protocols.pbcast.GMS timed out.

dafydd2277 Jan 7, 2011 9:34 AM (in response to belaban)

And, letting this instance stay shut down for 10 minutes didn't change the behavior. Still digging...
Actions
4. JBoss 5.1.0.GA - org.jgroups.protocols.pbcast.GMS timed out.

belaban Jan 7, 2011 10:22 AM (in response to dafydd2277)

No, own_addr is the address of the member who sent the discovery response, in this case the coordinator to whom we will send the JOIN request.
Can you reproduce this scenario ?
Actions
5. JBoss 5.1.0.GA - org.jgroups.protocols.pbcast.GMS timed out.

dafydd2277 Jan 7, 2011 10:29 AM (in response to belaban)

Really?! How weird. If 236.64 is shut down, how is it sending a discovery response? That's something for me to go investigate!

No, we've never reproduced this in house. Unfortunately.
Actions
6. JBoss 5.1.0.GA - org.jgroups.protocols.pbcast.GMS timed out.

dafydd2277 Jan 7, 2011 10:36 AM (in response to belaban)

And, never mind... Turns out, somone had started 236.64 at some point. We're going to shut that down, and see what 236.65 does with a new discovery response.

(D'oh!)
Actions
7. JBoss 5.1.0.GA - org.jgroups.protocols.pbcast.GMS timed out.

dafydd2277 Jan 7, 2011 10:51 AM (in response to dafydd2277)

Yah, that turned out to be it. I suppose the summary or lesson learned is: Verify that the JBoss instance sending the discovery reply is itself in a good state. Shut it down if necessary.

On to the next problem...

Thanks, Bela!
Actions

Go to original post