5 Replies Latest reply on Sep 21, 2007 12:18 PM by belaban

Cluster merge issues

jbossmk Sep 19, 2007 7:01 AM

We have a cluster of nodes deployed on the same machine with the cluster-service.xml having the following snippet:

<TCP bind_addr="localhost" start_port="${jboss.cluster.tcp.port:7800}" loopback="true"/>
<TCPPING initial_hosts="localhost[${jboss.cluster.tcp.port:7800}]" port_range="${jboss.cluster.tcp.port.range:5}" timeout="3500"
num_initial_members="${jboss.cluster.tcp.members:5}" up_thread="true" down_thread="true"/>
<MERGE2 min_interval="5000" max_interval="10000"/>
<FD shun="true" timeout="5000" max_tries="5" up_thread="false" down_thread="false" />
<VERIFY_SUSPECT timeout="4000" down_thread="false" up_thread="false" />
<pbcast.NAKACK down_thread="true" up_thread="true" gc_lag="100"
retransmit_timeout="3000"/>
<pbcast.STABLE desired_avg_gossip="20000" down_thread="false" up_thread="false" />
<pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="false"
print_local_addr="true" down_thread="true" up_thread="true"/>
<pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>

When a split happens, the nodes from the secondary partition doesn't merge at all. We are re-starting the node every time this happens.
Could someone tell me if there is anything wrong in the configuration?
Would setting shun="true" in the GMS change the behavior? I also heard that the JGroup channel's AUTO_RECONNECT should be set to true programatically, how do we do that declaratively?

Your help is appreciated.

Thanks.

1. Re: Cluster merge issues

belaban Sep 19, 2007 7:05 AM (in response to jbossmk)

#1 Check that 'localhost' really resolve to the correct address (e.g. not to 127.0.0.1) on *all* hosts
#2 You can't set AUTO_RECONNECT declaratively, use the following code to do this:

JChannel ch;
ch.setOpt(Channel.AUTO_RECONNECT, true);

Possibly also
ch.setOpt(Channel.AUTO_GETSTATE, true);
Actions
2. Re: Cluster merge issues

jbossmk Sep 19, 2007 7:12 AM (in response to jbossmk)

The "localhost" resolves to a valid domain, and all of the nodes (five of them) run on a single box. We are using JGroups 2.4.1.

Is the configuration Okay otherwise?

Why does the split happen at all even if all the nodes are running on the same box in the first place? Could it be because of long GC pauses? Could there be any other reasons?

Please let us know.

Thanks.
Actions
3. Re: Cluster merge issues

belaban Sep 19, 2007 7:26 AM (in response to jbossmk)

Use FD_SOCK instead of or on top of FD (see http://wiki.jboss.org/wiki/Wiki.jsp?page=FDVersusFD_SOCK for details). Suspicions can happen due to a number of reasons, e.g. garbage collection, up queue blocked by callback etc, also explained there
Actions
4. Re: Cluster merge issues

jbossmk Sep 21, 2007 12:16 PM (in response to jbossmk)

Thanks for your suggestion Bela.

If I use FD_SOCK on top of FD, then what happens when the FD has timed out after retrying, but the socket (FD_SOCK) is still active between the nodes? Would FD send a SUSPECT message?

Thanks in advance.
Actions
5. Re: Cluster merge issues

belaban Sep 21, 2007 12:18 PM (in response to jbossmk)

yes. So set the timeout in FD to a sufficiently high value
Actions

Go to original post