I am seeing a cluster split in our production JBoss environment and wondering why the merging process is not working. The configuration is as follows:
- JBoss: 4.0.3SP1
- OS: Solaris
- Cluster: there are 5 server instances on the same box. TCP is used as transport:
<TCP bind_addr="localhost" start_port="${jboss.cluster.tcp.port:7800}" loopback="true"/>
<TCPPING initial_hosts="localhost[${jboss.cluster.tcp.port:7800}]" port_range="${jboss.cluster.tcp.port.range:5}" timeout="3500"
num_initial_members="${jboss.cluster.tcp.members:5}" up_thread="true" down_thread="true"/>
<MERGE2 min_interval="5000" max_interval="10000"/>
<FD shun="true" timeout="5000" max_tries="5" up_thread="false" down_thread="false" />
<VERIFY_SUSPECT timeout="4000" down_thread="false" up_thread="false" />
<pbcast.NAKACK down_thread="true" up_thread="true" gc_lag="100"
retransmit_timeout="3000"/>
<pbcast.STABLE desired_avg_gossip="20000" down_thread="false" up_thread="false" />
<pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="false"
print_local_addr="true" down_thread="true" up_thread="true"/>
<pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
4 instances have formed one cluster and 1 instance formed another. The following are from JMX console:
1) CurrentView java.util.Vector R [162.111.75.85:3599, 162.111.75.85:3799, 162.111.75.85:3499, 162.111.75.85:3899]
2) CurrentView java.util.Vector R [162.111.75.85:3699]
Why the merge process doesn't work?
Is the TCP configuration wrong somewhere? The clustering has been working fine for a few weeks, and it split after restart today.
I recommend to replace JGroups 2.2.7 which is shipped with 4.0.3, with 2.4.1