Cluster Errors
dansimmons Dec 15, 2005 11:08 AMHello,
I am attempting to install a JBoss cluster with a total of 4 instances. I am using two servers. Each server will run two instances. I see errors shortly after starting the cluster. Does anyone have any suggestions on why this is happening? Does my configuration look ok?
1) Server Configuration
App1
- OS = Red Hat Enterprise Linux ES release 4 (Nahant Update 1)
- Linux Kernel = 2.6.9-11.ELsmp
- IPv6 Disabled = alias net-pf-10 off >> modprobe.conf
- Multiple NIC cards but only one used in cluster (eth0). An alias called (eth0:0) has been created to use with the second instance on the server.
- JDK = /usr/java/j2sdk1.4.2_10
- JBoss Version = jboss-3.2.7
=====
Cluster Configuration Steps
Config Node1
1) cd /opt/jboss-3.2.7/server
2) for i in node1 node2 ; do cp -a all $i ; done
3) cd node1/deploy
4) Add bind_addr to cluster-service.xml
Sample Config:
<!-- UDP: if you have a multihomed machine,
set the bind_addr attribute to the appropriate NIC IP address -->
<!-- UDP: On Windows machines, because of the media sense feature
being broken with multicast (even after disabling media sense)
set the loopback attribute to true -->
<UDP mcast_addr="228.1.2.3" mcast_port="45566" bind_addr="192.168.3.118"
ip_ttl="32" ip_mcast="true"
mcast_send_buf_size="800000" mcast_recv_buf_size="150000"
ucast_send_buf_size="800000" ucast_recv_buf_size="150000"
loopback="false" />
<PING timeout="2000" num_initial_members="3"
up_thread="true" down_thread="true" />
<MERGE2 min_interval="10000" max_interval="20000" />
<FD shun="true" up_thread="true" down_thread="true"
timeout="2500" max_tries="5" />
<VERIFY_SUSPECT timeout="3000" num_msgs="3"
up_thread="true" down_thread="true" />
<pbcast.NAKACK gc_lag="50" retransmit_timeout="300,600,1200,2400,4800"
max_xmit_size="8192"
up_thread="true" down_thread="true" />
<UNICAST timeout="300,600,1200,2400,4800" window_size="100" min_threshold="10"
down_thread="true" />
<pbcast.STABLE desired_avg_gossip="20000"
up_thread="true" down_thread="true" />
<FRAG frag_size="8192"
down_thread="true" up_thread="true" />
<pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
shun="true" print_local_addr="true" />
<pbcast.STATE_TRANSFER up_thread="true" down_thread="true" />
=======================
5) Command to start node1.
/opt/jboss-3.2.7/bin/run.sh -c node1 -b 192.168.3.118
======= Config Node2 ==========
1) cd /opt/jboss-3.2.7/server/node2/deploy
4) Add bind_addr to cluster-service.xml
Sample Config:
<!-- UDP: if you have a multihomed machine,
set the bind_addr attribute to the appropriate NIC IP address -->
<!-- UDP: On Windows machines, because of the media sense feature
being broken with multicast (even after disabling media sense)
set the loopback attribute to true -->
<UDP mcast_addr="228.1.2.3" mcast_port="45566" bind_addr="192.168.3.218"
ip_ttl="32" ip_mcast="true"
mcast_send_buf_size="800000" mcast_recv_buf_size="150000"
ucast_send_buf_size="800000" ucast_recv_buf_size="150000"
loopback="false" />
<PING timeout="2000" num_initial_members="3"
up_thread="true" down_thread="true" />
<MERGE2 min_interval="10000" max_interval="20000" />
<FD shun="true" up_thread="true" down_thread="true"
timeout="2500" max_tries="5" />
<VERIFY_SUSPECT timeout="3000" num_msgs="3"
up_thread="true" down_thread="true" />
<pbcast.NAKACK gc_lag="50" retransmit_timeout="300,600,1200,2400,4800"
max_xmit_size="8192"
up_thread="true" down_thread="true" />
<UNICAST timeout="300,600,1200,2400,4800" window_size="100" min_threshold="10"
down_thread="true" />
<pbcast.STABLE desired_avg_gossip="20000"
up_thread="true" down_thread="true" />
<FRAG frag_size="8192"
down_thread="true" up_thread="true" />
<pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
shun="true" print_local_addr="true" />
<pbcast.STATE_TRANSFER up_thread="true" down_thread="true" />
=======================
5) Command to start node2.
/opt/jboss-3.2.7/bin/run.sh -c node2 -b 192.168.3.218
=======================
App2
- OS = Red Hat Enterprise Linux ES release 4 (Nahant Update 1)
- Linux Kernel = 2.6.9-11.ELsmp
- IPv6 Disabled = alias net-pf-10 off >> modprobe.conf
- Multiple NIC cards but only one used in cluster (eth0). An alias called (eth0:0) has been created to use with the second instance on the server.
- JDK = /usr/java/j2sdk1.4.2_10
- JBoss Version = jboss-3.2.7
=====
Cluster Configuration Steps
Config Node3
1) cd /opt/jboss-3.2.7/server
2) for i in node3 node4 ; do cp -a all $i ; done
3) cd node3/deploy
4) Add bind_addr to cluster-service.xml
Sample Config:
<!-- UDP: if you have a multihomed machine,
set the bind_addr attribute to the appropriate NIC IP address -->
<!-- UDP: On Windows machines, because of the media sense feature
being broken with multicast (even after disabling media sense)
set the loopback attribute to true -->
<UDP mcast_addr="228.1.2.3" mcast_port="45566" bind_addr="192.168.3.119"
ip_ttl="32" ip_mcast="true"
mcast_send_buf_size="800000" mcast_recv_buf_size="150000"
ucast_send_buf_size="800000" ucast_recv_buf_size="150000"
loopback="false" />
<PING timeout="2000" num_initial_members="3"
up_thread="true" down_thread="true" />
<MERGE2 min_interval="10000" max_interval="20000" />
<FD shun="true" up_thread="true" down_thread="true"
timeout="2500" max_tries="5" />
<VERIFY_SUSPECT timeout="3000" num_msgs="3"
up_thread="true" down_thread="true" />
<pbcast.NAKACK gc_lag="50" retransmit_timeout="300,600,1200,2400,4800"
max_xmit_size="8192"
up_thread="true" down_thread="true" />
<UNICAST timeout="300,600,1200,2400,4800" window_size="100" min_threshold="10"
down_thread="true" />
<pbcast.STABLE desired_avg_gossip="20000"
up_thread="true" down_thread="true" />
<FRAG frag_size="8192"
down_thread="true" up_thread="true" />
<pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
shun="true" print_local_addr="true" />
<pbcast.STATE_TRANSFER up_thread="true" down_thread="true" />
=======================
5) Command to start node3.
/opt/jboss-3.2.7/bin/run.sh -c node3 -b 192.168.3.119
Config Node4
1) cd /opt/jboss-3.2.7/server/node4/deploy
4) Add bind_addr to cluster-service.xml
Sample Config:
<!-- UDP: if you have a multihomed machine,
set the bind_addr attribute to the appropriate NIC IP address -->
<!-- UDP: On Windows machines, because of the media sense feature
being broken with multicast (even after disabling media sense)
set the loopback attribute to true -->
<UDP mcast_addr="228.1.2.3" mcast_port="45566" bind_addr="192.168.3.219"
ip_ttl="32" ip_mcast="true"
mcast_send_buf_size="800000" mcast_recv_buf_size="150000"
ucast_send_buf_size="800000" ucast_recv_buf_size="150000"
loopback="false" />
<PING timeout="2000" num_initial_members="3"
up_thread="true" down_thread="true" />
<MERGE2 min_interval="10000" max_interval="20000" />
<FD shun="true" up_thread="true" down_thread="true"
timeout="2500" max_tries="5" />
<VERIFY_SUSPECT timeout="3000" num_msgs="3"
up_thread="true" down_thread="true" />
<pbcast.NAKACK gc_lag="50" retransmit_timeout="300,600,1200,2400,4800"
max_xmit_size="8192"
up_thread="true" down_thread="true" />
<UNICAST timeout="300,600,1200,2400,4800" window_size="100" min_threshold="10"
down_thread="true" />
<pbcast.STABLE desired_avg_gossip="20000"
up_thread="true" down_thread="true" />
<FRAG frag_size="8192"
down_thread="true" up_thread="true" />
<pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
shun="true" print_local_addr="true" />
<pbcast.STATE_TRANSFER up_thread="true" down_thread="true" />
=======================
5) Command to start node4.
/opt/jboss-3.2.7/bin/run.sh -c node4 -b 192.168.3.219
ERRORS ON App2, Node3
====================
2005-12-15 08:51:26,730 WARN [org.jgroups.protocols.pbcast.NAKACK] [192.168.3.119:46745] discarded message from non-member 192.168.3.118:34330
2005-12-15 08:51:33,552 WARN [org.jgroups.protocols.pbcast.NAKACK] [192.168.3.119:46745] discarded message from non-member 192.168.3.218:34327
2005-12-15 08:51:34,786 WARN [org.jgroups.protocols.pbcast.NAKACK] [192.168.3.119:46736 (additional data: 18 bytes)] discarded message from non-member 192.168.3.218:34283 (additional data: 18 bytes)
2005-12-15 08:51:38,866 WARN [org.jgroups.protocols.pbcast.NAKACK] [192.168.3.119:46745] discarded message from non-member 192.168.3.118:34330
2005-12-15 08:51:46,425 WARN [org.jgroups.protocols.pbcast.NAKACK]
[192.168.3.119:46736 (additional data: 18 bytes)] discarded message from non-member 192.168.3.219:46739 (additional data: 18 bytes)
2005-12-15 08:51:57,469 WARN [org.jgroups.protocols.pbcast.NAKACK] [192.168.3.119:46736 (additional data: 18 bytes)] discarded message from non-member 192.168.3.219:46739 (additional data: 18 bytes)
2005-12-15 08:52:05,034 WARN [org.jgroups.protocols.pbcast.NAKACK] [192.168.3.119:46745] discarded message from non-member 192.168.3.118:34330
2005-12-15 08:52:05,221 WARN [org.jgroups.protocols.pbcast.NAKACK] [192.168.3.119:46745] discarded message from non-member 192.168.3.218:34327
2005-12-15 08:52:08,667 WARN [org.jgroups.protocols.pbcast.NAKACK] [192.168.3.119:46736 (additional data: 18 bytes)] discarded message from non-member 192.168.3.118:34280 (additional data: 18 bytes)
2005-12-15 08:52:13,381 WARN [org.jgroups.protocols.pbcast.NAKACK] [192.168.3.119:46736 (additional data: 18 bytes)] discarded message from non-member 192.168.3.218:34283 (additional data: 18 bytes)
==========
Errors on App1, Node2
2005-12-15 08:56:58,519 INFO [org.jboss.cache.TreeCache] returning the transient state (217 bytes)
2005-12-15 08:57:00,243 WARN [org.jgroups.protocols.pbcast.NAKACK] [192.168.3.218:34334] discarded message from non-member 192.168.3.118:34330
2005-12-15 08:57:00,245 ERROR [org.jgroups.protocols.pbcast.NAKACK] sender at index 3 in digest is null
2005-12-15 08:57:00,246 INFO [org.jboss.cache.TreeCache] viewAccepted(): new members: [192.168.3.118:34330, 192.168.3.118:34337, 192.168.3.218:34334 , 192.168.3.219:46748]
2005-12-15 08:57:01,375 ERROR [org.jgroups.protocols.pbcast.GMS] [192.168.3.218:34334] received view <= current view; discarding it (current vid: [19 2.168.3.118:34330|25], new vid: [192.168.3.219:46748|24])
2005-12-15 08:57:02,880 INFO [org.jboss.cache.TreeCache] viewAccepted(): new members: [192.168.3.118:34337, 192.168.3.218:34334, 192.168.3.219:46748 ]
2005-12-15 08:57:08,675 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] _add(jboss.ha:service=HASingletonDeplo yer, 192.168.3.219:1099
2005-12-15 08:57:08,675 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] notifyKeyListeners
2005-12-15 08:57:08,675 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] notifying 1 listeners for key change: jboss.ha:service=HASingletonDeployer
2005-12-15 08:57:08,676 DEBUG [org.jboss.ha.singleton.HASingletonController] partitionTopologyChanged, isElectedNewMaster=false, isMasterNode=false, viewID=996459982
2005-12-15 08:57:08,679 DEBUG [org.jboss.ha.singleton.HASingletonController] _stopOldMaster, isMasterNode=false
2005-12-15 08:57:08,792 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] _add(HAJNDI, 192.168.3.219:1099
2005-12-15 08:57:08,792 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] notifyKeyListeners
2005-12-15 08:57:08,792 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] notifying 1 listeners for key change: HAJNDI
2005-12-15 08:57:08,792 DEBUG [org.jboss.ha.framework.server.HARMIServerImpl$RefreshProxiesHATarget] replicantsChanged 'HAJNDI' to 3 (intra-view id: 996459982)
2005-12-15 08:57:08,793 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] _add(DCacheBridge-DefaultJGBridge, 192 .168.3.219:1099
2005-12-15 08:57:08,793 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] notifyKeyListeners
2005-12-15 08:57:08,793 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] notifying 1 listeners for key change: DCacheBridge-DefaultJGBridge
2005-12-15 08:57:09,398 ERROR [org.jgroups.protocols.pbcast.NAKACK] sender at index 3 in digest is null
2005-12-15 08:57:09,399 INFO [org.jboss.cache.TreeCache] viewAccepted(): new members: [192.168.3.118:34330, 192.168.3.118:34337, 192.168.3.218:34334 , 192.168.3.219:46748]
==========
Errors on App1, Node1
2005-12-15 08:58:16,020 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.118:34337, local_addr=192.168.3.118:34330) message with seqno=10 not found in sent_msgs ! sent_msgs=[12 - 14]
2005-12-15 08:58:16,020 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.118:34337, local_addr=192.168.3.118:34330) message with seqno=11 not found in sent_msgs ! sent_msgs=[12 - 14]
2005-12-15 08:58:16,020 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.218:34334, local_addr=192.168.3.118:34330) message with seqno=0 not found in sent_msgs ! sent_msgs=[12 - 14]
2005-12-15 08:58:16,020 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.218:34334, local_addr=192.168.3.118:34330) message with seqno=1 not found in sent_msgs ! sent_msgs=[12 - 14]
2005-12-15 08:58:16,021 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.218:34334, local_addr=192.168.3.118:34330) message with seqno=2 not found in sent_msgs ! sent_msgs=[12 - 14]
2005-12-15 08:58:16,021 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.218:34334, local_addr=192.168.3.118:34330) message with seqno=3 not found in sent_msgs ! sent_msgs=[12 - 14]
2005-12-15 08:58:16,021 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.218:34334, local_addr=192.168.3.118:34330) message with seqno=4 not found in sent_msgs ! sent_msgs=[12 - 14]
2005-12-15 08:58:16,021 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.218:34334, local_addr=192.168.3.118:34330) message with seqno=5 not found in sent_msgs ! sent_msgs=[12 - 14]
2005-12-15 08:58:16,021 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.218:34334, local_addr=192.168.3.118:34330) message with seqno=6 not found in sent_msgs ! sent_msgs=[12 - 14]
2005-12-15 08:58:16,021 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.218:34334, local_addr=192.168.3.118:34330) message with seqno=7 not found in sent_msgs ! sent_msgs=[12 - 14]
2005-12-15 08:58:16,021 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.218:34334, local_addr=192.168.3.118:34330) message with seqno=8 not found in sent_msgs ! sent_msgs=[12 - 14]
2005-12-15 08:58:16,021 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.218:34334, local_addr=192.168.3.118:34330) message with seqno=9 not found in sent_msgs ! sent_msgs=[12 - 14]
2005-12-15 08:58:16,021 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.218:34334, local_addr=192.168.3.118:34330) message with seqno=10 not found in sent_msgs ! sent_msgs=[12 - 14]
2005-12-15 08:58:16,021 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.218:34334, local_addr=192.168.3.118:34330) message with seqno=11 not found in sent_msgs ! sent_msgs=[12 - 14]
2005-12-15 08:58:20,820 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.118:34337, local_addr=192.168.3.118:34330) message with seqno=0 not found in sent_msgs ! sent_msgs=[12 - 14]
2005-12-15 08:58:20,820 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.118:34337, local_addr=192.168.3.118:34330) message with seqno=1 not found in sent_msgs ! sent_msgs=[12 - 14]
===========
Errors on App2, Node4
2005-12-15 08:57:04,471 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] notifyKeyListeners
2005-12-15 08:57:04,471 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] notifying 1 listeners for key change: DCacheBridge-DefaultJGBridge
2005-12-15 08:57:04,472 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] End Re-Publish local replicants
2005-12-15 08:57:05,080 ERROR [org.jgroups.protocols.pbcast.NAKACK] sender at index 3 in digest is null
2005-12-15 08:57:05,081 INFO [org.jboss.cache.TreeCache] viewAccepted(): new members: [192.168.3.118:34330, 192.168.3.118:34337, 192.168.3.218:34334, 192.168.3.219:46748]
2005-12-15 08:58:48,262 DEBUG [org.jboss.resource.connectionmanager.IdleRemover] run: IdleRemover notifying pools, interval: 150000
2005-12-15 09:00:12,335 ERROR [org.jgroups.protocols.FD_SOCK] socket address for 192.168.3.118:34330 could not be fetched, retrying
2005-12-15 09:00:20,642 ERROR [org.jgroups.protocols.FD_SOCK] socket address for 192.168.3.118:34330 could not be fetched, retrying
2005-12-15 09:00:28,949 ERROR [org.jgroups.protocols.FD_SOCK] socket address for 192.168.3.118:34330 could not be fetched, retrying
========
Thanks!!
Dan Simmons
Enterprise Linux Systems Engineer
Rackspace Managed Hosting
Toll Free: 800-961-2888, ext 4642
Direct: 210-447-4642
Mobile: 832-217-9506
dsimmons@rackspace.com