1 Reply Latest reply on Apr 27, 2006 8:05 AM by anubisman

    Cluster Errors

    dansimmons

      Hello,

      I am attempting to install a JBoss cluster with a total of 4 instances. I am using two servers. Each server will run two instances. I see errors shortly after starting the cluster. Does anyone have any suggestions on why this is happening? Does my configuration look ok?

      1) Server Configuration

      App1
      - OS = Red Hat Enterprise Linux ES release 4 (Nahant Update 1)
      - Linux Kernel = 2.6.9-11.ELsmp
      - IPv6 Disabled = alias net-pf-10 off >> modprobe.conf
      - Multiple NIC cards but only one used in cluster (eth0). An alias called (eth0:0) has been created to use with the second instance on the server.
      - JDK = /usr/java/j2sdk1.4.2_10
      - JBoss Version = jboss-3.2.7

      =====

      Cluster Configuration Steps
      Config Node1
      1) cd /opt/jboss-3.2.7/server
      2) for i in node1 node2 ; do cp -a all $i ; done
      3) cd node1/deploy
      4) Add bind_addr to cluster-service.xml
      Sample Config:


      <!-- UDP: if you have a multihomed machine,
      set the bind_addr attribute to the appropriate NIC IP address -->
      <!-- UDP: On Windows machines, because of the media sense feature
      being broken with multicast (even after disabling media sense)
      set the loopback attribute to true -->
      <UDP mcast_addr="228.1.2.3" mcast_port="45566" bind_addr="192.168.3.118"
      ip_ttl="32" ip_mcast="true"
      mcast_send_buf_size="800000" mcast_recv_buf_size="150000"
      ucast_send_buf_size="800000" ucast_recv_buf_size="150000"
      loopback="false" />
      <PING timeout="2000" num_initial_members="3"
      up_thread="true" down_thread="true" />
      <MERGE2 min_interval="10000" max_interval="20000" />
      <FD shun="true" up_thread="true" down_thread="true"
      timeout="2500" max_tries="5" />
      <VERIFY_SUSPECT timeout="3000" num_msgs="3"
      up_thread="true" down_thread="true" />
      <pbcast.NAKACK gc_lag="50" retransmit_timeout="300,600,1200,2400,4800"
      max_xmit_size="8192"
      up_thread="true" down_thread="true" />
      <UNICAST timeout="300,600,1200,2400,4800" window_size="100" min_threshold="10"
      down_thread="true" />
      <pbcast.STABLE desired_avg_gossip="20000"
      up_thread="true" down_thread="true" />
      <FRAG frag_size="8192"
      down_thread="true" up_thread="true" />
      <pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
      shun="true" print_local_addr="true" />
      <pbcast.STATE_TRANSFER up_thread="true" down_thread="true" />


      =======================

      5) Command to start node1.
      /opt/jboss-3.2.7/bin/run.sh -c node1 -b 192.168.3.118

      ======= Config Node2 ==========
      1) cd /opt/jboss-3.2.7/server/node2/deploy
      4) Add bind_addr to cluster-service.xml
      Sample Config:


      <!-- UDP: if you have a multihomed machine,
      set the bind_addr attribute to the appropriate NIC IP address -->
      <!-- UDP: On Windows machines, because of the media sense feature
      being broken with multicast (even after disabling media sense)
      set the loopback attribute to true -->
      <UDP mcast_addr="228.1.2.3" mcast_port="45566" bind_addr="192.168.3.218"
      ip_ttl="32" ip_mcast="true"
      mcast_send_buf_size="800000" mcast_recv_buf_size="150000"
      ucast_send_buf_size="800000" ucast_recv_buf_size="150000"
      loopback="false" />
      <PING timeout="2000" num_initial_members="3"
      up_thread="true" down_thread="true" />
      <MERGE2 min_interval="10000" max_interval="20000" />
      <FD shun="true" up_thread="true" down_thread="true"
      timeout="2500" max_tries="5" />
      <VERIFY_SUSPECT timeout="3000" num_msgs="3"
      up_thread="true" down_thread="true" />
      <pbcast.NAKACK gc_lag="50" retransmit_timeout="300,600,1200,2400,4800"
      max_xmit_size="8192"
      up_thread="true" down_thread="true" />
      <UNICAST timeout="300,600,1200,2400,4800" window_size="100" min_threshold="10"
      down_thread="true" />
      <pbcast.STABLE desired_avg_gossip="20000"
      up_thread="true" down_thread="true" />
      <FRAG frag_size="8192"
      down_thread="true" up_thread="true" />
      <pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
      shun="true" print_local_addr="true" />
      <pbcast.STATE_TRANSFER up_thread="true" down_thread="true" />


      =======================

      5) Command to start node2.
      /opt/jboss-3.2.7/bin/run.sh -c node2 -b 192.168.3.218
      =======================

      App2
      - OS = Red Hat Enterprise Linux ES release 4 (Nahant Update 1)
      - Linux Kernel = 2.6.9-11.ELsmp
      - IPv6 Disabled = alias net-pf-10 off >> modprobe.conf
      - Multiple NIC cards but only one used in cluster (eth0). An alias called (eth0:0) has been created to use with the second instance on the server.
      - JDK = /usr/java/j2sdk1.4.2_10
      - JBoss Version = jboss-3.2.7

      =====

      Cluster Configuration Steps
      Config Node3
      1) cd /opt/jboss-3.2.7/server
      2) for i in node3 node4 ; do cp -a all $i ; done
      3) cd node3/deploy
      4) Add bind_addr to cluster-service.xml
      Sample Config:


      <!-- UDP: if you have a multihomed machine,
      set the bind_addr attribute to the appropriate NIC IP address -->
      <!-- UDP: On Windows machines, because of the media sense feature
      being broken with multicast (even after disabling media sense)
      set the loopback attribute to true -->
      <UDP mcast_addr="228.1.2.3" mcast_port="45566" bind_addr="192.168.3.119"
      ip_ttl="32" ip_mcast="true"
      mcast_send_buf_size="800000" mcast_recv_buf_size="150000"
      ucast_send_buf_size="800000" ucast_recv_buf_size="150000"
      loopback="false" />
      <PING timeout="2000" num_initial_members="3"
      up_thread="true" down_thread="true" />
      <MERGE2 min_interval="10000" max_interval="20000" />
      <FD shun="true" up_thread="true" down_thread="true"
      timeout="2500" max_tries="5" />
      <VERIFY_SUSPECT timeout="3000" num_msgs="3"
      up_thread="true" down_thread="true" />
      <pbcast.NAKACK gc_lag="50" retransmit_timeout="300,600,1200,2400,4800"
      max_xmit_size="8192"
      up_thread="true" down_thread="true" />
      <UNICAST timeout="300,600,1200,2400,4800" window_size="100" min_threshold="10"
      down_thread="true" />
      <pbcast.STABLE desired_avg_gossip="20000"
      up_thread="true" down_thread="true" />
      <FRAG frag_size="8192"
      down_thread="true" up_thread="true" />
      <pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
      shun="true" print_local_addr="true" />
      <pbcast.STATE_TRANSFER up_thread="true" down_thread="true" />


      =======================

      5) Command to start node3.
      /opt/jboss-3.2.7/bin/run.sh -c node3 -b 192.168.3.119

      Config Node4
      1) cd /opt/jboss-3.2.7/server/node4/deploy
      4) Add bind_addr to cluster-service.xml
      Sample Config:


      <!-- UDP: if you have a multihomed machine,
      set the bind_addr attribute to the appropriate NIC IP address -->
      <!-- UDP: On Windows machines, because of the media sense feature
      being broken with multicast (even after disabling media sense)
      set the loopback attribute to true -->
      <UDP mcast_addr="228.1.2.3" mcast_port="45566" bind_addr="192.168.3.219"
      ip_ttl="32" ip_mcast="true"
      mcast_send_buf_size="800000" mcast_recv_buf_size="150000"
      ucast_send_buf_size="800000" ucast_recv_buf_size="150000"
      loopback="false" />
      <PING timeout="2000" num_initial_members="3"
      up_thread="true" down_thread="true" />
      <MERGE2 min_interval="10000" max_interval="20000" />
      <FD shun="true" up_thread="true" down_thread="true"
      timeout="2500" max_tries="5" />
      <VERIFY_SUSPECT timeout="3000" num_msgs="3"
      up_thread="true" down_thread="true" />
      <pbcast.NAKACK gc_lag="50" retransmit_timeout="300,600,1200,2400,4800"
      max_xmit_size="8192"
      up_thread="true" down_thread="true" />
      <UNICAST timeout="300,600,1200,2400,4800" window_size="100" min_threshold="10"
      down_thread="true" />
      <pbcast.STABLE desired_avg_gossip="20000"
      up_thread="true" down_thread="true" />
      <FRAG frag_size="8192"
      down_thread="true" up_thread="true" />
      <pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
      shun="true" print_local_addr="true" />
      <pbcast.STATE_TRANSFER up_thread="true" down_thread="true" />


      =======================

      5) Command to start node4.
      /opt/jboss-3.2.7/bin/run.sh -c node4 -b 192.168.3.219
      ERRORS ON App2, Node3
      ====================
      2005-12-15 08:51:26,730 WARN [org.jgroups.protocols.pbcast.NAKACK] [192.168.3.119:46745] discarded message from non-member 192.168.3.118:34330
      2005-12-15 08:51:33,552 WARN [org.jgroups.protocols.pbcast.NAKACK] [192.168.3.119:46745] discarded message from non-member 192.168.3.218:34327
      2005-12-15 08:51:34,786 WARN [org.jgroups.protocols.pbcast.NAKACK] [192.168.3.119:46736 (additional data: 18 bytes)] discarded message from non-member 192.168.3.218:34283 (additional data: 18 bytes)
      2005-12-15 08:51:38,866 WARN [org.jgroups.protocols.pbcast.NAKACK] [192.168.3.119:46745] discarded message from non-member 192.168.3.118:34330
      2005-12-15 08:51:46,425 WARN [org.jgroups.protocols.pbcast.NAKACK]
      [192.168.3.119:46736 (additional data: 18 bytes)] discarded message from non-member 192.168.3.219:46739 (additional data: 18 bytes)
      2005-12-15 08:51:57,469 WARN [org.jgroups.protocols.pbcast.NAKACK] [192.168.3.119:46736 (additional data: 18 bytes)] discarded message from non-member 192.168.3.219:46739 (additional data: 18 bytes)
      2005-12-15 08:52:05,034 WARN [org.jgroups.protocols.pbcast.NAKACK] [192.168.3.119:46745] discarded message from non-member 192.168.3.118:34330
      2005-12-15 08:52:05,221 WARN [org.jgroups.protocols.pbcast.NAKACK] [192.168.3.119:46745] discarded message from non-member 192.168.3.218:34327
      2005-12-15 08:52:08,667 WARN [org.jgroups.protocols.pbcast.NAKACK] [192.168.3.119:46736 (additional data: 18 bytes)] discarded message from non-member 192.168.3.118:34280 (additional data: 18 bytes)
      2005-12-15 08:52:13,381 WARN [org.jgroups.protocols.pbcast.NAKACK] [192.168.3.119:46736 (additional data: 18 bytes)] discarded message from non-member 192.168.3.218:34283 (additional data: 18 bytes)
      ==========

      Errors on App1, Node2




      2005-12-15 08:56:58,519 INFO [org.jboss.cache.TreeCache] returning the transient state (217 bytes)
      2005-12-15 08:57:00,243 WARN [org.jgroups.protocols.pbcast.NAKACK] [192.168.3.218:34334] discarded message from non-member 192.168.3.118:34330
      2005-12-15 08:57:00,245 ERROR [org.jgroups.protocols.pbcast.NAKACK] sender at index 3 in digest is null
      2005-12-15 08:57:00,246 INFO [org.jboss.cache.TreeCache] viewAccepted(): new members: [192.168.3.118:34330, 192.168.3.118:34337, 192.168.3.218:34334 , 192.168.3.219:46748]
      2005-12-15 08:57:01,375 ERROR [org.jgroups.protocols.pbcast.GMS] [192.168.3.218:34334] received view <= current view; discarding it (current vid: [19 2.168.3.118:34330|25], new vid: [192.168.3.219:46748|24])
      2005-12-15 08:57:02,880 INFO [org.jboss.cache.TreeCache] viewAccepted(): new members: [192.168.3.118:34337, 192.168.3.218:34334, 192.168.3.219:46748 ]
      2005-12-15 08:57:08,675 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] _add(jboss.ha:service=HASingletonDeplo yer, 192.168.3.219:1099
      2005-12-15 08:57:08,675 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] notifyKeyListeners
      2005-12-15 08:57:08,675 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] notifying 1 listeners for key change: jboss.ha:service=HASingletonDeployer
      2005-12-15 08:57:08,676 DEBUG [org.jboss.ha.singleton.HASingletonController] partitionTopologyChanged, isElectedNewMaster=false, isMasterNode=false, viewID=996459982
      2005-12-15 08:57:08,679 DEBUG [org.jboss.ha.singleton.HASingletonController] _stopOldMaster, isMasterNode=false
      2005-12-15 08:57:08,792 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] _add(HAJNDI, 192.168.3.219:1099
      2005-12-15 08:57:08,792 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] notifyKeyListeners
      2005-12-15 08:57:08,792 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] notifying 1 listeners for key change: HAJNDI
      2005-12-15 08:57:08,792 DEBUG [org.jboss.ha.framework.server.HARMIServerImpl$RefreshProxiesHATarget] replicantsChanged 'HAJNDI' to 3 (intra-view id: 996459982)
      2005-12-15 08:57:08,793 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] _add(DCacheBridge-DefaultJGBridge, 192 .168.3.219:1099
      2005-12-15 08:57:08,793 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] notifyKeyListeners
      2005-12-15 08:57:08,793 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] notifying 1 listeners for key change: DCacheBridge-DefaultJGBridge
      2005-12-15 08:57:09,398 ERROR [org.jgroups.protocols.pbcast.NAKACK] sender at index 3 in digest is null
      2005-12-15 08:57:09,399 INFO [org.jboss.cache.TreeCache] viewAccepted(): new members: [192.168.3.118:34330, 192.168.3.118:34337, 192.168.3.218:34334 , 192.168.3.219:46748]

      ==========

      Errors on App1, Node1

      2005-12-15 08:58:16,020 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.118:34337, local_addr=192.168.3.118:34330) message with seqno=10 not found in sent_msgs ! sent_msgs=[12 - 14]
      2005-12-15 08:58:16,020 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.118:34337, local_addr=192.168.3.118:34330) message with seqno=11 not found in sent_msgs ! sent_msgs=[12 - 14]
      2005-12-15 08:58:16,020 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.218:34334, local_addr=192.168.3.118:34330) message with seqno=0 not found in sent_msgs ! sent_msgs=[12 - 14]
      2005-12-15 08:58:16,020 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.218:34334, local_addr=192.168.3.118:34330) message with seqno=1 not found in sent_msgs ! sent_msgs=[12 - 14]
      2005-12-15 08:58:16,021 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.218:34334, local_addr=192.168.3.118:34330) message with seqno=2 not found in sent_msgs ! sent_msgs=[12 - 14]
      2005-12-15 08:58:16,021 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.218:34334, local_addr=192.168.3.118:34330) message with seqno=3 not found in sent_msgs ! sent_msgs=[12 - 14]
      2005-12-15 08:58:16,021 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.218:34334, local_addr=192.168.3.118:34330) message with seqno=4 not found in sent_msgs ! sent_msgs=[12 - 14]
      2005-12-15 08:58:16,021 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.218:34334, local_addr=192.168.3.118:34330) message with seqno=5 not found in sent_msgs ! sent_msgs=[12 - 14]
      2005-12-15 08:58:16,021 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.218:34334, local_addr=192.168.3.118:34330) message with seqno=6 not found in sent_msgs ! sent_msgs=[12 - 14]
      2005-12-15 08:58:16,021 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.218:34334, local_addr=192.168.3.118:34330) message with seqno=7 not found in sent_msgs ! sent_msgs=[12 - 14]
      2005-12-15 08:58:16,021 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.218:34334, local_addr=192.168.3.118:34330) message with seqno=8 not found in sent_msgs ! sent_msgs=[12 - 14]
      2005-12-15 08:58:16,021 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.218:34334, local_addr=192.168.3.118:34330) message with seqno=9 not found in sent_msgs ! sent_msgs=[12 - 14]
      2005-12-15 08:58:16,021 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.218:34334, local_addr=192.168.3.118:34330) message with seqno=10 not found in sent_msgs ! sent_msgs=[12 - 14]
      2005-12-15 08:58:16,021 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.218:34334, local_addr=192.168.3.118:34330) message with seqno=11 not found in sent_msgs ! sent_msgs=[12 - 14]
      2005-12-15 08:58:20,820 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.118:34337, local_addr=192.168.3.118:34330) message with seqno=0 not found in sent_msgs ! sent_msgs=[12 - 14]
      2005-12-15 08:58:20,820 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.3.118:34337, local_addr=192.168.3.118:34330) message with seqno=1 not found in sent_msgs ! sent_msgs=[12 - 14]
      ===========

      Errors on App2, Node4
      2005-12-15 08:57:04,471 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] notifyKeyListeners
      2005-12-15 08:57:04,471 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] notifying 1 listeners for key change: DCacheBridge-DefaultJGBridge
      2005-12-15 08:57:04,472 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] End Re-Publish local replicants
      2005-12-15 08:57:05,080 ERROR [org.jgroups.protocols.pbcast.NAKACK] sender at index 3 in digest is null
      2005-12-15 08:57:05,081 INFO [org.jboss.cache.TreeCache] viewAccepted(): new members: [192.168.3.118:34330, 192.168.3.118:34337, 192.168.3.218:34334, 192.168.3.219:46748]
      2005-12-15 08:58:48,262 DEBUG [org.jboss.resource.connectionmanager.IdleRemover] run: IdleRemover notifying pools, interval: 150000
      2005-12-15 09:00:12,335 ERROR [org.jgroups.protocols.FD_SOCK] socket address for 192.168.3.118:34330 could not be fetched, retrying
      2005-12-15 09:00:20,642 ERROR [org.jgroups.protocols.FD_SOCK] socket address for 192.168.3.118:34330 could not be fetched, retrying
      2005-12-15 09:00:28,949 ERROR [org.jgroups.protocols.FD_SOCK] socket address for 192.168.3.118:34330 could not be fetched, retrying

      ========


      Thanks!!


      Dan Simmons
      Enterprise Linux Systems Engineer
      Rackspace Managed Hosting
      Toll Free: 800-961-2888, ext 4642
      Direct: 210-447-4642
      Mobile: 832-217-9506
      dsimmons@rackspace.com