2 Replies Latest reply on Apr 27, 2006 6:15 PM by coachvargo

    Cluster problem: Farm works, but HASingleton service does no

    coachvargo Newbie

      I am having a problem with clustering on 2 servers running red hat enterprise. I have set up the clustering to use the tcp config. It works fine with 2 servers I have locally, but not on 2 servers at my ISP. The 2 servers get up and running and I can see in the logs that they recognize each other...sort of. Farming works fine, I can copy a XXXXX-ds.xml file to the farm directory and it is properly sent to the other server. Both list the cluster as having 2 members, but neither of them wants to run the hasingleton services I set up. When I just start the first server, the service correctly runs and MasterNode = true. When I run the second server, it joins the cluster and the hasinglton services I set up are then destroyed on the main node (no longer exist on the jmx console on either server) and then BOTH nodes show MasterNode = false in the hasingleton service. The .sar files exist on both servers in the deploy-hasingleton directory, so that isn't an issue here.

      Anyone have any ideas? Here is a log sample from the second node in the cluster.
      I can see my config is making it ok to the log:

      2005-12-06 11:51:53,486 DEBUG [org.jboss.ha.framework.server.ClusterPartition] Setting JGProps from xml to: TCP(bind_addr=172.25.5.30;loopback=true;start_port=7800):TCPPING(down_thread=true;
      initial_hosts=172.25.5.30[7800],172.25.5.29[7800];num_initial_members=3;port_range=3;
      timeout=3500;up_thread=true):MERGE2(max_interval=10000;min_interval=5000):
      FD(down_thread=true;max_tries=5;shun=true;timeout=2500;up_thread=true):
      VERIFY_SUSPECT(down_thread=false;timeout=1500;up_thread=false):
      pbcast.NAKACK(down_thread=true;gc_lag=100;retransmit_timeout=3000;up_thread=true):
      pbcast.STABLE(desired_avg_gossip=20000;down_thread=false;up_thread=false):
      pbcast.GMS(down_thread=true;join_retry_timeout=2000;join_timeout=5000;
      print_local_addr=true;shun=false;up_thread=true):
      pbcast.STATE_TRANSFER(down_thread=true;up_thread=true)


      results of the tcp ping requests, which I think is a little strange since the ip address I have in my config for the other machine is being resolved to the network alias:

      2005-12-06 11:51:54,021 DEBUG [org.jgroups.protocols.TCPPING] [FIND_INITIAL_MBRS] sending PING request to st2clxll13:7800
      2005-12-06 11:51:54,022 DEBUG [org.jgroups.protocols.TCP] dest=st2clxll13:7800, hdrs:
      TCP: [TCP:group_addr=DefaultPartition]
      TCPPING: [PING: type=GET_MBRS_REQ, arg=null]
      2005-12-06 11:51:54,023 DEBUG [org.jgroups.protocols.TCPPING] [FIND_INITIAL_MBRS] sending PING request to st2clxll13:7801
      2005-12-06 11:51:54,024 DEBUG [org.jgroups.protocols.TCPPING] [FIND_INITIAL_MBRS] sending PING request to st2clxll13:7802
      2005-12-06 11:51:54,032 DEBUG [org.jgroups.protocols.TCP] opened connection to st2clxll13:7800
      2005-12-06 11:51:54,032 INFO [org.jgroups.blocks.ConnectionTable] connection was created to st2clxll13:7800
      2005-12-06 11:51:54,032 INFO [org.jgroups.blocks.ConnectionTable] created socket to st2clxll13:7800

      Here's the membership info from the log (it lists both members of the cluster as localhost, and also, notice how "I am" = null, where it should be the ip address of the host machine):

      2005-12-06 11:51:57,637 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] I am (null) received membershipChanged event:
      2005-12-06 11:51:57,638 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] Dead members: 0 ([])
      2005-12-06 11:51:57,638 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] New Members : 0 ([])
      2005-12-06 11:51:57,638 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] All Members : 2 ([127.0.0.1:1099, 127.0.0.1:1099])