0 Replies Latest reply on Jul 3, 2007 4:16 AM by Frank Snedecor

    handleJoin(192.168.1.9:32902) failed, retrying

    Frank Snedecor Newbie

      Hello,
      Every so often we start getting this error ?handleJoin(192.168.1.9:32
      915) failed, retrying
      ? after a restart of a machine in the production cluster. While in this state, no connections can be made and sometimes a global restart is needed. (Jboss4.0.3SP1)

      Using 4.0.5GA I was able to reproduce this in a virtual environment by starting virtual machine 1 then starting virtual machine 2. Then stopping machine 1 the coordinator and restarting virtual 1. However in4.0.5GA it seems to only do this for approx 5 minutes and then it finally reconnects. In the production environment, it appears as if it never reconnects even after hours of being in this state. See last code for state of production.

      Any tips on how to prevent this would be greatly apprectiated. I supplied some org.jgroups debug output while in this state.

      I think I might have read a bug about this such that getUniqueName is not unique when restarting since uses JNDI port??

      2007-07-03 05:03:49,733 DEBUG [org.jgroups.protocols.UDP] creating sockets and starting threads
      2007-07-03 05:03:49,733 INFO [org.jgroups.protocols.UDP] unicast sockets will use interface 192.168.1.9
      2007-07-03 05:03:49,737 INFO [org.jgroups.protocols.UDP] socket information:
      local_addr=192.168.1.9:32915, mcast_addr=230.1.2.7:45577, bind_addr=/192.168.1.9, ttl=2
      sock: bound to 192.168.1.9:32915, receive buffer size=131071, send buffer size=131071
      mcast_recv_sock: bound to 192.168.1.9:45577, send buffer size=131071, receive buffer size=131071
      mcast_send_sock: bound to 192.168.1.9:32916, send buffer size=131071, receive buffer size=131071
      2007-07-03 05:03:49,830 INFO [STDOUT]
      -------------------------------------------------------
      GMS: address is 192.168.1.9:32915
      -------------------------------------------------------
      2007-07-03 05:03:49,833 DEBUG [org.jgroups.protocols.UDP] created unicast receiver thread
      2007-07-03 05:03:49,906 DEBUG [org.jgroups.protocols.FD_SOCK] determinePingDest()=null, pingable_mbrs=[]
      2007-07-03 05:03:49,907 DEBUG [org.jgroups.protocols.FD_SOCK] pinger thread terminated
      2007-07-03 05:03:49,937 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      230.1.2.7:45577: 1 msgs
      
      2007-07-03 05:03:51,929 DEBUG [org.jgroups.protocols.PING] initial mbrs are [[own_addr=192.168.1.2:1155, coord_addr=192.168.1.9:32908]]
      2007-07-03 05:03:51,929 DEBUG [org.jgroups.protocols.pbcast.ClientGmsImpl] initial_mbrs are [[own_addr=192.168.1.2:1155, coord_addr=192.168.1.9:32908]]
      2007-07-03 05:03:51,929 DEBUG [org.jgroups.protocols.pbcast.ClientGmsImpl] sending handleJoin(192.168.1.9:32915) to 192.168.1.9:32908
      2007-07-03 05:03:51,933 DEBUG [org.jgroups.protocols.pbcast.STABLE] stable task started; num_gossip_runs=3, max_gossip_runs=3
      2007-07-03 05:03:51,983 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:03:52,293 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:03:52,897 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:03:54,101 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:03:54,953 WARN [org.jgroups.protocols.pbcast.ClientGmsImpl] handleJoin(192.168.1.9:32915) failed, retrying
      2007-07-03 05:03:56,510 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:03:56,993 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      230.1.2.7:45577: 1 msgs
      
      2007-07-03 05:03:58,966 DEBUG [org.jgroups.protocols.PING] initial mbrs are [[own_addr=192.168.1.2:1155, coord_addr=192.168.1.9:32908]]
      2007-07-03 05:03:58,966 DEBUG [org.jgroups.protocols.pbcast.ClientGmsImpl] initial_mbrs are [[own_addr=192.168.1.2:1155, coord_addr=192.168.1.9:32908]]
      2007-07-03 05:03:58,966 DEBUG [org.jgroups.protocols.pbcast.ClientGmsImpl] sending handleJoin(192.168.1.9:32915) to 192.168.1.9:32908
      
      *********************************************************************A bunuch of these and then finally we get .
      
      2007-07-03 05:05:44,188 DEBUG [org.jgroups.protocols.PING] initial mbrs are [[own_addr=192.168.1.2:1155, coord_addr=192.168.1.9:32908]]
      2007-07-03 05:05:44,188 DEBUG [org.jgroups.protocols.pbcast.ClientGmsImpl] initial_mbrs are [[own_addr=192.168.1.2:1155, coord_addr=192.168.1.9:32908]]
      2007-07-03 05:05:44,188 DEBUG [org.jgroups.protocols.pbcast.ClientGmsImpl] sending handleJoin(192.168.1.9:32915) to 192.168.1.9:32908
      2007-07-03 05:05:44,224 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:44,260 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:44,436 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:44,528 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:44,648 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:45,133 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:45,336 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:45,540 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:45,740 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:45,920 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:46,120 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:46,308 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 2 msgs
      
      2007-07-03 05:05:46,504 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:46,696 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:46,893 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:47,088 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:47,192 WARN [org.jgroups.protocols.pbcast.ClientGmsImpl] handleJoin(192.168.1.9:32915) failed, retrying
      2007-07-03 05:05:47,280 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:47,476 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:47,582 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[192.168.1.9:32908], from=192.168.1.2:1155)]
      2007-07-03 05:05:47,616 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:47,668 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:47,864 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:48,040 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:48,252 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:48,740 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:48,940 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:49,086 ERROR [org.jgroups.protocols.pbcast.ClientGmsImpl] suspect() should not be invoked on an instance of org.jgroups.protocols.pbcast.ClientGmsImpl
      2007-07-03 05:05:49,087 DEBUG [org.jgroups.blocks.RequestCorrelator] suspect=192.168.1.9:32908
      2007-07-03 05:05:49,136 DEBUG [org.jgroups.protocols.pbcast.NAKACK] message was discarded (not yet server)
      2007-07-03 05:05:49,144 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:49,232 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      230.1.2.7:45577: 1 msgs
      
      2007-07-03 05:05:49,344 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:49,524 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:49,724 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:49,912 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:50,109 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:50,300 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:50,496 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:50,692 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:50,884 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:51,080 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.9:32908: 1 msgs
      
      2007-07-03 05:05:51,200 DEBUG [org.jgroups.protocols.PING] initial mbrs are [[own_addr=192.168.1.2:1155, coord_addr=192.168.1.2:1155]]
      2007-07-03 05:05:51,200 DEBUG [org.jgroups.protocols.pbcast.ClientGmsImpl] initial_mbrs are [[own_addr=192.168.1.2:1155, coord_addr=192.168.1.2:1155]]
      2007-07-03 05:05:51,200 DEBUG [org.jgroups.protocols.pbcast.ClientGmsImpl] sending handleJoin(192.168.1.9:32915) to 192.168.1.2:1155
      2007-07-03 05:05:51,236 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      192.168.1.2:1155: 1 msgs
      
      2007-07-03 05:05:51,272 DEBUG [org.jgroups.protocols.UDP]
      

      Here is the output of the other machine. Looks like the unique name is not unique.
      01:29:19,705 INFO [Server] JBoss (MX MicroKernel) [4.0.5.GA (build: CVSTag=Branch_4_0 date=200702201759)] Started in 52s:15ms
      01:33:08,987 INFO [UI] New cluster view for partition UI (id: 2, delta: -1) : [127.0.0.2:1099]
      01:33:09,063 INFO [UI] I am (127.0.0.2:1099) received membershipChanged event:
      01:33:09,075 INFO [UI] Dead members: 0 ([])
      01:33:09,076 INFO [UI] New Members : 0 ([])
      01:33:09,076 INFO [UI] All Members : 1 ([127.0.0.2:1099])
      01:33:11,800 INFO [TomcatDeployer] deploy, ctxPath=/jbossmq-httpil, warUrl=.../deploy-hasingleton/jms/jbossmq-httpil.sar/jbossmq-httpil.war/
      01:33:12,814 INFO [A] Bound to JNDI name: queue/A
      01:33:12,825 INFO [B] Bound to JNDI name: queue/B
      01:33:12,834 INFO [C] Bound to JNDI name: queue/C
      01:33:12,843 INFO [D] Bound to JNDI name: queue/D
      01:33:12,850 INFO [ex] Bound to JNDI name: queue/ex
      01:33:12,891 INFO [testTopic] Bound to JNDI name: topic/testTopic
      01:33:12,907 INFO [securedTopic] Bound to JNDI name: topic/securedTopic
      01:33:12,917 INFO [testDurableTopic] Bound to JNDI name: topic/testDurableTopic
      01:33:12,936 INFO [testQueue] Bound to JNDI name: queue/testQueue
      01:33:13,069 INFO [UILServerILService] JBossMQ UIL service available at : /0.0.0.0:8093
      01:33:13,210 INFO [DLQ] Bound to JNDI name: queue/DLQ
      01:35:35,247 INFO [TreeCache] viewAccepted(): [192.168.1.2:1155|2] [192.168.1.2:1155]
      01:35:37,367 INFO [TreeCache] viewAccepted(): [192.168.1.2:1155|3] [192.168.1.2:1155, 192.168.1.9:32915]
      01:35:37,503 INFO [TreeCache] locking the subtree at / to transfer state
      01:35:37,517 INFO [StateTransferGenerator_140] returning the state for tree rooted in /(1024 bytes)
      01:35:40,765 INFO [UI] New cluster view for partition UI (id: 3, delta: 1) : [127.0.0.2:1099, 127.0.0.2:1099]
      01:35:40,767 INFO [UI] I am (127.0.0.2:1099) received membershipChanged event:
      01:35:40,768 INFO [UI] Dead members: 0 ([])
      01:35:40,768 INFO [UI] New Members : 0 ([])
      01:35:40,769 INFO [UI] All Members : 2 ([127.0.0.2:1099, 127.0.0.2:1099])

      On this production one, notice TreeCache shows 5 members but HAPartition.UI is only showing 4. The missing one vrx15 is now in this state such that it cannot accept incoming connections, and cannot join the cluster. It is only able to show handleJoin(something here) failed, retrying
       [org.jboss.cache.TreeCache] viewAccepted(): new members: [vrx13-in:39016, vrx12-in:42073, vrx14-in:58902, vrx15-in:41415, vrx10:57939]
      2007-06-27 18:30:49,682 INFO [org.jboss.ha.framework.interfaces.HAPartition.UI] New cluster view for partition UI: 159 ([192.168.120.112:1099, 192.168.120.143:1099, 192.168.120.144:1099, 192.168.120.110:1099] delta: -1)
      2007-06-27 18:30:49,683 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.UI] I am (192.168.120.110:1099) received membershipChanged event:
      2007-06-27 18:30:49,683 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.UI] Dead members: 1 ([192.168.120.111:1099])
      2007-06-27 18:30:49,683 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.UI] New Members : 0 ([])
      2007-06-27 18:30:49,683 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.UI] All Members : 4 ([ 192.168.120.112:1099, 192.168.120.143:1099, 192.168.120.144:1099, 192.168.120.110:1099])