handleJoin(192.168.1.9:32902) failed, retrying
snedapunk Jul 3, 2007 4:16 AMHello,
Every so often we start getting this error ?handleJoin(192.168.1.9:32
915) failed, retrying
? after a restart of a machine in the production cluster. While in this state, no connections can be made and sometimes a global restart is needed. (Jboss4.0.3SP1)
Using 4.0.5GA I was able to reproduce this in a virtual environment by starting virtual machine 1 then starting virtual machine 2. Then stopping machine 1 the coordinator and restarting virtual 1. However in4.0.5GA it seems to only do this for approx 5 minutes and then it finally reconnects. In the production environment, it appears as if it never reconnects even after hours of being in this state. See last code for state of production.
Any tips on how to prevent this would be greatly apprectiated. I supplied some org.jgroups debug output while in this state.
I think I might have read a bug about this such that getUniqueName is not unique when restarting since uses JNDI port??
2007-07-03 05:03:49,733 DEBUG [org.jgroups.protocols.UDP] creating sockets and starting threads 2007-07-03 05:03:49,733 INFO [org.jgroups.protocols.UDP] unicast sockets will use interface 192.168.1.9 2007-07-03 05:03:49,737 INFO [org.jgroups.protocols.UDP] socket information: local_addr=192.168.1.9:32915, mcast_addr=230.1.2.7:45577, bind_addr=/192.168.1.9, ttl=2 sock: bound to 192.168.1.9:32915, receive buffer size=131071, send buffer size=131071 mcast_recv_sock: bound to 192.168.1.9:45577, send buffer size=131071, receive buffer size=131071 mcast_send_sock: bound to 192.168.1.9:32916, send buffer size=131071, receive buffer size=131071 2007-07-03 05:03:49,830 INFO [STDOUT] ------------------------------------------------------- GMS: address is 192.168.1.9:32915 ------------------------------------------------------- 2007-07-03 05:03:49,833 DEBUG [org.jgroups.protocols.UDP] created unicast receiver thread 2007-07-03 05:03:49,906 DEBUG [org.jgroups.protocols.FD_SOCK] determinePingDest()=null, pingable_mbrs=[] 2007-07-03 05:03:49,907 DEBUG [org.jgroups.protocols.FD_SOCK] pinger thread terminated 2007-07-03 05:03:49,937 DEBUG [org.jgroups.protocols.UDP] sending msgs: 230.1.2.7:45577: 1 msgs 2007-07-03 05:03:51,929 DEBUG [org.jgroups.protocols.PING] initial mbrs are [[own_addr=192.168.1.2:1155, coord_addr=192.168.1.9:32908]] 2007-07-03 05:03:51,929 DEBUG [org.jgroups.protocols.pbcast.ClientGmsImpl] initial_mbrs are [[own_addr=192.168.1.2:1155, coord_addr=192.168.1.9:32908]] 2007-07-03 05:03:51,929 DEBUG [org.jgroups.protocols.pbcast.ClientGmsImpl] sending handleJoin(192.168.1.9:32915) to 192.168.1.9:32908 2007-07-03 05:03:51,933 DEBUG [org.jgroups.protocols.pbcast.STABLE] stable task started; num_gossip_runs=3, max_gossip_runs=3 2007-07-03 05:03:51,983 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:03:52,293 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:03:52,897 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:03:54,101 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:03:54,953 WARN [org.jgroups.protocols.pbcast.ClientGmsImpl] handleJoin(192.168.1.9:32915) failed, retrying 2007-07-03 05:03:56,510 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:03:56,993 DEBUG [org.jgroups.protocols.UDP] sending msgs: 230.1.2.7:45577: 1 msgs 2007-07-03 05:03:58,966 DEBUG [org.jgroups.protocols.PING] initial mbrs are [[own_addr=192.168.1.2:1155, coord_addr=192.168.1.9:32908]] 2007-07-03 05:03:58,966 DEBUG [org.jgroups.protocols.pbcast.ClientGmsImpl] initial_mbrs are [[own_addr=192.168.1.2:1155, coord_addr=192.168.1.9:32908]] 2007-07-03 05:03:58,966 DEBUG [org.jgroups.protocols.pbcast.ClientGmsImpl] sending handleJoin(192.168.1.9:32915) to 192.168.1.9:32908 *********************************************************************A bunuch of these and then finally we get . 2007-07-03 05:05:44,188 DEBUG [org.jgroups.protocols.PING] initial mbrs are [[own_addr=192.168.1.2:1155, coord_addr=192.168.1.9:32908]] 2007-07-03 05:05:44,188 DEBUG [org.jgroups.protocols.pbcast.ClientGmsImpl] initial_mbrs are [[own_addr=192.168.1.2:1155, coord_addr=192.168.1.9:32908]] 2007-07-03 05:05:44,188 DEBUG [org.jgroups.protocols.pbcast.ClientGmsImpl] sending handleJoin(192.168.1.9:32915) to 192.168.1.9:32908 2007-07-03 05:05:44,224 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:44,260 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:44,436 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:44,528 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:44,648 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:45,133 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:45,336 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:45,540 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:45,740 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:45,920 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:46,120 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:46,308 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 2 msgs 2007-07-03 05:05:46,504 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:46,696 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:46,893 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:47,088 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:47,192 WARN [org.jgroups.protocols.pbcast.ClientGmsImpl] handleJoin(192.168.1.9:32915) failed, retrying 2007-07-03 05:05:47,280 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:47,476 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:47,582 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[192.168.1.9:32908], from=192.168.1.2:1155)] 2007-07-03 05:05:47,616 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:47,668 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:47,864 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:48,040 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:48,252 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:48,740 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:48,940 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:49,086 ERROR [org.jgroups.protocols.pbcast.ClientGmsImpl] suspect() should not be invoked on an instance of org.jgroups.protocols.pbcast.ClientGmsImpl 2007-07-03 05:05:49,087 DEBUG [org.jgroups.blocks.RequestCorrelator] suspect=192.168.1.9:32908 2007-07-03 05:05:49,136 DEBUG [org.jgroups.protocols.pbcast.NAKACK] message was discarded (not yet server) 2007-07-03 05:05:49,144 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:49,232 DEBUG [org.jgroups.protocols.UDP] sending msgs: 230.1.2.7:45577: 1 msgs 2007-07-03 05:05:49,344 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:49,524 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:49,724 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:49,912 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:50,109 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:50,300 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:50,496 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:50,692 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:50,884 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:51,080 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.9:32908: 1 msgs 2007-07-03 05:05:51,200 DEBUG [org.jgroups.protocols.PING] initial mbrs are [[own_addr=192.168.1.2:1155, coord_addr=192.168.1.2:1155]] 2007-07-03 05:05:51,200 DEBUG [org.jgroups.protocols.pbcast.ClientGmsImpl] initial_mbrs are [[own_addr=192.168.1.2:1155, coord_addr=192.168.1.2:1155]] 2007-07-03 05:05:51,200 DEBUG [org.jgroups.protocols.pbcast.ClientGmsImpl] sending handleJoin(192.168.1.9:32915) to 192.168.1.2:1155 2007-07-03 05:05:51,236 DEBUG [org.jgroups.protocols.UDP] sending msgs: 192.168.1.2:1155: 1 msgs 2007-07-03 05:05:51,272 DEBUG [org.jgroups.protocols.UDP]
Here is the output of the other machine. Looks like the unique name is not unique.
01:29:19,705 INFO [Server] JBoss (MX MicroKernel) [4.0.5.GA (build: CVSTag=Branch_4_0 date=200702201759)] Started in 52s:15ms 01:33:08,987 INFO [UI] New cluster view for partition UI (id: 2, delta: -1) : [127.0.0.2:1099] 01:33:09,063 INFO [UI] I am (127.0.0.2:1099) received membershipChanged event: 01:33:09,075 INFO [UI] Dead members: 0 ([]) 01:33:09,076 INFO [UI] New Members : 0 ([]) 01:33:09,076 INFO [UI] All Members : 1 ([127.0.0.2:1099]) 01:33:11,800 INFO [TomcatDeployer] deploy, ctxPath=/jbossmq-httpil, warUrl=.../deploy-hasingleton/jms/jbossmq-httpil.sar/jbossmq-httpil.war/ 01:33:12,814 INFO [A] Bound to JNDI name: queue/A 01:33:12,825 INFO [B] Bound to JNDI name: queue/B 01:33:12,834 INFO [C] Bound to JNDI name: queue/C 01:33:12,843 INFO [D] Bound to JNDI name: queue/D 01:33:12,850 INFO [ex] Bound to JNDI name: queue/ex 01:33:12,891 INFO [testTopic] Bound to JNDI name: topic/testTopic 01:33:12,907 INFO [securedTopic] Bound to JNDI name: topic/securedTopic 01:33:12,917 INFO [testDurableTopic] Bound to JNDI name: topic/testDurableTopic 01:33:12,936 INFO [testQueue] Bound to JNDI name: queue/testQueue 01:33:13,069 INFO [UILServerILService] JBossMQ UIL service available at : /0.0.0.0:8093 01:33:13,210 INFO [DLQ] Bound to JNDI name: queue/DLQ 01:35:35,247 INFO [TreeCache] viewAccepted(): [192.168.1.2:1155|2] [192.168.1.2:1155] 01:35:37,367 INFO [TreeCache] viewAccepted(): [192.168.1.2:1155|3] [192.168.1.2:1155, 192.168.1.9:32915] 01:35:37,503 INFO [TreeCache] locking the subtree at / to transfer state 01:35:37,517 INFO [StateTransferGenerator_140] returning the state for tree rooted in /(1024 bytes) 01:35:40,765 INFO [UI] New cluster view for partition UI (id: 3, delta: 1) : [127.0.0.2:1099, 127.0.0.2:1099] 01:35:40,767 INFO [UI] I am (127.0.0.2:1099) received membershipChanged event: 01:35:40,768 INFO [UI] Dead members: 0 ([]) 01:35:40,768 INFO [UI] New Members : 0 ([]) 01:35:40,769 INFO [UI] All Members : 2 ([127.0.0.2:1099, 127.0.0.2:1099])
On this production one, notice TreeCache shows 5 members but HAPartition.UI is only showing 4. The missing one vrx15 is now in this state such that it cannot accept incoming connections, and cannot join the cluster. It is only able to show handleJoin(something here) failed, retrying
[org.jboss.cache.TreeCache] viewAccepted(): new members: [vrx13-in:39016, vrx12-in:42073, vrx14-in:58902, vrx15-in:41415, vrx10:57939] 2007-06-27 18:30:49,682 INFO [org.jboss.ha.framework.interfaces.HAPartition.UI] New cluster view for partition UI: 159 ([192.168.120.112:1099, 192.168.120.143:1099, 192.168.120.144:1099, 192.168.120.110:1099] delta: -1) 2007-06-27 18:30:49,683 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.UI] I am (192.168.120.110:1099) received membershipChanged event: 2007-06-27 18:30:49,683 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.UI] Dead members: 1 ([192.168.120.111:1099]) 2007-06-27 18:30:49,683 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.UI] New Members : 0 ([]) 2007-06-27 18:30:49,683 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.UI] All Members : 4 ([ 192.168.120.112:1099, 192.168.120.143:1099, 192.168.120.144:1099, 192.168.120.110:1099])