5 Replies Latest reply on Apr 24, 2018 9:54 PM by carcara

Order of node start up works one way, but not the other way

philipwu Mar 6, 2007 12:46 PM

Hi folks,

I'm having trouble trying to figure this out. Any help would be appreciated.
I have 2 nodes that I am trying to cluster.

If I start up node A and then node B, I get an error saying

12:35:05,988 WARN [ServiceController] Problem starting service jboss.cache:service=TomcatClusteringCache
org.jboss.cache.CacheException: Initial state transfer failed: Channel.getState() returned false

However, if I start up node B and then node A, everything starts up just fine.

Firewalls are disabled and both have the same cluster-service.xml configuration. I have read some FAQs saying that I should increase timeout, but that didn't help either. I also set the following in my tree cache configuration:

<pbcast.STATE_TRANSFER down_thread="true" up_thread="true"/>

And that didn't help either.

Any ideas?

thanx,
Phil

1. Re: Order of node start up works one way, but not the other

brian.stansberry Mar 6, 2007 1:06 PM (in response to philipwu)

1) What version of JBoss AS?

2) Which node gives you that WARN?

3) If it's the second node to start that reports the WARN, are there any exceptions in the first node?
Actions

2. Re: Order of node start up works one way, but not the other

philipwu Mar 6, 2007 1:30 PM (in response to philipwu)

Thanx for the quick reply:

1) I am using Jboss 4.0.5.GA installed with 1.2.0.GA jems installer, with ejb3-clustered configuration.

2) If I start node A and then node B, it is node B that gives that warning. I had switched log4j.xml to DEBUG level and I see no errors coming from node A. According to the logs, it appers at first that they can find each other and they are registered in a cluster. But soon afterwards node B is removed from the cluster probably because of the WARN message.

In the trace below node A = 192.168.1.100, node B = 192.168.1.105

Here's a larger trace of node B:

13:09:10,407 INFO [FRAG2] frag_size=60000, overhead=0, new frag_size=60000
13:09:10,516 INFO [UDP] unicast sockets will use interface 192.168.1.105
13:09:10,521 INFO [UDP] socket information:
local_addr=localhost:32811, mcast_addr=228.1.2.4:45577, bind_addr=/192.168.1.105, ttl=2
sock: bound to 192.168.1.105:32811, receive buffer size=109568, send buffer size=109568
mcast_recv_sock: bound to 192.168.1.105:45577, send buffer size=109568, receive buffer size=109568
mcast_send_sock: bound to 192.168.1.105:32812, send buffer size=109568, receive buffer size=109568
13:09:10,522 INFO [STDOUT]
-------------------------------------------------------
GMS: address is localhost:32811
-------------------------------------------------------
13:09:12,815 INFO [TreeCache] TreeCache local address is localhost:32811
13:09:12,816 INFO [TreeCache] viewAccepted(): [192.168.1.100:32771|1] [192.168.1.100:32771, localhost:32811]
13:09:27,832 WARN [ServiceController] Problem starting service jboss.cache:service=TomcatClusteringCache
org.jboss.cache.CacheException: Initial state transfer failed: Channel.getState() returned false
 at org.jboss.cache.TreeCache.fetchStateOnStartup(TreeCache.java:3191)
 at org.jboss.cache.TreeCache.startService(TreeCache.java:1429)

Here's some of the trace from node A:

2007-03-06 12:53:50,201 DEBUG [org.jgroups.protocols.pbcast.GMS] mbr=192.168.1.105:32811
2007-03-06 12:53:50,201 DEBUG [org.jgroups.protocols.pbcast.CoordGmsImpl] mbr=192.168.1.105:32811
2007-03-06 12:53:50,202 DEBUG [org.jgroups.protocols.pbcast.CoordGmsImpl] got digest=[localhost:32771: [5 : 6]
2007-03-06 12:53:50,202 DEBUG [org.jgroups.protocols.pbcast.GMS] VID=1, current members=(localhost:32771), new_mbrs=(192.168.1.105:32811), old
_mbrs=(), suspected_mbrs=()
2007-03-06 12:53:50,202 DEBUG [org.jgroups.protocols.pbcast.GMS] new view is [localhost:32771|1] [localhost:32771, 192.168.1.105:32811]
2007-03-06 12:53:50,202 DEBUG [org.jgroups.protocols.pbcast.CoordGmsImpl] joined member 192.168.1.105:32811, view is [localhost:32771|1] [loca
lhost:32771, 192.168.1.105:32811]

3) No exceptions in Node A

Thanx,

Phil

3. Re: Order of node start up works one way, but not the other

philipwu Mar 6, 2007 2:26 PM (in response to philipwu)

Another experiment:

I tried updating Jboss lib to use the latest stable JGroups 2.4.0

This time Node B shows a lightly different trace.

2007-03-06 14:20:09,055 DEBUG [org.jboss.ha.framework.server.ClusterPartition] Caught exception after channel connected; closing channel -- In
itial state transfer failed: Channel.getState() returned false

Looks like there was an exception in Node B, so node B sent a LEAVE request to Node A. Here's a more detailed trace of Node B:

2007-03-06 14:19:38,990 DEBUG [org.jgroups.protocols.FD_SOCK] VIEW_CHANGE received: [192.168.1.100:32772, 192.168.1.105:32825]
2007-03-06 14:19:38,991 DEBUG [org.jgroups.protocols.FD] suspected_mbrs: [], after adjustment: []
2007-03-06 14:19:38,993 DEBUG [org.jgroups.protocols.FD_SOCK] determinePingDest()=192.168.1.100:32772, pingable_mbrs=[192.168.1.100:32772, 192
.168.1.105:32825]
2007-03-06 14:19:38,994 DEBUG [org.jgroups.protocols.FD_SOCK] ping_dest=192.168.1.100:32772, ping_sock=Socket[addr=/192.168.1.100,port=54423,l
ocalport=47344], cache={192.168.1.100:32772=192.168.1.100:54423}
2007-03-06 14:19:39,047 DEBUG [org.jboss.ha.framework.interfaces.HAPartition.wubrothers] ViewAccepted: initial members set
2007-03-06 14:19:39,047 DEBUG [org.jboss.ha.framework.server.ClusterPartition] Starting channel
2007-03-06 14:19:39,048 DEBUG [org.jboss.ha.framework.interfaces.HAPartition.wubrothers] get nodeName
2007-03-06 14:19:39,048 DEBUG [org.jboss.ha.framework.interfaces.HAPartition.wubrothers] Get current members
2007-03-06 14:19:39,048 INFO [org.jboss.ha.framework.interfaces.HAPartition.wubrothers] Number of cluster members: 2
2007-03-06 14:19:39,049 INFO [org.jboss.ha.framework.interfaces.HAPartition.wubrothers] Other members: 1
2007-03-06 14:19:39,049 INFO [org.jboss.ha.framework.interfaces.HAPartition.wubrothers] Fetching state (will wait for 30000 milliseconds):
2007-03-06 14:19:39,050 DEBUG [org.jgroups.protocols.pbcast.STATE_TRANSFER] GET_STATE: asking 192.168.1.100:32772 for state
2007-03-06 14:19:39,050 DEBUG [org.jgroups.protocols.pbcast.STATE_TRANSFER] passing down a SUSPEND_STABLE event
2007-03-06 14:19:39,050 DEBUG [org.jgroups.protocols.pbcast.STABLE] suspending message garbage collection
2007-03-06 14:19:39,050 DEBUG [org.jgroups.protocols.pbcast.STABLE] resume task started, max_suspend_time=33000
2007-03-06 14:19:43,168 DEBUG [org.jgroups.util.TimeScheduler] Running task org.jgroups.protocols.pbcast.STABLE$StableTask@d7c6bf
2007-03-06 14:19:43,208 DEBUG [org.jgroups.util.TimeScheduler] Running task org.jgroups.protocols.TP$Bundler$BundlingTimer@392814
2007-03-06 14:19:46,920 DEBUG [org.jgroups.util.TimeScheduler] Running task org.jgroups.protocols.pbcast.STABLE$StableTask@7673a2
2007-03-06 14:19:49,000 DEBUG [org.jgroups.util.TimeScheduler] Running task true
2007-03-06 14:19:49,000 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 192.168.1.100:32772 (own address=192.168.1.105:32825)
2007-03-06 14:19:49,001 DEBUG [org.jgroups.protocols.FD] received ack from 192.168.1.100:32772
2007-03-06 14:19:55,487 DEBUG [org.jgroups.util.TimeScheduler] Running task org.jgroups.protocols.TP$Bundler$BundlingTimer@a193fe
2007-03-06 14:19:55,559 DEBUG [org.jgroups.util.TimeScheduler] Running task true
2007-03-06 14:19:55,559 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 192.168.1.100:32770 (own address=192.168.1.105:32823)
2007-03-06 14:19:55,599 DEBUG [org.jgroups.util.TimeScheduler] Running task org.jgroups.protocols.TP$Bundler$BundlingTimer@1fb2ef9
2007-03-06 14:19:55,633 DEBUG [org.jgroups.protocols.FD] received ack from 192.168.1.100:32770
2007-03-06 14:19:59,007 DEBUG [org.jgroups.util.TimeScheduler] Running task true
2007-03-06 14:19:59,007 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 192.168.1.100:32772 (own address=192.168.1.105:32825)
2007-03-06 14:19:59,008 DEBUG [org.jgroups.protocols.FD] received ack from 192.168.1.100:32772
2007-03-06 14:20:09,015 DEBUG [org.jgroups.util.TimeScheduler] Running task true
2007-03-06 14:20:09,015 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 192.168.1.100:32772 (own address=192.168.1.105:32825)
2007-03-06 14:20:09,016 DEBUG [org.jgroups.protocols.FD] received ack from 192.168.1.100:32772
2007-03-06 14:20:09,055 DEBUG [org.jboss.ha.framework.server.ClusterPartition] Caught exception after channel connected; closing channel -- In
itial state transfer failed: Channel.getState() returned false
2007-03-06 14:20:09,056 DEBUG [org.jgroups.protocols.pbcast.STABLE] resuming message garbage collection
2007-03-06 14:20:09,056 DEBUG [org.jgroups.protocols.pbcast.GMS] sending LEAVE request to 192.168.1.100:32772 (local_addr=192.168.1.105:32825)
2007-03-06 14:20:09,113 DEBUG [org.jgroups.protocols.pbcast.GMS] view=[192.168.1.100:32772|2] [192.168.1.100:32772]
2007-03-06 14:20:09,113 DEBUG [org.jgroups.protocols.pbcast.GMS] view=[192.168.1.100:32772|2] [192.168.1.100:32772]
2007-03-06 14:20:13,538 DEBUG [org.jgroups.util.TimeScheduler] Running task org.jgroups.protocols.pbcast.STABLE$StabilitySendTask@8d41f2
2007-03-06 14:20:13,578 DEBUG [org.jgroups.util.TimeScheduler] Running task org.jgroups.protocols.TP$Bundler$BundlingTimer@1e492d8
2007-03-06 14:20:14,062 DEBUG [org.jgroups.protocols.pbcast.GMS] sending LEAVE request to 192.168.1.100:32772 (local_addr=192.168.1.105:32825)
2007-03-06 14:20:14,371 DEBUG [org.jgroups.util.TimeScheduler] Running task 6-6
2007-03-06 14:20:14,978 DEBUG [org.jgroups.util.TimeScheduler] Running task 6-6
2007-03-06 14:20:15,494 DEBUG [org.jgroups.util.TimeScheduler] Running task org.jgroups.protocols.TP$Bundler$BundlingTimer@164de59
2007-03-06 14:20:15,566 DEBUG [org.jgroups.util.TimeScheduler] Running task true
2007-03-06 14:20:15,566 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 192.168.1.100:32770 (own address=192.168.1.105:32823)
2007-03-06 14:20:15,602 DEBUG [org.jgroups.util.TimeScheduler] Running task org.jgroups.protocols.TP$Bundler$BundlingTimer@141b736
2007-03-06 14:20:15,635 DEBUG [org.jgroups.protocols.FD] received ack from 192.168.1.100:32770
2007-03-06 14:20:16,186 DEBUG [org.jgroups.util.TimeScheduler] Running task 6-6
2007-03-06 14:20:18,594 DEBUG [org.jgroups.util.TimeScheduler] Running task 6-6
2007-03-06 14:20:19,022 DEBUG [org.jgroups.util.TimeScheduler] Running task true
2007-03-06 14:20:19,023 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 192.168.1.100:32772 (own address=192.168.1.105:32825)
2007-03-06 14:20:19,024 DEBUG [org.jgroups.protocols.FD] received ack from 192.168.1.100:32772
2007-03-06 14:20:19,070 DEBUG [org.jgroups.protocols.pbcast.GMS] sending LEAVE request to 192.168.1.100:32772 (local_addr=192.168.1.105:32825)
2007-03-06 14:20:19,378 DEBUG [org.jgroups.util.TimeScheduler] Running task 7-7
2007-03-06 14:20:19,986 DEBUG [org.jgroups.util.TimeScheduler] Running task 7-7
2007-03-06 14:20:21,194 DEBUG [org.jgroups.util.TimeScheduler] Running task 7-7
2007-03-06 14:20:23,402 DEBUG [org.jgroups.util.TimeScheduler] Running task 6-6
2007-03-06 14:20:23,602 DEBUG [org.jgroups.util.TimeScheduler] Running task 7-7
2007-03-06 14:20:24,074 DEBUG [org.jgroups.protocols.pbcast.GMS] 192.168.1.105:32825 changed role to org.jgroups.protocols.pbcast.ClientGmsImp
l
2007-03-06 14:20:24,078 DEBUG [org.jgroups.protocols.FD_SOCK] socket to 192.168.1.100:32772 was reset
2007-03-06 14:20:24,078 DEBUG [org.jgroups.protocols.FD_SOCK] pinger thread terminated
2007-03-06 14:20:24,079 DEBUG [org.jgroups.protocols.UDP] closing sockets and stopping threads
2007-03-06 14:20:24,079 DEBUG [org.jgroups.protocols.UDP] multicast receive socket closed
2007-03-06 14:20:24,079 DEBUG [org.jgroups.protocols.UDP] multicast send socket closed
2007-03-06 14:20:24,079 DEBUG [org.jgroups.protocols.UDP] multicast thread terminated
2007-03-06 14:20:24,080 DEBUG [org.jgroups.protocols.UDP] socket closed
2007-03-06 14:20:24,080 DEBUG [org.jgroups.protocols.UDP] unicast receiver socket is closed, exception=java.net.SocketException: Socket closed
2007-03-06 14:20:24,080 DEBUG [org.jgroups.protocols.UDP] unicast receiver thread terminated
2007-03-06 14:20:24,082 DEBUG [org.jboss.ha.framework.server.ClusterPartition] Starting failed jboss:service=wubrothers
java.lang.IllegalStateException: Initial state transfer failed: Channel.getState() returned false
at org.jboss.ha.framework.server.HAPartitionImpl.fetchState(HAPartitionImpl.java:351)

Node A still shows no exceptions. Here's the trace for node A that received the LEAVE request:

2007-03-06 14:04:45,864 DEBUG [org.jgroups.protocols.FD] received ack from 192.168.1.105:32825
2007-03-06 14:04:45,920 DEBUG [org.jgroups.protocols.pbcast.GMS] received LEAVE_REQ for 192.168.1.105:32825 from 192.168.1.105:32825
2007-03-06 14:04:45,973 DEBUG [org.jgroups.protocols.pbcast.GMS] new=[], suspected=[], leaving=[192.168.1.105:32825], new view: [192.168.1.100
:32772|2] [192.168.1.100:32772]
2007-03-06 14:04:45,975 DEBUG [org.jgroups.protocols.pbcast.GMS] view=[192.168.1.100:32772|2] [192.168.1.100:32772]
2007-03-06 14:04:45,975 DEBUG [org.jgroups.protocols.pbcast.GMS] [local_addr=192.168.1.100:32772] view is [192.168.1.100:32772|2] [192.168.1.1
00:32772]

Any other ideas?

Phil
Actions
4. Re: Order of node start up works one way, but not the other

brian.stansberry Mar 6, 2007 2:39 PM (in response to philipwu)

In this case the problem was with the HAPartition channel (cluster-service.xml). Earlier it was the Tomcat clustering channel (tc5-cluster.sar).

The state transfer failed because node B didn't get a response to the state transfer request from A within 30 secs. The bits of logs from A don't cover the period when B requested state, so it's hard to say why.

Suggest you:

1) Go through the process outlined at http://wiki.jboss.org/wiki/Wiki.jsp?page=TestingJBoss with the configs you use for both cluster-service.xml and tc5-cluster.sar. Most likely there is some connectivity issue.

2) Get the server clocks in sync. Won't fix the problem, but trying to interpret cluster logs with out of sync timestamps is way too much of a pain.
Actions
5. Re: Order of node start up works one way, but not the other way

carcara Apr 24, 2018 9:54 PM (in response to philipwu)

Hello,

I have the same problem. Do you fix this issue?

This issue only happening to me when my server is was the load. If I don't have load, the cluster works fine.

logs when I try to join the node in cluster:

2018-04-24 22:43:08,049 DEBUG [org.jgroups.protocols.pbcast.GMS] 192.168.123.10:7810 changed role to org.jgroups.protocols.pbcast.ParticipantGmsImpl
2018-04-24 22:43:08,051 DEBUG [org.jgroups.protocols.pbcast.STATE_TRANSFER] GET_STATE: asking 192.168.123.20:7810 for state
2018-04-24 22:43:08,051 DEBUG [org.jgroups.protocols.pbcast.STATE_TRANSFER] passing down a SUSPEND_STABLE event
2018-04-24 22:43:08,051 DEBUG [org.jgroups.protocols.pbcast.STABLE] suspending message garbage collection
2018-04-24 22:43:08,052 DEBUG [org.jgroups.protocols.pbcast.STABLE] resume task started, max_suspend_time=16500
2018-04-24 22:43:10,918 WARN [org.jboss.cache.TreeCache] node /JSESSION/localhost/xxxx/hv37aPNsKLuHbQgkCp54+A** not found
2018-04-24 22:43:10,970 WARN [org.jboss.cache.TreeCache] node /JSESSION/localhost/xxxx/2eiU7jl+RJSZX0t1KhLe1w** not found
2018-04-24 22:43:11,031 WARN [org.jboss.cache.TreeCache] node /JSESSION/localhost/xxxx/DWJKghNriwboNeTtfAqsYw** not found
2018-04-24 22:43:11,041 WARN [org.jboss.cache.TreeCache] node /JSESSION/localhost/xxx/aUE1zdZYr2lCR0rL14WZhA** not found
2018-04-24 22:43:11,042 WARN [org.jboss.cache.TreeCache] node /JSESSION/localhost/xxxx/oODkdnsefeZNttgVCWTvLQ** not found
...
2018-04-24 22:43:23,053 DEBUG [org.jgroups.protocols.pbcast.STABLE] resuming message garbage collection
2018-04-24 22:43:23,054 DEBUG [org.jgroups.protocols.pbcast.GMS] sending LEAVE request to 192.168.123.20:7810 (local_addr=192.168.123.10:7810)
2018-04-24 22:43:23,060 WARN [org.jboss.cache.TreeCache] node /JSESSION/localhost/xxx/s+rIVvEFUyZFC71u9s8pCA** not found
2018-04-24 22:43:23,078 WARN [org.jboss.cache.TreeCache] node /JSESSION/localhost/xxx/1va3twuNTm2qAPF-UdyAxQ** not found
2018-04-24 22:43:23,095 WARN [org.jboss.cache.TreeCache] node /JSESSION/localhost/xx/surKCZMqxr4Ib9R4c+FGyg** not found
2018-04-24 22:43:23,165 WARN [org.jboss.cache.TreeCache] node /JSESSION/localhost/xxx/BPvPRsYsgUmg07eNWmekHw** not found
2018-04-24 22:43:23,165 WARN [org.jboss.cache.TreeCache] node /JSESSION/localhost/xxxx/FwikxGZgSvDcbDgJ1X+Rpg** not found
2018-04-24 22:43:23,166 WARN [org.jboss.cache.TreeCache] node /JSESSION/localhost/xx/OiJuFVuTNMDfRk6yJviBrQ** not found
2018-04-24 22:43:23,166 WARN [org.jboss.cache.TreeCache] node /JSESSION/localhost/xxxxx/dyhn1m1voD9fC4PXqW8Xvw** not found
2018-04-24 22:43:23,537 DEBUG [org.jgroups.protocols.pbcast.GMS] 192.168.123.10:7810 changed role to org.jgroups.protocols.pbcast.ClientGmsImpl
2018-04-24 22:43:23,545 DEBUG [org.jgroups.protocols.FD_SOCK] socket to 192.168.123.20:7810 was reset
2018-04-24 22:43:23,545 DEBUG [org.jgroups.protocols.FD_SOCK] pinger thread terminated
2018-04-24 22:43:23,545 INFO [org.jgroups.blocks.ConnectionTable] exception is java.net.SocketException: Socket closed

2018-04-24 22:43:23,549 WARN [org.jboss.system.ServiceController] Problem starting service jboss.cache:service=TomcatClusteringCache
org.jboss.cache.CacheException: Initial state transfer failed: Channel.getState() returned false
at org.jboss.cache.TreeCache.fetchStateOnStartup(TreeCache.java:3378)
at org.jboss.cache.TreeCache.startService(TreeCache.java:1556)
at org.jboss.cache.aop.PojoCache.startService(PojoCache.java:94)
at org.jboss.system.ServiceMBeanSupport.jbossInternalStart(ServiceMBeanSupport.java:289)

Can you help me, please?

Thanks
Actions

Go to original post