5 Replies Latest reply on Apr 24, 2018 9:54 PM by cfranco Franco

    Order of node start up works one way, but not the other way

    Philip Wu Newbie

      Hi folks,

      I'm having trouble trying to figure this out. Any help would be appreciated.
      I have 2 nodes that I am trying to cluster.

      If I start up node A and then node B, I get an error saying

      12:35:05,988 WARN [ServiceController] Problem starting service jboss.cache:service=TomcatClusteringCache
      org.jboss.cache.CacheException: Initial state transfer failed: Channel.getState() returned false

      However, if I start up node B and then node A, everything starts up just fine.

      Firewalls are disabled and both have the same cluster-service.xml configuration. I have read some FAQs saying that I should increase timeout, but that didn't help either. I also set the following in my tree cache configuration:

      <pbcast.STATE_TRANSFER down_thread="true" up_thread="true"/>

      And that didn't help either.

      Any ideas?

      thanx,
      Phil

        • 1. Re: Order of node start up works one way, but not the other
          Brian Stansberry Master

          1) What version of JBoss AS?

          2) Which node gives you that WARN?

          3) If it's the second node to start that reports the WARN, are there any exceptions in the first node?

          • 2. Re: Order of node start up works one way, but not the other
            Philip Wu Newbie

            Thanx for the quick reply:

            1) I am using Jboss 4.0.5.GA installed with 1.2.0.GA jems installer, with ejb3-clustered configuration.

            2) If I start node A and then node B, it is node B that gives that warning. I had switched log4j.xml to DEBUG level and I see no errors coming from node A. According to the logs, it appers at first that they can find each other and they are registered in a cluster. But soon afterwards node B is removed from the cluster probably because of the WARN message.

            In the trace below node A = 192.168.1.100, node B = 192.168.1.105

            Here's a larger trace of node B:

            13:09:10,407 INFO [FRAG2] frag_size=60000, overhead=0, new frag_size=60000
            13:09:10,516 INFO [UDP] unicast sockets will use interface 192.168.1.105
            13:09:10,521 INFO [UDP] socket information:
            local_addr=localhost:32811, mcast_addr=228.1.2.4:45577, bind_addr=/192.168.1.105, ttl=2
            sock: bound to 192.168.1.105:32811, receive buffer size=109568, send buffer size=109568
            mcast_recv_sock: bound to 192.168.1.105:45577, send buffer size=109568, receive buffer size=109568
            mcast_send_sock: bound to 192.168.1.105:32812, send buffer size=109568, receive buffer size=109568
            13:09:10,522 INFO [STDOUT]
            -------------------------------------------------------
            GMS: address is localhost:32811
            -------------------------------------------------------
            13:09:12,815 INFO [TreeCache] TreeCache local address is localhost:32811
            13:09:12,816 INFO [TreeCache] viewAccepted(): [192.168.1.100:32771|1] [192.168.1.100:32771, localhost:32811]
            13:09:27,832 WARN [ServiceController] Problem starting service jboss.cache:service=TomcatClusteringCache
            org.jboss.cache.CacheException: Initial state transfer failed: Channel.getState() returned false
             at org.jboss.cache.TreeCache.fetchStateOnStartup(TreeCache.java:3191)
             at org.jboss.cache.TreeCache.startService(TreeCache.java:1429)
            


            Here's some of the trace from node A:


            2007-03-06 12:53:50,201 DEBUG [org.jgroups.protocols.pbcast.GMS] mbr=192.168.1.105:32811
            2007-03-06 12:53:50,201 DEBUG [org.jgroups.protocols.pbcast.CoordGmsImpl] mbr=192.168.1.105:32811
            2007-03-06 12:53:50,202 DEBUG [org.jgroups.protocols.pbcast.CoordGmsImpl] got digest=[localhost:32771: [5 : 6]
            2007-03-06 12:53:50,202 DEBUG [org.jgroups.protocols.pbcast.GMS] VID=1, current members=(localhost:32771), new_mbrs=(192.168.1.105:32811), old
            _mbrs=(), suspected_mbrs=()
            2007-03-06 12:53:50,202 DEBUG [org.jgroups.protocols.pbcast.GMS] new view is [localhost:32771|1] [localhost:32771, 192.168.1.105:32811]
            2007-03-06 12:53:50,202 DEBUG [org.jgroups.protocols.pbcast.CoordGmsImpl] joined member 192.168.1.105:32811, view is [localhost:32771|1] [loca
            lhost:32771, 192.168.1.105:32811]


            3) No exceptions in Node A

            Thanx,

            Phil


            • 3. Re: Order of node start up works one way, but not the other
              Philip Wu Newbie

              Another experiment:

              I tried updating Jboss lib to use the latest stable JGroups 2.4.0

              This time Node B shows a lightly different trace.

              2007-03-06 14:20:09,055 DEBUG [org.jboss.ha.framework.server.ClusterPartition] Caught exception after channel connected; closing channel -- In
              itial state transfer failed: Channel.getState() returned false


              Looks like there was an exception in Node B, so node B sent a LEAVE request to Node A. Here's a more detailed trace of Node B:

              2007-03-06 14:19:38,990 DEBUG [org.jgroups.protocols.FD_SOCK] VIEW_CHANGE received: [192.168.1.100:32772, 192.168.1.105:32825]
              2007-03-06 14:19:38,991 DEBUG [org.jgroups.protocols.FD] suspected_mbrs: [], after adjustment: []
              2007-03-06 14:19:38,993 DEBUG [org.jgroups.protocols.FD_SOCK] determinePingDest()=192.168.1.100:32772, pingable_mbrs=[192.168.1.100:32772, 192
              .168.1.105:32825]
              2007-03-06 14:19:38,994 DEBUG [org.jgroups.protocols.FD_SOCK] ping_dest=192.168.1.100:32772, ping_sock=Socket[addr=/192.168.1.100,port=54423,l
              ocalport=47344], cache={192.168.1.100:32772=192.168.1.100:54423}
              2007-03-06 14:19:39,047 DEBUG [org.jboss.ha.framework.interfaces.HAPartition.wubrothers] ViewAccepted: initial members set
              2007-03-06 14:19:39,047 DEBUG [org.jboss.ha.framework.server.ClusterPartition] Starting channel
              2007-03-06 14:19:39,048 DEBUG [org.jboss.ha.framework.interfaces.HAPartition.wubrothers] get nodeName
              2007-03-06 14:19:39,048 DEBUG [org.jboss.ha.framework.interfaces.HAPartition.wubrothers] Get current members
              2007-03-06 14:19:39,048 INFO [org.jboss.ha.framework.interfaces.HAPartition.wubrothers] Number of cluster members: 2
              2007-03-06 14:19:39,049 INFO [org.jboss.ha.framework.interfaces.HAPartition.wubrothers] Other members: 1
              2007-03-06 14:19:39,049 INFO [org.jboss.ha.framework.interfaces.HAPartition.wubrothers] Fetching state (will wait for 30000 milliseconds):
              2007-03-06 14:19:39,050 DEBUG [org.jgroups.protocols.pbcast.STATE_TRANSFER] GET_STATE: asking 192.168.1.100:32772 for state
              2007-03-06 14:19:39,050 DEBUG [org.jgroups.protocols.pbcast.STATE_TRANSFER] passing down a SUSPEND_STABLE event
              2007-03-06 14:19:39,050 DEBUG [org.jgroups.protocols.pbcast.STABLE] suspending message garbage collection
              2007-03-06 14:19:39,050 DEBUG [org.jgroups.protocols.pbcast.STABLE] resume task started, max_suspend_time=33000
              2007-03-06 14:19:43,168 DEBUG [org.jgroups.util.TimeScheduler] Running task org.jgroups.protocols.pbcast.STABLE$StableTask@d7c6bf
              2007-03-06 14:19:43,208 DEBUG [org.jgroups.util.TimeScheduler] Running task org.jgroups.protocols.TP$Bundler$BundlingTimer@392814
              2007-03-06 14:19:46,920 DEBUG [org.jgroups.util.TimeScheduler] Running task org.jgroups.protocols.pbcast.STABLE$StableTask@7673a2
              2007-03-06 14:19:49,000 DEBUG [org.jgroups.util.TimeScheduler] Running task true
              2007-03-06 14:19:49,000 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 192.168.1.100:32772 (own address=192.168.1.105:32825)
              2007-03-06 14:19:49,001 DEBUG [org.jgroups.protocols.FD] received ack from 192.168.1.100:32772
              2007-03-06 14:19:55,487 DEBUG [org.jgroups.util.TimeScheduler] Running task org.jgroups.protocols.TP$Bundler$BundlingTimer@a193fe
              2007-03-06 14:19:55,559 DEBUG [org.jgroups.util.TimeScheduler] Running task true
              2007-03-06 14:19:55,559 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 192.168.1.100:32770 (own address=192.168.1.105:32823)
              2007-03-06 14:19:55,599 DEBUG [org.jgroups.util.TimeScheduler] Running task org.jgroups.protocols.TP$Bundler$BundlingTimer@1fb2ef9
              2007-03-06 14:19:55,633 DEBUG [org.jgroups.protocols.FD] received ack from 192.168.1.100:32770
              2007-03-06 14:19:59,007 DEBUG [org.jgroups.util.TimeScheduler] Running task true
              2007-03-06 14:19:59,007 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 192.168.1.100:32772 (own address=192.168.1.105:32825)
              2007-03-06 14:19:59,008 DEBUG [org.jgroups.protocols.FD] received ack from 192.168.1.100:32772
              2007-03-06 14:20:09,015 DEBUG [org.jgroups.util.TimeScheduler] Running task true
              2007-03-06 14:20:09,015 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 192.168.1.100:32772 (own address=192.168.1.105:32825)
              2007-03-06 14:20:09,016 DEBUG [org.jgroups.protocols.FD] received ack from 192.168.1.100:32772
              2007-03-06 14:20:09,055 DEBUG [org.jboss.ha.framework.server.ClusterPartition] Caught exception after channel connected; closing channel -- In
              itial state transfer failed: Channel.getState() returned false
              2007-03-06 14:20:09,056 DEBUG [org.jgroups.protocols.pbcast.STABLE] resuming message garbage collection
              2007-03-06 14:20:09,056 DEBUG [org.jgroups.protocols.pbcast.GMS] sending LEAVE request to 192.168.1.100:32772 (local_addr=192.168.1.105:32825)
              2007-03-06 14:20:09,113 DEBUG [org.jgroups.protocols.pbcast.GMS] view=[192.168.1.100:32772|2] [192.168.1.100:32772]
              2007-03-06 14:20:09,113 DEBUG [org.jgroups.protocols.pbcast.GMS] view=[192.168.1.100:32772|2] [192.168.1.100:32772]
              2007-03-06 14:20:13,538 DEBUG [org.jgroups.util.TimeScheduler] Running task org.jgroups.protocols.pbcast.STABLE$StabilitySendTask@8d41f2
              2007-03-06 14:20:13,578 DEBUG [org.jgroups.util.TimeScheduler] Running task org.jgroups.protocols.TP$Bundler$BundlingTimer@1e492d8
              2007-03-06 14:20:14,062 DEBUG [org.jgroups.protocols.pbcast.GMS] sending LEAVE request to 192.168.1.100:32772 (local_addr=192.168.1.105:32825)
              2007-03-06 14:20:14,371 DEBUG [org.jgroups.util.TimeScheduler] Running task 6-6
              2007-03-06 14:20:14,978 DEBUG [org.jgroups.util.TimeScheduler] Running task 6-6
              2007-03-06 14:20:15,494 DEBUG [org.jgroups.util.TimeScheduler] Running task org.jgroups.protocols.TP$Bundler$BundlingTimer@164de59
              2007-03-06 14:20:15,566 DEBUG [org.jgroups.util.TimeScheduler] Running task true
              2007-03-06 14:20:15,566 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 192.168.1.100:32770 (own address=192.168.1.105:32823)
              2007-03-06 14:20:15,602 DEBUG [org.jgroups.util.TimeScheduler] Running task org.jgroups.protocols.TP$Bundler$BundlingTimer@141b736
              2007-03-06 14:20:15,635 DEBUG [org.jgroups.protocols.FD] received ack from 192.168.1.100:32770
              2007-03-06 14:20:16,186 DEBUG [org.jgroups.util.TimeScheduler] Running task 6-6
              2007-03-06 14:20:18,594 DEBUG [org.jgroups.util.TimeScheduler] Running task 6-6
              2007-03-06 14:20:19,022 DEBUG [org.jgroups.util.TimeScheduler] Running task true
              2007-03-06 14:20:19,023 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 192.168.1.100:32772 (own address=192.168.1.105:32825)
              2007-03-06 14:20:19,024 DEBUG [org.jgroups.protocols.FD] received ack from 192.168.1.100:32772
              2007-03-06 14:20:19,070 DEBUG [org.jgroups.protocols.pbcast.GMS] sending LEAVE request to 192.168.1.100:32772 (local_addr=192.168.1.105:32825)
              2007-03-06 14:20:19,378 DEBUG [org.jgroups.util.TimeScheduler] Running task 7-7
              2007-03-06 14:20:19,986 DEBUG [org.jgroups.util.TimeScheduler] Running task 7-7
              2007-03-06 14:20:21,194 DEBUG [org.jgroups.util.TimeScheduler] Running task 7-7
              2007-03-06 14:20:23,402 DEBUG [org.jgroups.util.TimeScheduler] Running task 6-6
              2007-03-06 14:20:23,602 DEBUG [org.jgroups.util.TimeScheduler] Running task 7-7
              2007-03-06 14:20:24,074 DEBUG [org.jgroups.protocols.pbcast.GMS] 192.168.1.105:32825 changed role to org.jgroups.protocols.pbcast.ClientGmsImp
              l
              2007-03-06 14:20:24,078 DEBUG [org.jgroups.protocols.FD_SOCK] socket to 192.168.1.100:32772 was reset
              2007-03-06 14:20:24,078 DEBUG [org.jgroups.protocols.FD_SOCK] pinger thread terminated
              2007-03-06 14:20:24,079 DEBUG [org.jgroups.protocols.UDP] closing sockets and stopping threads
              2007-03-06 14:20:24,079 DEBUG [org.jgroups.protocols.UDP] multicast receive socket closed
              2007-03-06 14:20:24,079 DEBUG [org.jgroups.protocols.UDP] multicast send socket closed
              2007-03-06 14:20:24,079 DEBUG [org.jgroups.protocols.UDP] multicast thread terminated
              2007-03-06 14:20:24,080 DEBUG [org.jgroups.protocols.UDP] socket closed
              2007-03-06 14:20:24,080 DEBUG [org.jgroups.protocols.UDP] unicast receiver socket is closed, exception=java.net.SocketException: Socket closed
              2007-03-06 14:20:24,080 DEBUG [org.jgroups.protocols.UDP] unicast receiver thread terminated
              2007-03-06 14:20:24,082 DEBUG [org.jboss.ha.framework.server.ClusterPartition] Starting failed jboss:service=wubrothers
              java.lang.IllegalStateException: Initial state transfer failed: Channel.getState() returned false
              at org.jboss.ha.framework.server.HAPartitionImpl.fetchState(HAPartitionImpl.java:351)



              Node A still shows no exceptions. Here's the trace for node A that received the LEAVE request:

              2007-03-06 14:04:45,864 DEBUG [org.jgroups.protocols.FD] received ack from 192.168.1.105:32825
              2007-03-06 14:04:45,920 DEBUG [org.jgroups.protocols.pbcast.GMS] received LEAVE_REQ for 192.168.1.105:32825 from 192.168.1.105:32825
              2007-03-06 14:04:45,973 DEBUG [org.jgroups.protocols.pbcast.GMS] new=[], suspected=[], leaving=[192.168.1.105:32825], new view: [192.168.1.100
              :32772|2] [192.168.1.100:32772]
              2007-03-06 14:04:45,975 DEBUG [org.jgroups.protocols.pbcast.GMS] view=[192.168.1.100:32772|2] [192.168.1.100:32772]
              2007-03-06 14:04:45,975 DEBUG [org.jgroups.protocols.pbcast.GMS] [local_addr=192.168.1.100:32772] view is [192.168.1.100:32772|2] [192.168.1.1
              00:32772]


              Any other ideas?

              Phil

              • 4. Re: Order of node start up works one way, but not the other
                Brian Stansberry Master

                In this case the problem was with the HAPartition channel (cluster-service.xml). Earlier it was the Tomcat clustering channel (tc5-cluster.sar).

                The state transfer failed because node B didn't get a response to the state transfer request from A within 30 secs. The bits of logs from A don't cover the period when B requested state, so it's hard to say why.

                Suggest you:

                1) Go through the process outlined at http://wiki.jboss.org/wiki/Wiki.jsp?page=TestingJBoss with the configs you use for both cluster-service.xml and tc5-cluster.sar. Most likely there is some connectivity issue.

                2) Get the server clocks in sync. Won't fix the problem, but trying to interpret cluster logs with out of sync timestamps is way too much of a pain.

                • 5. Re: Order of node start up works one way, but not the other way
                  cfranco Franco Newbie

                  Hello,

                   

                  I have the same problem. Do you fix this issue?

                   

                  This issue only happening to me when my server is was the load. If I don't have load, the cluster works fine.

                   

                   

                  logs when I try to join the node in cluster:

                   

                  2018-04-24 22:43:08,049 DEBUG [org.jgroups.protocols.pbcast.GMS] 192.168.123.10:7810 changed role to org.jgroups.protocols.pbcast.ParticipantGmsImpl

                  2018-04-24 22:43:08,051 DEBUG [org.jgroups.protocols.pbcast.STATE_TRANSFER] GET_STATE: asking 192.168.123.20:7810 for state

                  2018-04-24 22:43:08,051 DEBUG [org.jgroups.protocols.pbcast.STATE_TRANSFER] passing down a SUSPEND_STABLE event

                  2018-04-24 22:43:08,051 DEBUG [org.jgroups.protocols.pbcast.STABLE] suspending message garbage collection

                  2018-04-24 22:43:08,052 DEBUG [org.jgroups.protocols.pbcast.STABLE] resume task started, max_suspend_time=16500

                  2018-04-24 22:43:10,918 WARN  [org.jboss.cache.TreeCache] node /JSESSION/localhost/xxxx/hv37aPNsKLuHbQgkCp54+A** not found

                  2018-04-24 22:43:10,970 WARN  [org.jboss.cache.TreeCache] node /JSESSION/localhost/xxxx/2eiU7jl+RJSZX0t1KhLe1w** not found

                  2018-04-24 22:43:11,031 WARN  [org.jboss.cache.TreeCache] node /JSESSION/localhost/xxxx/DWJKghNriwboNeTtfAqsYw** not found

                  2018-04-24 22:43:11,041 WARN  [org.jboss.cache.TreeCache] node /JSESSION/localhost/xxx/aUE1zdZYr2lCR0rL14WZhA** not found

                  2018-04-24 22:43:11,042 WARN  [org.jboss.cache.TreeCache] node /JSESSION/localhost/xxxx/oODkdnsefeZNttgVCWTvLQ** not found

                  ...

                  2018-04-24 22:43:23,053 DEBUG [org.jgroups.protocols.pbcast.STABLE] resuming message garbage collection

                  2018-04-24 22:43:23,054 DEBUG [org.jgroups.protocols.pbcast.GMS] sending LEAVE request to 192.168.123.20:7810 (local_addr=192.168.123.10:7810)

                  2018-04-24 22:43:23,060 WARN  [org.jboss.cache.TreeCache] node /JSESSION/localhost/xxx/s+rIVvEFUyZFC71u9s8pCA** not found

                  2018-04-24 22:43:23,078 WARN  [org.jboss.cache.TreeCache] node /JSESSION/localhost/xxx/1va3twuNTm2qAPF-UdyAxQ** not found

                  2018-04-24 22:43:23,095 WARN  [org.jboss.cache.TreeCache] node /JSESSION/localhost/xx/surKCZMqxr4Ib9R4c+FGyg** not found

                  2018-04-24 22:43:23,165 WARN  [org.jboss.cache.TreeCache] node /JSESSION/localhost/xxx/BPvPRsYsgUmg07eNWmekHw** not found

                  2018-04-24 22:43:23,165 WARN  [org.jboss.cache.TreeCache] node /JSESSION/localhost/xxxx/FwikxGZgSvDcbDgJ1X+Rpg** not found

                  2018-04-24 22:43:23,166 WARN  [org.jboss.cache.TreeCache] node /JSESSION/localhost/xx/OiJuFVuTNMDfRk6yJviBrQ** not found

                  2018-04-24 22:43:23,166 WARN  [org.jboss.cache.TreeCache] node /JSESSION/localhost/xxxxx/dyhn1m1voD9fC4PXqW8Xvw** not found

                  2018-04-24 22:43:23,537 DEBUG [org.jgroups.protocols.pbcast.GMS] 192.168.123.10:7810 changed role to org.jgroups.protocols.pbcast.ClientGmsImpl

                  2018-04-24 22:43:23,545 DEBUG [org.jgroups.protocols.FD_SOCK] socket to 192.168.123.20:7810 was reset

                  2018-04-24 22:43:23,545 DEBUG [org.jgroups.protocols.FD_SOCK] pinger thread terminated

                  2018-04-24 22:43:23,545 INFO  [org.jgroups.blocks.ConnectionTable] exception is java.net.SocketException: Socket closed

                   

                  2018-04-24 22:43:23,549 WARN  [org.jboss.system.ServiceController] Problem starting service jboss.cache:service=TomcatClusteringCache

                  org.jboss.cache.CacheException: Initial state transfer failed: Channel.getState() returned false

                    at org.jboss.cache.TreeCache.fetchStateOnStartup(TreeCache.java:3378)

                    at org.jboss.cache.TreeCache.startService(TreeCache.java:1556)

                    at org.jboss.cache.aop.PojoCache.startService(PojoCache.java:94)

                    at org.jboss.system.ServiceMBeanSupport.jbossInternalStart(ServiceMBeanSupport.java:289)

                   

                  Can you help me, please?

                   

                  Thanks