3 Replies Latest reply on Dec 13, 2005 6:53 PM by brian.stansberry

    Clustered nodes cannot discover each other after unplugging/

    vegecat

      Hi, all

      I have two clustered servers running. I unplugged the network cable of one server. After the other server detected the failure, I replugged in the network cable. The two servers could not discover each other. No error message was shown either.

        • 1. Re: Clustered nodes cannot discover each other after unplugg

          you should turn on the the jgroups log tracing to see the details. Please refer to the jgroups wiki page in jboss wiki site.

          • 2. Re: Clustered nodes cannot discover each other after unplugg
            vegecat

            Hi, Ben

            Thanks for the pointer. I tested the unplug/replug network cable scenario again while enabling TRACE jgroups. The result was different from what I saw previously. The two clustered nodes could discover each other after an initial failure. This is the message shown on one node after I replugged in the network cable: (NMS is the partition name)

            16:59:04,867 INFO [TreeCache] viewAccepted(): new members: [USABBRDUL14407:2567
            , USABBRDUL15163:1672]
            16:59:04,877 ERROR [GMS] [USABBRDUL14407:2567] received view <= current view; di
            scarding it (current vid: [USABBRDUL14407:2567|13], new vid: [USABBRDUL14407:256
            7|13])
            16:59:05,077 WARN [NAKACK] [USABBRDUL14407:2569 (additional data: 19 bytes)] di
            scarded message from non-member USABBRDUL15163:1674 (additional data: 18 bytes)
            16:59:06,740 INFO [TreeCache] viewAccepted(): new members: [USABBRDUL14407:2568
            , USABBRDUL15163:1676]
            16:59:06,740 ERROR [GMS] [USABBRDUL14407:2568] received view <= current view; di
            scarding it (current vid: [USABBRDUL14407:2568|13], new vid: [USABBRDUL14407:256
            8|13])
            16:59:08,642 WARN [NAKACK] [USABBRDUL14407:2569 (additional data: 19 bytes)] di
            scarded message from non-member USABBRDUL15163:1674 (additional data: 18 bytes)
            16:59:08,963 INFO [NMS] New cluster view for partition NMS (id: 13, delta: 1) :
             [130.110.93.205:1099, 10.66.248.243:1099]
            16:59:08,963 INFO [NMS] Merging partitions...
            16:59:08,963 INFO [NMS] Dead members: 0
            16:59:08,963 INFO [NMS] Originating groups: [[USABBRDUL14407:2569 (additional d
            ata: 19 bytes)|12] [USABBRDUL14407:2569 (additional data: 19 bytes)], [USABBRDUL
            15163:1674 (additional data: 18 bytes)|12] [USABBRDUL15163:1674 (additional data
            : 18 bytes)]]
            16:59:08,973 ERROR [GMS] [USABBRDUL14407:2569 (additional data: 19 bytes)] recei
            ved view <= current view; discarding it (current vid: [USABBRDUL14407:2569 (addi
            tional data: 19 bytes)|13], new vid: [USABBRDUL14407:2569 (additional data: 19 b
            ytes)|13])
            16:59:09,323 ERROR [NMS] merge failed
            java.lang.ClassCastException: EDU.oswego.cs.dl.util.concurrent.ConcurrentReaderH
            ashMap
             at org.jboss.ha.framework.server.DistributedReplicantManagerImpl.mergeMe
            mbers(DistributedReplicantManagerImpl.java:791)
             at org.jboss.ha.framework.server.DistributedReplicantManagerImpl$MergeMe
            mbers.run(DistributedReplicantManagerImpl.java:927)
            16:59:17,595 WARN [CoordGmsImpl] merge responses from subgroup coordinators <=
            1 ([sender=USABBRDUL14407:2534, view=[USABBRDUL14407:2534|1] [USABBRDUL14407:253
            4, USABBRDUL15163:1641], digest=[USABBRDUL14407:2534: [0 : 21, USABBRDUL15163:16
            41: [0 : 0]]). Cancelling merge
            16:59:37,784 WARN [CoordGmsImpl] merge responses from subgroup coordinators <=
            1 ([sender=USABBRDUL14407:2534, view=[USABBRDUL14407:2534|1] [USABBRDUL14407:253
            4, USABBRDUL15163:1641], digest=[USABBRDUL14407:2534: [0 : 21, USABBRDUL15163:16
            41: [0 : 0]]). Cancelling merge
            16:59:50,963 WARN [CoordGmsImpl] merge responses from subgroup coordinators <=
            1 ([sender=USABBRDUL14407:2534, view=[USABBRDUL14407:2534|1] [USABBRDUL14407:253
            4, USABBRDUL15163:1641], digest=[USABBRDUL14407:2534: [0 : 22, USABBRDUL15163:16
            41: [0 : 0]]). Cancelling merge
            17:00:04,162 WARN [CoordGmsImpl] merge responses from subgroup coordinators <=
            1 ([sender=USABBRDUL14407:2534, view=[USABBRDUL14407:2534|1] [USABBRDUL14407:253
            4, USABBRDUL15163:1641], digest=[USABBRDUL14407:2534: [0 : 22, USABBRDUL15163:16
            41: [0 : 0]]). Cancelling merge
            17:00:18,172 WARN [CoordGmsImpl] merge responses from subgroup coordinators <=
            1 ([sender=USABBRDUL14407:2534, view=[USABBRDUL14407:2534|1] [USABBRDUL14407:253
            4, USABBRDUL15163:1641], digest=[USABBRDUL14407:2534: [0 : 23, USABBRDUL15163:16
            41: [0 : 0]]). Cancelling merge
            17:00:21,848 WARN [FD] ping_dest is null: members=[USABBRDUL14407:2568, USABBRD
            UL15163:1676], pingable_mbrs=[USABBRDUL14407:2568], local_addr=USABBRDUL14407:25
            68
            17:00:22,979 WARN [FD] ping_dest is null: members=[USABBRDUL14407:2567, USABBRD
            UL15163:1672], pingable_mbrs=[USABBRDUL14407:2567], local_addr=USABBRDUL14407:25
            67
            17:00:23,350 INFO [TreeCache] viewAccepted(): new members: [USABBRDUL14407:2568
            ]
            17:00:24,482 INFO [TreeCache] viewAccepted(): new members: [USABBRDUL14407:2567
            ]
            17:00:29,148 WARN [FD] ping_dest is null: members=[USABBRDUL14407:2569 (additio
            nal data: 19 bytes), USABBRDUL15163:1674 (additional data: 18 bytes)], pingable_
            mbrs=[USABBRDUL14407:2569 (additional data: 19 bytes)], local_addr=USABBRDUL1440
            7:2569 (additional data: 19 bytes)
            17:00:29,579 INFO [NMS] Suspected member: USABBRDUL15163:1674 (additional data:
             18 bytes)
            17:00:29,579 INFO [NMS] New cluster view for partition NMS (id: 14, delta: -1)
            : [130.110.93.205:1099]
            17:00:29,589 INFO [NMS] I am (130.110.93.205:1099) received membershipChanged e
            vent:
            17:00:29,589 INFO [NMS] Dead members: 1 ([10.66.248.243:1099])
            17:00:29,589 INFO [NMS] New Members : 0 ([])
            17:00:29,589 INFO [NMS] All Members : 1 ([130.110.93.205:1099])
            17:00:33,044 WARN [NAKACK] [USABBRDUL14407:2569 (additional data: 19 bytes)] di
            scarded message from non-member USABBRDUL15163:1674 (additional data: 18 bytes)
            17:00:33,054 WARN [NAKACK] [USABBRDUL14407:2568] discarded message from non-mem
            ber USABBRDUL15163:1676
            17:00:33,064 WARN [NAKACK] [USABBRDUL14407:2567] discarded message from non-mem
            ber USABBRDUL15163:1672
            17:00:35,958 INFO [TreeCache] viewAccepted(): new members: [USABBRDUL14407:2568
            , USABBRDUL15163:1698]
            17:00:35,958 INFO [TreeCache] viewAccepted(): new members: [USABBRDUL14407:2567
            , USABBRDUL15163:1701]
            17:00:35,968 INFO [NMS] New cluster view for partition NMS (id: 15, delta: 1) :
             [130.110.93.205:1099, 10.66.248.243:1099]
            17:00:35,968 INFO [NMS] I am (130.110.93.205:1099) received membershipChanged e
            vent:
            17:00:35,968 INFO [NMS] Dead members: 0 ([])
            17:00:35,968 INFO [NMS] New Members : 1 ([10.66.248.243:1099])
            17:00:35,978 INFO [TreeCache] locking the tree to obtain transient state
            17:00:35,978 INFO [TreeCache] returning the transient state (140 bytes)
            17:00:35,978 INFO [NMS] All Members : 2 ([130.110.93.205:1099, 10.66.248.243:10
            99])
            17:00:35,978 INFO [TreeCache] locking the tree to obtain transient state
            17:00:35,978 INFO [TreeCache] returning the transient state (140 bytes)
            17:00:52,071 WARN [CoordGmsImpl] merge responses from subgroup coordinators <=
            1 ([sender=USABBRDUL14407:2534, view=[USABBRDUL14407:2534|1] [USABBRDUL14407:253
            4, USABBRDUL15163:1641], digest=[USABBRDUL14407:2534: [0 : 24, USABBRDUL15163:16
            41: [0 : 0]]). Cancelling merge
            17:01:06,442 WARN [CoordGmsImpl] merge responses from subgroup coordinators <=
            1 ([sender=USABBRDUL14407:2534, view=[USABBRDUL14407:2534|1] [USABBRDUL14407:253
            4, USABBRDUL15163:1641], digest=[USABBRDUL14407:2534: [0 : 24, USABBRDUL15163:16
            41: [0 : 0]]). Cancelling merge


            • 3. Re: Clustered nodes cannot discover each other after unplugg
              brian.stansberry

              The

              16:59:09,323 ERROR [NMS] merge failed
              java.lang.ClassCastException: EDU.oswego.cs.dl.util.concurrent.ConcurrentReaderH
              ashMap
              at org.jboss.ha.framework.server.DistributedReplicantManagerImpl.mergeMe
              mbers(DistributedReplicantManagerImpl.java:791)
              at org.jboss.ha.framework.server.DistributedReplicantManagerImpl$MergeMe
              mbers.run(DistributedReplicantManagerImpl.java:927)
              problem is due to http://jira.jboss.com/jira/browse/JBAS-2439.

              As for the rest of the problems, it's hard to tell without understanding your environment. Is this the "all" config from 4.0.3, with 3 tree caches (session replication, SFSB replication, entity bean replication) + the NMS Partition? If so, it looks like 2 of the caches recovered:

              17:00:35,958 INFO [TreeCache] viewAccepted(): new members: [USABBRDUL14407:2568
              , USABBRDUL15163:1698]
              17:00:35,958 INFO [TreeCache] viewAccepted(): new members: [USABBRDUL14407:2567
              , USABBRDUL15163:1701]


              while the NMS partition lost a member:

              17:00:35,968 INFO [NMS] New Members : 1 ([10.66.248.243:1099])


              which isn't surprising given the above referenced bug. Not clear what happened to the 3rd TreeCache. By the end of the log snippet it hadn't recovered.