Clustered nodes cannot discover each other afte...| JBoss.org Content Archive (Read Only)

1. Re: Clustered nodes cannot discover each other after unplugg

ben.wang Dec 13, 2005 11:50 AM (in response to vegecat)

you should turn on the the jgroups log tracing to see the details. Please refer to the jgroups wiki page in jboss wiki site.

2. Re: Clustered nodes cannot discover each other after unplugg

vegecat Dec 13, 2005 5:08 PM (in response to vegecat)

Hi, Ben

Thanks for the pointer. I tested the unplug/replug network cable scenario again while enabling TRACE jgroups. The result was different from what I saw previously. The two clustered nodes could discover each other after an initial failure. This is the message shown on one node after I replugged in the network cable: (NMS is the partition name)

16:59:04,867 INFO [TreeCache] viewAccepted(): new members: [USABBRDUL14407:2567
, USABBRDUL15163:1672]
16:59:04,877 ERROR [GMS] [USABBRDUL14407:2567] received view <= current view; di
scarding it (current vid: [USABBRDUL14407:2567|13], new vid: [USABBRDUL14407:256
7|13])
16:59:05,077 WARN [NAKACK] [USABBRDUL14407:2569 (additional data: 19 bytes)] di
scarded message from non-member USABBRDUL15163:1674 (additional data: 18 bytes)
16:59:06,740 INFO [TreeCache] viewAccepted(): new members: [USABBRDUL14407:2568
, USABBRDUL15163:1676]
16:59:06,740 ERROR [GMS] [USABBRDUL14407:2568] received view <= current view; di
scarding it (current vid: [USABBRDUL14407:2568|13], new vid: [USABBRDUL14407:256
8|13])
16:59:08,642 WARN [NAKACK] [USABBRDUL14407:2569 (additional data: 19 bytes)] di
scarded message from non-member USABBRDUL15163:1674 (additional data: 18 bytes)
16:59:08,963 INFO [NMS] New cluster view for partition NMS (id: 13, delta: 1) :
 [130.110.93.205:1099, 10.66.248.243:1099]
16:59:08,963 INFO [NMS] Merging partitions...
16:59:08,963 INFO [NMS] Dead members: 0
16:59:08,963 INFO [NMS] Originating groups: [[USABBRDUL14407:2569 (additional d
ata: 19 bytes)|12] [USABBRDUL14407:2569 (additional data: 19 bytes)], [USABBRDUL
15163:1674 (additional data: 18 bytes)|12] [USABBRDUL15163:1674 (additional data
: 18 bytes)]]
16:59:08,973 ERROR [GMS] [USABBRDUL14407:2569 (additional data: 19 bytes)] recei
ved view <= current view; discarding it (current vid: [USABBRDUL14407:2569 (addi
tional data: 19 bytes)|13], new vid: [USABBRDUL14407:2569 (additional data: 19 b
ytes)|13])
16:59:09,323 ERROR [NMS] merge failed
java.lang.ClassCastException: EDU.oswego.cs.dl.util.concurrent.ConcurrentReaderH
ashMap
 at org.jboss.ha.framework.server.DistributedReplicantManagerImpl.mergeMe
mbers(DistributedReplicantManagerImpl.java:791)
 at org.jboss.ha.framework.server.DistributedReplicantManagerImpl$MergeMe
mbers.run(DistributedReplicantManagerImpl.java:927)
16:59:17,595 WARN [CoordGmsImpl] merge responses from subgroup coordinators <=
1 ([sender=USABBRDUL14407:2534, view=[USABBRDUL14407:2534|1] [USABBRDUL14407:253
4, USABBRDUL15163:1641], digest=[USABBRDUL14407:2534: [0 : 21, USABBRDUL15163:16
41: [0 : 0]]). Cancelling merge
16:59:37,784 WARN [CoordGmsImpl] merge responses from subgroup coordinators <=
1 ([sender=USABBRDUL14407:2534, view=[USABBRDUL14407:2534|1] [USABBRDUL14407:253
4, USABBRDUL15163:1641], digest=[USABBRDUL14407:2534: [0 : 21, USABBRDUL15163:16
41: [0 : 0]]). Cancelling merge
16:59:50,963 WARN [CoordGmsImpl] merge responses from subgroup coordinators <=
1 ([sender=USABBRDUL14407:2534, view=[USABBRDUL14407:2534|1] [USABBRDUL14407:253
4, USABBRDUL15163:1641], digest=[USABBRDUL14407:2534: [0 : 22, USABBRDUL15163:16
41: [0 : 0]]). Cancelling merge
17:00:04,162 WARN [CoordGmsImpl] merge responses from subgroup coordinators <=
1 ([sender=USABBRDUL14407:2534, view=[USABBRDUL14407:2534|1] [USABBRDUL14407:253
4, USABBRDUL15163:1641], digest=[USABBRDUL14407:2534: [0 : 22, USABBRDUL15163:16
41: [0 : 0]]). Cancelling merge
17:00:18,172 WARN [CoordGmsImpl] merge responses from subgroup coordinators <=
1 ([sender=USABBRDUL14407:2534, view=[USABBRDUL14407:2534|1] [USABBRDUL14407:253
4, USABBRDUL15163:1641], digest=[USABBRDUL14407:2534: [0 : 23, USABBRDUL15163:16
41: [0 : 0]]). Cancelling merge
17:00:21,848 WARN [FD] ping_dest is null: members=[USABBRDUL14407:2568, USABBRD
UL15163:1676], pingable_mbrs=[USABBRDUL14407:2568], local_addr=USABBRDUL14407:25
68
17:00:22,979 WARN [FD] ping_dest is null: members=[USABBRDUL14407:2567, USABBRD
UL15163:1672], pingable_mbrs=[USABBRDUL14407:2567], local_addr=USABBRDUL14407:25
67
17:00:23,350 INFO [TreeCache] viewAccepted(): new members: [USABBRDUL14407:2568
]
17:00:24,482 INFO [TreeCache] viewAccepted(): new members: [USABBRDUL14407:2567
]
17:00:29,148 WARN [FD] ping_dest is null: members=[USABBRDUL14407:2569 (additio
nal data: 19 bytes), USABBRDUL15163:1674 (additional data: 18 bytes)], pingable_
mbrs=[USABBRDUL14407:2569 (additional data: 19 bytes)], local_addr=USABBRDUL1440
7:2569 (additional data: 19 bytes)
17:00:29,579 INFO [NMS] Suspected member: USABBRDUL15163:1674 (additional data:
 18 bytes)
17:00:29,579 INFO [NMS] New cluster view for partition NMS (id: 14, delta: -1)
: [130.110.93.205:1099]
17:00:29,589 INFO [NMS] I am (130.110.93.205:1099) received membershipChanged e
vent:
17:00:29,589 INFO [NMS] Dead members: 1 ([10.66.248.243:1099])
17:00:29,589 INFO [NMS] New Members : 0 ([])
17:00:29,589 INFO [NMS] All Members : 1 ([130.110.93.205:1099])
17:00:33,044 WARN [NAKACK] [USABBRDUL14407:2569 (additional data: 19 bytes)] di
scarded message from non-member USABBRDUL15163:1674 (additional data: 18 bytes)
17:00:33,054 WARN [NAKACK] [USABBRDUL14407:2568] discarded message from non-mem
ber USABBRDUL15163:1676
17:00:33,064 WARN [NAKACK] [USABBRDUL14407:2567] discarded message from non-mem
ber USABBRDUL15163:1672
17:00:35,958 INFO [TreeCache] viewAccepted(): new members: [USABBRDUL14407:2568
, USABBRDUL15163:1698]
17:00:35,958 INFO [TreeCache] viewAccepted(): new members: [USABBRDUL14407:2567
, USABBRDUL15163:1701]
17:00:35,968 INFO [NMS] New cluster view for partition NMS (id: 15, delta: 1) :
 [130.110.93.205:1099, 10.66.248.243:1099]
17:00:35,968 INFO [NMS] I am (130.110.93.205:1099) received membershipChanged e
vent:
17:00:35,968 INFO [NMS] Dead members: 0 ([])
17:00:35,968 INFO [NMS] New Members : 1 ([10.66.248.243:1099])
17:00:35,978 INFO [TreeCache] locking the tree to obtain transient state
17:00:35,978 INFO [TreeCache] returning the transient state (140 bytes)
17:00:35,978 INFO [NMS] All Members : 2 ([130.110.93.205:1099, 10.66.248.243:10
99])
17:00:35,978 INFO [TreeCache] locking the tree to obtain transient state
17:00:35,978 INFO [TreeCache] returning the transient state (140 bytes)
17:00:52,071 WARN [CoordGmsImpl] merge responses from subgroup coordinators <=
1 ([sender=USABBRDUL14407:2534, view=[USABBRDUL14407:2534|1] [USABBRDUL14407:253
4, USABBRDUL15163:1641], digest=[USABBRDUL14407:2534: [0 : 24, USABBRDUL15163:16
41: [0 : 0]]). Cancelling merge
17:01:06,442 WARN [CoordGmsImpl] merge responses from subgroup coordinators <=
1 ([sender=USABBRDUL14407:2534, view=[USABBRDUL14407:2534|1] [USABBRDUL14407:253
4, USABBRDUL15163:1641], digest=[USABBRDUL14407:2534: [0 : 24, USABBRDUL15163:16
41: [0 : 0]]). Cancelling merge

3. Re: Clustered nodes cannot discover each other after unplugg

brian.stansberry Dec 13, 2005 6:53 PM (in response to vegecat)

The

16:59:09,323 ERROR [NMS] merge failed
java.lang.ClassCastException: EDU.oswego.cs.dl.util.concurrent.ConcurrentReaderH
ashMap
at org.jboss.ha.framework.server.DistributedReplicantManagerImpl.mergeMe
mbers(DistributedReplicantManagerImpl.java:791)
at org.jboss.ha.framework.server.DistributedReplicantManagerImpl$MergeMe
mbers.run(DistributedReplicantManagerImpl.java:927)

problem is due to http://jira.jboss.com/jira/browse/JBAS-2439.

As for the rest of the problems, it's hard to tell without understanding your environment. Is this the "all" config from 4.0.3, with 3 tree caches (session replication, SFSB replication, entity bean replication) + the NMS Partition? If so, it looks like 2 of the caches recovered:

17:00:35,958 INFO [TreeCache] viewAccepted(): new members: [USABBRDUL14407:2568
, USABBRDUL15163:1698]
17:00:35,958 INFO [TreeCache] viewAccepted(): new members: [USABBRDUL14407:2567
, USABBRDUL15163:1701]

while the NMS partition lost a member:

17:00:35,968 INFO [NMS] New Members : 1 ([10.66.248.243:1099])

which isn't surprising given the above referenced bug. Not clear what happened to the 3rd TreeCache. By the end of the log snippet it hadn't recovered.

JBossDeveloper

Clustered nodes cannot discover each other after unplugging/

1. Re: Clustered nodes cannot discover each other after unplugg

2. Re: Clustered nodes cannot discover each other after unplugg

3. Re: Clustered nodes cannot discover each other after unplugg