we have the problem that our cluster sometimes looses his master and only restarting the 'right' node solves it. We are using jboss 4.0.2 but the code which causes the problem is present up to 4.2.0.CR1. I can reproduce it as follows:
cluster with two nodes: node_a, node_b, both jboss-4.0.2
node_a is master, clusterview is node_a:1199,node_b:1199
we have a high workload on both nodes
node_b has problems and doesn't shut down normally, we have to kill him with kill -9 and restart it immediately
node_bnew (node_bn) rejoins the cluster before node_a recognices that node_bold (node_bo) is realy dead
result is the current clusterview: node_a:1199, node_b(o):1199,node_b(n):1199
after a very short time, node_a sends a new clusterview without node_b(o), but node_b(n) doesn't remove his dead predecessor
on node_bn the old node_bo still remains in the replicants list of org.jboss.ha.framework.server.DistributedReplicantManagerImpl and this is never changed.
when now node_a is restarted normaly, node_bn is the first node in the clusterview and JGroups knows he is the master, but the DistributedReplicantManagerImpl not
only when node_bn is restarted, too, the 'right' node from above, all works fine again
otherwise we don't have a master in our cluster. This means e.g. that isMasterReplica of HAJNDI returns 'false' on all nodes and code which should run on the master is never triggered
The reason for this behaviour comes from generateUniqueNodeName in class ClusterPartition. It uses <node_name>:<jndi_port> if the JNDI-Subsystem is started and <node_name>:<generated UID> otherwise. The first form is not unique after a restart of a node and it is set as additional data in JGroups and then used as a key in some HashMaps. Because of this ambiguous key you see a node in the clusterview twice and because of the implementation of DistributedReplicantManagerImpl.setCurrentState and HAPartitionImpl.getDeadMembers the dead old node is never removed from the replicants list of DistributedReplicantsManagerImpl. And this means, that for all deployed services the call to isMasterReplica is false, even if node_bn is the master.
I solved the problem by commenting out the return statement in ClusterPartition.generateUniqueNodeName which returns the name with JNDI port and then the method generates a UID and gives back <node_name>:<generated UID>. This is unique, but the drawback is, that now the clusterview looks a little bit ugly.
In my opinion this is a bug, the unique node name is not unique. But may be it should not be unique across node restarts. Would it be possible, to make it a config option in ClusterPartition to always use the UID, which is off by default? Then if someone want's a complete unique node name can configure it and for all others nothing changes.
After fixing this, I get a new minor problem: There is a deadlock condition in the locking order of the synchronized blocks in DistributedReplicantManagerImpl. I had to move the notifyKeyListeners outside of the synchronized(replicants)-block in purgeDeadMemebers and remember the changed keys in an ArrayList. All other methods have a locking order of localReplicants -> replicants and only purgeDeadMembers uses replicants -> localReplicants. I see that in newer versions of this class the collections from the concurrent-package are used, so this problem may not be present there.
This posting is long enough ;), so if someone wants additional data, like line numbers, diffs etc. I can post it here. It would be nice, if the config option could be integrated and I would be glad to help to do this :).
Thanks much for the excellent report. Could you post a couple JIRA issues for these two issues?
Sure, but I haven't used JIRA yet, so I have some, maybe stupid, questions: Should I use 'Bug' as issue type or 'Support Patch'? I would add it for project 'JBoss Clustering', is this ok? I can add a patch for both issues against JBoss 4.0.2. Would this be helpful?
Good questions; I wasn't focusing when I wrote before.
Use "Bug" and the project should be "JBoss Application Server".
Sure, a patch would be helpful. :)
I posted two Bugreports:
But I found a minor bug in HAPartitionImpl. The debug output in method viewAccepted doesn't log the old size, because the members-field changed during the method. I changed 'this.memebers' to 'oldMembers' with some checking for oldMembers == null. Its not very elegant, but it works. Should I post a bug for this, too? I think it's not so important.
--- HAPartitionImpl.java 2007-04-10 18:30:52.000000000 +0200
+++ HAPartitionImpl.java-patched 2007-04-10 18:38:51.000000000 +0200
@@ -503,7 +503,10 @@
event.originatingGroups = mergeView.getSubgroups();
- log.debug("membership changed from " + this.members.size() + " to "
+ //log.debug("membership changed from " + this.members.size() + " to "
+ int tmpOldSize = 0;
+ if (oldMembers != null) tmpOldSize = oldMembers.size();
+ log.debug("membership changed from " + tmpOldSize + " to "
// Put the view change to the asynch queue
Thanks for those two. If you don't mind, go ahead and open a JIRA for the logging issue; it needs to get fixed and JIRA is the to-fix list.
I added a bugreport for the logging issue, too. The added patch is a little bit smaller, but I think the check if oldMembers is null, could be removed. Because of the prior code, it can't be null, but the check is saver when the code changes in future.
Thanks. The last one is fixed. I believe the null check was needed for the case of the receipt of the first view. Anyway, as you say it makes it safer and a viewAccepted() call is infrequent so there's no point reducing safety for speed.
Thanks a lot. I played a little bit around and now I could generate a patch which adds the attribute 'forceUidInNodeName' to ClusterPartiton/ClusterPartitonMBean which is false per default, which means that the behaviour is untouched and when its set to true the UID is used. But to implement it as an attribute depends on the decision how unique the node name should be. If it is a bug to use the JNDI-Port we should not implement an attribute, but if it is a feature, e.g. fro readability or if someone don't need the uniqueness across restarts, an attribute would be a good idea, I think. Anyway, are you interested in the the two patches? I could add them to the JIRA-entry.
The use of JNDI port is not meant for prettiness. It's actually intended that in the use case you describe in your initial post, the name should *not* be unique.
The idea here is to detect the case you describe, where a node is killed and restarted and rejoins the group before the group knows the first version is dead. The intent is to prevent that happening by giving the node the same name and taking appropriate action (IIRC throwing exception) if the node name is a duplicate in the view.
In JGroups 2.6, a feature called "Logical Addresses" will be added that will deal with this problem at the JGroups level, avoiding this admittedly hacky generateUniqueNodeName business.
Your report indicates that this mechanism is breaking down; that's the bug.
BTW, does your JGroups config in cluster-service.xml include FD_SOCK next to FD? If not, adding it is a good idea; with it node_a would detect the kill -9 of node_b immediately. See http://wiki.jboss.org/wiki/Wiki.jsp?page=JGroupsFD_SOCK.
Are you using a TCP-based config? The mechanism I described is encapsulated in HAPartitionImpl.verifyNodeIsUnique(), which from a look at it will break down if a the node returns with the same InetAddress + port. That's more likely to happen with a TCP-based config.
No I don't have FD_SOCK next to FD. I saw the doubled node when the whole cluster had a high load with lots of garbage collection runs. That was the reason why I thought that the use of FD_SOCK will lessen the likelihood of the problem, but it still can happen, especially when a node takes a GC run of about 5 to 8 seconds. Ok, the nodes were less powerful machines, now they are much stronger :).
I have a UDP-based config. And some of the timing parameter are not very smart. The result is that it sometimes happens that a node is suspected before he had the chance to send his first message. But changing these parameters will again only reduce the likelihood of the doubled node, but it can't be avoided.
I agree, the base problem is that the doubled node is not completely recognised. I thought it was easier to make a node name unique, than detecting that he is in the list twice and that one of this names is from a dead member. But it may be that the unique name is not a clean design and that it only works because of some side effects ;).