9 Replies Latest reply on Nov 5, 2007 5:33 PM by brian.stansberry

purgeDeadMembers Failing - what is the cause?

jboss_cody Nov 5, 2007 1:40 PM

Hello everyone,

I have yet another issue with my cluster. I have seen this issue in other forum entries, however I am unable to locate the definite answer for it's cause.

I have a cluster which consists of 6 nodes. I have implemented fail-over and session replication within the cluster. My problem is, that when specific nodes in my cluster are stopped, their session information is not accurate when fail-over occurs.

I have investigated into each nodes logs and I have discovered that not all of my nodes have this problem.

Here is the entry from a node which does have the problem, as I called the shutdown.

2007-11-04 21:15:04,784 DEBUG [org.jboss.ha.framework.server.HARMIServerImpl$RefreshProxiesHATarget] replicantsChanged 'HAJNDI' to 2 (intra-view id: 56409408)
2007-11-04 21:15:04,860 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.jboss1] New cluster view for partition jboss1 (id: 27, delta: -1) : [192.168.202.x:1099, 192.168.202.x:1099]
2007-11-04 21:15:04,860 DEBUG [org.jboss.ha.framework.interfaces.HAPartition.jboss1] dead members: [192.168.202.x:1099]
2007-11-04 21:15:04,860 DEBUG [org.jboss.ha.framework.interfaces.HAPartition.jboss1] membership changed from 3 to 2
2007-11-04 21:15:04,862 DEBUG [org.jboss.ha.framework.interfaces.HAPartition.jboss1] Begin notifyListeners, viewID: 27
2007-11-04 21:15:04,863 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.jboss1] I am (192.168.202.x:1099) received membershipChanged event:
2007-11-04 21:15:04,863 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.jboss1] Dead members: 1 ([192.168.202.x:1099])
2007-11-04 21:15:04,864 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.jboss1] New Members : 0 ([])
2007-11-04 21:15:04,864 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.jboss1] All Members : 2 ([192.168.202.x:1099, 192.168.202.x:1099])
2007-11-04 21:15:04,864 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.jboss1] purgeDeadMembers, [192.168.202.x:1099]
2007-11-04 21:15:04,864 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.jboss1] trying to remove deadMember 192.168.202.x:1099 for key DCacheBridge-DefaultJGBridge
2007-11-04 21:15:04,864 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.jboss1] 192.168.202.11:1099 was NOT removed!!!
2007-11-04 21:15:04,864 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.jboss1] trying to remove deadMember 192.168.202.x:1099 for key jboss.ha:service=HASingletonDeployer
2007-11-04 21:15:04,864 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.jboss1] 192.168.202.11:1099 was NOT removed!!!
2007-11-04 21:15:04,864 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.jboss1] trying to remove deadMember 192.168.202.x:1099 for key HAJNDI
2007-11-04 21:15:04,864 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.jboss1] 192.168.202.11:1099 was NOT removed!!!
2007-11-04 21:15:04,864 DEBUG [org.jboss.ha.framework.interfaces.HAPartition.jboss1] End notifyListeners, viewID: 27
2007-11-04 21:15:05,625 INFO [org.jboss.cache.TreeCache] viewAccepted(): [192.168.202.x:33173|27] [192.168.202.x:33173, 192.168.202.x:33183]
2007-11-04 21:15:07,200 DEBUG [org.jboss.web.tomcat.service.session.JBossCacheManager] Looking for sessions that have expired ...
2007-11-04 21:15:14,362 INFO [org.jboss.cache.TreeCache] viewAccepted(): [192.168.202.x:33180|28] [192.168.202.x:33180]
2007-11-04 21:15:14,470 DEBUG [org.jboss.ha.singleton.HASingletonController] partitionTopologyChanged, isElectedNewMaster=true, isMasterNode=true, viewID=-424447
2007-11-04 21:15:14,500 DEBUG [org.jboss.cache.invalidation.bridges.JGCacheInvalidationBridge] Updating list of invalidation groups that are bridged...

Next I shutdown another node, which does not have the problem.

2007-11-04 21:15:14,500 DEBUG [org.jboss.cache.invalidation.bridges.JGCacheInvalidationBridge] ... nothing needs to be bridged.
2007-11-04 21:15:14,501 DEBUG [org.jboss.cache.invalidation.bridges.JGCacheInvalidationBridge] The list of replicant for the JG bridge has changed, computing and updating local info...
2007-11-04 21:15:14,501 DEBUG [org.jboss.cache.invalidation.bridges.JGCacheInvalidationBridge] ... No bridge info was associated to this node
2007-11-04 21:15:14,731 INFO [org.jboss.cache.TreeCache] viewAccepted(): [192.168.202.x:33178|28] [192.168.202.x:33178]
2007-11-04 21:15:14,744 DEBUG [org.jboss.ha.framework.server.HARMIServerImpl$RefreshProxiesHATarget] replicantsChanged 'HAJNDI' to 1 (intra-view id: -424447)
2007-11-04 21:15:15,078 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.jboss1] New cluster view for partition jboss1 (id: 28, delta: -1) : [192.168.202.x:1099]
2007-11-04 21:15:15,078 DEBUG [org.jboss.ha.framework.interfaces.HAPartition.jboss1] dead members: [192.168.202.x:1099]
2007-11-04 21:15:15,079 DEBUG [org.jboss.ha.framework.interfaces.HAPartition.jboss1] membership changed from 2 to 1
2007-11-04 21:15:15,081 DEBUG [org.jboss.ha.framework.interfaces.HAPartition.jboss1] Begin notifyListeners, viewID: 28
2007-11-04 21:15:15,081 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.jboss1] I am (192.168.202.x:1099) received membershipChanged event:
2007-11-04 21:15:15,081 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.jboss1] Dead members: 1 ([192.168.202.x:1099])
2007-11-04 21:15:15,081 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.jboss1] New Members : 0 ([])
2007-11-04 21:15:15,081 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.jboss1] All Members : 1 ([192.168.202.x:1099])
2007-11-04 21:15:15,081 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.jboss1] purgeDeadMembers, [192.168.202.x:1099]
2007-11-04 21:15:15,081 DEBUG [org.jboss.ha.framework.interfaces.HAPartition.jboss1] End notifyListeners, viewID: 28
2007-11-04 21:15:15,571 INFO [org.jboss.cache.TreeCache] viewAccepted(): [192.168.202.x:33173|28] [192.168.202.x:33173]

I do not understand why one node is not able to be removed and the other is. My two configurations are the same. "DistributedReplicantManager" is able to remove one node, and not the other, what causes this?

Could anyone please advise as to what my problem may be?

Thank you in advance.

1. Re: purgeDeadMembers Failing - what is the cause?

brian.stansberry Nov 5, 2007 1:49 PM (in response to jboss_cody)

What AS release is this?
Actions
2. Re: purgeDeadMembers Failing - what is the cause?

brian.stansberry Nov 5, 2007 1:53 PM (in response to jboss_cody)

Also, what kind of "sessions" are not accurate? I'm guessing EJB2 SFSBs, as that's the only type of session that uses DistributedReplicantManager, but that's just a guess.
Actions
3. Re: purgeDeadMembers Failing - what is the cause?

jboss_cody Nov 5, 2007 2:41 PM (in response to jboss_cody)
The release is 4.2.1 GA.

The sessions, are just a "session.getId()" call within a jsp page.

When I said "sessions", I may have misspoken. I have a sample application that displays the sessionId inside a .jsp. It also displays the name of of the node, and it has a counter of how many hits that it has taken.

Heres an example from the jsp:
session ID: y20PI8Ro-nlJavY5zjNXaQ**.jboss11 You have hit this page 1 times (Link to this Page)

I start 2 nodes.
I enter my url and I get sent to node1.
Then I stop node1, and click a link that refreshes the current page. (basically a refresh)

Once I do this, the sessionId appears(which is good), with the same node name(which is bad, because node1 is now dead.)

The counter resets(which is also bad, it shoud have incremented).

Then I refresh again, and this time the sessionId is the same(which is good), the node name is different(which is the fail-over occurring-good)

The counter increments (which is good, but the count is 1 less than what it is supposed to be).

I see in the logs that the DistributedReplicantManager is unable to remove the dead members.

What would make this happen? Is there a configuration that I'm missing?
Actions
4. Re: purgeDeadMembers Failing - what is the cause?

jboss_cody Nov 5, 2007 2:47 PM (in response to jboss_cody)

It is possible that the browser (IE) is causing this behavior, but it only occurs when I attempt to test fail-over from one node to another.
Actions
5. Re: purgeDeadMembers Failing - what is the cause?

brian.stansberry Nov 5, 2007 3:11 PM (in response to jboss_cody)

The DistributedReplicantManager is unrelated to web session clustering, so that's likely not the issue.

Do you have UseJK="true" in the jboss-web-cluster.sar/META-INF/jboss-service.xml file on all members?

If yes, what you report is odd for sure.
Actions
6. Re: purgeDeadMembers Failing - what is the cause?

jboss_cody Nov 5, 2007 3:17 PM (in response to jboss_cody)

Don't you mean in /jboss-web.deployer/META-INF/jboss-service.xml?

There isn't such an attribute in jboss-web-cluster.sar/META-INF/jboss-service.xml...

And the answer to your question is yes, I do. I have tested the functionality with Apache.
Actions
7. Re: purgeDeadMembers Failing - what is the cause?

brian.stansberry Nov 5, 2007 4:38 PM (in response to jboss_cody)

Yeah, I meant jboss-web.deployer. Sorry about that. Not sure what the issue is. A possibility is a slow replication, but given that you are doing things manually and stopping a server, that seems pretty unlikely.

You can turn on TRACE logging of org.jboss.web.tomcat.service.session -- that might show you something.
Actions
8. Re: purgeDeadMembers Failing - what is the cause?

jboss_cody Nov 5, 2007 4:57 PM (in response to jboss_cody)

A possibility is a slow replication...

It seemed that I may be causing the issue by not giving my cluster enough time to replicate the session throughout.

I tried waiting for about a minute before executing the refresh, but again, when I click the link OR hit the refresh button, the first info that gets displayed is old. It isn't until I refresh twice that accurate info is displayed. I will keep investigating, knowing my luck, I've configured this to happen without knowing it.

: {

Thanks again for the replies...
Actions
9. Re: purgeDeadMembers Failing - what is the cause?

brian.stansberry Nov 5, 2007 5:33 PM (in response to jboss_cody)

A replication with no load should take tens to a few hundred ms, so I suspect something else is going on; probably will seem obvious once we see it. Try the TRACE logging; you'll get timestamped log messages as sessions are replicated and received and special messages when the server recognizes a failover.
Actions

Go to original post