3 Replies Latest reply on Aug 26, 2004 1:53 AM by ovidiu.feodorov

Clustering related Question: Case 1843

ivelin.ivanov Aug 19, 2004 12:37 AM

Please find attached a write-up that briefly explains the problem, the causes and suggests several solutions.

The Problem

If there are two ore more cluster nodes with a HASingleton service enabled and the master service instance shuts down (explicit node shutdown, for example), there is a noticeable delay until another node takes over and starts the service instance. More precisely, it takes 60 seconds for the transfer to complete.

The problem does not show up in JBoss 3.2.3 (JGroups 2.2.0) but shows up in JBoss 3.2.4 (JGroups 2.2.4).

The Cause

The sequence of events happening in the HA Singleton layer during an explicit node shutdown follows: the HASingletonController is notified to shut down. HAServiceMBeanSupprt.stopService() executes and during the unregistration of the DRM listener, it calls "_remove" asynchronously on the cluster, so the call goes to the future master too. On the future master, DistributedReplicantManager._remove() triggers local master election. The HASingletonController instance realizes that it will become master and in one of the subsequent steps it synchronously call "_stopOldMaster" on the partition. This distributed RPC blocks and exits only with timeout (60 secs).

I was able to replicate the problem independently in JGroups. The deadlock shows up every time I use nested distributed calls. The test setup consists of two RpcDispatchers A and B, each dispatcher being able to handle innerMethod() and outerMethod(). innerMethod() is simple (just a System.out.println(), for example). outerMethod() internally calls innerMethod() as a group RPC:

public void innerMethod() {
System.out.println("innerMethod()");
}

public void outerMethod() {
rpcDispatcher.callRemoteMethods(null,
new MethodCall("innerMethod", new Object[0]),
GroupRequest.GET_ALL,
60000);
}

From A, I callRemoteMethod(B, "outerMethod"). The following things happen:

A B
callRemoteMethod(B, "outerMethod")

outerMethod() executes
outerMethod() calls innerMethod() on group
the group call never returns but by timeout

innerMethod() executes
.....

innerMethod() group call timeouts after 60000

The bottom-most cause of the problem is the fact that MessageDispatcher uses only one thread ("MessageDispatcher up processing thread") to handle incoming RPC requests and to make nested calls. Once a nested call is made, the thread blocks on a mutex and never gets woken up to handle the response to the nested calls, not unless the timeout expires.

Solutions

1. A quick and temporary solution is to use an asychronous "_stopOldMaster" distributed RPC call in HASingletonSupport.partitionTopologyChanged(). In the situation presented above, it does not matter anyway, since the master instance is shutting down or it's down already. If another service instance in the cluster (not the master) goes down or up, that shouldn't be a reason to switch the master, so again, it shouldn't matter anyway. This works; however it just hides the symptoms.

2. Fix JGroups. One idea is to have RpcDispatcher use a thread per incoming call and possible a thread pool. This way, the deadlock problem goes away. I am still looking at JGroups 2.2.0 and trying to understand why nested distributed calls work for that release. I will come up with a solution and a test case.

Cheers,
Ovidiu

1. Re: Clustering related Question: Case 1843

lepe Aug 25, 2004 10:52 AM (in response to ivelin.ivanov)

Hi!

Are you sure that this is not happening in 3.2.3/2.2.0!?

Look at this thread http://www.jboss.org/index.html?module=bb&op=viewtopic&t=53408

It might not be related but...

Sacha - yes I know I should test a newer version but that is not possible right now since our customer can not reproduce it in test (at the moment).
/L
Actions
2. Re: Clustering related Question: Case 1843

ovidiu.feodorov Aug 25, 2004 2:47 PM (in response to ivelin.ivanov)

Yes, I tested it myself and at least the HASingleton was working correctly - falling back immediately to other instance.

As far as the 60 secs timeout is concerned, that seems to be a JGroups synchronous call timeout. As I mentioned in the write-up below, I was able to replicate it outside JBoss and I am on it. It is probably a matter of turning deadlock detection on, but I want to perform more tests before concluding it. I will open a JGroups issue later today and link it here, all the discussions related to JGroups nested distributed calls will be recorded there.
Actions
3. Re: Clustering related Question: Case 1843

ovidiu.feodorov Aug 26, 2004 1:53 AM (in response to ivelin.ivanov)

I checked in a temporary solution for Case 1843 on 3.2.6RC2: HASingletonSupport.makeThisNodeMaster() uses an asynchronous distributed call instead of a synchronous one. Will revert to synchronous as soon as the JGroups nested distributed call issue is clarified.

The JGroups issue can be tracked at: http://sourceforge.net/tracker/index.php?func=detail&aid=1016522&group_id=6081&atid=106081
Actions

Go to original post