6 Replies Latest reply on Apr 22, 2005 4:46 PM by belaban

60 seconds delay now and then

lepe Aug 25, 2004 10:16 AM

Sometimes when a node has been restarted it takes very long time before the restart is completed (20 minutes or more instead of 1-2 minutes). Looking in the server log i noticed where it lost time... now and then there is noting done for a period of 60 seconds. Look at this clip from a log. Any direction on where to look... (JBoss 3.2.3 is used)

/L

2004-08-24 15:10:29,423 DEBUG [org.jboss.ha.singleton.HASingletonController] start HASingletonController
2004-08-24 15:11:29,431 DEBUG [DefaultPartition:ReplicantManager] notifyKeyListeners

jboss.mq:service=HAILSingletonController
2004-08-24 15:11:29,431 DEBUG [org.jboss.ha.singleton.HASingletonController] partitionTopologyChanged, isElectedNewMaster=false, isMasterNode=false, viewID=-2094479582
2004-08-24 15:12:29,442 INFO [org.jboss.ha.singleton.HASingletonController] Started jboss.mq:service=HAILSingletonController

2004-08-24 15:12:29,457 DEBUG [org.jboss.ha.framework.server.HARMIServerImpl$RefreshProxiesHATarget] Invocation authorization called: ENABLE_INVOCATIONS
2004-08-24 15:13:29,473 DEBUG [DefaultPartition:ReplicantManager] notifyKeyListeners

2004-08-24 15:13:29,502 INFO [org.jboss.ha.jndi.HANamingService] Started jboss:service=HAJNDI
2004-08-24 15:14:29,515 DEBUG [DefaultPartition:ReplicantManager] notifyKeyListeners

2004-08-24 15:14:37,393 DEBUG [org.jboss.ha.framework.server.HATarget] Invocation authorization called: MAKE_INVOCATIONS_WAIT
2004-08-24 15:15:37,395 DEBUG [DefaultPartition:ReplicantManager] notifyKeyListeners

2004-08-24 15:20:02,925 DEBUG [org.jboss.ha.singleton.HASingletonController] partitionTopologyChanged, isElectedNewMaster=false, isMasterNode=false, viewID=1157502159
2004-08-24 15:21:02,929 INFO [org.jboss.ha.singleton.HASingletonController] Started WCar.voc:service=VOC_DUMMY_ACCOUNTScheduler-HASingletonController

2004-08-24 15:22:02,972 DEBUG [org.jboss.ha.singleton.HASingletonController] partitionTopologyChanged, isElectedNewMaster=false, isMasterNode=false, viewID=1157502159
2004-08-24 15:23:02,972 INFO [org.jboss.ha.singleton.HASingletonController] Started WCar.voc:service=GSM_SMS_VO_10494Scheduler-HASingletonController

2004-08-24 15:24:03,013 DEBUG [org.jboss.ha.singleton.HASingletonController] partitionTopologyChanged, isElectedNewMaster=false, isMasterNode=false, viewID=1157502159
2004-08-24 15:25:03,026 INFO [org.jboss.ha.singleton.HASingletonController] Started WCar.voc:service=GSM_SMS_VO_10576Scheduler-HASingletonController

2004-08-24 15:26:03,070 DEBUG [org.jboss.ha.singleton.HASingletonController] partitionTopologyChanged, isElectedNewMaster=false, isMasterNode=false, viewID=1157502159
2004-08-24 15:27:03,083 INFO [org.jboss.ha.singleton.HASingletonController] Started WCar.voc:service=GSM_SMS_VO_10103Scheduler-HASingletonController

2004-08-24 15:27:03,108 DEBUG [org.jboss.ha.framework.server.FarmMemberService] registerRPCHandler
2004-08-24 15:28:03,109 DEBUG [org.jboss.ha.framework.server.FarmMemberService] Found 1 farmDeployments responses

2004-08-24 15:28:03,109 INFO [org.jboss.ha.framework.server.FarmMemberService] **** pullNewDeployments ****
2004-08-24 15:29:03,119 INFO [org.jboss.ha.framework.server.FarmMemberService] farmDeployment(), deploy locally: /opt/jboss-3.2.3/server/voc/tmp/cluster-examples-service.xml

1. Re: 60 seconds delay now and then

slaboure Aug 25, 2004 10:22 AM (in response to lepe)

Hello,

Please try with a recent snapshot of JBoss (like 3.2.6RCx or 3.2.5), I bet you won't get much feedback based on an old release of jboss.

Cheers,

sacha
Actions
2. Re: 60 seconds delay now and then

hsaha Apr 21, 2005 6:37 AM (in response to lepe)

Can you check whether the servers you are using for the cluster are connected to the same switch? There may be some multicasting issues.

We had a similar problem which got solved when we put the servers in the same switch. Sounds weird.. but it solved our problem.

Himadri
Actions
3. Re: 60 seconds delay now and then

belaban Apr 21, 2005 7:19 AM (in response to lepe)

Some of the hangs were caused by distributed deadlocks in the clustering module. Those have been fixed, probably in 3.2.6/3.2.7, definitely in 4.0.1/4.0.2
Actions
4. Re: 60 seconds delay now and then

lepe Apr 21, 2005 7:23 AM (in response to lepe)

yes, i can confirm that. When the customer moved to 3.2.6 the problems went away (but others came instead but that is life ;)

/L
Actions
5. Re: 60 seconds delay now and then

mikefinn Apr 22, 2005 3:35 PM (in response to lepe)
We are seeing this error now as well on 3.2.2. We have been running without problems on 3.2.2 for over a year, and now are getting these problems in our test environment. Both nodes are plugged into same switch, are on same subnet, and nothing seems to have not changed on them, besides deployment of new apps.

I traced it down to org.jgroups.blocks.GroupRequest, where it is waiting for a mutex lock:

rsp_mutex.wait(timeout);

It hangs there until the 60sec timeout, for each deployment unit. Starting the server used to take 1.5 minutes. Now it's over an hour. Here's an interesting twist - whatever node comes up first seems to come up fine. It's the second node that has problems. We have a 2 node cluster, and it doesn't matter which node is brought up first. Results are always the same.

Kind of smells like a deadlock problem, but I can't understand why this is happening all of a sudden.

Unfortunately, upgrading to 3.2.6 is not an option right now (and shouldn't be necessary, as we have been running fine for some time). We have 10-15 apps running on the environment, so upgrades are costly and complicated.

Mike
Actions
6. Re: 60 seconds delay now and then

belaban Apr 22, 2005 4:46 PM (in response to lepe)

You can produce a deadlock relatively simply: I described this some time ago in http://www.jgroups.org/javagroupsnew/docs/papers/Middleware.ps.gz.
Any of your own apps can do this, or (more likely as with pre 3.2.6 versions) the Clustering blocks are not reentrant.
Could be as simple as binding into HA-JNDI in a callback.
Actions

Go to original post