6 Replies Latest reply on Apr 22, 2005 4:46 PM by belaban

    60 seconds delay now and then

    lepe

      Sometimes when a node has been restarted it takes very long time before the restart is completed (20 minutes or more instead of 1-2 minutes). Looking in the server log i noticed where it lost time... now and then there is noting done for a period of 60 seconds. Look at this clip from a log. Any direction on where to look... (JBoss 3.2.3 is used)

      /L

      2004-08-24 15:10:29,423 DEBUG [org.jboss.ha.singleton.HASingletonController] start HASingletonController
      2004-08-24 15:11:29,431 DEBUG [DefaultPartition:ReplicantManager] notifyKeyListeners
      
      jboss.mq:service=HAILSingletonController
      2004-08-24 15:11:29,431 DEBUG [org.jboss.ha.singleton.HASingletonController] partitionTopologyChanged, isElectedNewMaster=false, isMasterNode=false, viewID=-2094479582
      2004-08-24 15:12:29,442 INFO [org.jboss.ha.singleton.HASingletonController] Started jboss.mq:service=HAILSingletonController
      
      2004-08-24 15:12:29,457 DEBUG [org.jboss.ha.framework.server.HARMIServerImpl$RefreshProxiesHATarget] Invocation authorization called: ENABLE_INVOCATIONS
      2004-08-24 15:13:29,473 DEBUG [DefaultPartition:ReplicantManager] notifyKeyListeners
      
      2004-08-24 15:13:29,502 INFO [org.jboss.ha.jndi.HANamingService] Started jboss:service=HAJNDI
      2004-08-24 15:14:29,515 DEBUG [DefaultPartition:ReplicantManager] notifyKeyListeners
      
      2004-08-24 15:14:37,393 DEBUG [org.jboss.ha.framework.server.HATarget] Invocation authorization called: MAKE_INVOCATIONS_WAIT
      2004-08-24 15:15:37,395 DEBUG [DefaultPartition:ReplicantManager] notifyKeyListeners
      
      2004-08-24 15:20:02,925 DEBUG [org.jboss.ha.singleton.HASingletonController] partitionTopologyChanged, isElectedNewMaster=false, isMasterNode=false, viewID=1157502159
      2004-08-24 15:21:02,929 INFO [org.jboss.ha.singleton.HASingletonController] Started WCar.voc:service=VOC_DUMMY_ACCOUNTScheduler-HASingletonController
      
      2004-08-24 15:22:02,972 DEBUG [org.jboss.ha.singleton.HASingletonController] partitionTopologyChanged, isElectedNewMaster=false, isMasterNode=false, viewID=1157502159
      2004-08-24 15:23:02,972 INFO [org.jboss.ha.singleton.HASingletonController] Started WCar.voc:service=GSM_SMS_VO_10494Scheduler-HASingletonController
      
      2004-08-24 15:24:03,013 DEBUG [org.jboss.ha.singleton.HASingletonController] partitionTopologyChanged, isElectedNewMaster=false, isMasterNode=false, viewID=1157502159
      2004-08-24 15:25:03,026 INFO [org.jboss.ha.singleton.HASingletonController] Started WCar.voc:service=GSM_SMS_VO_10576Scheduler-HASingletonController
      
      2004-08-24 15:26:03,070 DEBUG [org.jboss.ha.singleton.HASingletonController] partitionTopologyChanged, isElectedNewMaster=false, isMasterNode=false, viewID=1157502159
      2004-08-24 15:27:03,083 INFO [org.jboss.ha.singleton.HASingletonController] Started WCar.voc:service=GSM_SMS_VO_10103Scheduler-HASingletonController
      
      2004-08-24 15:27:03,108 DEBUG [org.jboss.ha.framework.server.FarmMemberService] registerRPCHandler
      2004-08-24 15:28:03,109 DEBUG [org.jboss.ha.framework.server.FarmMemberService] Found 1 farmDeployments responses
      
      2004-08-24 15:28:03,109 INFO [org.jboss.ha.framework.server.FarmMemberService] **** pullNewDeployments ****
      2004-08-24 15:29:03,119 INFO [org.jboss.ha.framework.server.FarmMemberService] farmDeployment(), deploy locally: /opt/jboss-3.2.3/server/voc/tmp/cluster-examples-service.xml
      


        • 1. Re: 60 seconds delay now and then
          slaboure

          Hello,

          Please try with a recent snapshot of JBoss (like 3.2.6RCx or 3.2.5), I bet you won't get much feedback based on an old release of jboss.

          Cheers,


          sacha

          • 2. Re: 60 seconds delay now and then
            hsaha

            Can you check whether the servers you are using for the cluster are connected to the same switch? There may be some multicasting issues.

            We had a similar problem which got solved when we put the servers in the same switch. Sounds weird.. but it solved our problem.

            Himadri

            • 3. Re: 60 seconds delay now and then
              belaban

              Some of the hangs were caused by distributed deadlocks in the clustering module. Those have been fixed, probably in 3.2.6/3.2.7, definitely in 4.0.1/4.0.2

              • 4. Re: 60 seconds delay now and then
                lepe

                yes, i can confirm that. When the customer moved to 3.2.6 the problems went away (but others came instead but that is life ;)

                /L

                • 5. Re: 60 seconds delay now and then
                  mikefinn

                  We are seeing this error now as well on 3.2.2. We have been running without problems on 3.2.2 for over a year, and now are getting these problems in our test environment. Both nodes are plugged into same switch, are on same subnet, and nothing seems to have not changed on them, besides deployment of new apps.

                  I traced it down to org.jgroups.blocks.GroupRequest, where it is waiting for a mutex lock:

                  rsp_mutex.wait(timeout);


                  It hangs there until the 60sec timeout, for each deployment unit. Starting the server used to take 1.5 minutes. Now it's over an hour. Here's an interesting twist - whatever node comes up first seems to come up fine. It's the second node that has problems. We have a 2 node cluster, and it doesn't matter which node is brought up first. Results are always the same.

                  Kind of smells like a deadlock problem, but I can't understand why this is happening all of a sudden.

                  Unfortunately, upgrading to 3.2.6 is not an option right now (and shouldn't be necessary, as we have been running fine for some time). We have 10-15 apps running on the environment, so upgrades are costly and complicated.

                  Mike

                  • 6. Re: 60 seconds delay now and then
                    belaban

                    You can produce a deadlock relatively simply: I described this some time ago in http://www.jgroups.org/javagroupsnew/docs/papers/Middleware.ps.gz.
                    Any of your own apps can do this, or (more likely as with pre 3.2.6 versions) the Clustering blocks are not reentrant.
                    Could be as simple as binding into HA-JNDI in a callback.