0 Replies Latest reply on Feb 18, 2005 1:15 PM by alapins

    HASingleton failover slow

    alapins

      I'm setting up a clustered singleton MDB. Only a single MDB should exist on the cluster. When the node the MDB is on fails, it should be activated on another node (i.e. startDelivery is called on the invoker-binding-proxy for the MDB). The MDB is set up as Adrian recommended in
      http://www.jboss.org/index.html?module=bb&op=viewtopic&t=41489 and the MDB activator hasingleton MBean is set up per http://www.jboss.org/index.html?module=bb&op=viewtopic&t=55794.

      The failover works fine, eventually. It seems to consistently take over 1 minute for the singleton MBean to come up on another node once the master node fails. I'm working on a 2 node cluster on different boxes, running 3.2.6 on windows (loopback is set to true, though this doesn't make any difference in this case). Here's a snippet from the log files showing the time gap:

      2005-02-17 16:54:00,959 DEBUG [org.jboss.deployment.MainDeployer] Begin deployment start file:/C:/pf/jboss-3.2.6/server/all//deploy-hasingleton
      2005-02-17 16:54:00,959 DEBUG [org.jboss.deployment.MainDeployer] Begin deployment start file:/C:/pf/jboss-3.2.6/server/all/deploy-hasingleton/MdbActivator.sar
      2005-02-17 16:54:00,959 DEBUG [org.jboss.deployment.SARDeployer] Deploying SAR, start step: url file:/C:/pf/jboss-3.2.6/server/all/deploy-hasingleton/MdbActivator.sar
      2005-02-17 16:54:00,959 DEBUG [org.jboss.system.ServiceController] starting service test.hasingletonmdb:service=MdbActivator
      2005-02-17 16:54:00,959 DEBUG [com.gweiss.test.jboss.hasingletonmdb.MdbActivator] Starting test.hasingletonmdb:service=MdbActivator
      2005-02-17 16:54:00,959 DEBUG [com.gweiss.test.jboss.hasingletonmdb.MdbActivator] Started test.hasingletonmdb:service=MdbActivator
      2005-02-17 16:54:00,959 DEBUG [org.jboss.management.j2ee.LocalJBossServerDomain] handleNotification: javax.management.Notification[source=jboss.system:service=ServiceController,type= org.jboss.system.ServiceMBean.start,sequenceNumber=175,timeStamp=1108677240959,message=null,userData=test.hasingletonmdb:service=MdbActivator]
      2005-02-17 16:54:00,959 DEBUG [org.jboss.system.ServiceController] Starting dependent components for: test.hasingletonmdb:service=MdbActivator dependent components: [ObjectName: test.hasingletonmdb:service=MdbActivatorController
       state: CREATED
       I Depend On: jboss:service=DefaultPartition
       test.hasingletonmdb:service=MdbActivator
      
       Depends On Me: ]
      2005-02-17 16:54:00,959 DEBUG [org.jboss.system.ServiceController] starting service test.hasingletonmdb:service=MdbActivatorController
      2005-02-17 16:54:00,959 DEBUG [org.jboss.ha.singleton.HASingletonController] Starting test.hasingletonmdb:service=MdbActivatorController
      2005-02-17 16:54:00,959 DEBUG [org.jboss.ha.singleton.HASingletonController] start HASingletonController
      2005-02-17 16:54:00,959 DEBUG [org.jboss.ha.singleton.HASingletonController] findHAPartitionWithName, name=DefaultPartition
      2005-02-17 16:54:00,974 DEBUG [org.jboss.ha.framework.server.HAPartitionImpl] dests=[eugenehp3000cluster:2275 (additional data: 20 bytes)], method_call=DistributedReplicantManager._add(test.hasingletonmdb:service=MdbActivatorController, 192.168.201.224:1099, ), mode=2, timeout=60000
      2005-02-17 16:54:00,974 DEBUG [org.jboss.ha.framework.server.HAPartitionImpl] real_dests=[eugenehp3000cluster:2275 (additional data: 20 bytes)]
      2005-02-17 16:55:00,976 DEBUG [org.jboss.ha.framework.server.HAPartitionImpl] responses: [sender=eugenehp3000cluster:2275 (additional data: 20 bytes), retval=null, received=false, suspected=false]
      

      Looking at the logs, it looks like what is happening is that the node is being notified that it is the new master, and brings up the hasingleton. The singleton base class (HASingletonController) as it is being brought up seems to try to query the main partition, which I would thinkit would know was itself, since it has received notification that it was the master. This doesn't seem to be the case, though, and so it times out on it's request after 60 seconds, then brings up the hasingleton.

      My question is, how can I reduce the failover time? Where in the config scripts can I reduce the timeout only for requesting the partition? Or is my guess right that it should know that it's the master already, and if so is there a patch available that corrects this?

      Thanks,
      Alex