HASingleton failover slow
alapins Feb 18, 2005 1:15 PMI'm setting up a clustered singleton MDB. Only a single MDB should exist on the cluster. When the node the MDB is on fails, it should be activated on another node (i.e. startDelivery is called on the invoker-binding-proxy for the MDB). The MDB is set up as Adrian recommended in
http://www.jboss.org/index.html?module=bb&op=viewtopic&t=41489 and the MDB activator hasingleton MBean is set up per http://www.jboss.org/index.html?module=bb&op=viewtopic&t=55794.
The failover works fine, eventually. It seems to consistently take over 1 minute for the singleton MBean to come up on another node once the master node fails. I'm working on a 2 node cluster on different boxes, running 3.2.6 on windows (loopback is set to true, though this doesn't make any difference in this case). Here's a snippet from the log files showing the time gap:
2005-02-17 16:54:00,959 DEBUG [org.jboss.deployment.MainDeployer] Begin deployment start file:/C:/pf/jboss-3.2.6/server/all//deploy-hasingleton 2005-02-17 16:54:00,959 DEBUG [org.jboss.deployment.MainDeployer] Begin deployment start file:/C:/pf/jboss-3.2.6/server/all/deploy-hasingleton/MdbActivator.sar 2005-02-17 16:54:00,959 DEBUG [org.jboss.deployment.SARDeployer] Deploying SAR, start step: url file:/C:/pf/jboss-3.2.6/server/all/deploy-hasingleton/MdbActivator.sar 2005-02-17 16:54:00,959 DEBUG [org.jboss.system.ServiceController] starting service test.hasingletonmdb:service=MdbActivator 2005-02-17 16:54:00,959 DEBUG [com.gweiss.test.jboss.hasingletonmdb.MdbActivator] Starting test.hasingletonmdb:service=MdbActivator 2005-02-17 16:54:00,959 DEBUG [com.gweiss.test.jboss.hasingletonmdb.MdbActivator] Started test.hasingletonmdb:service=MdbActivator 2005-02-17 16:54:00,959 DEBUG [org.jboss.management.j2ee.LocalJBossServerDomain] handleNotification: javax.management.Notification[source=jboss.system:service=ServiceController,type= org.jboss.system.ServiceMBean.start,sequenceNumber=175,timeStamp=1108677240959,message=null,userData=test.hasingletonmdb:service=MdbActivator] 2005-02-17 16:54:00,959 DEBUG [org.jboss.system.ServiceController] Starting dependent components for: test.hasingletonmdb:service=MdbActivator dependent components: [ObjectName: test.hasingletonmdb:service=MdbActivatorController state: CREATED I Depend On: jboss:service=DefaultPartition test.hasingletonmdb:service=MdbActivator Depends On Me: ] 2005-02-17 16:54:00,959 DEBUG [org.jboss.system.ServiceController] starting service test.hasingletonmdb:service=MdbActivatorController 2005-02-17 16:54:00,959 DEBUG [org.jboss.ha.singleton.HASingletonController] Starting test.hasingletonmdb:service=MdbActivatorController 2005-02-17 16:54:00,959 DEBUG [org.jboss.ha.singleton.HASingletonController] start HASingletonController 2005-02-17 16:54:00,959 DEBUG [org.jboss.ha.singleton.HASingletonController] findHAPartitionWithName, name=DefaultPartition 2005-02-17 16:54:00,974 DEBUG [org.jboss.ha.framework.server.HAPartitionImpl] dests=[eugenehp3000cluster:2275 (additional data: 20 bytes)], method_call=DistributedReplicantManager._add(test.hasingletonmdb:service=MdbActivatorController, 192.168.201.224:1099, ), mode=2, timeout=60000 2005-02-17 16:54:00,974 DEBUG [org.jboss.ha.framework.server.HAPartitionImpl] real_dests=[eugenehp3000cluster:2275 (additional data: 20 bytes)] 2005-02-17 16:55:00,976 DEBUG [org.jboss.ha.framework.server.HAPartitionImpl] responses: [sender=eugenehp3000cluster:2275 (additional data: 20 bytes), retval=null, received=false, suspected=false]
Looking at the logs, it looks like what is happening is that the node is being notified that it is the new master, and brings up the hasingleton. The singleton base class (HASingletonController) as it is being brought up seems to try to query the main partition, which I would thinkit would know was itself, since it has received notification that it was the master. This doesn't seem to be the case, though, and so it times out on it's request after 60 seconds, then brings up the hasingleton.
My question is, how can I reduce the failover time? Where in the config scripts can I reduce the timeout only for requesting the partition? Or is my guess right that it should know that it's the master already, and if so is there a patch available that corrects this?
Thanks,
Alex