HA-JMS fails, Master node undeploying channels, no failover
jkressin Jun 22, 2006 5:56 AMFirst, sorry for the lengthy post, but I need to describe the problem in detail:
We have a cluster of 6 JBoss instances (JBoss 4.0.3SP1) on 3 physical machines. Each machine runs two JBoss instances and each JBoss instance has its own IP. The machines have one network adapter with two IP-Adresses. We use UDP as the transport layer in JGroups (config below). From the range of cluster services we only use HA-JMS, means clustered topics and queues. Everything works fine, but from time to time (every 2-4 days) the HA-JMS completely fails which means that messages get lost, which should not happen at all (that's why we use a cluster).
Here's what happens: All instances are up and running, and I can see that all 6 instances participate in the cluster. Suddenly on the master node I see a log file entry like this:
2006-06-21 08:14:35,049 INFO [org.jboss.ha.framework.interfaces.HAPartition.StagePartition] New cluster view for partition StagePartition: 203 ([62.50.43.21
1:1099, 62.50.43.213:1099, 62.50.43.216:1099, 62.50.43.215:1099, 62.50.43.214:1099, 62.50.43.210:1099] delta: 0)
2006-06-21 08:14:35,049 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] I am (62.50.43.210:1099) received membershipChan
ged event:
2006-06-21 08:14:35,049 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] Dead members: 0 ([])
2006-06-21 08:14:35,049 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] New Members : 0 ([])
2006-06-21 08:14:35,049 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] All Members : 6 ([62.50.43.211:1099, 62.50.43.21
3:1099, 62.50.43.216:1099, 62.50.43.215:1099, 62.50.43.214:1099, 62.50.43.210:1099])
The first strange this is: Dead members:0, New members: 0 which I read as "nothing has changed at all" ;)
Directly after this message, the master node starts to undeploy all queues and topics:
2006-06-21 08:14:35,329 INFO [org.jboss.mq.server.jmx.Topic.sgw/MOCacheInvalidationTopic] Unbinding JNDI name: topic/sgw/MOCacheInvalidationTopic
2006-06-21 08:14:35,465 INFO [org.jboss.mq.server.jmx.Topic.sgw/CdaHtmlCacheInvalidationTopic] Unbinding JNDI name: topic/sgw/CdaHtmlCacheInvalidationTopic
2006-06-21 08:14:35,466 INFO [org.jboss.mq.server.jmx.Queue.sgw/AlertUserQueue] Unbinding JNDI name: queue/sgw/AlertUserQueue
2006-06-21 08:14:35,466 INFO [org.jboss.mq.server.jmx.Queue.sgw/UserQueue] Unbinding JNDI name: queue/sgw/UserQueue
2006-06-21 08:14:35,467 INFO [org.jboss.mq.server.jmx.Queue.sgw/OrderQueue] Unbinding JNDI name: queue/sgw/OrderQueue
[...]
2006-06-21 08:14:35,470 INFO [org.jboss.mq.server.jmx.Queue.DLQ] Unbinding JNDI name: queue/DLQ
2006-06-21 08:14:35,546 INFO [org.jboss.web.tomcat.tc5.TomcatDeployer] undeploy, ctxPath=/jbossmq-httpil, warUrl=.../deploy-hasingleton/jms/jbossmq-httpil.s
ar/jbossmq-httpil.war/
But the instance still claims to be the master node. No other instance starts to take over the undeployed services, so whenever an instance tries to post a message we get:
javax.jms.InvalidDestinationException: This destination does not exist! TOPIC.sgw/MOCacheInvalidationTopic
at org.jboss.mq.server.JMSDestinationManager.addMessage(JMSDestinationManager.java:389)
at org.jboss.mq.server.JMSDestinationManager.addMessage(JMSDestinationManager.java:373)
at org.jboss.mq.server.JMSServerInvoker.addMessage(JMSServerInvoker.java:136)
at org.jboss.mq.il.uil2.ServerSocketManagerHandler.handleMsg(ServerSocketManagerHandler.java:92)
at org.jboss.mq.il.uil2.SocketManager$ReadTask.handleMsg(SocketManager.java:369)
Exactly at the time when the master node undeploys all services, all the other instances start to go crazy as well:
2006-06-21 08:14:24,728 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] I am (62.50.43.215:1099) received membershipChanged event:
2006-06-21 08:14:24,728 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] Dead members: 2 ([62.50.43.210:1099, 62.50.43.214:1099])
2006-06-21 08:14:24,728 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] New Members : 0 ([])
2006-06-21 08:14:24,728 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] All Members : 4 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099, 62.50.43.215:1
099])
2006-06-21 08:14:24,798 INFO [org.jboss.ha.framework.interfaces.HAPartition.StagePartition] Suspected member: 62.50.43.214:54923 (additional data: 17 bytes)
2006-06-21 08:14:26,800 INFO [org.jboss.ha.framework.interfaces.HAPartition.StagePartition] Suspected member: dep004174-05:54893 (additional data: 17 bytes)
2006-06-21 08:14:31,547 ERROR [com.artnology.sgw.cda.tracking.Webtracking] getObjectType() returns null for SGWID '4-102-0-0-0'
2006-06-21 08:14:34,867 INFO [org.jboss.ha.framework.interfaces.HAPartition.StagePartition] New cluster view for partition StagePartition: 202 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1
099, 62.50.43.215:1099, 62.50.43.214:1099] delta: 1)
2006-06-21 08:14:34,867 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] I am (62.50.43.215:1099) received membershipChanged event:
2006-06-21 08:14:34,867 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] Dead members: 0 ([])
2006-06-21 08:14:34,867 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] New Members : 1 ([62.50.43.214:1099])
2006-06-21 08:14:34,867 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] All Members : 5 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099, 62.50.43.215:1
099, 62.50.43.214:1099])
2006-06-21 08:14:35,022 INFO [org.jboss.ha.framework.interfaces.HAPartition.StagePartition] New cluster view for partition StagePartition: 203 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1
099, 62.50.43.215:1099, 62.50.43.214:1099, 62.50.43.210:1099] delta: 1)
2006-06-21 08:14:35,022 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] I am (62.50.43.215:1099) received membershipChanged event:
2006-06-21 08:14:35,022 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] Dead members: 0 ([])
2006-06-21 08:14:35,022 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] New Members : 1 ([62.50.43.210:1099])
2006-06-21 08:14:35,022 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] All Members : 6 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099, 62.50.43.215:1
099, 62.50.43.214:1099, 62.50.43.210:1099])
The log messages from the instances do not correlate. I have no idea why this happens, and like I said it happens sporadically, there is no obvious pattern in it (say, it happens all 4 hours or so). Did someone experience a similiar behaviour before? Can someone tell me what I can do to hunt down this problem? Why does the master node suddenly start to undeploy all channels, but still claims to be the master node? Whenever this problem occurs, messages get lost which is unacceptable for a productive system. Any help is greatly appreciated.
Thanks!
Jochen
JGroups configuration:
<server> <!-- ==================================================================== --> <!-- Cluster Partition: defines cluster --> <!-- ==================================================================== --> <mbean code="org.jboss.ha.framework.server.ClusterPartition" name="jboss:service=${jboss.partition.name:DefaultPartition}"> <!-- Name of the partition being built --> <attribute name="PartitionName">${jboss.partition.name:DefaultPartition}</attribute> <!-- The address used to determine the node name --> <attribute name="NodeAddress">${jboss.bind.address}</attribute> <!-- Determine if deadlock detection is enabled --> <attribute name="DeadlockDetection">False</attribute> <!-- Max time (in ms) to wait for state transfer to complete. Increase for large states --> <attribute name="StateTransferTimeout">30000</attribute> <!-- The JGroups protocol configuration --> <attribute name="PartitionConfig"> <Config> <UDP mcast_addr="${jboss.partition.udpGroup:228.1.2.3}" mcast_port="45566" ip_ttl="8" ip_mcast="true" mcast_send_buf_size="800000" mcast_recv_buf_size="150000" ucast_send_buf_size="800000" ucast_recv_buf_size="150000" loopback="false"/> <PING timeout="2000" num_initial_members="3" up_thread="true" down_thread="true"/> <MERGE2 min_interval="10000" max_interval="20000"/> <FD shun="true" up_thread="true" down_thread="true" timeout="2500" max_tries="5"/> <VERIFY_SUSPECT timeout="3000" num_msgs="3" up_thread="true" down_thread="true"/> <pbcast.NAKACK gc_lag="50" retransmit_timeout="300,600,1200,2400,4800" max_xmit_size="8192" up_thread="true" down_thread="true"/> <UNICAST timeout="300,600,1200,2400,4800" window_size="100" min_threshold="10" down_thread="true"/> <pbcast.STABLE desired_avg_gossip="20000" up_thread="true" down_thread="true"/> <FRAG frag_size="8192" down_thread="true" up_thread="true"/> <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="true" print_local_addr="true"/> <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/> </Config> </attribute> <depends>jboss:service=Naming</depends> </mbean> <!-- ==================================================================== --> <!-- HA Session State Service for SFSB --> <!-- ==================================================================== --> <mbean code="org.jboss.ha.hasessionstate.server.HASessionStateService" name="jboss:service=HASessionState"> <!-- Name of the partition to which the service is linked --> <attribute name="PartitionName">${jboss.partition.name:DefaultPartition}</attribute> <!-- JNDI name under which the service is bound --> <attribute name="JndiName">/HASessionState/Default</attribute> <!-- Max delay before cleaning unreclaimed state. Defaults to 30*60*1000 => 30 minutes --> <attribute name="BeanCleaningDelay">0</attribute> <depends>jboss:service=Naming</depends> <depends>jboss:service=${jboss.partition.name:DefaultPartition}</depends> </mbean> <!-- ==================================================================== --> <!-- HA JNDI --> <!-- ==================================================================== --> <mbean code="org.jboss.ha.jndi.HANamingService" name="jboss:service=HAJNDI"> <depends>jboss:service=${jboss.partition.name:DefaultPartition}</depends> <!-- Name of the partition to which the service is linked --> <attribute name="PartitionName">${jboss.partition.name:DefaultPartition}</attribute> <!-- Bind address of bootstrap and HA-JNDI RMI endpoints --> <attribute name="BindAddress">${jboss.bind.address}</attribute> <!-- Port on which the HA-JNDI stub is made available --> <attribute name="Port">1100</attribute> <!-- RmiPort to be used by the HA-JNDI service once bound. 0 => auto. --> <attribute name="RmiPort">1101</attribute> <!-- Accept backlog of the bootstrap socket --> <attribute name="Backlog">50</attribute> <!-- The thread pool service used to control the bootstrap and auto discovery lookups --> <depends optional-attribute-name="LookupPool" proxy-type="attribute">jboss.system:service=ThreadPool</depends> <!-- A flag to disable the auto discovery via multicast --> <attribute name="DiscoveryDisabled">false</attribute> <!-- Set the auto-discovery bootstrap multicast bind address. If not specified and a BindAddress is specified, the BindAddress will be used. --> <attribute name="AutoDiscoveryBindAddress">${jboss.bind.address}</attribute> <!-- Multicast Address and group port used for auto-discovery --> <attribute name="AutoDiscoveryAddress">${jboss.partition.udpGroup:230.0.0.4}</attribute> <attribute name="AutoDiscoveryGroup">1102</attribute> <!-- The TTL (time-to-live) for autodiscovery IP multicast packets --> <attribute name="AutoDiscoveryTTL">16</attribute> <!-- Client socket factory to be used for client-server RMI invocations during JNDI queries <attribute name="ClientSocketFactory">custom</attribute> --> <!-- Server socket factory to be used for client-server RMI invocations during JNDI queries <attribute name="ServerSocketFactory">custom</attribute> --> </mbean> <mbean code="org.jboss.invocation.jrmp.server.JRMPInvokerHA" name="jboss:service=invoker,type=jrmpha"> <attribute name="ServerAddress">${jboss.bind.address}</attribute> <attribute name="RMIObjectPort">4447</attribute> <!-- <attribute name="RMIClientSocketFactory">custom</attribute> <attribute name="RMIServerSocketFactory">custom</attribute> --> <depends>jboss:service=Naming</depends> </mbean> <!-- the JRMPInvokerHA creates a thread per request. This implementation uses a pool of threads --> <mbean code="org.jboss.invocation.pooled.server.PooledInvokerHA" name="jboss:service=invoker,type=pooledha"> <attribute name="NumAcceptThreads">1</attribute> <attribute name="MaxPoolSize">300</attribute> <attribute name="ClientMaxPoolSize">300</attribute> <attribute name="SocketTimeout">60000</attribute> <attribute name="ServerBindAddress">${jboss.bind.address}</attribute> <attribute name="ServerBindPort">4446</attribute> <attribute name="ClientConnectAddress">${jboss.bind.address}</attribute> <attribute name="ClientConnectPort">0</attribute> <attribute name="EnableTcpNoDelay">false</attribute> <depends optional-attribute-name="TransactionManagerService">jboss:service=TransactionManager</depends> <depends>jboss:service=Naming</depends> </mbean> <!-- ==================================================================== --> <!-- ==================================================================== --> <!-- Distributed cache invalidation --> <!-- ==================================================================== --> <mbean code="org.jboss.cache.invalidation.bridges.JGCacheInvalidationBridge" name="jboss.cache:service=InvalidationBridge,type=JavaGroups"> <attribute name="InvalidationManager">jboss.cache:service=InvalidationManager</attribute> <attribute name="PartitionName">${jboss.partition.name:DefaultPartition}</attribute> <attribute name="BridgeName">DefaultJGBridge</attribute> <depends>jboss:service=${jboss.partition.name:DefaultPartition}</depends> <depends>jboss.cache:service=InvalidationManager</depends> </mbean> </server>