6 Replies Latest reply on Jul 21, 2006 11:58 AM by jkressin

    HA-JMS fails, Master node undeploying channels, no failover

    jkressin

      First, sorry for the lengthy post, but I need to describe the problem in detail:
      We have a cluster of 6 JBoss instances (JBoss 4.0.3SP1) on 3 physical machines. Each machine runs two JBoss instances and each JBoss instance has its own IP. The machines have one network adapter with two IP-Adresses. We use UDP as the transport layer in JGroups (config below). From the range of cluster services we only use HA-JMS, means clustered topics and queues. Everything works fine, but from time to time (every 2-4 days) the HA-JMS completely fails which means that messages get lost, which should not happen at all (that's why we use a cluster).

      Here's what happens: All instances are up and running, and I can see that all 6 instances participate in the cluster. Suddenly on the master node I see a log file entry like this:

      2006-06-21 08:14:35,049 INFO [org.jboss.ha.framework.interfaces.HAPartition.StagePartition] New cluster view for partition StagePartition: 203 ([62.50.43.21
      1:1099, 62.50.43.213:1099, 62.50.43.216:1099, 62.50.43.215:1099, 62.50.43.214:1099, 62.50.43.210:1099] delta: 0)
      2006-06-21 08:14:35,049 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] I am (62.50.43.210:1099) received membershipChan
      ged event:
      2006-06-21 08:14:35,049 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] Dead members: 0 ([])
      2006-06-21 08:14:35,049 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] New Members : 0 ([])
      2006-06-21 08:14:35,049 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] All Members : 6 ([62.50.43.211:1099, 62.50.43.21
      3:1099, 62.50.43.216:1099, 62.50.43.215:1099, 62.50.43.214:1099, 62.50.43.210:1099])

      The first strange this is: Dead members:0, New members: 0 which I read as "nothing has changed at all" ;)

      Directly after this message, the master node starts to undeploy all queues and topics:

      2006-06-21 08:14:35,329 INFO [org.jboss.mq.server.jmx.Topic.sgw/MOCacheInvalidationTopic] Unbinding JNDI name: topic/sgw/MOCacheInvalidationTopic
      2006-06-21 08:14:35,465 INFO [org.jboss.mq.server.jmx.Topic.sgw/CdaHtmlCacheInvalidationTopic] Unbinding JNDI name: topic/sgw/CdaHtmlCacheInvalidationTopic
      2006-06-21 08:14:35,466 INFO [org.jboss.mq.server.jmx.Queue.sgw/AlertUserQueue] Unbinding JNDI name: queue/sgw/AlertUserQueue
      2006-06-21 08:14:35,466 INFO [org.jboss.mq.server.jmx.Queue.sgw/UserQueue] Unbinding JNDI name: queue/sgw/UserQueue
      2006-06-21 08:14:35,467 INFO [org.jboss.mq.server.jmx.Queue.sgw/OrderQueue] Unbinding JNDI name: queue/sgw/OrderQueue
      [...]
      2006-06-21 08:14:35,470 INFO [org.jboss.mq.server.jmx.Queue.DLQ] Unbinding JNDI name: queue/DLQ
      2006-06-21 08:14:35,546 INFO [org.jboss.web.tomcat.tc5.TomcatDeployer] undeploy, ctxPath=/jbossmq-httpil, warUrl=.../deploy-hasingleton/jms/jbossmq-httpil.s
      ar/jbossmq-httpil.war/

      But the instance still claims to be the master node. No other instance starts to take over the undeployed services, so whenever an instance tries to post a message we get:

      javax.jms.InvalidDestinationException: This destination does not exist! TOPIC.sgw/MOCacheInvalidationTopic
      at org.jboss.mq.server.JMSDestinationManager.addMessage(JMSDestinationManager.java:389)
      at org.jboss.mq.server.JMSDestinationManager.addMessage(JMSDestinationManager.java:373)
      at org.jboss.mq.server.JMSServerInvoker.addMessage(JMSServerInvoker.java:136)
      at org.jboss.mq.il.uil2.ServerSocketManagerHandler.handleMsg(ServerSocketManagerHandler.java:92)
      at org.jboss.mq.il.uil2.SocketManager$ReadTask.handleMsg(SocketManager.java:369)

      Exactly at the time when the master node undeploys all services, all the other instances start to go crazy as well:

      2006-06-21 08:14:24,728 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] I am (62.50.43.215:1099) received membershipChanged event:
      2006-06-21 08:14:24,728 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] Dead members: 2 ([62.50.43.210:1099, 62.50.43.214:1099])
      2006-06-21 08:14:24,728 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] New Members : 0 ([])
      2006-06-21 08:14:24,728 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] All Members : 4 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099, 62.50.43.215:1
      099])
      2006-06-21 08:14:24,798 INFO [org.jboss.ha.framework.interfaces.HAPartition.StagePartition] Suspected member: 62.50.43.214:54923 (additional data: 17 bytes)
      2006-06-21 08:14:26,800 INFO [org.jboss.ha.framework.interfaces.HAPartition.StagePartition] Suspected member: dep004174-05:54893 (additional data: 17 bytes)
      2006-06-21 08:14:31,547 ERROR [com.artnology.sgw.cda.tracking.Webtracking] getObjectType() returns null for SGWID '4-102-0-0-0'
      2006-06-21 08:14:34,867 INFO [org.jboss.ha.framework.interfaces.HAPartition.StagePartition] New cluster view for partition StagePartition: 202 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1
      099, 62.50.43.215:1099, 62.50.43.214:1099] delta: 1)
      2006-06-21 08:14:34,867 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] I am (62.50.43.215:1099) received membershipChanged event:
      2006-06-21 08:14:34,867 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] Dead members: 0 ([])
      2006-06-21 08:14:34,867 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] New Members : 1 ([62.50.43.214:1099])
      2006-06-21 08:14:34,867 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] All Members : 5 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099, 62.50.43.215:1
      099, 62.50.43.214:1099])
      2006-06-21 08:14:35,022 INFO [org.jboss.ha.framework.interfaces.HAPartition.StagePartition] New cluster view for partition StagePartition: 203 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1
      099, 62.50.43.215:1099, 62.50.43.214:1099, 62.50.43.210:1099] delta: 1)
      2006-06-21 08:14:35,022 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] I am (62.50.43.215:1099) received membershipChanged event:
      2006-06-21 08:14:35,022 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] Dead members: 0 ([])
      2006-06-21 08:14:35,022 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] New Members : 1 ([62.50.43.210:1099])
      2006-06-21 08:14:35,022 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] All Members : 6 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099, 62.50.43.215:1
      099, 62.50.43.214:1099, 62.50.43.210:1099])

      The log messages from the instances do not correlate. I have no idea why this happens, and like I said it happens sporadically, there is no obvious pattern in it (say, it happens all 4 hours or so). Did someone experience a similiar behaviour before? Can someone tell me what I can do to hunt down this problem? Why does the master node suddenly start to undeploy all channels, but still claims to be the master node? Whenever this problem occurs, messages get lost which is unacceptable for a productive system. Any help is greatly appreciated.

      Thanks!

      Jochen

      JGroups configuration:

      <server>
      
       <!-- ==================================================================== -->
       <!-- Cluster Partition: defines cluster -->
       <!-- ==================================================================== -->
      
       <mbean code="org.jboss.ha.framework.server.ClusterPartition"
       name="jboss:service=${jboss.partition.name:DefaultPartition}">
      
       <!-- Name of the partition being built -->
       <attribute name="PartitionName">${jboss.partition.name:DefaultPartition}</attribute>
      
       <!-- The address used to determine the node name -->
       <attribute name="NodeAddress">${jboss.bind.address}</attribute>
      
       <!-- Determine if deadlock detection is enabled -->
       <attribute name="DeadlockDetection">False</attribute>
      
       <!-- Max time (in ms) to wait for state transfer to complete. Increase for large states -->
       <attribute name="StateTransferTimeout">30000</attribute>
      
       <!-- The JGroups protocol configuration -->
       <attribute name="PartitionConfig">
       <Config>
       <UDP mcast_addr="${jboss.partition.udpGroup:228.1.2.3}" mcast_port="45566"
       ip_ttl="8" ip_mcast="true"
       mcast_send_buf_size="800000" mcast_recv_buf_size="150000"
       ucast_send_buf_size="800000" ucast_recv_buf_size="150000"
       loopback="false"/>
       <PING timeout="2000" num_initial_members="3"
       up_thread="true" down_thread="true"/>
       <MERGE2 min_interval="10000" max_interval="20000"/>
       <FD shun="true" up_thread="true" down_thread="true"
       timeout="2500" max_tries="5"/>
       <VERIFY_SUSPECT timeout="3000" num_msgs="3"
       up_thread="true" down_thread="true"/>
       <pbcast.NAKACK gc_lag="50" retransmit_timeout="300,600,1200,2400,4800"
       max_xmit_size="8192"
       up_thread="true" down_thread="true"/>
       <UNICAST timeout="300,600,1200,2400,4800" window_size="100" min_threshold="10"
       down_thread="true"/>
       <pbcast.STABLE desired_avg_gossip="20000"
       up_thread="true" down_thread="true"/>
       <FRAG frag_size="8192"
       down_thread="true" up_thread="true"/>
       <pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
       shun="true" print_local_addr="true"/>
       <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
       </Config>
       </attribute>
       <depends>jboss:service=Naming</depends>
       </mbean>
      
       <!-- ==================================================================== -->
       <!-- HA Session State Service for SFSB -->
       <!-- ==================================================================== -->
      
       <mbean code="org.jboss.ha.hasessionstate.server.HASessionStateService"
       name="jboss:service=HASessionState">
       <!-- Name of the partition to which the service is linked -->
       <attribute name="PartitionName">${jboss.partition.name:DefaultPartition}</attribute>
       <!-- JNDI name under which the service is bound -->
       <attribute name="JndiName">/HASessionState/Default</attribute>
       <!-- Max delay before cleaning unreclaimed state.
       Defaults to 30*60*1000 => 30 minutes -->
       <attribute name="BeanCleaningDelay">0</attribute>
       <depends>jboss:service=Naming</depends>
       <depends>jboss:service=${jboss.partition.name:DefaultPartition}</depends>
       </mbean>
      
       <!-- ==================================================================== -->
       <!-- HA JNDI -->
       <!-- ==================================================================== -->
      
       <mbean code="org.jboss.ha.jndi.HANamingService"
       name="jboss:service=HAJNDI">
       <depends>jboss:service=${jboss.partition.name:DefaultPartition}</depends>
       <!-- Name of the partition to which the service is linked -->
       <attribute name="PartitionName">${jboss.partition.name:DefaultPartition}</attribute>
       <!-- Bind address of bootstrap and HA-JNDI RMI endpoints -->
       <attribute name="BindAddress">${jboss.bind.address}</attribute>
       <!-- Port on which the HA-JNDI stub is made available -->
       <attribute name="Port">1100</attribute>
       <!-- RmiPort to be used by the HA-JNDI service once bound. 0 => auto. -->
       <attribute name="RmiPort">1101</attribute>
       <!-- Accept backlog of the bootstrap socket -->
       <attribute name="Backlog">50</attribute>
       <!-- The thread pool service used to control the bootstrap and
       auto discovery lookups -->
       <depends optional-attribute-name="LookupPool"
       proxy-type="attribute">jboss.system:service=ThreadPool</depends>
      
       <!-- A flag to disable the auto discovery via multicast -->
       <attribute name="DiscoveryDisabled">false</attribute>
       <!-- Set the auto-discovery bootstrap multicast bind address. If not
       specified and a BindAddress is specified, the BindAddress will be used. -->
       <attribute name="AutoDiscoveryBindAddress">${jboss.bind.address}</attribute>
       <!-- Multicast Address and group port used for auto-discovery -->
       <attribute name="AutoDiscoveryAddress">${jboss.partition.udpGroup:230.0.0.4}</attribute>
       <attribute name="AutoDiscoveryGroup">1102</attribute>
       <!-- The TTL (time-to-live) for autodiscovery IP multicast packets -->
       <attribute name="AutoDiscoveryTTL">16</attribute>
      
       <!-- Client socket factory to be used for client-server
       RMI invocations during JNDI queries
       <attribute name="ClientSocketFactory">custom</attribute>
       -->
       <!-- Server socket factory to be used for client-server
       RMI invocations during JNDI queries
       <attribute name="ServerSocketFactory">custom</attribute>
       -->
       </mbean>
      
       <mbean code="org.jboss.invocation.jrmp.server.JRMPInvokerHA"
       name="jboss:service=invoker,type=jrmpha">
       <attribute name="ServerAddress">${jboss.bind.address}</attribute>
       <attribute name="RMIObjectPort">4447</attribute>
       <!--
       <attribute name="RMIClientSocketFactory">custom</attribute>
       <attribute name="RMIServerSocketFactory">custom</attribute>
       -->
       <depends>jboss:service=Naming</depends>
       </mbean>
      
       <!-- the JRMPInvokerHA creates a thread per request. This implementation uses a pool of threads -->
       <mbean code="org.jboss.invocation.pooled.server.PooledInvokerHA"
       name="jboss:service=invoker,type=pooledha">
       <attribute name="NumAcceptThreads">1</attribute>
       <attribute name="MaxPoolSize">300</attribute>
       <attribute name="ClientMaxPoolSize">300</attribute>
       <attribute name="SocketTimeout">60000</attribute>
       <attribute name="ServerBindAddress">${jboss.bind.address}</attribute>
       <attribute name="ServerBindPort">4446</attribute>
       <attribute name="ClientConnectAddress">${jboss.bind.address}</attribute>
       <attribute name="ClientConnectPort">0</attribute>
       <attribute name="EnableTcpNoDelay">false</attribute>
       <depends optional-attribute-name="TransactionManagerService">jboss:service=TransactionManager</depends>
       <depends>jboss:service=Naming</depends>
       </mbean>
      
       <!-- ==================================================================== -->
      
       <!-- ==================================================================== -->
       <!-- Distributed cache invalidation -->
       <!-- ==================================================================== -->
      
       <mbean code="org.jboss.cache.invalidation.bridges.JGCacheInvalidationBridge"
       name="jboss.cache:service=InvalidationBridge,type=JavaGroups">
       <attribute name="InvalidationManager">jboss.cache:service=InvalidationManager</attribute>
       <attribute name="PartitionName">${jboss.partition.name:DefaultPartition}</attribute>
       <attribute name="BridgeName">DefaultJGBridge</attribute>
       <depends>jboss:service=${jboss.partition.name:DefaultPartition}</depends>
       <depends>jboss.cache:service=InvalidationManager</depends>
       </mbean>
      
      </server>
      


        • 1. Re: HA-JMS fails, Master node undeploying channels, no failo
          brian.stansberry

          1) You refer to "the master node". Please confirm that this is 62.50.43.211.

          2) On the node that produced the first bit of logging in your post, do you see log entries with this content "New cluster view for partition StagePartition: 202" and "New cluster view for partition StagePartition: 201"?

          3) If you have a log entry somewhere that contains "New cluster view for partition StagePartition: 200", please compare the list of nodes to the first line in the first log entry in your post. Does it have the same 6 nodes but in different order?

          What I'm driving at here is I wonder if the machine doing the first bit of logging lost a couple view changes, going from 200 to 203. The result would be Dead members:0, New members: 0 but a different order of members.

          I'm not sure what that would mean if it were the case, but it's an avenue to explore.

          • 2. Re: HA-JMS fails, Master node undeploying channels, no failo
            jkressin

            Thanks very much for your reply. I examined the logfiles again to answer your questions:

            "bstansberry@jboss.com" wrote:
            1) You refer to "the master node". Please confirm that this is 62.50.43.211.


            No, at that time the master node was 62.50.43.210. The first logoutput and the second one are from this machine, means that the master node (62.50.43.210) produced the output "Dead members:0, New members: 0" and immediately after that undeployed all the HA-Queues and HA-Topics. Sorry, I should have made that clear in my first post.

            "bstansberry@jboss.com" wrote:

            2) On the node that produced the first bit of logging in your post, do you see log entries with this content "New cluster view for partition StagePartition: 202" and "New cluster view for partition StagePartition: 201"?


            No, these messages are not present in the logfile.

            "bstansberry@jboss.com" wrote:

            3) If you have a log entry somewhere that contains "New cluster view for partition StagePartition: 200", please compare the list of nodes to the first line in the first log entry in your post. Does it have the same 6 nodes but in different order?


            You are right, I can see the same nodes, but in different order

            "bstansberry@jboss.com" wrote:

            What I'm driving at here is I wonder if the machine doing the first bit of logging lost a couple view changes, going from 200 to 203. The result would be Dead members:0, New members: 0 but a different order of members.


            Thanks, now I start to understand what is happening. You are right that the machine indeed lost some of the view changes, that's a problem I probably have to investigate on the network level.

            But the most intersting question for me is: Even if the (Master-)node lost some viewchanges, why does it suddenly undeploy the (HA-)queues and (HA-)topics? And why is the failover not happening, no other node is starting to deploy the queues and topics instead. I cannot explain how this is possible and also found no information in the docs or in the forums on this issue.

            The critical thing is that if I run into this scenario my HA-Queues and HA-Topics are not present on any instance, leading to lost messages and therefore also lost data. This situation should not be possible at all in a cluster. I am not quite sure if this is a cluster issue (I guess so), so if it is something related to JMS please let me know so I can ask in JMS-Forum.

            BTW: This is the only real problem we have with the JBoss platform. Everything else is working fine and stable. Developing with JBoss really was a breeze, so thanks for this great piece of software.

            Thanks again for your help.

            Jochen


            • 3. Re: HA-JMS fails, Master node undeploying channels, no failo
              brian.stansberry

              OK, things are a bit clearer. Don't know the full answer yet but we're getting there.

              "jkressin" wrote:

              But the most intersting question for me is: Even if the (Master-)node lost some viewchanges, why does it suddenly undeploy the (HA-)queues and (HA-)topics?


              They are undeployed because when view 203 came in, 65.20.43.211 was no longer the first node in the view, 62.50.43.211 was. All HASingleton services (currently, we're looking to change this) run on the first member in the view on which they are deployed. If a node that is currently the singleton master for the service discovers its no longer that first node, it will stop providing the service.

              "jkressin" wrote:
              And why is the failover not happening, no other node is starting to deploy the queues and topics instead. I cannot explain how this is possible and also found no information in the docs or in the forums on this issue.


              This is the key question. 65.20.43.211 should have taken over as the HA-JMS server and deployed the queues and topics. Is there anything interesting in the 65.20.43.211 logs that could shed light on why it didn't?

              • 4. Re: HA-JMS fails, Master node undeploying channels, no failo
                jkressin

                Sorry for not replying for a while, but I was analyzing the logfiles and trying to reproduce the behaviour we have on our production system. Thanks to the answers here I think I understand now better what is going on, and I indeed found a way to reproduce the behaviour.

                First, I was wrong in my assumption that the channels are never rebound to JNDI when the master node fails. Here's what happens:

                Initally node 210 is the master node, and node 211 is a "slave" (hope the terminology is correct). At 08:14:24 the node 211 begins to receive new views.
                Taken from 211's logfile:

                2006-06-21 08:14:24,757 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.StagePartition] New cluster view for partition StagePartition (id: 201, delta: -2) : [62.50.43.211:1099, 62.50.
                43.213:1099, 62.50.43.216:1099, 62.50.43.215:1099]
                2006-06-21 08:14:24,757 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] I am (62.50.43.211:1099) received membershipChanged event:
                2006-06-21 08:14:24,757 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] Dead members: 2 ([62.50.43.210:1099, 62.50.43.214:1099])
                2006-06-21 08:14:24,757 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] New Members : 0 ([])
                2006-06-21 08:14:24,757 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] All Members : 4 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099, 62.50.43.215:1
                099])


                As node 211 is now the master node and node 210 is in the list of dead members, node 211 deploys all channels, like it should.
                Taken from 211's logfile:

                2006-06-21 08:14:25,496 INFO [org.jboss.web.tomcat.tc5.TomcatDeployer] deploy, ctxPath=/jbossmq-httpil, warUrl=.../deploy-hasingleton/jms/jbossmq-httpil.sar/jbossmq-httpil.war/
                2006-06-21 08:14:26,916 INFO [org.jboss.mq.server.jmx.Topic.sgw/MOCacheInvalidationTopic] Bound to JNDI name: topic/sgw/MOCacheInvalidationTopic
                2006-06-21 08:14:26,917 INFO [org.jboss.mq.server.jmx.Topic.sgw/CdaHtmlCacheInvalidationTopic] Bound to JNDI name: topic/sgw/CdaHtmlCacheInvalidationTopic
                [...]

                But: Node 210 did not receive view 201 at all, so this node still has all the channels deployed as well. The next thing I see in the logfile of 211 is that node 214 is still sending messages, but from the viewpoint of 211 is not a cluster member anymore. I do not know if this is of any relevance, but to give you a complete picture I wanted to mention it.
                Taken from 211's logfile:
                2006-06-21 08:14:29,985 ERROR [org.jgroups.protocols.pbcast.CoordGmsImpl] mbr 62.50.43.214:54923 (additional data: 17 bytes) is not a member !
                2006-06-21 08:14:29,987 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.StagePartition] Suspected member: 62.50.43.214:54923 (additional data: 17 bytes)

                Next, 211 is receiving two more view changes (id 202 and 203).
                Taken from 211's logfile:

                2006-06-21 08:14:34,867 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.StagePartition] New cluster view for partition StagePartition (id: 202, delta: 1) : [62.50.43.211:1099, 62.50.4
                3.213:1099, 62.50.43.216:1099, 62.50.43.215:1099, 62.50.43.214:1099]
                2006-06-21 08:14:34,867 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] I am (62.50.43.211:1099) received membershipChanged event:
                2006-06-21 08:14:34,867 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] Dead members: 0 ([])
                2006-06-21 08:14:34,867 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] New Members : 1 ([62.50.43.214:1099])
                2006-06-21 08:14:34,867 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] All Members : 5 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099, 62.50.43.215:1
                099, 62.50.43.214:1099])
                2006-06-21 08:14:35,021 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.StagePartition] New cluster view for partition StagePartition (id: 203, delta: 1) : [62.50.43.211:1099, 62.50.4
                3.213:1099, 62.50.43.216:1099, 62.50.43.215:1099, 62.50.43.214:1099, 62.50.43.210:1099]
                2006-06-21 08:14:35,021 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] I am (62.50.43.211:1099) received membershipChanged event:
                2006-06-21 08:14:35,021 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] Dead members: 0 ([])
                2006-06-21 08:14:35,021 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] New Members : 1 ([62.50.43.210:1099])
                2006-06-21 08:14:35,021 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] All Members : 6 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099, 62.50.43.215:1
                099, 62.50.43.214:1099, 62.50.43.210:1099])

                Node 210 was not receiving view 202, but view 203. After receiving view 203 node 210 is aware that it is no longer the master node, and it undeploys the channels:
                Taken from 210's logfile:
                2006-06-21 08:14:35,049 INFO [org.jboss.ha.framework.interfaces.HAPartition.StagePartition] New cluster view for partition StagePartition: 203 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1
                099, 62.50.43.215:1099, 62.50.43.214:1099, 62.50.43.210:1099] delta: 0)
                2006-06-21 08:14:35,049 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] I am (62.50.43.210:1099) received membershipChanged event:
                2006-06-21 08:14:35,049 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] Dead members: 0 ([])
                2006-06-21 08:14:35,049 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] New Members : 0 ([])
                2006-06-21 08:14:35,049 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] All Members : 6 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099, 62.50.43.215:1
                099, 62.50.43.214:1099, 62.50.43.210:1099])
                2006-06-21 08:14:35,329 INFO [org.jboss.mq.server.jmx.Topic.sgw/MOCacheInvalidationTopic] Unbinding JNDI name: topic/sgw/MOCacheInvalidationTopic
                2006-06-21 08:14:35,465 INFO [org.jboss.mq.server.jmx.Topic.sgw/CdaHtmlCacheInvalidationTopic] Unbinding JNDI name: topic/sgw/CdaHtmlCacheInvalidationTopic
                2006-06-21 08:14:35,466 INFO [org.jboss.mq.server.jmx.Queue.sgw/AlertUserQueue] Unbinding JNDI name: queue/sgw/AlertUserQueue
                2006-06-21 08:14:35,466 INFO [org.jboss.mq.server.jmx.Queue.sgw/UserQueue] Unbinding JNDI name: queue/sgw/UserQueue

                And exactly from this point onwards the two nodes cannot lookup any channel anymore, although they are bound on 211 and 211 is the master node according to the view messages 201 - 203. The next messages appear on all nodes in the cluster:

                javax.jms.InvalidDestinationException: This destination does not exist! TOPIC.sgw/MOCacheInvalidationTopic
                at org.jboss.mq.server.JMSDestinationManager.addMessage(JMSDestinationManager.java:389)
                at org.jboss.mq.server.JMSDestinationManager.addMessage(JMSDestinationManager.java:373)
                at org.jboss.mq.server.JMSServerInvoker.addMessage(JMSServerInvoker.java:136)
                at org.jboss.mq.il.uil2.ServerSocketManagerHandler.handleMsg(ServerSocketManagerHandler.java:92)

                I guess it has something to do with the small period of time where kind of two master nodes existed in the cluster. After view 201 the node 211 thinks that it is the master node, but because 210 did not receice the message this node also thinks it is the master node. This situation lasted for ~10 seconds, until view 203 was received by node 210. How does JGroups/JBoss handle such a scenario?

                In order to reproduce the behaviour we did the following: We set up a cluster with 4 nodes on 2 machines with the same configuration as on the production system. We installed a script which blocks the UDP traffic periodically. The script runs every second and blocks the incoming UDP traffic for another second with a probability of 50%. That way we wanted to simulate network "jitter", because we suspect that UDP packets get lost somehow on the production system. I can see the same behaviour on our test cluster by running this script. View changed messages get lost from time to time and after that happens the nodes fail to lookup the channels, although they are present always on one of the nodes. Because I was able to reproduce I thought it is important to let you know. Currently I do not want to switch to the TCP stack, because we are not 100% sure yet that UDP packets really get lost. According to our hoster, everything is fine with the network (but, hey, they always tell you that ;)

                Anyway, if you have any more hints how to solve this problem or have any questions on our test setup to reproduce the behaviour, please let me know. Thanks very much your answers so far, I really appreciate the time you guys put in answering questions here in the forum!

                Thanks,

                Jochen

                • 5. Re: HA-JMS fails, Master node undeploying channels, no failo
                  brian.stansberry

                  Hi Jochen,

                  Sorry for the slow reply; I was on vacation the last 2 weeks.

                  If you have the log files from 210, 211 and 214 and can zip them up and send to me at bstansberry at jboss dot com that would be good. Please include your cluster-service.xml file as well.

                  • 6. Re: HA-JMS fails, Master node undeploying channels, no failo
                    jkressin

                    Hi Brian,

                    I will send the logfiles and the cluster-service.xml, and thanks again for looking into this issue!

                    Jochen