2 Replies Latest reply on Jul 2, 2011 3:23 PM by penumbraposts

    Cluster node being marked as NON-RELIABLE shortly after startup with no way to rejoin cluster

    parmstrong

      I have a cluster with two nodes in it.  I start them both up.  I start sending messages to the cluster and they are getting load balanced nicely.  Then a few minutes after working fine I will get a message (shown below) in the logs of one of the servers.  At that point only every other message gets handled and every other message is queued up for the other server which has been booted from the cluster.  For some reason the other node never receives messages from that point on.  Even after restarting that node the problem still persists until I bring the entire cluster down and restart it and start the cycle again.  I guess here I have 3 questions.  First why is the other node marked as NON-RELIABLE.  Second why does it never get messages again even after a server restart.  Third why ,when the only consumer on the bridge is marked as NON-RELIABLE, do messages keep being put on the bridge instead of just getting handled by the surviving node?  I am running HornetQ 2.1.1 integrated with jboss 5.1.  Here is the error message I get in the node that survives when the other node is kicked from the cluster:

       

      DATE:

      Thu Nov 11 17:02:37 MST 2010

      LEVEL:

      WARN

      LOGGER:

      [org.hornetq.core.server.impl.QueueImpl]

      MESSAGE:

      (group:HornetQ-server-threads238779431-1643168823))   removing consumer which did not handle a message,   consumer=org.hornetq.core.server.cluster.impl.BridgeImpl@2d349d88, message=Reference[6660]:NON-RELIABLE

      THREAD:

      (Thread-16

      EXTRA INFO:

      java.lang.NullPointerException
      at   org.hornetq.core.buffers.impl.ResetLimitWrappedHornetQBuffer.(ResetLimitWrappedHornetQBuffer.java:39)
      at org.hornetq.core.message.impl.MessageImpl.getBodyBuffer(MessageImpl.java:254)
      at   org.hornetq.core.client.impl.ClientProducerImpl.doSend(ClientProducerImpl.java:220)
      at   org.hornetq.core.client.impl.ClientProducerImpl.send(ClientProducerImpl.java:139)
      at org.hornetq.core.server.cluster.impl.BridgeImpl.handle(BridgeImpl.java:477)
      at org.hornetq.core.server.impl.QueueImpl.handle(QueueImpl.java:1361)
      at org.hornetq.core.server.impl.QueueImpl.deliver(QueueImpl.java:1153)
      at org.hornetq.core.server.impl.QueueImpl.access$700(QueueImpl.java:65)
      at org.hornetq.core.server.impl.QueueImpl$DeliverRunner.run(QueueImpl.java:1567)
      at   org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:96)
      at   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
      at   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
      at java.lang.Thread.run(Thread.java:636)

        • 1. Re: Cluster node being marked as NON-RELIABLE shortly after startup with no way to rejoin cluster
          brian.hayes

          I have this same issue too. Identical running HornetQ 2.1.2 Final integrated with Jboss 5.0.1 GA. Also if you look at the Queue's that Hornet Q set's up as part of the cluster there will be no consumers. Which on initial startup there is only one to begin with.

           

          Config wise: Here are the main changes I've made from the default's

           

           

                <message-counter-enabled>true</message-counter-enabled>

           

          <!-- Jboss says setting this to true has big performance impact but without it being set you get a message resend loop ocuring. -->

          <persist-delivery-count-before-delivery>false</persist-delivery-count-before-delivery>

           

             <jmx-management-enabled>true</jmx-management-enabled>

           

                <broadcast-group name="bg-group1">
          <!--<local-bind-address>${jboss.bind.address:localhost}</local-bind-address>-->
                   <group-address>${jboss.partition.udpGroup:231.7.7.7}</group-address>
                   <group-port>9876</group-port>
                   <broadcast-period>5000</broadcast-period>
                   <connector-ref connector-name="netty"/>
                </broadcast-group>
             </broadcast-groups>
             <discovery-groups>
                <discovery-group name="dg-group1">
          <!--<local-bind-address>${jboss.bind.address:localhost}</local-bind-address>-->
                   <group-address>${jboss.partition.udpGroup:231.7.7.7}</group-address>
                   <group-port>9876</group-port>
                   <refresh-timeout>10000</refresh-timeout>
                </discovery-group>
             </discovery-groups>
            
             <cluster-connections>
                <cluster-connection name="${jboss.cluster.name:HornetQCluster}">
                   <address>jms</address>
                   <use-duplicate-detection>true</use-duplicate-detection>
                   <forward-when-no-consumers>false</forward-when-no-consumers>
                   <max-hops>1</max-hops>          
               <discovery-group-ref discovery-group-name="dg-group1"/>
                </cluster-connection>
             </cluster-connections>
             

             <broadcast-group name="bg-group1">

          <!--<local-bind-address>${jboss.bind.address:localhost}</local-bind-address>-->

                <group-address>${jboss.partition.udpGroup:231.7.7.7}</group-address>

                <group-port>9876</group-port>

                <broadcast-period>5000</broadcast-period>

                <connector-ref connector-name="netty"/>

             </broadcast-group>

          </broadcast-groups>

           

          <discovery-groups>

             <discovery-group name="dg-group1">

          <!--<local-bind-address>${jboss.bind.address:localhost}</local-bind-address>-->

                <group-address>${jboss.partition.udpGroup:231.7.7.7}</group-address>

                <group-port>9876</group-port>

                <refresh-timeout>10000</refresh-timeout>

             </discovery-group>

          </discovery-groups>

           

          <cluster-connections>

             <cluster-connection name="${jboss.cluster.name:HornetQCluster}">

                <address>jms</address>

                <use-duplicate-detection>true</use-duplicate-detection>

                <forward-when-no-consumers>false</forward-when-no-consumers>

                <max-hops>1</max-hops>

                <discovery-group-ref discovery-group-name="dg-group1"/>

             </cluster-connection>

          </cluster-connections>

          Queue's are setup like this: Each Client Application sends a message to EmailIn, EmailIn then parses / builds the SMTP message and sends (using the same client code) to SMTPOut for the Client. So the communication between EmailIn is treated as an internal client that is sending to SMTPOut.
                <!--default for catch all-->
                <address-setting match="#">
                   <dead-letter-address>jms.queue.DLQ</dead-letter-address>
                   <expiry-address>jms.queue.ExpiryQueue</expiry-address>
                   <redelivery-delay>0</redelivery-delay>
                   <max-size-bytes>100485760</max-size-bytes>      
                   <message-counter-history-day-limit>10</message-counter-history-day-limit>
                   <address-full-policy>BLOCK</address-full-policy>
                </address-setting>
           
          <!-- Client DQL settings. seems the ones in jboss.jar do not work. -->
          <address-setting match="jms.queue.CM/EmailIn">
             <dead-letter-address>jms.queue.CM/EmailInDLQ</dead-letter-address>
             <redistribution-delay>0</redistribution-delay>
             <max-delivery-attempts>3</max-delivery-attempts>
             <message-counter-history-day-limit>10</message-counter-history-day-limit>
          </address-setting>
          <address-setting match="jms.queue.Client/EmailSMTPOut">
             <dead-letter-address>jms.queue.Client/EmailSMTPOutDLQ</dead-letter-address>
             <redistribution-delay>0</redistribution-delay>
             <max-delivery-attempts>3</max-delivery-attempts>
             <message-counter-history-day-limit>10</message-counter-history-day-limit>
          </address-setting>

          • 2. Re: Cluster node being marked as NON-RELIABLE shortly after startup with no way to rejoin cluster
            penumbraposts

            I have the same issue using TorqueBox 1.0.2 which is based on JBoss 6.0 Final and HornetQ 2.1.2 Final.