2 Replies Latest reply on Jul 2, 2011 3:23 PM by penumbraposts

Cluster node being marked as NON-RELIABLE shortly after startup with no way to rejoin cluster

parmstrong Nov 15, 2010 11:45 AM

I have a cluster with two nodes in it. I start them both up. I start sending messages to the cluster and they are getting load balanced nicely. Then a few minutes after working fine I will get a message (shown below) in the logs of one of the servers. At that point only every other message gets handled and every other message is queued up for the other server which has been booted from the cluster. For some reason the other node never receives messages from that point on. Even after restarting that node the problem still persists until I bring the entire cluster down and restart it and start the cycle again. I guess here I have 3 questions. First why is the other node marked as NON-RELIABLE. Second why does it never get messages again even after a server restart. Third why ,when the only consumer on the bridge is marked as NON-RELIABLE, do messages keep being put on the bridge instead of just getting handled by the surviving node? I am running HornetQ 2.1.1 integrated with jboss 5.1. Here is the error message I get in the node that survives when the other node is kicked from the cluster:

DATE:	Thu Nov 11 17:02:37 MST 2010
LEVEL:	WARN
LOGGER:	[org.hornetq.core.server.impl.QueueImpl]
MESSAGE:	(group:HornetQ-server-threads238779431-1643168823)) removing consumer which did not handle a message, consumer=org.hornetq.core.server.cluster.impl.BridgeImpl@2d349d88, message=Reference[6660]:NON-RELIABLE
THREAD:	(Thread-16
EXTRA INFO:	java.lang.NullPointerException at org.hornetq.core.buffers.impl.ResetLimitWrappedHornetQBuffer.(ResetLimitWrappedHornetQBuffer.java:39) at org.hornetq.core.message.impl.MessageImpl.getBodyBuffer(MessageImpl.java:254) at org.hornetq.core.client.impl.ClientProducerImpl.doSend(ClientProducerImpl.java:220) at org.hornetq.core.client.impl.ClientProducerImpl.send(ClientProducerImpl.java:139) at org.hornetq.core.server.cluster.impl.BridgeImpl.handle(BridgeImpl.java:477) at org.hornetq.core.server.impl.QueueImpl.handle(QueueImpl.java:1361) at org.hornetq.core.server.impl.QueueImpl.deliver(QueueImpl.java:1153) at org.hornetq.core.server.impl.QueueImpl.access$700(QueueImpl.java:65) at org.hornetq.core.server.impl.QueueImpl$DeliverRunner.run(QueueImpl.java:1567) at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:96) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636)

1. Re: Cluster node being marked as NON-RELIABLE shortly after startup with no way to rejoin cluster

brian.hayes Dec 9, 2010 9:10 AM (in response to parmstrong)

I have this same issue too. Identical running HornetQ 2.1.2 Final integrated with Jboss 5.0.1 GA. Also if you look at the Queue's that Hornet Q set's up as part of the cluster there will be no consumers. Which on initial startup there is only one to begin with.

Config wise: Here are the main changes I've made from the default's

      <message-counter-enabled>true</message-counter-enabled>


<persist-delivery-count-before-delivery>false</persist-delivery-count-before-delivery>

   <jmx-management-enabled>true</jmx-management-enabled>

      <broadcast-group name="bg-group1">

         <group-address>${jboss.partition.udpGroup:231.7.7.7}</group-address>
         <group-port>9876</group-port>
         <broadcast-period>5000</broadcast-period>
         <connector-ref connector-name="netty"/>
      </broadcast-group>
   </broadcast-groups>

   <discovery-groups>
      <discovery-group name="dg-group1">

         <group-address>${jboss.partition.udpGroup:231.7.7.7}</group-address>
         <group-port>9876</group-port>
         <refresh-timeout>10000</refresh-timeout>
      </discovery-group>
   </discovery-groups>

   <cluster-connections>
      <cluster-connection name="${jboss.cluster.name:HornetQCluster}">
         <address>jms</address>
         <use-duplicate-detection>true</use-duplicate-detection>
         <forward-when-no-consumers>false</forward-when-no-consumers>
         <max-hops>1</max-hops>
     <discovery-group-ref discovery-group-name="dg-group1"/>
      </cluster-connection>
   </cluster-connections>

   <broadcast-group name="bg-group1">

      <group-address>${jboss.partition.udpGroup:231.7.7.7}</group-address>
      <group-port>9876</group-port>
      <broadcast-period>5000</broadcast-period>
      <connector-ref connector-name="netty"/>
   </broadcast-group>
</broadcast-groups>

<discovery-groups>
   <discovery-group name="dg-group1">

      <group-address>${jboss.partition.udpGroup:231.7.7.7}</group-address>
      <group-port>9876</group-port>
      <refresh-timeout>10000</refresh-timeout>
   </discovery-group>
</discovery-groups>

<cluster-connections>
   <cluster-connection name="${jboss.cluster.name:HornetQCluster}">
      <address>jms</address>
      <use-duplicate-detection>true</use-duplicate-detection>
      <forward-when-no-consumers>false</forward-when-no-consumers>
      <max-hops>1</max-hops>
      <discovery-group-ref discovery-group-name="dg-group1"/>
   </cluster-connection>
</cluster-connections>

Queue's are setup like this: Each Client Application sends a message to EmailIn, EmailIn then parses / builds the SMTP message and sends (using the same client code) to SMTPOut for the Client. So the communication between EmailIn is treated as an internal client that is sending to SMTPOut.

      
      <address-setting match="#">
         <dead-letter-address>jms.queue.DLQ</dead-letter-address>
         <expiry-address>jms.queue.ExpiryQueue</expiry-address>
         <redelivery-delay>0</redelivery-delay>
         <max-size-bytes>100485760</max-size-bytes>
         <message-counter-history-day-limit>10</message-counter-history-day-limit>
         <address-full-policy>BLOCK</address-full-policy>
      </address-setting>


<address-setting match="jms.queue.CM/EmailIn">
   <dead-letter-address>jms.queue.CM/EmailInDLQ</dead-letter-address>
   <redistribution-delay>0</redistribution-delay>
   <max-delivery-attempts>3</max-delivery-attempts>
   <message-counter-history-day-limit>10</message-counter-history-day-limit>
</address-setting>

<address-setting match="jms.queue.Client/EmailSMTPOut">
   <dead-letter-address>jms.queue.Client/EmailSMTPOutDLQ</dead-letter-address>
   <redistribution-delay>0</redistribution-delay>
   <max-delivery-attempts>3</max-delivery-attempts>
   <message-counter-history-day-limit>10</message-counter-history-day-limit>
</address-setting>
Actions
2. Re: Cluster node being marked as NON-RELIABLE shortly after startup with no way to rejoin cluster

penumbraposts Jul 2, 2011 3:23 PM (in response to brian.hayes)

I have the same issue using TorqueBox 1.0.2 which is based on JBoss 6.0 Final and HornetQ 2.1.2 Final.
Actions

Go to original post