3 Replies Latest reply on Jun 26, 2013 12:29 PM by mlange

    Message loss during failover

    mlange

      We have a clustered setup (3 master/3 backup nodes) with a shared storage. I wonder why messages are lost when a failover is done (shutting down one master node, after failover has occured the master node is restarted again and becomes the live instance).

       

      We are using plain JMS for producing and consuming:

       

      Producer:

      session = connection.createQueueSession(false, Session.AUTO_ACKNOWLEDGE);

      queueSender.send(....)

       

      Consumer:

      QueueSession session = connection.createQueueSession(true, Session.AUTO_ACKNOWLEDGE);

      message = queueReceiver.receive(2000L);

       

      During the failiover the client gets several exceptions:

      STDERR 'Caused by: HornetQException[errorCode=3 message=Timed out waiting for response when sending packet 49]'

      ERROR 2013-06-13 17:23:22,078 [http-0.0.0.0-6003-37][d0a99a0f-0d09-4f9d-99d0-5de4336f409c][] STDERR '   at org.hornetq.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:302)'

      ERROR 2013-06-13 17:23:22,078 [http-0.0.0.0-6003-37][d0a99a0f-0d09-4f9d-99d0-5de4336f409c][] STDERR '   at org.hornetq.core.client.impl.ClientSessionImpl.bindingQuery(ClientSessionImpl.java:399)'

       

      OR

      "Unblocking a blocking call that will never get a response"

       

      According to the docs:

      "If the client code is in a blocking call to the server, waiting for a response to continue its execution, when failover occurs, the new session will not have any knowledge of the call that was in progress. This call might otherwise hang for ever, waiting for a response that will never come.

      To prevent this, HornetQ will unblock any blocking calls that were in progress at the time of failover by making them throw a javax.jms.JMSException (if using JMS), or a HornetQException with error code HornetQException.UNBLOCKED. It is up to the client code to catch this exception and retry any operations if desired."

       

      What is the recommended way here? Is it safe to retry in case of HornetQException.UNBLOCKED and/or HornetQException.CONNECTION_TIMEDOUT? Are there any recommendations regarding the consumer? In which exception cases must transactions be rolled back/when is HornetQ rolling back itself?

       

      Thanks!

       

      Marek

        • 1. Re: Message loss during failover
          gaohoward

          Hi, did you configure your Connection Factory to use HA?

          • 2. Re: Message loss during failover
            ataylor

            If failover occurs while the send method is called then it is possible that the message might not have been received by the server, this is why we throw HornetQException.UNBLOCKED. Its up to you how you handle it, altho re sending the message may mean you end up with duplicates, so you can either:

             

            1. use XA, this is 2 phase commit and will give you 100% gaurantee

            2. use a JMS Transaction, although if failover occurs during the commit call again you won't know whether it has completed but you can just call it again and assume that if you get a No Transaction error that is was commited

            3. use message duplicate detection, the user manual will explain this in full but basically you add a unique ID to each message

             

            The same is true for any acks for a consumer

            • 3. Re: Message loss during failover
              mlange

              Retrying the unblocked call on the producer side seems to be working as expected. However I cannot get the consumers to receive the same message number as messages sent into the queues. It seems that HornetQ is somehow redelivering messages and this leads to duplicates on the client (After the failover more messages arrive then sent). I have verified this by checking the duplicate ids and the JMSXDeliveryCount property (>1) on the messages received by the consumers. I wonder why this exception occurs:

               

              ERROR [org.hornetq.core.protocol.core.ServerSessionPacketHandler] (Old I/O server worker (parentId: 70102389, [id: 0x042dad75, /10.230.34.13:11930])) Caught exception: HornetQException[errorCode=104 message=Could not find reference on consumer 3 for queue XYZ....]

               

              Looking at the code it seems that the reference to be delivered is empty:

               

              MessageReference ref;

              ref = deliveringRefs.poll();

              if (ref == null)

              { ..throw above exception...}

               

              I saw this error reported sometimes and it seems to be mainly related to JMS session sharing across threads. This is not the case in my client.

               

              Any clue what is going on here? I have been investigating this now for more than a week and I am getting hopeless to deploy HornetQ in production with this non-working failover scenario.

               

              The same error is reported here: https://community.jboss.org/thread/222539

               

              Btw. this does not seem to be related to failover only - the same happens when shutting down a node in the cluster (without having a backup node in place) and restarting it again. So it could be related to redistribution of the JMS resources among the cluster.

               

              Thanks!