4 Replies Latest reply on Oct 29, 2014 3:56 PM by jbertram

    HQ 2.4.4: Clients silently losing connection to server, lots of HQ224051 errors on server

    noky

      We're experiencing problems with a recent upgrade from HornetQ server 2.2.20 to 2.4.4 (upgrade happened yesterday). The reason for the upgrade was to address a problem whereby sometimes the HornetQ server would erroneously deliver a message to the wrong client (JMS message selectors were matching incorrectly). The selector bug seems to be fixed, but today we experienced some massive connection problems whereby JMS messages were silently not getting delivered to certain clients. It seemed like the clients lost contact with the server but were not detecting this and thus the reconnect could not happen. To remedy the problem, we had to restart the client applications and have downgraded HornetQ server back to 2.2.20.

       

      The HornetQ server logs show tons of messages like these:

       

      06:10:59,577 ERROR [org.hornetq.core.server] HQ224051: Failed to call notification listener: java.lang.IllegalStateException: Cannot find queue info for queue 8e600ed4-3e15-4bfa-af29-b580b528f2144c02d449-5763-11e4-bba5-ff6870e1e70e

              at org.hornetq.core.postoffice.impl.PostOfficeImpl.onNotification(PostOfficeImpl.java:292) [hornetq-server.jar:]

              at org.hornetq.core.server.management.impl.ManagementServiceImpl.sendNotification(ManagementServiceImpl.java:682) [hornetq-server.jar:]

              at org.hornetq.core.postoffice.impl.PostOfficeImpl.removeBinding(PostOfficeImpl.java:543) [hornetq-server.jar:]

              at org.hornetq.core.server.cluster.impl.ClusterConnectionImpl$MessageFlowRecordImpl.removeBinding(ClusterConnectionImpl.java:1395) [hornetq-server.jar:]

              at org.hornetq.core.server.cluster.impl.ClusterConnectionImpl$MessageFlowRecordImpl.doBindingRemoved(ClusterConnectionImpl.java:1383) [hornetq-server.jar:]

              at org.hornetq.core.server.cluster.impl.ClusterConnectionImpl$MessageFlowRecordImpl.handleNotificationMessage(ClusterConnectionImpl.java:1157) [hornetq-server.jar:]

              at org.hornetq.core.server.cluster.impl.ClusterConnectionImpl$MessageFlowRecordImpl.onMessage(ClusterConnectionImpl.java:1131) [hornetq-server.jar:]

              at org.hornetq.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1116) [hornetq-core-client.jar:]

              at org.hornetq.core.client.impl.ClientConsumerImpl.access$500(ClientConsumerImpl.java:56) [hornetq-core-client.jar:]

              at org.hornetq.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1251) [hornetq-core-client.jar:]

              at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:104) [hornetq-core-client.jar:]

              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_51]

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_51]

              at java.lang.Thread.run(Thread.java:744) [rt.jar:1.7.0_51]

       

      Any ideas here? This seems like a serious problem.

       

      NOTE: Our client applications were still using the HornetQ 2.2.20 libraries to connect to the server. It was not possible to update the client application first due to the fact that the HornetQ protocol does not seem to be downward compatible. It is also a fairly tedious process to update the client software (literally hundreds of individual applications).

       

      Also, we run the HQ server in stand-alone and clustered mode (2 server cluster)

       

      Thanks for your help,

       

      Mike

        • 1. Re: HQ 2.4.4: Clients silently losing connection to server, lots of HQ224051 errors on server
          noky

          Anything? Is HQ 2.4.4 even considered stable? I was under the assumption that it contains the latest bugfixes for the 2.4 branch

           

          Would it help to try and provide a test case? This really is a showstopper bug. Things seem to run fine at first, but eventually clients listening on a topic stop getting updates from the server and the ExceptionListener does not seem to get invoked.

          • 2. Re: HQ 2.4.4: Clients silently losing connection to server, lots of HQ224051 errors on server
            jbertram

            I would expect 2.4.4.Final to be stable.

             

            A test-case always helps.  In fact, it is the greatest help when reporting issues.  If you can demonstrate a problem with code then we can see exactly what the problem is and fix it.

             

            Based on the error you're receiving I conclude that you're running a cluster.  Is that correct?  If so, can you describe your cluster and provide details about the JMS connection factory configuration(s)?

             

            Also, do you see this problem with clients using 2.4.4.Final libraries?

            • 3. Re: Re: HQ 2.4.4: Clients silently losing connection to server, lots of HQ224051 errors on server
              noky

              Indeed we are running a cluster: two HornetQ nodes in standalone mode using the "clustered" config. Configuration is fairly close to stock, with the following major exceptions:

              * Paging on, <max-size-bytes> increased by 10x and <page-size-bytes> = 10485760

              * <group-address> in <broadcast-groups> is IPV6 format

              * Listening on different set of ports (via run.sh modifications)

               

              We did not upgrade any clients to the 2.4.4 libraries. The plan was to ensure the system was stable before doing this, particularly because updated clients would not have been able to communicate with an older version of the server in the event of a failback to 2.2.20 (which we ended up performing)

               

              I can try to come up with a test case, but it might be difficult. The problem took a day or so to manifest. Everything seemed to be fine in testing and after the initial upgrade to HQ 2.4.4 servers in production. After about a day, we noticed random service disruptions and certain clients just stopped getting data at different times (and the errors started appearing in the HQ logs). We normally detect problems on the client side via the ExceptionListener, the clients re-establish a connection to the server in this case. However, the ExceptionListener was never invoked (occurrences are logged).

               

              What exactly does that exception on the server side (HQ224051) even mean? From my naive interpretation, it seems like the server just lost information about various queues backing certain JMS topic subscriptions...

              • 4. Re: Re: HQ 2.4.4: Clients silently losing connection to server, lots of HQ224051 errors on server
                jbertram

                We did not upgrade any clients to the 2.4.4 libraries. The plan was to ensure the system was stable before doing this, particularly because updated clients would not have been able to communicate with an older version of the server in the event of a failback to 2.2.20 (which we ended up performing)

                I assumed you had set this up in a QA or test environment and tested with clients using both old and new libraries.

                 

                I can try to come up with a test case, but it might be difficult. The problem took a day or so to manifest.

                Perhaps you could speed the process up by running increased load through.

                 

                After about a day, we noticed random service disruptions and certain clients just stopped getting data at different times (and the errors started appearing in the HQ logs). We normally detect problems on the client side via the ExceptionListener, the clients re-establish a connection to the server in this case. However, the ExceptionListener was never invoked (occurrences are logged).

                Did you see evidence that the connection between the client and the server was broken?  There are a handful of connection-related admin functions that can help with this.  If the connection didn't break then I would expect the ExceptionListener to be invoked.

                 

                How are the connection factories on the server configured?

                 

                What exactly does that exception on the server side (HQ224051) even mean? From my naive interpretation, it seems like the server just lost information about various queues backing certain JMS topic subscriptions...

                The error is related to the mechanism which HornetQ uses to keep track of events/data around the cluster.  In this case a binding has been removed from one node of the cluster and a notification was sent to the other node so that it could also remove the binding from its meta-data.  The problem, however, is that the other node doesn't recognize the ID of the binding in question.