4 Replies Latest reply on Feb 9, 2011 5:10 PM by ryanhos

    ServiceInvoker.deliverSync() replies getting bounced around JBM Cluster

    ryanhos

      Environment:  JBoss ESB 4.7 deployed on JBoss 5.1.0 GA.  JBoss Messaging 1.4.7.

       

      We noticed some peculiar behavior on our test cluster last week when we spun-up the load testing software.  Replies to ServiceInvoker.deliverSync() were seemingly getting lost, ending in an eventual timeout.  I watched the record counts in the JBoss Messaging tables and noticed something surprising.  There were quite a few messages on the reply queues, and they were getting shuffled from cluster node to cluster node.  I instantly suspected our old friend, the Message Selector, which is how JBoss ESB routes reply messages.  So, i wrote a test bed.  This is what I found.  I hope that someone has already run into this problem and can steer me around it.

       

      Blue Cluster Node:

           One clustered queue named "test.queue"

           One message producer, constantly sending messages with the property SelectorKey=Blue_<ever-increasing-integer-starting-with-zero> to the local test.queue.

           One message producer, constantly sending messages with the property SelectorKey=Red_<ever-increasing-integer-starting-with-zero> to the local test.queue

           One message consumer, constantly making a new connection, session, and consumer, attempting to consume messages with

      SelectorKey=Blue_<ever-increasing-integer-starting-with-zero>

       

      Red Cluster Node:

           One clustered queue named "test.queue"

           One message consumer, constantly making a new connection, session, and consumer, attempting to consume messages with

      SelectorKey=Red_<ever-increasing-integer-starting-with-zero>

       

      I booted both cluster nodes and invoked the MBean operations to launch the producers and then the consumers.  Everything went well until about the 40th producer iteration, when both consumers stalled waiting for a message. The producers kept going until I terminated them.

       

      On the Blue Cluster Node, Consumer was waiting on SelectorKey=Blue_40.  On the Red Cluster Node, Consumer was waiting on SelectorKey=Red_37.  I used the Queue's listAllMessages(String selector) method on each cluster node to ascertain where the messages were.  You guessed it.  Message Blue_40 was in the queue on the Red Cluster Node.  Message Red_37 was in the queue on the Blue Cluster Node.  (listAllMessages(String selector) does not return every known message on a logical clustered queue, just the messages currently the responsibility of a particular JBoss Messaging Cluster Node).

       

      It follows logically that the Blue_40 message was sucked across to the Red Node by the JBoss Messaging MessageSucker.  This is probably due to the fact that a consumer appeared over on that node and found an empty queue.  The Red Node JBM process wanted to feed the starved consumer, so it sucked a block of messages over from the Blue Node, which had all of them since that is the only place they were being produced.  It didn't manage to obtain the message it was looking for (Red_37), but it did manage to steal the Blue_40 message.

       

      So, who else is running synchronous ESB invocations on a cluster and has navigated around this problem or avoided encountering it entirely?  Does anyone have any creative ideas for getting aroudn this?  One "hail-mary" suggestion from our team was to make the reply queue a topic, in the hope that the selector processing is somewhat different.  This is not a solution we're proud of; it's just a possible, untested solution.

        • 1. ServiceInvoker.deliverSync() replies getting bounced around JBM Cluster
          tfennelly

          Hi Ryan.

           

          This sounds like a bug in JBM.  Have you asked about this on the JBM forums?  Perhaps there's a way of configuring the queues that they'd know about.

           

          Can you package up your test in some and send it to us?

          • 2. ServiceInvoker.deliverSync() replies getting bounced around JBM Cluster
            tfennelly

            Hi again Ryan.

             

            In fact... I was just talking to Kevin and he tells me that a number of clustering bugs relating to message selectors have been fixed in JBM.  So would it be possible for you to get a newer version (or get the upstream code or SNAPSHOTs) and test again?

            • 3. ServiceInvoker.deliverSync() replies getting bounced around JBM Cluster
              ryanhos

              Tom,

               

              Thanks for the reply.  It appears that I'm on the most recent non-2.0 version of JBMessaging, 1.4.7.  I realize that the problem lies solely within the JBM code, but I posted here for two reasons.  1.) I was hoping that someone had seen the same interaction between the ESB and JBM and had already overcome it, and 2.) the answer on the JBM forums is likely to be "upgrade to HornetQ," which is an unlikely solution at this point on this contract.

               

              I'll try to sanitize the code, removing all of our in-house libraries (logging, mostly) and send it over for examination.  I'll post it over in the JBM forums too, in hopes that there's a user over there who has seen the same problem.

              • 4. ServiceInvoker.deliverSync() replies getting bounced around JBM Cluster
                ryanhos

                I beleive a have a reasonable and easy-to-implement solution.  I had clustered the *_reply queues, merely out of some false sense of necessity.  I dug into the ESB code and realized tha the Client's ServiceInvoker self-addresses the replyToEpr, so the synchronous reply message will always return back to the node that originally serviced the client's message, no matter which service EPR the load balancer chooses**.  This is exactly the node where the client is waiting for that message.  With the queues un-clustered, the MessageSucker will never redistribute reply messages to satisfy starved consumers (Pickup Couriers) on other nodes.  I guess the clustering of the *_reply queues was silly anyway, since the ESB clients weren't connecting to a failover-aware ConnectionFactory.  Now, if node dies, the synchronous response dies with it, but at least when the nodes don't die, all of my clients will get their synchronous responses.

                 

                I think a good lesson here is that JMS is just not meant for point-to-point communcation.  Unfortunately, I cannot offer an alternate solution to accompany my impeachment of the current mechanism for synchronous reponse delivery.

                 

                **We've installed a load balancer that prefers EPRs in the same process, and failing that (I don't see how), on the same server, just to minimize unnecessary network traffic.  We have a homogenous ESB cluster, so this doesn't seem harmful.