1 2 3 Previous Next 30 Replies Latest reply on Nov 18, 2009 12:48 PM by ataylor

    Race condition found by failure in GroupingFailoverReplicati

    jmesnil

      This is related to intermittent failure on GroupingFailoverReplicationTest.testGroupingLocalHandlerFails.
      The test sometimes fail because the remote queue binding is added twice to the same server.

      Here are some logs I added to understand what's going on:

      INFO: ADDED BINDING uuid=51ab42be-ce05-11de-b662-001c42000009, notif.51aaa67a-ce05-11de-b662-001c42000009519b3d24-ce05-11de-b662-001c42000009, distance=0, backup=false, server=519b3d24-ce05-11de-b662-001c42000009
      INFO: ADDED BINDING uuid=51acf06f-ce05-11de-b662-001c42000009, queue0519b3d24-ce05-11de-b662-001c42000009, distance=0, backup=false, server=519b3d24-ce05-11de-b662-001c42000009
      INFO: SEND INFO TO QUEUE from 519b3d24-ce05-11de-b662-001c42000009
      INFO: ROUTE 51ae01e0-ce05-11de-b662-001c42000009 BINDING ADDED For queue0519b3d24-ce05-11de-b662-001c42000009 in 519b3d24-ce05-11de-b662-001c42000009
      INFO: CLEAR ALL BINDINGS
      INFO: adding 51ae01e0-ce05-11de-b662-001c42000009 binding for cluster queue0519b3d24-ce05-11de-b662-001c42000009, to 51a687c7-ce05-11de-b662-001c42000009, distance=0, backup=false
      INFO: ADDED BINDING uuid=51b3ce41-ce05-11de-b662-001c42000009, queue0519b3d24-ce05-11de-b662-001c42000009, distance=1, backup=false, server=51a687c7-ce05-11de-b662-001c42000009
      INFO: adding 51acf06f-ce05-11de-b662-001c42000009 binding for cluster queue0519b3d24-ce05-11de-b662-001c42000009, to 51a687c7-ce05-11de-b662-001c42000009, distance=0, backup=false
      ATTENTION: Remote queue binding queue0519b3d24-ce05-11de-b662-001c42000009 has already been bound in the post office. Most likely cause for this is you have a loop in your cluster due to cluster max-hops being too large or you have multiple cluster connections to the same nodes using overlapping addresses

      519b3d24 <-> server #0
      51a687c7 <-> server #1
      notif.51aaa67a <-> the bridge's notification consumer
      queue0519b3d24 <-> the queue

      Here is what happens when the test fails:

      1.1 on server #0, local binding is added for queue0519b3d24
      -> sent notification 51acf06f
      1.2 on server #1, the bridge creates a notif queue notif.51aaa67a
      -> it sends a management message "sendInfoQueue"
      2. server #0 replies to the "sendInfoQueue" by sending a BINDING_ADDED message (51ae01e0)
      3. server #1 clear all its bindings
      3. server #1 receive the BINDING_ADDED message sent from sendInfoToQueue (51ae01e0)
      -> it will add the remote queue binding for queue0519b3d24
      4. server #1 receive the BINDING_ADDED *notification* (51acf06f)
      -> EXCEPTION! binding has already been added

      There is a notificationLock in PostOfficeImpl which should prevent to receive the messages
      from "sendInfoToQueue" *before* the notification when the local binding is added.
      However, the messages and the notification do not walk the same code path.
      Notification will be routed like a regular messages while messages sent by sendInfoQueue
      will use routeDirect() instead.
      I need to find out if it is possible that message routed directly (i.e. from sendInfoQueue ) will go before the message routed normally (i.e. the notifications) even if the routeDirect() is called after route().

        • 1. Re: Race condition found by failure in GroupingFailoverRepli
          jmesnil

          ok, some more findings

          The issue is that messages are routed from separate thread and are not routed in the expected order.

          * in 1 thread, the BINDING_ADDED notification (202d41cd ) is routed normally
          -> it enters PostOffice.processRoute() first
          * in another thread, the PostOffice.sendInfoQueueToQueue route a BINDING_ADDED message (202e051e) directly
          -> it enters PostOffice.processRoute() second

          However, the BINDING_ADDED message (202e051e) will finish to process first first and will be added to the queue
          before the BINDING_ADDED notification (202d41cd)

          The order is not what we expect...

          But the real issue is about the ordering of the CLEAR_BINDINGS message sent by PostOffice.sendInfoQueueToQueue() and the BINDING_ADDED message.

          If the CLEAR_BINDINGS message is received before the BINDING_ADDED notification, the bindings will be added twice to the PostOffice -> the exception will occur wether it is the BINDING_ADDED notification or the BINDING_ADDED notification message which is received first by the consumer.

          The root issue is that CLEAR_BINDINGS message can be delivered to the consumer first even if the BINDING_ADDED notification is sent first.
          These 2 messages are processed by separate thread and there is no guarantee that BINDING_ADDED will be routed and added to the queue first even though it calls processRoute() first.

          Another issue which aggravates the race condition is that the notifications are *durable* messages while the messages sent by PostOffice.sendInfoQueueToQueue() are *not durable* -> the notifications will spent even more time being processed by the PostOffice and make it more likely the messages sent by PostOffice.sendInfoQueueToQueue() to be delivered first.

          For the record, here are the logs which shows the race condition:

          INFO: ADDED BINDING uuid=20297139-ce17-11de-bb74-001c42000009, notif.20292318-ce17-11de-bb74-001c420000092013ed62-ce17-11de-bb74-001c42000009, distance=0, backup=false, server=2013ed62-ce17-11de-bb74-001c42000009
          INFO: ADDED BINDING uuid=202d41cd-ce17-11de-bb74-001c42000009, queue02013ed62-ce17-11de-bb74-001c42000009, distance=0, backup=false, server=2013ed62-ce17-11de-bb74-001c42000009
          INFO: PROCESS ROUTE for 202d41cd-ce17-11de-bb74-001c42000009 true null
          INFO: SEND INFO 202e051e-ce17-11de-bb74-001c42000009
          INFO: PROCESS ROUTE for 202e051e-ce17-11de-bb74-001c42000009 true null
          INFO: PROCESSING ROUTE TOOK 4 202e051e-ce17-11de-bb74-001c42000009 1257870958719
          INFO: ADDED 202e051e-ce17-11de-bb74-001c42000009 to notif.20292318-ce17-11de-bb74-001c42000009
          INFO: DELIVERING MESSAGE 202e051e-ce17-11de-bb74-001c42000009
          INFO: PROCESSING ROUTE TOOK 6 202d41cd-ce17-11de-bb74-001c42000009 1257870958720
          INFO: ADDED 202d41cd-ce17-11de-bb74-001c42000009 to notif.20292318-ce17-11de-bb74-001c42000009
          INFO: CLEAR BINDINGS
          INFO: DIRECT DELIVERED HANDLED 202e051e-ce17-11de-bb74-001c42000009 org.hornetq.core.server.impl.ServerConsumerImpl@2781b3d4
          INFO: DELIVERING MESSAGE 202d41cd-ce17-11de-bb74-001c42000009
          INFO: DIRECT DELIVERED HANDLED 202d41cd-ce17-11de-bb74-001c42000009 org.hornetq.core.server.impl.ServerConsumerImpl@2781b3d4
          INFO: adding 202e051e-ce17-11de-bb74-001c42000009 binding for cluster queue02013ed62-ce17-11de-bb74-001c42000009, to 202181f5-ce17-11de-bb74-001c42000009, distance=0, backup=false
          INFO: ADDED BINDING uuid=2031ae9f-ce17-11de-bb74-001c42000009, queue02013ed62-ce17-11de-bb74-001c42000009, distance=1, backup=false, server=202181f5-ce17-11de-bb74-001c42000009
          INFO: adding 202d41cd-ce17-11de-bb74-001c42000009 binding for cluster queue02013ed62-ce17-11de-bb74-001c42000009, to 202181f5-ce17-11de-bb74-001c42000009, distance=0, backup=false
          ATTENTION: Remote queue binding queue02013ed62-ce17-11de-bb74-001c42000009 has already been bound in the post office. Most likely cause for this is you have a loop in your cluster due to cluster max-hops being too large or you have multiple cluster connections to the same nodes using overlapping addresses
          



          • 2. Re: Race condition found by failure in GroupingFailoverRepli
            jmesnil

            A simple question to finish this task for tomorrow?

            Why are messages sent by PostOffice.sendInfoToQueue routed directly?

            They seem a little bastardized: they look like notifications (except theuy're not durable), they are handled like notifications but they are routed directly instead of using the NotificationService.
            Why can't they be send like regular notifications?

            • 3. Re: Race condition found by failure in GroupingFailoverRepli
              timfox

               

              "jmesnil" wrote:
              A simple question to finish this task for tomorrow?

              Why are messages sent by PostOffice.sendInfoToQueue routed directly?

              They seem a little bastardized: they look like notifications (except theuy're not durable), they are handled like notifications but they are routed directly instead of using the NotificationService.
              Why can't they be send like regular notifications?


              Because they must only be sent to the consumer that requested them.

              If you routed them through the post office they would be sent to all queues registered at that address, which would mean they would arrive at consumers on different nodes which had not requested them.

              • 4. Re: Race condition found by failure in GroupingFailoverRepli
                timfox

                 

                "jmesnil" wrote:
                ok, some more findings

                The issue is that messages are routed from separate thread and are not routed in the expected order.

                * in 1 thread, the BINDING_ADDED notification (202d41cd ) is routed normally
                -> it enters PostOffice.processRoute() first
                * in another thread, the PostOffice.sendInfoQueueToQueue route a BINDING_ADDED message (202e051e) directly
                -> it enters PostOffice.processRoute() second


                I'm trying to understand how this can be the case.

                Both managementService.sendNotification() and postoffice:sendQueueInfoToQueue require the notification lock before they can executed, so how can they happen concurrently as you suggest?

                I guess I am missing something here...

                • 5. Re: Race condition found by failure in GroupingFailoverRepli
                  jmesnil

                   

                  "timfox" wrote:


                  I'm trying to understand how this can be the case.

                  Both managementService.sendNotification() and postoffice:sendQueueInfoToQueue require the notification lock before they can executed, so how can they happen concurrently as you suggest?

                  I guess I am missing something here...


                  Holding the notification lock is not enough to guarantee that the notification will be added to the queue before the sendQueueInfoToQueue messages.

                  The management service will hold the lock, route the notification, a Runnable will be added to StorageManager.afterReplicated() and the lock is released. The same goes for postoffice:sendQueueInfoToQueue.

                   storageManager.afterReplicated(new Runnable()
                   {
                   public void run()
                   {
                   addReferences(refs);
                   }
                   });
                  


                  The lock ensures that the notification and messages are routed in the correct order.
                  However, the messages are effectively added later on when the Runnable is called back outside the notification lock.
                  When the test fails, the sendQueueInfos messages are added to the queue *before* the notification even though the notification was routed before.

                  I am now tracing the code related to the ReplicationManager to understand why the sendQueueInfos replication is not done after the notification replication...



                  • 6. Re: Race condition found by failure in GroupingFailoverRepli
                    timfox

                    Ah you're talking about the new replication code that Clebert added?

                    • 7. Re: Race condition found by failure in GroupingFailoverRepli
                      jmesnil

                      I am confused by the thread activities related to the Replication.

                      When GroupingFailoverReplicationTest.testGroupingLocalHandlerFails fails, messages are routed from separate server session (1 for the management notification consumer, 1 for the client creating the queue).

                      I added some logging to PostOfficeImpl.processRoute():

                       if (storageManager.isReplicated())
                       {
                       log.info("IN PROCESS ROUTE " + message.getStringProperty("foobar") + " " + Thread.currentThread().getId());
                       storageManager.afterReplicated(new Runnable()
                       {
                       public void run()
                       {
                       log.info("AFTER REPLICATION " + message.getStringProperty("foobar") + " " + Thread.currentThread().getId());
                       addReferences(refs);
                       }
                       });
                       }
                      


                      It logs which thread is executing PostOffice.processRoute() and which thread is executing the runnable.
                      When the test fails, the logs are:

                      INFO: IN PROCESS ROUTE 353320f5-cf9a-11de-b688-001c42000009 82
                      INFO: IN PROCESS ROUTE 35351cc6-cf9a-11de-b688-001c42000009 83
                      INFO: AFTER REPLICATION 35351cc6-cf9a-11de-b688-001c42000009 83
                      INFO: AFTER REPLICATION 353320f5-cf9a-11de-b688-001c42000009 64
                      


                      As you can see there are 2 threads calling PostOffice.processRoute() (#82 & #83).
                      Thread #83 will finish replication before thread #82 -> messages will be added to the queue out of order. This is the issue causing the test to fail.
                      A seconde issue is that the thread will will execute the runnable added in thread #82 is thread #64 and not thread #82!



                      • 8. Re: Race condition found by failure in GroupingFailoverRepli
                        timfox

                        It's not a problem if replication responses are executed on different threads. Don't expect them to be executed on the same thread.

                        But it is a problem if they are executed out of order.

                        • 9. Re: Race condition found by failure in GroupingFailoverRepli
                          jmesnil

                          It's not clear yet but the issue may be that replication path is not the same for both types of messages.

                          The messages sent from PostOffice.sendInfoQueueToQueue are not replicated and they do not trigger any replication (the bridge notif queue is non-durable), hence, the "replication" process will complete quite fast for them as they have nothing to wait from the backup server. The thread calling this method is not waiting for any replication from the backup server.

                          The notification message will not be replicated either but its executing thread is waiting for the replication of a packet REPLICATION_APPEND. The notif message will not be added to the queue until the packet has been replicated on the backup server.
                          When the test fails, it is because the replication response for the REPLICATION_APPEND arrives late and the CLEAR_BINDING message was already added to the queue.

                          I need to understand under which conditions the failure occurs, what's the REPLICATION_APPEND packet corresponds to, if it is normal to wait for the replication of this packet before adding the notification message, etc.


                          • 10. Re: Race condition found by failure in GroupingFailoverRepli
                            timfox

                            Hmm, this sounds like a more general problem with the replication code. From what I can gather from your explanation, it seems that durable messages are replicated but non durable ones aren't?

                            If this is true, then this will cause ordering problems, e.g. if I send the following messages (D=Durable, N=Non Durable) from a producer

                            D1 N1 D2 N2 D3 N3 D4 N4

                            Then on the server the non durable ones will be added to the queue immediately but the durable ones will only be added when the replication response comes back.

                            So you'll probably end up with the messages in the queue in the following order:

                            N1 N2 N3 N4 D1 D2 D3 D4

                            Which is incorrect and breaks JMS ordering guarantees.

                            The quick fix for this would be to always replicate a packet even for non durable messages - we can create a dummy replication packet which actually does nothing on the backup server, but this will guarantee that all messages will be added to the queue in the correct order, since they will only be added when the replication responses return.

                            • 11. Re: Race condition found by failure in GroupingFailoverRepli
                              clebert.suconic

                              Yes.. it sounds like a bug:


                              I have created a JIRA: https://jira.jboss.org/jira/browse/HORNETQ-218

                              • 12. Re: Race condition found by failure in GroupingFailoverRepli
                                clebert.suconic

                                 

                                I need to understand under which conditions the failure occurs, what's the REPLICATION_APPEND packet corresponds to, if it is normal to wait for the replication of this packet before adding the notification message, etc.



                                @Jeff: Just to answer your question, REPLICATION_APPEND is sent by ReplicatedJournal.

                                Any time you call Storagemanager.append, the Replicatoin_Append packet will be sent to to the ReplicationEndpoint on the backup node.

                                • 13. Re: Race condition found by failure in GroupingFailoverRepli
                                  clebert.suconic

                                  I have a fix already, but I couldn't commit it today. If you guys are interested on it, I have attached a patch to the JIRA.

                                  • 14. Re: Race condition found by failure in GroupingFailoverRepli
                                    clebert.suconic

                                    I got 100% testsuite pass with the patch I attached into https://jira.jboss.org/jira/browse/HORNETQ-218


                                    However GroupingFailoverReplicationTest takes 93 seconds on each test. It always pass to me, but It's taking a huge ammount of time to close the connection, and to stop the server. (The executor is taking forever to shutdown). However it looks like this was already happening without my changes.


                                    1 2 3 Previous Next