6 Replies Latest reply on Jan 13, 2009 9:54 PM by clebert.suconic

    Failover & Paiging...

    clebert.suconic

      This is a summary of what we discussed today at:

      http://www.antwerkz.com/javabot/javabot/home/3/%23jbossmessaging/2/07/1/01/0/2009/ (starting at [10:41])


      The basic problem on Paging and Failover, is when the ACK is replicated to the backup node, the message eventually will be on page file, and replicateACK may fail for not being able to locate the ID on the queue:


      public void deliverReplicated(final long messageID) throws Exception
       {
       // It may not be the first in the queue - since there may be multiple producers
       // sending to the queue
       MessageReference ref = messageQueue.removeReferenceWithID(messageID);
      
       if (ref == null)
       {
       throw new IllegalStateException("Cannot find ref when replicating delivery " + messageID);
       }
      
      


      So fix that, based on our discussion, I'm going to change DuplicateID detection to work at Queue level (not Address), and when ref == null, I'm going to add a new ID to be ignored on the Queue.

      That will require some synchronization on route when duplicateID != null.

        • 1. Re: Failover & Paiging...
          clebert.suconic

          The DuplicateIDCache will not work:


          - The MessageID lookup could fail in two places:
          I - on ServerConsuemrImpl::deliverReplicated, where we need to temporarily remove the messages from the Queue and place it on deliveringRefs.

          II - On ServerConsuemrImpl::acknowledge... as (I) failed, as deliveringRefs will not have the reference to ack.


          On (I), we can't really add anything to duplicateIDCache, as the ID is just being placed on the deliveringReferences.


          I have created a Branch with my current changes:

          https://svn.jboss.org/repos/messaging/branches/Branch_Failover_Page


          And PagingfailoverTest will pass with this Hack on QueueImpl:

          // This is a temporary hack, for the temporary branch only
           for (int i = 0; i < 10; i++)
           {
           System.out.println("Retry " + i);
           Thread.sleep(100);
          
           ref = removeReferenceWithID(id, false);
          
           if (ref != null)
           {
           System.out.println("Finally found it:");
           break;
           }
           }
          






          • 2. Re: Failover & Paiging...
            clebert.suconic

            There is an issue with DuplicateIDCache and rollback also.


            Case the user decide for a rollback after the failOver, we need to make sure the data will come back to the Queue.

            • 3. Re: Failover & Paiging...
              clebert.suconic

              It was just faster to act and implement something now while everybody in Europe was sleeping and talk about other options later, so we could move forward now.

              It was actually relatively simple to force depage when the reference is not found, what fixed the problems I raised on my previous posts.

              All of this is being controlled at ServerConsuemerImpl::deliverReplicated. Now I'm also sending the address used on delivery. If the reference is not found I will force a depage.

              I don't think we would actually have an issue with OMEs or anything. Even if there are order differences between the two nodes, I don't think both page systems would be too different. We will talk about it when I wake up.


              All of this is on the branch I created:

              https://svn.jboss.org/repos/messaging/branches/Branch_Failover_Page

              • 4. Re: Failover & Paiging...
                timfox

                What's the latest status on this?

                Also, what's the status on large message replication- we haven't discussed the design on that one yet.

                • 5. Re: Failover & Paiging...
                  clebert.suconic

                  Paging is implemented on the Branch, per my last request on this thread. Forcing a depage when the message is not found on replicateDelivery.

                  I'm just waiting your approval before merging it on the branch.

                  For LargeMessage, I'm first debugging why the credits are not being replicated between the nodes for LargeMessages, making the consumer busy. Per our last discussion we may not need to do anything on LargeMessages... just replicate credits and let is send the chunks.

                  After I've found the root cause of the failures I will decide if I need any design for a fix.

                  • 6. Re: Failover & Paiging...
                    clebert.suconic

                    Besides sending the AddressName on replicate delivery, and forcing a depage until a reference is found, I also had to make sure replicateDelivery wouldn't subtract sizes on the PageControl.


                    replicateDelivery was calling removeFromQueue, and that would call addSize(-size).

                    addSize is supposed to be called only on ACK, so I had done a few tweaks around the method.