1 2 3 4 Previous Next 49 Replies Latest reply on Oct 26, 2006 10:13 AM by clebert.suconic Go to original post
      • 15. Re: Client failover redeliveries discussion
        clebert.suconic

        YEaaaaaaaaaahhhhhh Yeaaaaaaaaahhh... (I needed to share my joy somehow :-) )

        I could just have ServerSide failover working along with ClientSide recovery (recovering client objects into the new server after connection failure event) on DurableSubscriptions. Of course there are still work to do but I was expecting to have it prototyped by the end of the next week, hence my joy.

        Well, keeping the joy asside lets discuss what I have done so far, and what needs to be done:

        I have created a method transferChannel into PagingChannelSupport that receives the oldChannel. This method will transfer all the state from the old node (ChainID representing the PostOffice) into the current chainID (or current queue).

        Look at the code:

        /** Transfer messages for an old channel to a new channel.
         * This is used during HA failoever when a connection fail and messages will need to be transfered to a new node */
         public void transferChannel(long oldchannelID) throws Exception
         {
         log.info("Transfering state from " + oldchannelID +" into " + this.getChannelID());
         synchronized (refLock)
         {
         while(true)
         {
         InitialLoadInfo ili =pm.getInitialReferenceInfos(oldchannelID,fullSize);
         if (ili.getRefInfos().size()==0)
         {
         break;
         }
        
         log.info("got " + ili.getRefInfos().size() + " references to move");
        
        
        
         Map refMap = pushReferences(ili);
         Iterator referencesIterator = ili.getRefInfos().iterator();
         while (referencesIterator.hasNext())
         {
         ReferenceInfo info = (ReferenceInfo)referencesIterator.next();
         log.info("transfering reference " + info.getMessageId() + " from " + oldchannelID + " into " + this.getChannelID());
         MessageReference messageReference = (MessageReference )refMap.get(new Long(info.getMessageId()));
        
         ///// BIG TODOS:
         ///// What to do with transaction here?
         ///// Do we need to remove from old channel? (Consider the case of the Old Server coming back... I guess we should.. bu we have to check this)
         pm.addReference(this.getChannelID(),messageReference,null);
         pm.removeReference(oldchannelID,messageReference, null);
         }
         }
         }
         log.info("transfer state done");
         }
        
        



        For accomplishing this I have also added a new parameter into SessionEndpoint.createConsumerDelegate (.... ,oldChainID) where I send the oldChainID when we are recreating objects.


        For the method itself I have to verify how to encapsulate a transaction in a better fashion, and these todos I have on the method. And of course better test it, I will get probably some bugs out of this... but this is a very good start.



        Clebert

        • 16. Re: Client failover redeliveries discussion
          timfox

          Clebert-

          There should be no need to transfer anything from one channel to another.

          You just need to load the old channel on the new node, keeping the same channel id.

          • 17. Re: Client failover redeliveries discussion
            clebert.suconic

             

            "timfox" wrote:
            Clebert-

            There should be no need to transfer anything from one channel to another.

            You just need to load the old channel on the new node, keeping the same channel id.


            It looks like we will have to think a little bit more about failover on durable subscriptions.

            As far as I looked into the code, Postoffice will only accept one binding using a single name. We will be loading a Durable subscription from a dead node, where the current node could have the same subscription already existent (that's why I decided to transfer contents) but now I realize there might be a problem:

            What if another client is already subscribed to an existent queue. Can we have two clients with the same connectionID/subscriptionName?

            • 18. Re: Client failover redeliveries discussion
              timfox

               

              "clebert.suconic@jboss.com" wrote:


              As far as I looked into the code, Postoffice will only accept one binding using a single name.



              Correct. Also this applies to jms partial queues, not just durable subscriptions.


              We will be loading a Durable subscription from a dead node, where the current node could have the same subscription already existent (that's why I decided to transfer contents)


              postoffice needs to be extended to be able to support more than one local queue with the same name


              What if another client is already subscribed to an existent queue. Can we have two clients with the same connectionID/subscriptionName?


              We need to cope with this too.

              See, I told you server side failover wasn't trivial :)

              • 19. Re: Client failover redeliveries discussion
                ovidiu.feodorov

                I've checked in a simple failover test that fails. It's too late to look too deep into it now, but I'll get back tomorrow. In the mean time, you could take a look.

                It's ManualClusteringTest.testSimpleFailover()

                • 20. Re: Client failover redeliveries discussion
                  ovidiu.feodorov

                  I also noticed that you don't deal with QueueBrowsers, only producers and consumers. This is perhaps a minor oversight, right?

                  • 21. Re: Client failover redeliveries discussion
                    timfox

                     

                    "clebert.suconic@jboss.com" wrote:


                    As far as I looked into the code, Postoffice will only accept one binding using a single name. We will be loading a Durable subscription from a dead node, where the current node could have the same subscription already existent (that's why I decided to transfer contents)


                    Another reason why you can't transfer the contents is that it could very very slow. There may be 10s of millions of messages in the partial queue.

                    It shouldn't be too hard to extend the clustered post office to allow more than one local clustered queue on a cluster router instance.



                    but now I realize there might be a problem:

                    What if another client is already subscribed to an existent queue. Can we have two clients with the same connectionID/subscriptionName?


                    Do you mean client id?

                    Why would this be a problem?

                    • 22. Re: Client failover redeliveries discussion
                      timfox

                       

                      "clebert.suconic@jboss.com" wrote:

                      What if another client is already subscribed to an existent queue. Can we have two clients with the same connectionID/subscriptionName?


                      Now I think I understand what you are asking.

                      Currently the queue name is clientid.subname. In the failover situation, if we suffix the name with a number then we should be able to tell them apart

                      e.g. the failover queue would have name clientid.subname.1.

                      When unsubscribing we can then identify the queue properly.

                      Any new connections should only be added to the original queue not the failed over one.

                      We could put some clever logic in that said something like "if number of messages in queue < 1000 (or whatever)" then merge partial queues, but this is a nice to have.




                      • 23. Re: Client failover redeliveries discussion
                        timfox

                        For routing purposes the fauiled over queue should be treated like a remote queue.

                        • 24. Re: Client failover redeliveries discussion
                          clebert.suconic

                           

                          "ovidiu.feodorov@jboss.com" wrote:
                          I also noticed that you don't deal with QueueBrowsers, only producers and consumers. This is perhaps a minor oversight, right?


                          It's just not done yet. I will finish consumers first, then I will move to QueueBrowsers. Most of the logic I'm writing will be reused on QueueBrowsers.

                          • 25. Re: Client failover redeliveries discussion
                            clebert.suconic

                            At this point I have changed PostOffice's signature to accept the nodeID into its signatures.

                            This code is sitting on my laptop only now. I won't commit this until I can test it.

                            public interface PostOffice extends MessagingComponent
                            {
                             Binding bindQueue(String condition, Queue queue) throws Exception;
                             Binding bindQueue(int nodeID,String condition, Queue queue) throws Exception;
                            
                            
                             Binding unbindQueue(String queueName) throws Throwable;
                             Binding unbindQueue(int nodeID,String queueName) throws Throwable;
                            
                             /**
                             * List the bindings that match the specified condition
                             * @param condition
                             * @return
                             * @throws Exception
                             */
                             Collection listBindingsForCondition(String condition) throws Exception;
                            
                             /**
                             * Get the binding for the specified queue name
                             * @param queueName
                             * @return
                             * @throws Exception
                             */
                             Binding getBindingForQueueName(int nodeID, String queueName) throws Exception;
                             Binding getBindingForQueueName(String queueName) throws Exception;
                            
                             /**
                             * Route a reference.
                             * @param ref
                             * @param condition The message will be routed to a queue if specified condition matches the condition
                             * of the binding
                             *
                             * @param tx The transaction or null if not in the context of a transaction
                             * @return true if ref was accepted by at least one queue
                             * @throws Exception
                             */
                             boolean route(MessageReference ref, String condition, Transaction tx) throws Exception;
                            
                             boolean isLocal();
                            }
                            


                            And I have changed ConnectionState to also have its original nodeID, so in case of failure it will open the ConnectionDelegate with a different nodeID on the current nodeID. I will have to make some changes (like not throwing an exception if the nodeID is not the current nodeID) and I will be testing for implications on the runtime.

                            • 26. Re: Client failover redeliveries discussion
                              clebert.suconic

                               

                              "Tim" wrote:
                              We could put some clever logic in that said something like "if number of messages in queue < 1000 (or whatever)" then merge partial queues, but this is a nice to have.


                              I guess this is hard to do, as you might have two clients with the same ID on the same node. I guess we will need to have the PartialQueue working on the same node, maybe using this new introduced signature on the PostOffice. Don't know yet.

                              • 27. Re: Client failover redeliveries discussion
                                timfox

                                I don't understand why you need a new method on the post office (the bind with the node id), can you shed some light....?

                                On failover, it's the post office that will detect the failure using JGroups, so it will know what has failed over.

                                It will also know which queues it needs to take over responsibility for, since it already has this list of queues internally.

                                Therefore all it needs to do, is for each of the remote queue stubs correspsonding to the queues for which it will take over responsiblity it needs to replace them with a local clustered queue and call load() on it.

                                When reconnecting consumers to the failed over queue, I agree you will need to add a ne getBindingForQueueName() method that takes a node id.

                                However this only needs to be on the ClusteredPostOffice interface, not the post office interface.

                                • 28. Re: Client failover redeliveries discussion
                                  clebert.suconic

                                   

                                  "Tim Fox" wrote:
                                  I don't understand why you need a new method on the post office (the bind with the node id), can you shed some light....?

                                  On failover, it's the post office that will detect the failure using JGroups, so it will know what has failed over.


                                  I'm doing the failover from client side. The client will detect the socket failure and it will open a new connection on a new server, when the connection is opened the original nodeID is also sent then the queue from the old server is made active on the current server.

                                  "Tim Fox" wrote:
                                  Therefore all it needs to do, is for each of the remote queue stubs correspsonding to the queues for which it will take over responsiblity it needs to replace them with a local clustered queue and call load() on it.


                                  Consider the case where the client also died with the server, for example a MessageDrivenBean located on the save VM as the JMS Server. On that case all we have to do is to wait the client come back alive, we don't need to failover that queue.

                                  Why I'm saying that? Because I think it would make sense to fail over as it's needed, having the client taking the action on where to fail over to. I'm just prototyping at this point.

                                  At this point I'm having to deal with PostOffice interface as well because in a failover event, non clustered queues will be reconnected as well.

                                  • 29. Re: Client failover redeliveries discussion
                                    timfox

                                     

                                    "clebert.suconic@jboss.com" wrote:

                                    I'm doing the failover from client side. The client will detect the socket failure and it will open a new connection on a new server, when the connection is opened the original nodeID is also sent then the queue from the old server is made active on the current server.


                                    How can you ensure that different clients that use the same queue fail over on to the same node?

                                    There are other issues here. When a durable subscription is attached to a topic, then the subscription must retain all messages *even if it isn't active* - this is a requirement of the JMS spec.

                                    If you only load the durable sub when requested to by a client, then you're going to lose when doing in memory persistent replication.

                                    A similar reasonsing applies to queues.



                                    At this point I'm having to deal with PostOffice interface as well because in a failover event, non clustered queues will be reconnected as well.


                                    Non clustered queues don't need to be reconnected. If queues want to benefit from HA then they should be made clustered.

                                    Personally I don't think you can drive the failover fully from the client side.