11 Replies Latest reply on Jan 13, 2007 10:48 PM by Ovidiu Feodorov

    Failover refactoring

    Ovidiu Feodorov Master

      1. The clustering/failover support has been refactored as follows:

      There is a new PER_INSTANCE ClusteringAspect part of ClusteredConnectionFactory's stack, which encapsulates load balancing and failover node picking policies. Each client-side delegate is guarded by a PER_INSTANCE FailoverValveInterceptor instance. The valve is normally open. On failover, all valves under a connection are closed recursively. If there are active threads through valve on closing, the current behavior is that closeValve() waits until all active threads unwrap (I could change this if necessary). Once the valve is closed, no threads are allowed through it until is explicitly opened again. The client-side failover event (which includes operating the valves, among other things) is detected and controlled by PER_VM FailoverAspect.

      2. ClientClusteredConnectionFactoryDelegate doesn't extend ClientConnectionFactoryDelegate anymore. Also, I introduced two distinct aspect stacks corresponding to clustered ConnectionFactories and non-clustered ConnectionFactories.

      3. One can register a FailoverListener to a JBossConnection and so to be notified when a failover starts and ends. So far, known events are FAILURE_DETECTED, FAILOVER_STARTED, FAILOVER_COMPLETED, FAILOVER_FAILED.

      4. The initialization of the AOP stack is done by a specialized component (ClientAOPStackLoader). Both ConnectionDelegate and ClusteredConnectionDelegate delegate to it. getAOPStackConfig() is not part of the ConnectionFactory interface anymore.

      Some tests fail, but I will follow up on that soon.

      I added a boatload of failover mop-up tasks, grouped under http://jira.jboss.org/jira/browse/JBMESSAGING-706

        • 1. Re: Failover refactoring
          Tim Fox Master

          Looks nice

          • 2. Re: Failover refactoring
            Clebert Suconic Master

            Why did you remove testSimpleFailover from FailoverTest?

            This test killed the server, and performed an invocation while the server is dead. Something would intercept an exception from the communication layer and activate failover from that exception.

            I reactivated the test on my local copy, and the test fails.

            • 3. Re: Failover refactoring
              Clebert Suconic Master

              I just want to add two points that were also part of the refactoring.

              - performFailover was made part of ClientConnectionDelegate (public method)
              - There is one instance of the Valve per Delegate. That means when a failure happens we will have to close multiple valves.

              • 4. Re: Failover refactoring
                Ovidiu Feodorov Master

                New stuff:

                Messages that come over a failed-over connection are now preferentially routed to their corresponding failover queues (and not the other local queues, as it was the case so far)

                • 5. Re: Failover refactoring
                  Clebert Suconic Master

                  > (and not the other local queues, as it was the case so far)

                  We had routers taking care of this, balancing between local queues and failedover queues. Did you change this policy?

                  • 6. Re: Failover refactoring
                    Clebert Suconic Master

                    I had this conversation in private with Ovidiu, and I wanted to make a public post about this.

                    I'm kind of concerned about the fact we are using multiple valves (instead of a single instance per connection).

                    The way is implemented now, you have a Valve on every descendant of Connection (Sessions, Producers, Consumers and Browsers).

                    When we get a failure, we have to close every single valve before performing a failover, i.e. The lock is not atomic and you have multiple locks.

                    The problem is when you place a try..catch(IOException) on the ValveIntercetor to capture failure exceptions. In cases where you have multiple threads using the same Connection (multiple sessions/threads on the same Connection). As the lock should be aways atomic (close every valve for the failed Connection), you can have multiple Threads trying to close the whole hierarchy of valves.

                    Lets talk in examples:

                    - You have one Connection and 50 Threads/50 Sessions. Suppose all these 50 Threads are communicating at the same time.

                    - We kill the server, all of the 50 threads are going to catch an IOException at the same time.

                    - As all of the 50 threads are going to catch the exception at the same time... All of them are going to close the valve at the same time. Valve close is a loop on the hierarchy:

                     // from HierarchicalStateSupport
                     public void closeChildrensValves() throws Exception
                     {
                     if (children == null)
                     {
                     return;
                     }
                    
                     for(Iterator i = children.iterator(); i.hasNext(); )
                     {
                     HierarchicalState s = (HierarchicalState)i.next();
                     ((Valve)s.getDelegate()).closeValve();
                     }
                     }
                    
                     public void openChildrensValves() throws Exception
                     {
                     if (children == null)
                     {
                     return;
                     }
                    
                     for(Iterator i = children.iterator(); i.hasNext(); )
                     {
                     HierarchicalState s = (HierarchicalState)i.next();
                     ((Valve)s.getDelegate()).openValve();
                     }
                     }
                    


                    I belive we should use a single Valve instance, the way it was done before.
                    The only reason for this refactoring was the fact I was using a single instance but using AOP API to install the Valve on the right place with the proper instantiation. This could be refactored in another way such as delegating the Valve to an external object on the ValveAspect (which was going to be renamed to FailureInterceptor).

                    I'm playing with another Branch and I have already refactored it to use a single instance of Valve, and have installed the try...catch(IOException) in place. I'm only using standard XML config.

                    • 7. Re: Failover refactoring
                      Clebert Suconic Master

                      I talked to Kabir. We could control instantiation of interceptors by using Factories (witch is a straighforward and documented way), but getInstance is not exposed.

                      So, I requested a feature that will be useful on future releases.

                      http://jira.jboss.org/jira/browse/JBAOP-338

                      I won't use such feature now as we will stick with delegates or with the muliple valves. (something we will have to decide later), but I think the feature request is valid.

                      • 8. Re: Failover refactoring
                        Ovidiu Feodorov Master

                         

                        "clebert.suconic@jboss.com" wrote:
                        > (and not the other local queues, as it was the case so far)

                        We had routers taking care of this, balancing between local queues and failedover queues. Did you change this policy?


                        No. This comes in top of the message pull policy, that it still in effect (well, right now it isn't since we're using a NullMessagePullPolicy, but if we were using one, it would be in effect).

                        These behaviors complement each other.

                        And by thew way, this reminds me that we should also test the pull policy, currently we have 0 tests for it.

                        • 9. Re: Failover refactoring
                          Tim Fox Master

                           

                          "ovidiu.feodorov@jboss.com" wrote:

                          And by thew way, this reminds me that we should also test the pull policy, currently we have 0 tests for it.


                          The pull functionality is tested in the core clustering tests, but not in the jms clustering tests.

                          • 10. Re: Failover refactoring
                            Ovidiu Feodorov Master

                            OK. We still need jms tests for it.

                            • 11. Re: Failover refactoring
                              Ovidiu Feodorov Master

                              I've added one: org.jboss.test.messaging.jms.clustering.DistributedQueueTest.testMessageRedistributionAmongNodes(). Of course we could add an infinite number of more sophisticated cases, but at least the basic functionality is tested.