11 Replies Latest reply on Jan 13, 2007 10:48 PM by ovidiu.feodorov

    Failover refactoring

    ovidiu.feodorov

      1. The clustering/failover support has been refactored as follows:

      There is a new PER_INSTANCE ClusteringAspect part of ClusteredConnectionFactory's stack, which encapsulates load balancing and failover node picking policies. Each client-side delegate is guarded by a PER_INSTANCE FailoverValveInterceptor instance. The valve is normally open. On failover, all valves under a connection are closed recursively. If there are active threads through valve on closing, the current behavior is that closeValve() waits until all active threads unwrap (I could change this if necessary). Once the valve is closed, no threads are allowed through it until is explicitly opened again. The client-side failover event (which includes operating the valves, among other things) is detected and controlled by PER_VM FailoverAspect.

      2. ClientClusteredConnectionFactoryDelegate doesn't extend ClientConnectionFactoryDelegate anymore. Also, I introduced two distinct aspect stacks corresponding to clustered ConnectionFactories and non-clustered ConnectionFactories.

      3. One can register a FailoverListener to a JBossConnection and so to be notified when a failover starts and ends. So far, known events are FAILURE_DETECTED, FAILOVER_STARTED, FAILOVER_COMPLETED, FAILOVER_FAILED.

      4. The initialization of the AOP stack is done by a specialized component (ClientAOPStackLoader). Both ConnectionDelegate and ClusteredConnectionDelegate delegate to it. getAOPStackConfig() is not part of the ConnectionFactory interface anymore.

      Some tests fail, but I will follow up on that soon.

      I added a boatload of failover mop-up tasks, grouped under http://jira.jboss.org/jira/browse/JBMESSAGING-706

        • 1. Re: Failover refactoring
          timfox

          Looks nice

          • 2. Re: Failover refactoring
            clebert.suconic

            Why did you remove testSimpleFailover from FailoverTest?

            This test killed the server, and performed an invocation while the server is dead. Something would intercept an exception from the communication layer and activate failover from that exception.

            I reactivated the test on my local copy, and the test fails.

            • 3. Re: Failover refactoring
              clebert.suconic

              I just want to add two points that were also part of the refactoring.

              - performFailover was made part of ClientConnectionDelegate (public method)
              - There is one instance of the Valve per Delegate. That means when a failure happens we will have to close multiple valves.

              • 4. Re: Failover refactoring
                ovidiu.feodorov

                New stuff:

                Messages that come over a failed-over connection are now preferentially routed to their corresponding failover queues (and not the other local queues, as it was the case so far)

                • 5. Re: Failover refactoring
                  clebert.suconic

                  > (and not the other local queues, as it was the case so far)

                  We had routers taking care of this, balancing between local queues and failedover queues. Did you change this policy?

                  • 6. Re: Failover refactoring
                    clebert.suconic

                    I had this conversation in private with Ovidiu, and I wanted to make a public post about this.

                    I'm kind of concerned about the fact we are using multiple valves (instead of a single instance per connection).

                    The way is implemented now, you have a Valve on every descendant of Connection (Sessions, Producers, Consumers and Browsers).

                    When we get a failure, we have to close every single valve before performing a failover, i.e. The lock is not atomic and you have multiple locks.

                    The problem is when you place a try..catch(IOException) on the ValveIntercetor to capture failure exceptions. In cases where you have multiple threads using the same Connection (multiple sessions/threads on the same Connection). As the lock should be aways atomic (close every valve for the failed Connection), you can have multiple Threads trying to close the whole hierarchy of valves.

                    Lets talk in examples:

                    - You have one Connection and 50 Threads/50 Sessions. Suppose all these 50 Threads are communicating at the same time.

                    - We kill the server, all of the 50 threads are going to catch an IOException at the same time.

                    - As all of the 50 threads are going to catch the exception at the same time... All of them are going to close the valve at the same time. Valve close is a loop on the hierarchy:

                     // from HierarchicalStateSupport
                     public void closeChildrensValves() throws Exception
                     {
                     if (children == null)
                     {
                     return;
                     }
                    
                     for(Iterator i = children.iterator(); i.hasNext(); )
                     {
                     HierarchicalState s = (HierarchicalState)i.next();
                     ((Valve)s.getDelegate()).closeValve();
                     }
                     }
                    
                     public void openChildrensValves() throws Exception
                     {
                     if (children == null)
                     {
                     return;
                     }
                    
                     for(Iterator i = children.iterator(); i.hasNext(); )
                     {
                     HierarchicalState s = (HierarchicalState)i.next();
                     ((Valve)s.getDelegate()).openValve();
                     }
                     }
                    


                    I belive we should use a single Valve instance, the way it was done before.
                    The only reason for this refactoring was the fact I was using a single instance but using AOP API to install the Valve on the right place with the proper instantiation. This could be refactored in another way such as delegating the Valve to an external object on the ValveAspect (which was going to be renamed to FailureInterceptor).

                    I'm playing with another Branch and I have already refactored it to use a single instance of Valve, and have installed the try...catch(IOException) in place. I'm only using standard XML config.

                    • 7. Re: Failover refactoring
                      clebert.suconic

                      I talked to Kabir. We could control instantiation of interceptors by using Factories (witch is a straighforward and documented way), but getInstance is not exposed.

                      So, I requested a feature that will be useful on future releases.

                      http://jira.jboss.org/jira/browse/JBAOP-338

                      I won't use such feature now as we will stick with delegates or with the muliple valves. (something we will have to decide later), but I think the feature request is valid.

                      • 8. Re: Failover refactoring
                        ovidiu.feodorov

                         

                        "clebert.suconic@jboss.com" wrote:
                        > (and not the other local queues, as it was the case so far)

                        We had routers taking care of this, balancing between local queues and failedover queues. Did you change this policy?


                        No. This comes in top of the message pull policy, that it still in effect (well, right now it isn't since we're using a NullMessagePullPolicy, but if we were using one, it would be in effect).

                        These behaviors complement each other.

                        And by thew way, this reminds me that we should also test the pull policy, currently we have 0 tests for it.

                        • 9. Re: Failover refactoring
                          timfox

                           

                          "ovidiu.feodorov@jboss.com" wrote:

                          And by thew way, this reminds me that we should also test the pull policy, currently we have 0 tests for it.


                          The pull functionality is tested in the core clustering tests, but not in the jms clustering tests.

                          • 10. Re: Failover refactoring
                            ovidiu.feodorov

                            OK. We still need jms tests for it.

                            • 11. Re: Failover refactoring
                              ovidiu.feodorov

                              I've added one: org.jboss.test.messaging.jms.clustering.DistributedQueueTest.testMessageRedistributionAmongNodes(). Of course we could add an infinite number of more sophisticated cases, but at least the basic functionality is tested.