10 Replies Latest reply on Feb 8, 2007 8:51 AM by smarlow

    JBAS-2520 - HASingleton stopping problem

        • 1. Re: JBAS-2520 - HASingleton stopping problem

          "
          [ Permlink ]
          Here's how I reproduced the bug.

          1. Create the following classes

          ----- interface hatest.MyHASingletonServiceMBean -----
          package hatest;

          import org.jboss.ha.singleton.HASingletonMBean;

          public interface MyHASingletonServiceMBean extends HASingletonMBean {
          // nothing to add
          }
          ----------

          ----- class hatest.MyHASingletonService -----
          package hatest;

          import org.jboss.ha.singleton.HASingletonSupport;
          import org.jboss.logging.Logger;

          public class MyHASingletonService extends HASingletonSupport implements MyHASingletonServiceMBean {
          private static final Logger logger = Logger.getLogger(MyHASingletonService.class);

          public void startSingleton() {
          logger.info("I am the Master!");
          }

          public void stopSingleton() {
          logger.info("I am no longer the Master.");
          throw new RuntimeException("I don't want to die!");
          }
          }
          ----------

          2. Package the classes in a SAR with the following jboss-service.xml

          ----- ha-test.sar/META-INF/jboss-service.xml -----
          <?xml version="1.0" encoding="UTF-8"?>


          jboss:service=${jboss.partition.name:DefaultPartition}


          ----------

          3. Deploy to JBoss

          15:47:02,978 INFO [MyHASingletonService] I am the Master!

          4. Redeploy the SAR by touching META-INF/jboss-service.xml

          15:47:18,168 INFO [MyHASingletonService] I am no longer the Master.
          15:47:18,170 WARN [MyHASingletonService] Stopping failed hatest:service=MyHASingletonService
          java.lang.RuntimeException: I don't want to die!
          at hatest.MyHASingletonService.stopSingleton(MyHASingletonService.java:15)
          ...
          15:47:18,348 WARN [ServiceController] Problem starting service hatest:service=MyHASingletonService
          java.lang.NullPointerException
          at org.jboss.ha.jmx.HAServiceMBeanSupport.getServiceHAName(HAServiceMBeanSupport.java:361)
          at org.jboss.ha.jmx.HAServiceMBeanSupport$1.replicantsChanged(HAServiceMBeanSupport.java:195)
          ...

          Comment by Mirko Nasato [03/Dec/05 11:08 AM] Delete
          [ Permlink ]
          Attached the zipped ha-test.sar used to reproduce the bug.

          Comment by Mirko Nasato [03/Dec/05 11:12 AM] Delete
          [ Permlink ]
          Regular (non HA-singleton) MBean do not have this problem, i.e. they are redeployed correctly even if they throw an exception when undeploying.

          Comment by Mirko Nasato [03/Dec/05 11:51 AM] Delete
          [ Permlink ]
          In the real world situation we weren't throwing a RuntimeException on purpose of course. A ClassCastException was generated because of another problem, a JDNI object being replaced by another one in a different ClassLoader by NonSerializableFactory after another EAR was deployed.

          This bug effectively turned what was supposed to be a "high availability" service into a "zero availability" one.

          Comment by Scott Marlow [05/Dec/05 08:51 AM] Delete
          [ Permlink ]
          This is a 50/50 problem. If the stop _stopOldMaster() fails, should the operation continue? There are probably some cases where the answer would be yes and some no.

          If we change this for the 4.0.4 release, to catch the exception, log it and resume starting the new master (makeThisNodeMaster()). We would return from _stopOldMaster not knowing if the old singleton has stopped or not.

          I'll go ahead and make the change as it should help in the case that you hit.

          Comment by Scott Marlow [05/Dec/05 10:09 AM] Delete
          [ Permlink ]
          As noted in my comments, this doesn't completely solve the problem, the root cause of the exception still needs to be solved as its unknown of the singleton stopped or not.

          Comment by Scott Marlow [05/Dec/05 10:10 AM] Delete
          [ Permlink ]
          The code change is in head and 4.0.4
          "

          • 2. Re: JBAS-2520 - HASingleton stopping problem

            I think this fix is correct.

            But we should discuss the problem FIRST rather than just hacking a workaround
            for this particular usecase.

            • 3. Re: JBAS-2520 - HASingleton stopping problem

              I'll explain my particular case in even more detail, so you can see if it can apply to others as well.

              The first time the HASingleton threw an exception when stopping was when one node was shutdown (because of other problems with that node). The second node (we have only 2 nodes in the cluster at the moment) refused to become the master

               2005-11-18 12:14:19,721 ERROR [ourapp.HASingletonScheduledService] _stopOldMaster failed. New master singleton will not start.
              


              In this case not starting the new master was clearly not the best choice, because for sure the old master had been stopped despite the exception, the whole application server being stopped.

              After the problem occurred we tried redeploying the EAR containing the service to try and restore the service without affecting the other EARs running in the same appserver, but each time we got that NPE in HAServiceMBeanSupport.getServiceHAName().

              So eventually we had to bring down both JBoss nodes, which means an outage in all the applications deployed in that cluster, just to have that single HASingleton service start up again.

              I agree that in other cases it may not be a good idea to start the new master if the old one failed to stop because you could end up having the HASingleton service running on more than one node.

              But I think this is somewhat less likely to happen, as like in our case the service may be teared down anyway because it's being stopped as part of a server shutdown, or because it throws an exception while trying to close a resource that's already been closed so it's effectively already stopped.

              And if it does happen it's much easier to fix the situation. If your service is running on 2 nodes when it shouldn't you can just stop the HASingleton on one node using the JMX console as a temporary measure, or restart one JBoss node so when it comes up again it's in a clean state. The situation we ended up required the whole cluster to be shut down and restarted which is far worse.

              Well this is my biased point of view anyway ;-)

              Thanks

              Mirko


              • 4. Re: JBAS-2520 - HASingleton stopping problem

                So there are really two questions for me:

                1) Why does this fail in the first place?

                The lifecycle needs to fixed to avoid the NPE.
                Indeed, what is wrong with the lifeycle that the MBean has no name?

                2) When it does fail what is the recovery?

                Clearly there is something going wrong if we are the master and we cannot
                stop ourselves?

                But we are probably stopping ourselves for a reason.
                With the exception of badly written subclasses (we can't do
                anything abou bugs in user's create/start/stop/destroyService),
                we should be able to stop and restart regardless of errors.

                We need to differentiate the problems:

                a) We cannot stop ourselves
                b) We cannot start the new master on a different node because we left the cluster or there is no cluster
                c) We cannot start the new master on a different node because of some other error
                reported by the new master

                and consider how to recover from them and report the underlying issue.

                • 5. Re: JBAS-2520 - HASingleton stopping problem
                  smarlow

                  I agree that the root cause of the failure should be solved. Each occurence of a failure is a separate issue from the HASingleton itself failing.

                  My take on recovery:

                  >a) We cannot stop ourselves

                  Send a message to event listener indicating that we cannot stop ourselves (this might send email or a beeper notification.) Let user policy (code or configuration policy determine if we should {terminate server process, ignore error, try again}. Default action could be ignore.

                  > b) We cannot start the new master on a different node because we left the cluster or there is no cluster

                  I think that the current master will attempt stopping itself when HASingletonSupport.partitionTopologyChanged() is invoked. We could send a message to event listener indicating that we left the cluster or there is no cluster. I may be reading the current code wrong, but it looks like the remaining cluster will elect a new master (need to verify.)

                  Let user policy (code or configuration policy determine if we should {terminate server process, ignore error, try again}. Default action could be ignore.

                  >c) We cannot start the new master on a different node because of some other error reported by the new master

                  We could send a message to event listener on the different node indicating that it failed to become master. Let user policy (code or configuration policy determine if we should {terminate server process, ignore error, try again}. Default action could be terminate server process so that a new master is chosen.

                  A nice thing would be if cluster management Failures were defined as an aspect that could be handled consistently across the board. The problem that I am thinking of is how we deal with failures across the board, do we manually handle the errors or inject handlers that deal with varying qualities of service. Or perhaps I should be asking if we should wait until we switch to using AOP to attempt across the board handling of failures.

                  • 6. Re: JBAS-2520 - HASingleton stopping problem
                    smarlow

                     

                    Scott, can you take a look at this in conjunction with what Alex did on configuring the HASingleton election policy and try to determine if solving this will require some change that breaks interoperability between releases (as opposed to just introducing a new feature). If it will break interoperability, please reschedule to 5.0.0.Beta. Otherwise, let's do this for 5.0.1.CR1.


                    I don't see an interoperability issue here.

                    From a cluster management point of view, it would be nice to apply a generic solution for handling cluster errors such as SingletonService stop failure. The handling code as previously suggested, could deal with solving the problem or notifying someone that can deal with it (email/pager/send snmp trap). Otherwise, we can continue to log the error as we do now and continue execution (starting the new singleton) with the hope that something might work. Of course users can catch the exception directly in their code and do something better (if there is a better thing to do.)

                    Should we invite someone from the JBoss ON team to give input on this (cluster management) issue?



                    • 7. Re: JBAS-2520 - HASingleton stopping problem
                      smarlow

                      Alex and I discussed some of our options (using a policy driven approach versus a listener.)

                      It seems to me now that users can already detect the error in their implementation of stopSingleton by surrounding the logic with a try {} catch(java.lang.Throwable). So, I don't think we need to do anything else.

                      If we do decide to implement a cluster wide error listener, we could include the stopSingleton failure in the list of events that can be observed.

                      Other thoughts?

                      • 8. Re: JBAS-2520 - HASingleton stopping problem
                        brian.stansberry

                        Thanks for the input on interoperability; we'll leave the issue for 5.0.1.CR1.

                        I'm packing up my house right now, so will defer further comment for a couple weeks until I'm resettled :)

                        • 9. Re: JBAS-2520 - HASingleton stopping problem
                          smarlow

                          What additional policies settings are desired for ha-singleton failures?

                          How about on ha-singleton Stop or Start failure, choose one of the following modes:

                          Terminate all (current) clustered nodes

                          Log the error and continue running without the singleton

                          • 10. Re: JBAS-2520 - HASingleton stopping problem
                            smarlow

                            From email with Brian:

                            For failure to stop, I could see:

                            1) Undeploy service on node that failed. Undeploy could mean stop + destroy, or a true full undeploy. Or just a stop.
                            2) Undeploy service across cluster.
                            3) Ignore, but don't proceed to start.
                            4) Ignore.
                            5) Redeploy on failed node.
                            6) Redeploy across cluster.

                            For failure to start, it would be the same, except you could replace #3 with "Elect a new master".