10 Replies Latest reply on Aug 31, 2010 9:12 AM by aloubyansky

    types of deployment errors and their handling

    aloubyansky

      I'd like to understand the idea behind the current design, atm it's confusing to me. I want to clarify whether there is a central point for service startup/lifecycle error handling or different kinds of errors at different phases are supposed to go through different error handling code.

       


      There is ServiceListener which has event callbacks including serviceFailed for startup errors. There are implementations of ServiceListener (like ServerStartupListener) that collect all the failures and then later they can be queried for the errors to handle/log them, etc.
      ServiceListeners are available only inside BatchBuilder, BatchServiceBuilder, ServiceBuilder, ServiceController, so listeners can be invoked from only instances of those classes.

       

      Question 1. There could be errors related to a service creation/initialization/start that happen before a service (service builder, controller, etc) is even created (as a Java object). E.g. the logic that is performed in ServiceActivator.activate(ctx) and errors like the deployment root doesn't exist, etc. So, should these errors be reported to ServiceListener? Currently, they are not and this logic executed outside of service builders and controllers, so they can't be reported to service listeners.

       

      Question 2. There is BatchBuilder.install() method which may throw ServiceRegistryException. This method installs services going through the dependencies, etc. If errors happen, those errors are immediately propagated up and are re-thrown from install() w/o notifying the registered listeners. I guess, the listeners are actually supposed to be notified in this case?

        • 1. Re: types of deployment errors and their handling
          johnbailey

          As for item #1, I think there are a couple things to think about.  I think subsystems/extensions, should be handled differently from deployments.  I think if sub-system fails to activate based on some kind of exception thrown during activation (not a service start), I think the server should basically stop progressing, and basically log the information related to the problem.  I just to don't think it makes sense for the server to continue if a subsystem failed.  Deployments are a different story.  I think the server should log the problem info, but continue to boot with other deployments.  This is basically what is happening now.  What is still needed is some ability to rollback any deployment specific services that may have been added to the batch prior to the failure occurring.  The deployment already use MSC sub-batches, but the API does not have any options to rollback.

          • 2. Re: types of deployment errors and their handling
            brian.stansberry

            Agreed on subsystems/extensions being handled differently from deployments. I can't see how we're doing our users a favor by leaving a server running with partially broken subsystems. If a subsystem finds a problem that it can live with, it should log an ERROR/WARN and not throw an exception. If an exception is thrown, that's an indication that the subsystem and thus the entire server is in an invalid state.

             

            The draft deployment API that I'll start a thread on next includes ways for the user to specify how they want things to work if failure occurs during deployment.

            • 3. Re: types of deployment errors and their handling
              aloubyansky

              Yes, that makes sense. Not only a subsystem failure will lead to failed deployments but in general it'd make sense to keep a server running if it could recover from the error which in this case means to restart the subsystem which I don't think we even considered?

               

              Actually my point starting this thread was clarify the current deployment mechanism and (more importantly) how it's actually supposed to work and how it is supposed to fail.

               

              But what effects errors from different kinds of "startable" pieces (subsystems, applications, etc) have has to be clearly defined as well. I'll have a look and start a new thread for that.

              • 4. Re: types of deployment errors and their handling
                johnbailey

                It may be good to clarify the current process being used for deployments.  The process for deployments is as follows:

                1. Create a sub-batch for the deployment
                2. Create a high-level service that represents the deployment (parent service)
                3. Add a sub-batch level dependency on the deployment service created in step 2

                 

                Having a high-level service for the deployment allows the deployment to be stopped/started/removed in a single unit based on the the batch level dependency. 

                 

                There are a couple additions that could be made to the process to add richer failure handling.  One would be to add a sub-batch level listener that would watch for service start failures and decide whether to stop the whole deployment or to allow a partial deployment service start.  This could then be a user provided configuration to decide how to handle partial starts. The second addition would be to add support to roll-back the services added to the batch if any activation errors occur prior to the batch installation.  This could be driven with the same user configuration to allow partial deployments.  This would require an API addition to batches in MSC. 

                • 5. Re: types of deployment errors and their handling
                  jesper.pedersen

                  And deployment == ?   Are we talking about an EAR here f.ex ? What is the use-case for deploying the JAR part of the EAR if the WAR part fails ?

                   

                  Or are we talking about starting the EAR on X nodes even if node Y fails ?

                  • 6. Re: types of deployment errors and their handling
                    johnbailey

                    I would like to add a couple definitions to the discussion. 

                     

                    • Activation - The process step that is executing the service activators (sub-systems, Deployments, etc)
                    • Service Start - The actual start operation on the services themselves

                     

                    The reason I want to separate these is the error handling should be different.  Activation  errors in subsystem should be considered catastrophic failures.  These will certainly cause major failures further in the startup/runtime.  In all likelihood these are not recoverable and will result in only portions of the sub-system services to be available.  I feel this should result in halting the server start.  I also think service start errors in a sub-system should halt the server startup process as well.  I just don't think these can be recovered with a restart. 

                     

                    What does everyone think?

                     

                    As for deployment activation errors, this has been discussed in previous posts, but in essence these should either rollback the batch or allow the previously added services to remain.  Either way, the errors should be logged and server should continue the boot process.  Deployment service start errors should also either stop the whole deployment or allow partial start based on user configuration.

                    • 7. Re: types of deployment errors and their handling
                      johnbailey

                      I think we are really talking about any user provided deployments.  So if a JAR were to fail inside an EAR deployment, I would think the same would be true.  It will either rollback/stop the whole EAR deployment, or allow partial deployment.  So basically the rollback/stop is controlled for high-level deployments only. 

                      • 8. Re: types of deployment errors and their handling
                        jesper.pedersen

                        Yes, but I'm questioning the use-case for the partial deployment of an user deployment such as an EAR.

                         

                        As to subsystems - it is up to the subsystem to resolve its requirements and dependencies and throw an error if it can't be started. The question is then if the server want to shutdown or disable the deployments that the subsystem handles.

                        • 9. Re: types of deployment errors and their handling
                          aloubyansky

                          There shouldn't be such a thing as a partly deployed EAR. I would even consider it as a bug.

                          • 10. Re: types of deployment errors and their handling
                            aloubyansky

                            John Bailey wrote:

                             

                            I would like to add a couple definitions to the discussion. 

                             

                            • Activation - The process step that is executing the service activators (sub-systems, Deployments, etc)
                            • Service Start - The actual start operation on the services themselves

                             

                            The reason I want to separate these is the error handling should be different.  Activation  errors in subsystem should be considered catastrophic failures.  These will certainly cause major failures further in the startup/runtime.  In all likelihood these are not recoverable and will result in only portions of the sub-system services to be available.  I feel this should result in halting the server start.  I also think service start errors in a sub-system should halt the server startup process as well.  I just don't think these can be recovered with a restart. 

                             

                            What does everyone think?

                            If a service is an essential part of the subsystem then it's effectively a subsystem failure and the the server should stop.

                             

                            As for deployment activation errors, this has been discussed in previous posts, but in essence these should either rollback the batch or allow the previously added services to remain.  Either way, the errors should be logged and server should continue the boot process.  Deployment service start errors should also either stop the whole deployment or allow partial start based on user configuration.

                            I'd like to clarify what is this mechanism that is responsible for handling deployment activation errors. These errors pass by the listeners. They happen before the deployment is created and from this point of view they kind of aren't even deployment errors? Would any listener be notified at all if there was an activation failure? Can you imagine a listener that would be interested to receive a deployment activation error?

                             

                            This kind of separation of activation and service start errors is confusing to me. You could have an exception hierarchy to differentiate between the two but why would you want to have different error handling mechanisms isn't clear to me.

                             

                            Thanks.