6 Replies Latest reply on Feb 15, 2010 10:55 AM by Jean-Frederic Clere

    Fixing JBPAPP-3463 a clean way

    Jean-Frederic Clere Master

      I have worked around JBPAPP-3463 on 1.0.x and I am wondering if it worth doing something more sophisticated in 1.1.x

       

      The first idea is to add to each context record in httpd the information about active requests (with sessionid and without) and to have DISABLE-APP and STOP-APP returning that information. The logic in AS could retry after a while to redo the DISABLE-APP or the STOP-APP until the active request is zero. For example wait 2 seconds and retry 3 times after that the applicatio would undeployed and REMOVE-APP send to httpd.

        • 1. Re: Fixing JBPAPP-3463 a clean way
          Brian Stansberry Master

          Interesting. Would that add a lot of overhead on the httpd side? To avoid a race I figure you'd have to track the requests all the time.

           

          Semi-OT: Remy's comments on JBPAPP-3463 re: 503s and QoS effect of retries poke a hole in an idea I've long had -- using 503s (or some other header) and failover to shed HA sessions from nodes undergoing a shutdown. The concept was once shutdown starts the node sends a DISABLE_APP and then starts tracking the session id and time of request for requests that continue coming in. If a request comes in for a session it hasn't seen within a configurable amount of time (say 30 secs) it assumes the session has been safely replicated and responds with a 503.

           

          Perhaps an approach is to do what you suggest, but on the server side switch to synchronous replication before sending the DISABLE_APP. With that the server knows any request that's come in since it switched to sync repl has been safely replicated. So it feels more "comfortable" sending the STOP_APP.

           

          Still, if the cluster is under heavy load, the active request count httpd returns won't get to zero. So eventually the AS will proceed to undeploy and there may be "in flight" requests. We need to handle those cleanly.

          • 2. Re: Fixing JBPAPP-3463 a clean way
            Jean-Frederic Clere Master

            "Would that add a lot of overhead on the httpd side?"

            A bit 1 or 2 fields in the context structure and corresponding locking mechanism.

            What about having a DISABLE_APP_RSP and STOP_APP_RSP that tells AS how many requests are being process when the DISABLE_APP (or STOP_APP) is sent? AS could send several DISABLE_APP until it gets 0 request without sessionid and same for STOP_APP.

             

             

            DISABLE_APP means don't give me request without sessionid. Adding a timeout and sending 503 after that in httpd, breaks the logic of sending DISABLE_APP and then waiting in AS for session to drain and then sending STOP_APP,

             

            See JBPAPP-3614 for the requests broken in the "middle".

            • 3. Re: Fixing JBPAPP-3463 a clean way
              Brian Stansberry Master

              jfrederic.clere@jboss.com wrote:

               

              What about having a DISABLE_APP_RSP and STOP_APP_RSP that tells AS how many requests are being process when the DISABLE_APP (or STOP_APP) is sent? AS could send several DISABLE_APP until it gets 0 request without sessionid and same for STOP_APP.

               

              The more I think about it the more I like it. I was too focused on DISABLE_APP before; the key is STOP_APP. Once it sends that once and gets a response, no more new "in the middle" requests. So sounds like the key thing is JBPAPP-3614.

               

              So, perhaps something like this on the java side (for an HA app):

               

              1) Get signal to cleanly shutdown.

              2) Start replicating sessions synchronously

              3) Start tracking requests flowing through whatever valve is doing all this

              4) Send DISABLE_APP

              5) Wait a bit to give time for any requests that came in right before 2) to complete their asynchronous replication

              6) Send STOP_APP, get STOP_APP_RSP response count

              7) Response count can be compared to # of requests that have hit the valve -- 3) above.

              a) if ==, block until in-flight requests return through valve

              b) if <, pause a bit and go back to 6

              8) (maybe not necessary ???) send STOP_APP until request count is 0, indicating requests that have returned through valve are completed at httpd side

              9) undeploy, send REMOVE_APP

               

              Not included above is timeout logic to give up and move to 9) if it's taking too long.

              • 4. Re: Fixing JBPAPP-3463 a clean way
                Paul Ferraro Master

                bstansberry@jboss.com wrote:

                 

                The more I think about it the more I like it. I was too focused on DISABLE_APP before; the key is STOP_APP. Once it sends that once and gets a response, no more new "in the middle" requests. So sounds like the key thing is JBPAPP-3614.

                 

                So, perhaps something like this on the java side (for an HA app):

                 

                1) Get signal to cleanly shutdown.

                2) Start replicating sessions synchronously

                3) Start tracking requests flowing through whatever valve is doing all this

                4) Send DISABLE_APP

                5) Wait a bit to give time for any requests that came in right before 2) to complete their asynchronous replication

                6) Send STOP_APP, get STOP_APP_RSP response count

                7) Response count can be compared to # of requests that have hit the valve -- 3) above.

                a) if ==, block until in-flight requests return through valve

                b) if <, pause a bit and go back to 6

                8) (maybe not necessary ???) send STOP_APP until request count is 0, indicating requests that have returned through valve are completed at httpd side

                9) undeploy, send REMOVE_APP

                 

                Not included above is timeout logic to give up and move to 9) if it's taking too long.

                In general, this seems like the right way to go.  A few thoughts:

                * If we're most likely going to resort to polling the STOP_APP response anyway, then we can probably avoid the request tracking valve altogether.

                * Rather than the arbitrary sleep to allow async session replication to complete, we should be able to leverage a CacheListener for proactive replication notifications, no?

                * Hmm, a cache listener would require that valve after all - to detect the replication completion of sessions for requests (3).

                • 5. Re: Fixing JBPAPP-3463 a clean way
                  Jean-Frederic Clere Master

                  I would go for something like:

                   

                  1) Get signal to cleanly shutdown.

                  2) Send DISABLE_APP (no request without sessionid go that node) but that step is not mandatory.

                  3) Send STOP_APP.

                  4) replicate sessions synchronously (note that replication can happen because of the failover too).

                  3) Send STOP_APP, get STOP_APP_RSP response count.

                  5) Check response count

                  6) if !=0, pause a bit and go back to 4

                  7) if == 0 check replication, undeploy and send REMOVE_APP

                  • 6. Re: Fixing JBPAPP-3463 a clean way
                    Jean-Frederic Clere Master

                    In fact there are 2 things to solve:

                    - a single application undeployement.

                    - a shutdown of a node.

                     

                    For the shutdown of a node it would be nice to send the DISABLE-APP * and STOP-APP * before the undeploy of each application occurs. Because a server with 1000 application would neeed a while to stop otherwise, no?

                     

                    BTW: All those logic need a timeout to prevent waiting for hung requests there should a parameter for the timeout value.