6 Replies Latest reply on Mar 21, 2007 2:42 PM by rachmato

    Testsuite runs failing due to server failed startups/timeout

    rachmato

      We have been observing testsuite execution failures due to:
      (i) servers failing to shutdown correctly, blocking the startup of subsequent server starts
      (ii) servers timing out on startup
      Some of these failures arise on platforms which have been newly introducted for JBPAPP testing, where some platforms (e.g. HPUX) cause the AS to start more slowly. These build failures wreak havoc on any attempts to automate the execution of the testsuite.

      I've been investigating these problems. Here is what I have found:

      1. The standard means of starting and stopping servers is to:
      (i) define a server configuration in the server:config element in the file testsuite/import/server-config.xml
      (ii) use server:start conf="" to start the configuration
      (iii) use server:stop conf="" to stop the configuration

      server:start will exec the program org.jboss.Main and wait for 120 seconds for the server to begin running and reach a started state (the server is 'started' if it responds to an HTTP request); if the server fails to respond, the ant task StartServerTask which will raise a BuildException with the message "Error starting server " and the build will be terminated (if ant is set up to do so).

      server:stop will exec the program org.jboss.Shutdown and will wait 120 seconds for the server to shutdown (determined by a call to Runtime.exitValue()). If the server does not shutdown, the message "Failed to shutdown server before timeout. Destroying the process" and an attempt is made to destroy the process. No build exception is raised and the build will continue.

      The target jboss-all-config-tests uses this approach, and it has been known to fail to correctly shut down the 'all' server instance, blocking the startup of tests-security-manager, which follows it in the testsuite. This needs further investigation. A similar issue was raised (JBQA-405) and the code was changed by Anil to add the class library commons-logging.jar to the stopServerClasspath(), as when exec'ing org.jboss.Shutdown, it may fail to execute due to classes it needs not being present on ths classpath. I believe this was in response to 'minimal' not shutting down, and this was a repeatble error. However, these changes made in the JIRA issue are not present in the current distribution versions.

      2. There are some old mechanisms for starting and stopping servers, which some tests are using:
      <start-jboss>
      - executes the java program org.jboss.Main in the background (with 64m memory overriding)
      - doesn't block, so we need to use <wait-on-host> to wait until startup has completed
      <wait-on-host>
      - waits for a host to start, by periodically trying to make a connection to http:hostname:8080
      - fails the build and spits out a message if not up in 60 seconds
      <stop-jboss>
      - executes the java program org.jboss.Shutdown in the background
      - does not block once executed, therefore we need to follow this with <wait-for-shutdown>
      <wait-on-shutdown>
      - waits for a host to shutdown, by checks for the presence of "[org.jboss.system.server.Server] Shutdown complete" in server.log
      - fails the build and spits out a message if not up in 60 seconds

      These old, and really deprecated, methods of startup and shutdown are inferior to server:start and server:stop, in that:
      - they will not try to kill a process which won't shut down, but only fail the build
      - they also use their own timeouts, which are 60 secs, instead of 120 secs. Given that JBoss configuration 'all' takes 45 secs to start on a fast machine, with no other processes running, this does not account for (i) slow hosts (ii) hosts with several jboss instances running (iii) complex configuration startups.
      - the wait-on-shutdown target seems to be broken, as the log entry for system startup has changed from "[org.jboss.system.server.Server] Shutdown complete" to "[Server] Shutdown complete".

      There are a number of targets which use <start-jboss>/<wait-on-host> and <stop-jboss>/<wait-on-shutdown> and so are suscecptible to timeouts.
      They are:
      * tests-compatibility-pooledInvokers
      * tests-compatibility
      * tests-jacc-securitymgr
      * tests-jacc-security-external
      * tests-jacc-security-allstarrole

      What is worse, the latter three mix a server:start startup with <stop-jboss>/<wait-on-shutdown>.

      Thus, in trying to fix the problem of the testsuite failing due to (i) servers failing to start up because other servers have not cleanly shutdown and (ii) server startups timing out and causing the build to fail, it might help by:

      1. reworking the targets above to use server:start and server:stop, so that at least we have consistent behaviours being applied to stopping and starting servers.
      2. re-investigate the probelm with 'all' not shutting down, which may be sue to OOME. One approach would be to introduce TRACE statements which trace the execution of steps between a client exec'ing Shutdown, and the shutdown() request being processed on the server side.

      I'lll go ahead and make these changes to a temporary copy of the source tree and test them out as much as I can.
      If anyone has any comments, suggestions, please feel free to comment.

        • 1. Re: Testsuite runs failing due to server failed startups/tim

           

          "rachmatowicz@jboss.com" wrote:
          A similar issue was raised (JBQA-405) and the code was changed by Anil to add the class library commons-logging.jar to the stopServerClasspath(), as when exec'ing org.jboss.Shutdown, it may fail to execute due to classes it needs not being present on ths classpath. I believe this was in response to 'minimal' not shutting down, and this was a repeatble error. However, these changes made in the JIRA issue are not present in the current distribution versions.


          If we were seeing the server fail to shutdown cleanly across platforms, I agree investigation of JBQA-405 would be pertinent. But since it is intermittent, I think it is more likely that the "all" server is experiencing an OOME, which prevents it from responding to the shutdown request. Have you verified that the "all" server is not experiencing an OOME by looking at the logs?

          "rachmatowicz@jboss.com" wrote:

          1. reworking the targets above to use server:start and server:stop, so that at least we have consistent behaviours being applied to stopping and starting servers.
          2. re-investigate the probelm with 'all' not shutting down, which may be sue to OOME. One approach would be to introduce TRACE statements which trace the execution of steps between a client exec'ing Shutdown, and the shutdown() request being processed on the server side.


          I am in total agreement with #1. As for #2, you should be able to see the presence of an OOME just by greping the logs. If you don't see it, an OOME on the server is unlikely, IIUC.


          • 2. Re: Testsuite runs failing due to server failed startups/tim
            rachmato

            1. The 'all' server was experiencing OOMEs of the PermSpace variety, as opposed to the heap space variety. Heap space is used to store object instances and other garbage-collectable data items, whereas PermSpace is used to store class definitions and other 'permanent' data structures, which are not subject to garbage collection (as far as I am aware). These were visible in the log as:

            18:23:08,027 ERROR [STDERR] Exception in thread "JBoss System Threads(1)-108"
            18:23:08,027 WARN [RunnableTaskWrapper] Unhandled throwable for runnable: org.jnp.server.Main$BootstrapRequestHandler@8aba19
            java.lang.OutOfMemoryError: PermGen space


            2. The fact that servers controlled with server:start and server:stop do not shutdown correctly seems to be due to memory problems: I ran the testsuite with the JVM memory settings

            -Xms256m -Xmx256m -XX:MaxPermSize=128m -XX:PermSize=128m

            and the server shutdown problems (as well as the OOMEs) disappeared.

            • 3. Re: Testsuite runs failing due to server failed startups/tim
              starksm64

              What branch are you talking about? I went through and updated all of the jboss5 trunk memory configs and have been getting complete runs.

              • 4. Re: Testsuite runs failing due to server failed startups/tim

                This is 4.2. Should we just be increasing the Permsize across the board? Or is this indicative of a memory leak?

                • 5. Re: Testsuite runs failing due to server failed startups/tim

                  It's a memory leak.

                  The permenant generation fills up with class/method references
                  which should be released when the classloader is undeployed.

                  So either there are tests that are not undeploying their test artifacts from the server
                  or there are still references to the classloader/classes somewhere.

                  EJB3 has had a known issue recently.

                  JBoss5 has a problem in the VFS layer (but I don't believe that is related to the
                  permsize) which still needs to be sorted out.

                  I don't see any reason why we can't bootstrap the server within 64M?

                  This used to be possible. In fact, the basic ejb server in the 2.4.x days booted
                  and ran in 4M. :-)

                  It is only because people keep increasing the
                  default memory to fix the testsuite that this "bloatware" problem is not being addressed.
                  i.e. what is using/wasting all this memory.

                  • 6. Re: Testsuite runs failing due to server failed startups/tim
                    rachmato

                    I ran the 4.2 testsuite yesterday on x86-RHEL4 with the heap size and PermSize and heapsize set as above, and had a clean run (i.e. the problem with OOMEs on the 'all' configuration were 'fixed').

                    I ran the 4.2 testsuite today on Windows x86_64 with the same values of heap size and PermSize on the 'all' configuration and got OOMEs on heap space in the 'jacc-sercuritymgr' and 'default' configuration:

                    jacc-securitymgr/log/server.log: java.lang.OutOfMemoryError: Java heap space
                    ...

                    default/log/server.log: java.lang.OutOfMemoryError: Java heap space
                    ....

                    I had not adjusted the JVM memory options on those targets.