Testsuite runs failing due to server failed startups/timeout
rachmato Mar 18, 2007 3:49 PMWe have been observing testsuite execution failures due to:
(i) servers failing to shutdown correctly, blocking the startup of subsequent server starts
(ii) servers timing out on startup
Some of these failures arise on platforms which have been newly introducted for JBPAPP testing, where some platforms (e.g. HPUX) cause the AS to start more slowly. These build failures wreak havoc on any attempts to automate the execution of the testsuite.
I've been investigating these problems. Here is what I have found:
1. The standard means of starting and stopping servers is to:
(i) define a server configuration in the server:config element in the file testsuite/import/server-config.xml
(ii) use server:start conf="" to start the configuration
(iii) use server:stop conf="" to stop the configuration
server:start will exec the program org.jboss.Main and wait for 120 seconds for the server to begin running and reach a started state (the server is 'started' if it responds to an HTTP request); if the server fails to respond, the ant task StartServerTask which will raise a BuildException with the message "Error starting server " and the build will be terminated (if ant is set up to do so).
server:stop will exec the program org.jboss.Shutdown and will wait 120 seconds for the server to shutdown (determined by a call to Runtime.exitValue()). If the server does not shutdown, the message "Failed to shutdown server before timeout. Destroying the process" and an attempt is made to destroy the process. No build exception is raised and the build will continue.
The target jboss-all-config-tests uses this approach, and it has been known to fail to correctly shut down the 'all' server instance, blocking the startup of tests-security-manager, which follows it in the testsuite. This needs further investigation. A similar issue was raised (JBQA-405) and the code was changed by Anil to add the class library commons-logging.jar to the stopServerClasspath(), as when exec'ing org.jboss.Shutdown, it may fail to execute due to classes it needs not being present on ths classpath. I believe this was in response to 'minimal' not shutting down, and this was a repeatble error. However, these changes made in the JIRA issue are not present in the current distribution versions.
2. There are some old mechanisms for starting and stopping servers, which some tests are using:
<start-jboss>
- executes the java program org.jboss.Main in the background (with 64m memory overriding)
- doesn't block, so we need to use <wait-on-host> to wait until startup has completed
<wait-on-host>
- waits for a host to start, by periodically trying to make a connection to http:hostname:8080
- fails the build and spits out a message if not up in 60 seconds
<stop-jboss>
- executes the java program org.jboss.Shutdown in the background
- does not block once executed, therefore we need to follow this with <wait-for-shutdown>
<wait-on-shutdown>
- waits for a host to shutdown, by checks for the presence of "[org.jboss.system.server.Server] Shutdown complete" in server.log
- fails the build and spits out a message if not up in 60 seconds
These old, and really deprecated, methods of startup and shutdown are inferior to server:start and server:stop, in that:
- they will not try to kill a process which won't shut down, but only fail the build
- they also use their own timeouts, which are 60 secs, instead of 120 secs. Given that JBoss configuration 'all' takes 45 secs to start on a fast machine, with no other processes running, this does not account for (i) slow hosts (ii) hosts with several jboss instances running (iii) complex configuration startups.
- the wait-on-shutdown target seems to be broken, as the log entry for system startup has changed from "[org.jboss.system.server.Server] Shutdown complete" to "[Server] Shutdown complete".
There are a number of targets which use <start-jboss>/<wait-on-host> and <stop-jboss>/<wait-on-shutdown> and so are suscecptible to timeouts.
They are:
* tests-compatibility-pooledInvokers
* tests-compatibility
* tests-jacc-securitymgr
* tests-jacc-security-external
* tests-jacc-security-allstarrole
What is worse, the latter three mix a server:start startup with <stop-jboss>/<wait-on-shutdown>.
Thus, in trying to fix the problem of the testsuite failing due to (i) servers failing to start up because other servers have not cleanly shutdown and (ii) server startups timing out and causing the build to fail, it might help by:
1. reworking the targets above to use server:start and server:stop, so that at least we have consistent behaviours being applied to stopping and starting servers.
2. re-investigate the probelm with 'all' not shutting down, which may be sue to OOME. One approach would be to introduce TRACE statements which trace the execution of steps between a client exec'ing Shutdown, and the shutdown() request being processed on the server side.
I'lll go ahead and make these changes to a temporary copy of the source tree and test them out as much as I can.
If anyone has any comments, suggestions, please feel free to comment.