Diagnosing the root cause of recent outage in J2EE web application
mrgordonz Aug 15, 2013 10:26 AMApologies in advance for a slightly wordy question.
We provide hosting and application management of a J2EE web application. The technology stack is CentOS 5.9 64 bit, Apache 2.2.17, Tomcat 5, JBoss 4.2.2. The (virtual) server has 6 GB RAM. We typically see around 2500 concurrent users during business hours, and as a rule the environment runs fine (we have even cracked 3300 concurrent users without any performance problems). Recently we had some brief outages, and we aren't sure as to the root cause. The outages only last a couple of minutes - long enough to get the alert email from the monitoring software, verify that the application is not available, open a terminal to the server, and restart a service - about 2-3 minutes.
Some info in relation to the outages:
- All services were still running - nothing had crashed, and there were no heap dumps generated
- JBoss logs show no evidence of OutOfMemory exception
- No error messages appeared in browser
- When trying to access the site, nothing would load in the browser, and it seemed to spend an eternity thinking - like it was "spinning its wheels"
- Initially restarting Apache restored access; but after the 4th outage in the space of 2 hours we restarted JBoss. This seemed to fix the problem.
- During the outages, concurrency was quite low - well below the average during business hours
We have analysed log files, monitoring reports, GC logs, etc. What we can say for sure is that at the times of the outages, the monitoring software reported that there were heaps of Apache Busy Servers. I'm not sure what a Busy Server is, but at the time of an outage this value spiked to between 150 and 200. The average value for Busy Servers is about 5, rarely going over 10. We can also see from the GC logs that there seemed to be a memory issue at the time of the outages, for example:
4967.376: [Full GC [PSYoungGen: 17576K->0K(656384K)] [PSOldGen: 3131794K->628760K(3145728K)] 3149370K->628760K(3802112K) [PSPermGen: 157485K->157485K(315008K)], 2.6067200 secs]
The JVM options are:
JAVA_OPTIONS="${JAVA_OPTIONS} -Xms4096m -Xmx4096m -XX:NewRatio=3 -XX:PermSize=256m -XX:MaxPermSize=512m -XX:SurvivorRatio=4"
JAVA_OPTIONS="${JAVA_OPTIONS} -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000 -XX:+DisableExplicitGC -XX:+UseLWPSynchronization"
JAVA_OPTIONS="${JAVA_OPTIONS} -XX:+PrintClassHistogram -XX:+HeapDumpOnOutOfMemoryError -Xloggc:sabagc.log -XX:+PrintGCDetails"
What we think has happened is that Permgen has run out of space, which in turn causes Tomcat to stop accepting requests (from Apache via mod_jk). This causes Apache to start queuing requests, hence the high number of Busy Servers. Restarting Apache was a short term fix because it didn't actually address the memory issue. Memory was freed up when we restarted JBoss.
Based on the available information, does this sound plausible? And is the solution to increase the memory for XX:PermSize and XX:MaxPermSize?