While stress testing our application which runs under JBoss 3.0.6 on a dual processor Linux box (kernel 2.4.18-5smp) using Sun's 1.4.1_01 JVM, we've encountered an interesting failure. The box is providing a variety of stateful and stateless session beans to a separate web server box that is running Apache with the Resin servlet container.
At right around 500 users, the web server box starts reporting the following error:
05/16 10:17:14.377 akZtqtw_JFfd (PageDisplayUI.java:62) - Exception occurred ...
java.rmi.RemoteException: Service unavailable
At that time, we noticed that the Apache and Resin threads skyrockets on the web server box and the number of threads on the JBoss box jumped from 100 or so to over 350. During this time, the load on either box did not exceed 2 (and averaged around 1.2).
Further investigation revealed that JBoss reported an active thread count of 350+ threads but its list thread method on the JMX console only reported around 85 or so. We then noticed that there were a large number of
sockets open to the box that were stuck in the CLOSE_WAIT state. In fact, the number of sockets open exactly corresponded to the number of missing threads.
Once we stopped the load on the web server, the web box did not show any open sockets to JBoss while the sockets stuck in CLOSE_WAIT on the JBoss box remained. In fact, these sockets did not go away until we stopped JBoss.
At this point our leading theory is that we hit some kind of resource limit (probably related to networking) and that caused RMI related errors which JBoss was not able to recover from due to a bug in either in JBoss or the RMI code. I have searched Sun's Java bug database, the JBoss forums, and the sourceforge project's bug database and not seen anything that appears to be related to this.
So I'm posting this to the forum to see if anyone has ever seen anything similiar or has any ideas of possible kernel or JBoss parameters to tune. We plan on trying the latest Sun JVM (1.4.1_02) as well as a newer JBoss version (3.0.7 and 3.2), and possibly even a different vendor's JVM. If we can find the cause of the failure, I'ld like to then improve JBoss' handling of it.