We run an ejb based web service on Wildfly. It uses both SOAP and REST web service endpoints and a MySQL database. Shortly after we migrated some production servers from Wildfly 9.0.2.Final to 10.0.0.Final, we started experiencing occasional performance problems that rapidly escalate to complete inoperability. Symptoms include constant very high processor load, rapidly rising memory consumption that leads to out of memory exceptions, increased count of timer interrupts, web server serving content slower and slower, some EJB operations taking a very long time to complete and management interface becoming unresponsive so that deploying and undeploying applications (using jboss-cli.sh) fails on a timeout. The server remained responsive enough that it could be restarted using the init script.
The operating system is CentOS 6.8, with OpenJDK version downgraded to java-1.8.0-openjdk-126.96.36.199-0.b14.el6_7 from CentOS 6.7 because of this bug: Issue with SSL and java-1.8.0-openjdk 91-1.b14 - Red Hat Customer Portal
We have thus far only encountered this issue in production, though on two separate servers. One of them is virtual and the other is bare metal, so it is very unlikely this is a hardware issue. Our servers that run Wildfly 10 on CentOS 7 have not encountered the issue yet. We haven't been able to isolate a factor that would trigger the issue. It has happened as soon as 15 minutes after application server restart. The MySQL database has shown no signs of trouble, such as deadlocks. MySQL JDBC connector version is 5.1.33, and the issue also occurs when using version 5.1.39. There is indication that high load would make the issue more likely to occur but it is not a requirement as the issue has also occurred during off peak hours.
I took a few stack traces spaced a few seconds apart using "kill -3" at the time the issue was happening. Attached is the console.log file containing the stack traces.
We have also observed that with Wildfly 9 the server sometimes consumes all processing time of one core until restart. This can accumulate multiple times but does not happen often and does not lead to inoperability. The server remains fully responsive. This has been observed on the aforementioned two servers that ran into the more serious issue on Wildfly 10 and also on one server running CentOS 7 that has worked properly with Wildfly 10.