We have a customer encountering what looks like a thread leak that after about 12 hours of use causes our server to go OOME and crash. I have read a lot of the forum postings, like this one, http://community.jboss.org/message/157560#157560 and this blog.
What we are seeing is a constant growth of handles and threads (SocketManager.ReadTask, SocketManager.WriteTask and SocketManager.MsgPool) until the server finally falls over. We have doubled check our code to ensure that all the sessions and connections are being closed. We are not able to reproduce this in our lab
I am not saying this is JBoss MQs problem but just looking for suggestions on how to debug and track this further.
- We are running JBoss 4.2.3.GA using the default MQ implementation running on Windows 2003 Server. Java is 1.5.0_22 Sun.
- Messages are being persisted to SQLSever database.
- We have a hub and spoke architecture. Our Central Server is where the problem occurs, which is the hub. The spokes are 1 for each retail store a customer may have. In this case, that means about 200 retail stores.
- The retail stores post sales txns to a local queue and then we have a remote MDB defined at Central that gets called to save the txn on our Central server and then later sent to a host system.
- Central also has a queue destination for every retail store for downloads that need to be delivered to each store.
- The result of this means that once up and running, JBoss has 200 remote MDBs defined, each one pointing to a different JBoss 'store' server.
- Every retail store is connecting to Central for the download - so the mdbs are at the Store but there's a constant connection for the mdb at store which is a consumer of a queue on our Central Server.
- It looks like each queue or mdb defined contributes 3-4 threads to the server (readtask, writetask, msgpool and some have 'connection consumer for dest' - queue name).
- I have two thread dumps taken just a couple minutes apart: the first shows 430 threads each for readtask, writetask and msgpool, a couple minutes later, we have 444 readtask/writetask threads and 442 msgpool threads - and total thread count went from 1416 to 1484. Thread dumps attached.
It looks like the ReadTask is not receiving the CloseMsg (m_connectionClosing) because we were able to see connections defined in netstat and tcpview that showed an ESTABLISHED connection for the local port 18093 (our Central server bind port) and some store ip:port(for example storea:1547), yet when we logged into the remote server and ran netstat -a - local port 1547 was not in use or timed_wait or even showed up in the netstat output.
We have some logs from TRACE level (for org.jboss.mq) for about 15 minutes before the customer needed to stop our testing on their production box.
We are looking for suggestions on how to debug this further or get closer to finding the source of the problem!