I am using JBoss Messaging 1.4.7.GA with the bisocket transport from JBoss Remoting 2.5.3.SP1 connecting to remote clients.
Ever since I started using this combination I have seen some stability issues on large and busy production systems that I have been unable to reproduce in a test environment. I have mostly seen these issues when the systems have been very busy, and I have noticed that these issues become a lot worse when the network is not good (high packet loss and/or parts of the network disconnecting at times).
Today I think I have found the cause of these issues, and it looks like a bug in the interface between JBoss Messaging and JBoss Remoting. I would like some of you JBoss Messaging experts here to comment on my observations before I open a JIRA ticket.
My findings starts with two thread dumps obtained at a production server experiencing these issues. The thread dumps were taken 8 seconds apart, and both show the same stack trace:
"WorkManager(2)-76" daemon prio=10 tid=0x00002aaae483c000 nid=0x3d40 in Object.wait() [0x000000004770a000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- locked <0x000000070f9cb8f8> (a java.util.HashSet)
at org.jboss.remoting.Client.invokeOneway(Client.java:926) - not client side, means we may block instead of executing in a new thread
at org.jboss.remoting.callback.ServerInvokerCallbackHandler.handleCallbackOneway(ServerInvokerCallbackHandler.java:708) - serverSide=false
- locked <0x00000006a8c8dfd0> (a org.jboss.jms.server.endpoint.ServerSessionEndpoint)
- locked <0x0000000695caa390> (a java.lang.Object)
- locked <0x0000000695caa498> (a java.lang.Object)
This thread is trying to deliver a JMS message and is waiting in BisocketClientInvoker.createSocket() for a remote client to open a connection back to the server. At the same time the thread holds a read lock in MessagingPostOffice. This means that any work in MessagingPostOffice that needs a write lock is blocked.
The problem in this case is that the remote client is no longer connected to the network, so it will never connect back to the server. The wait in BisocketClientInvoker.createSocket() will time out after some time. There is a retry loop in BisocketClientInvoker.createSocket() which will retry a few times, but eventually the call will fail and the read lock in MessagingPostOffice is released. If I read the code correctly the read lock is held for 60 seconds, if the JBoss Remoting settings are default.
Having the MessagingPostOffice lock exposed to client communication problems in this way is not good, and probably not what you want.
Looking at the way JBoss Messaging calls JBoss Remoting when delivering a message to a remote client, I see that you use one-way calls where the caller does not care about a response, and the call is executed in another thread. But in JBoss Remoting there are two way of doing this:
- Starting a thread on the caller side to handle the call. This way the new thread will take care of any communication problems, and the calling thread can return immediately so the read lock in MessagingPostOffice is immediately released.
- Starting a new thread on the remote side to handle the call. This way the calling thread has to take care of any communication problems, and only when the invocation has been delivered to the remote side will the remote side start a new thread to handle the call.
Unfortunately it looks like JBoss Messaging is using the last way of doing one-way calls.
So my question is: Isn't this a bug? Shouldn't we use the first way of doing one-way calls instead?