The following is a copy of the email I sent to our group earlier today.
From what I can see the issue appears to be related to two things
- The way the current codebase repeatedly creates connections
- The way JBoss Remoting works
One cause appears to be the timers used by remoting. From what I can
see there are three in play
Each one of these timers has indirect access to the majority of the heap.
The big culprit in the timers appears to be LeasePinger$LeaseTimerTask.
When connections are closed this task is cancelled but unfortunately
cancelling j.u.TimerTask does *not* remove the task from the queue, all
it does is mark it as cancelled. The consequence of this is that every
instance referenced by the task *cannot* be garbage collected until the
timer would normally fire (and the task is then removed from the queue).
Referenced from each LeasePinger instance is a BisocketClientInvoker
which contains a ClientSocketWrapper. Each ClientSocketWrapper
references a Socket, a DataInputStream (containing BufferedInputStream)
and a DataOutputStream (containing BufferedOutputStream). Each BIS/BOS
contains a 64k array! In my tests these instances amount to a
cumulative size of about 1/3 of the heap.
Another cause appears to be the use of hash maps. There are numerous
hashmaps referenced from BisocketServerInvoker and BisocketClientInvoker
which do not appear to be garbage collected. One reason is the above
timers but a second is that BisocketServerInvoker holds on to
BisocketServerInvoker references in a static map called
listenerIdToServerInvokerMap. This map currently contains an instance
of BisocketServerInvoker for every iteration of the loop.
This has all been discovered from examining profile information, not
source code. It may be that this analysis is completely wrong and that
examination of the source code will highlight other issues.
Thanks Kev - looks like a JBoss Remoting issue.
I will investigate Monday when after I return from California.
BTW - any particular reason ESB is creating so many ephemeral connections?
I would say this is an anti-pattern, although of course this should not be causing a resource leak.
Yes, it appears from the profiler that this is a remoting issue. Which version are you using? Is it 2.0.0GA?
I also agree that it is an anti-pattern and we are working to address this in the 4.x code. Kurt and I are rewriting this part of the code now.
The version of remoting appears to be 2.2.0 beta1, is this correct?
I have suggested fixes for both of these issues. Initial tests show the client appearing to remain stable at around 2m-5m.
I'll run more tests later this weekend.
Yes, we are running 2.2.0.beta1.
BTW I think Ron Sigal may have some fixes for this too - maybe you guys should liaise?
I have sent my suggested fixes to Ron.
I will run more tests later this weekend though as there appears to be another issue on the server side.
The server side issue is with Messaging :-)
The ServerSessionEndpoint code contains an executor for each instance. Unfortunately nothing shuts down the executor which means that the threads created by the executor are never destroyed.
I modified the messaging code to include a call to executor.shutdownNow() in ServerSessionEndpoint.close() and this appears to have done the trick.
Kurt's test is now up to 40000 iterations and the client/server look stable.
I'll give it a longer run over the next day or so.
I am moving this thread to "Design of Messaging on JBoss (Messaging/JBoss)"
Thanks for you work over the past few days :) If you want to join the messaging team there are vacancies ;)
Regarding the executor - good catch. Actually this queued executor was a last minute addition to workaround another remoting issue.
BTW are you creating/destroying a lot of sessions rapidly now? If so, pls bear in mind that sessions are also fairly heavyweight objects so this would be considered an anti-pattern too.
This is why the JCA layer for instance, caches underlying JMS sessions.
Thanks for the offer, I'll bear it in mind :-)
These observations have come from the initial test which Kurt wrote, the intention of which was to mimic the code which is currently present in the ESB.
We are aware that these are anti-patterns and that their use will have a negative effect on performance. Kurt and I are rewriting this as we speak.
Do you have any plans to release a version of Messaging which addresses these issues?
We are hoping to release a follow up to 1.2.0 fairly soon, although I'm not sure of the exact timing.
BTW I couldn't see your change for the QueuedExecutor in SVN - I assume you haven't committed it yet?
(I'm thinking a shutdownAfterExecutingCurrentTask() would be more appropriate too)
I have not committed anything to svn as I felt the decision on how best to handle this should come from the messaging team. :-)
There are two other shutdown methods which could be used depending on what you wish to achieve. The choice of shutdownNow was only made to test the fix.
My instinct would be to let the queue drain using shutdownAfterProcessingCurrentlyQueuedTasks rather than to use the shutdownAfterExecutingCurrentTask.
I'll apply the fix