Could you check your GC via verbosegc? There's a thread in this forum where someone discovered that RMI did affect GC, at least with some environments. Apparently, where RMI is involved, a full GC operates unless overridden in the JVM options. Also, are you using the hsqldb for your message queues? I'm wondering if this could have an impact. Is there anything else in your configuration that has a cycle time in the 30s to 1min timeframe - e.g. JMS ping, etc?
All I can think of for now.
-verbose:gc shows frequent "normal" GC's taking about 5 to 25 msec each (average ~10), and rare full GC's taking about 30 to 40 msec. This seems to suggest that excessive full GC's are not the issue. Also, the GC timings do not account for the strange 100 msec "sleep" gaps.
I am using the standard Hypersonic DB for the JMS queues (standard 2.4.10 configuration). Or do you mean something else with "hsqldb"? I did try clearing out the jbossmq directory (has worked wonders in the past), but without any discernible effect.
On the application side, there are a number of periodic tasks, but none with a cycle time of 30 sec to 1 min. Not sure about the frequency of JMS pings and other JBoss internals though.
I wonder whether I have got some thread priority issue. There are a number of JMS queues that are listened to by both an MDB and a non-EJB thread. The two grab a disjoint set of messages from the queue (filtered by a header field), i.e. they are not fighting over messages.
However, while the MDB gets sufficient processing time to process messages almost as soon as they appear in the queue, the non-EJB thread (launched from a web component) has to wait much longer.
What is the priority of threads running the MDB? Is it higher or lower than a non-EJB thread? In other words, does it make sense to experiment with giving the non-EJB thread a higher priority?
Sorry about asking all these questions instead of just trying it out - we have this issues on a customer installation and cannot reproduce it on our own boxes. And the customer techie's patience is obviously limited :-)
OK. I'm guessing that you have the same boxes and set up as your client? And you don't have the same performance issue as your client with the same data and load? Just confirm this.
I don't remember seeing any thread priority in the JBoss code ever. Someone might want to correct me on this.
So your non-EJB thread is running in the JBoss JVM?
If outside the JVM, the lookup and connection may take longer so that may be the hold up.
With the hsqldb, I am wondering if the size of the database is affecting your performance. People have commented on the impact of large amounts of data on the Hypersonic performance.
If the client's JVM is the same as on your machines for the same architecture, the only other thing I can think may be the problem is the patch level on the OS. Sun recommend a patch set for their different JVMs.
Is there anything else going on, on their Solaris system that might hold up anything?
P.S. I agree that the GC does not seem to be the cause - but it is always good to be able to definitely rule something out.
> OK. I'm guessing that you have the same boxes and set up as your client? And you don't have the same performance issue as your client with the same data and load? Just confirm this.
We thought so - still, I have now asked them to tar up their whole directory structure (incl. JBoss, Apache, all their data etc.) so we can be 100% sure they have not inadvertently tweaked some obscure setting.
> So your non-EJB thread is running in the JBoss JVM?
Yes. Going up the stack all the way, it is effectively launched from a servlet running in Tomcat.
> Is there anything else going on, on their Solaris system that might hold up anything?
Nope - clean box. And CPU usage on both processors is well below 100%.
Well, if the JVM, application and data are exactly the same on the two systems once you transfer, and you are not experiencing the same problem, it points to one of two things. Either the OS patch levels are not the same and something is interfering with the JVM operation, or there is a hardware related issue. I would start suspecting something with the JVM/OS interaction first. You are getting an interrupt-and-wait symptom - hold up, but little CPU consumption. It is possible that hardware might be doing something but I have never seen a Solaris box show these symptoms.
Keep us updated with what you find.
Turns out that the problem was a little vicious cycle in JBossMQ, triggered by awkward allocation of processing time by the JVM:
Not enough processor time for JMS queue => queue grows => message selectors take longer to filter messages => queue grows further etc.
It looks like the JBossMQ message selectors may become quite inefficient as a queue grows beyond a hundred messages or so. Applying the selectors was what caused the 100 ms gaps I mentioned earlier (and the 10-20 seconds tied up with exclusively this task).
The fix in my case was in two areas -
(1) make the queue listener a fixed-rate timer task - this seems to give it sufficient priority
(2) throttle the input further up the chain to avoid long queues in the first place.
... and it also turned out that the initial trigger for the queue explosion only occured for a specific combination of input data, which explained why it was so difficult to reproduce.
In other words, I do not now think that JVM and/or Solaris patch levels were relevant.