Please see thread http://www.jboss.com/index.html?module=bb&op=viewtopic&t=67846 in which I corresponded with you earlier. Background of our system: We use JBoss 3.2.6, read uncommitted and repl_async, with two appservers in the cluster. We have mostly plain old java objects and extensively use JBossCache for clustering replication.
Earlier, we had a lot of problems in production that were not attributable to anything specific:
1. JVM suddenly died with java defunct processes within 1 hour of application starting.
2. NACKACK messages that got displayed.
You had initially suggested that I post in apache groups for the defunct processes problem (since we use apache load balancer). For the NACKACK problem you had suggested we use JBoss 2.2.9 (alpha). Replacing 2.2.8 JBossCache with 2.2.9 alpha actually fixed the java defunct processes to a great extent. Instead of dying after 1 hour of starting JBoss, now the processes died after a week or so.
Then we had another production release, and this time both appservers in the cluster died within a few minutes (1-2 minutes) of starting the app servers. In order to find the problem, we started rolling back each JBossCache change that was introduced. The one rollback that made a difference was - we were creating a new cache that got populated on demand through a database lookup. Earlier we used local hashmaps. The difference between this cache and all the other caches we already had was: (a) the object it holds is a synchronized map (b) some methods are synchronized and (c) the volume is high. There are appromixately 6000 database calls (and therefore 6000 object replications) in a span of 1-2 minutes. The same object is being used, and replicated (i.e, the 6000 database entries are all held in the same synchronized map and the entire object is replicated each time). We rolled back this change and use the version with local hashmaps and the crash did not happen anymore.
Rolling back this change has helped some, but now the jboss crash occurs approximately once a day - sometimes with defunct processes and sometimes without. There is absolutely no information in the log. But we know that this is somehow related to JGroups as changing from 2.2.8 to 2.2.9 has greatly changed the behavior.
The only other change that has not been rolled back is the introduction of a stateless session bean. Earlier all our code was just plain old java objects (not distributed), but now we have a stateless session bean that we use to lookup information from the other app server.
This is a production problem, so it is very urgent. We also have support purchased, so I am going to post there too. I would appreciate it if you could look at that.
Please let me know if you need any other info.