We have Jboss cache running in production and within the same cluster of JVMs have a number a number of different instances for the following reasons.
1) For data which doesn't need to be distributed we use a single node non replicated instance (one instance for each entity)
2) And two clustered instances; one for the main 'entity' of the system (in our case 'orders') and another for all other distributed entities.
Our assumption was we would get greater throughput on the cahce of the order entity if it had its own instance rather than having to compete for resources (threads/Jgroups etc.) with the other distributed entities.
My question is - is this assumption correct or are we guilty of 'over engineering it' and could deploying in such a fashion actually have the opposite affect and degrade the peformance of the cluster (if so we would be interested to know why).
On the whole the performance is great but we are getting hit by stability. About once a week (tends to be when the system has the highest use - although not always the case) we start getting locking/replication errors on the 'orders' cluster which eventually leads to the CPU of one of the JVMs in the cluster hitting 100% - (we have yet to grab thread dump as it tends to happen when the developers are off site and ops are supporting the application) - hence making the cluster pretty useless as all the transactions/calls to the other nodes in the cluster fail with a replication errors.
Restarting the JVM with the high CPU sorts the issue.
Also we have found that if the ops guys restart the wrong JVM that the initial state transfer fails with the following error.
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(Unknown Source)
at sun.nio.cs.StreamEncoder$CharsetSE.writeBytes(Unknown Source)
at sun.nio.cs.StreamEncoder$CharsetSE.implFlushBuffer(Unknown Source)
at sun.nio.cs.StreamEncoder$CharsetSE.implFlush(Unknown Source)
at sun.nio.cs.StreamEncoder.flush(Unknown Source)
at java.io.OutputStreamWriter.flush(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
We are using PESSIMITIC locking - but for reads we simply make a copy of the object and let the read lock go immediately rather than waiting for the main transaction to finish - i.e. all reads on done on a sub transaction. We do the 'put' to the cache right at the end of the transaction (hence reducing the time that the write lock is held) checking that the version of the entity in the database is less than that of the entity being comitted (so a poor man's OPTIMISTIC lock) - we tried using the cache's OPTIMISTIC lock but we had all sorts of weird problems beyond this post
We are using Jboss Cache 2.2.0.
The stack trace is weird, it shows an interruption during logging. Doesn't make a lot of sense.
Regarding your architecture, it does make sense to some degree, although you are still sharing network resources for actual replication (bandwidth), etc. So you may be over-engineering things a bit, I would have started with just 2 caches (a replicated one and a local one) until you know for sure that there is contention on shared resources.
When using pessimistic locking, locks are per-node so there won't be contention for different entities, as long as they are spread well across the tree structure.
Also, I have just tagged 2.2.1.GA and this should be released today - this has some serialization-related performance enhancements that you may find interesting, as well as a faster JGroups library.
Also, I'd like to add, have you tried 3.0.0.CR1 with MVCC? It is significantly faster and conceptually more stable than either pessimistic or optimistic locking.
Many thanks for your reply - just to let you know we found out what the problem was.
We have a listener attached to each cache which helps us manage indexes on the cache (we index some entities to prevent us having to traverse the all the entities when retrieving a selection of items - through jboss cache we have pretty much killed off any need to access the database with the exception of the inital load, writes and updates)
Within the listener there was a HashMap instance variable - it was this variable which was causing JVM to lock up - when we did a thread dump we noticed that the replication thread was getting stuck on the put method of this HashMap at the same time as another normal application thread was - hence the replication queue was backing up and causing the replication exceptions to be thrown to the other nodes on the cluster. Searching the net I found a number of instances where hashmap can spin out of control if two thread hit the rebuild method of the hashmap. We have since adding read/write lock to this variable and we have yet to experience this issue again. Unfortunately it was fairly hard to find as you would have expected to see a deadlock - and it was only when we did a number of thread over a minute that we picked this up.
We have briefly tried the version three and things seem go really well until ramped up the load on the clustered cache (the local caches where absolutely flying) it started to through lock errors (which was actually the last issue I was expecting given the new locking system - before pointing the figure at the new version I want to do more analysis at our end as it could be some silly code at our end or the database running really slowly).