I am working on integrating Infinispan into Hawkular Metrics for implementing roll ups. Hawkular Metrics is built on top of Cassandra. As data is ingested, we perform async writes to persist it to Cassandra.The data is also asynchronously written to a cache. I specify the IGNORE_RETURN_VALUES and SKIP_LOCKING flags when accessing the cache. Entries are written once and never updated. Cache entries get purged by a background job that runs every minute. Once the work is done the data points are removed from the cache. Infinispan is embedded in Hawkular Metrics. We have some performance tests that run regularly, and they focus data ingestion. My changes have caused a significant drop in performance. I am struggling to figure out what I can do to get performance relatively close to what it was. Here is my configuration:
<infinispan> <cache-container default-cache="default"> <transport cluster="HawkularMetrics"/> <distributed-cache name="rawData" mode="ASYNC" > <groups enabled="true"/> <eviction size="50000" strategy="LRU" type="COUNT"/> <persistence passivation="true"> <leveldb-store path="/tmp/hawkular-metrics/raw/data" shared="true" preload="true"> <expiration path="/tmp/hawkular-metrics/raw/expired"/> <compression type="NONE" /> <implementation type="JAVA"/> <write-behind/> </leveldb-store> </persistence> </distributed-cache> </cache-container> </infinispan>
It was evident that long and frequent GC pauses are a big contributor to the performance drop. I refactor my cache entry class to have a much smaller memory foot print. This made a big improvement. Adding the eviction and leveldb cache store seemed to make the GC activity a lot more stable, but overall performance is still way down. The performance test uses 300 client threads. Each client request includes 10 data points, and requests are submitted as fast as the Hawkular Metrics server will process them. The server has a 4 GB heap with no adaptive sizing (i.e., min heap == max heap). Earlier today we updated the test to pass the following options to the server JVM - -XX:NewSize=1024M, -XX:MaxNewSize=1024M, -XX:+UseCompressedOop. It didn't seem to help at all.
Before I made some of the changes, I was seeing stop the world pauses in excess of 10 seconds. Granted I have not spent a whole lot of time studying GC logs in general, but I think things look pretty sane now. What else should I look at?