3 Replies Latest reply on Aug 30, 2016 9:01 AM by rvansa

Write Performance

john.sanda Aug 29, 2016 11:32 PM

I am working on integrating Infinispan into Hawkular Metrics for implementing roll ups. Hawkular Metrics is built on top of Cassandra. As data is ingested, we perform async writes to persist it to Cassandra.The data is also asynchronously written to a cache. I specify the IGNORE_RETURN_VALUES and SKIP_LOCKING flags when accessing the cache. Entries are written once and never updated. Cache entries get purged by a background job that runs every minute. Once the work is done the data points are removed from the cache. Infinispan is embedded in Hawkular Metrics. We have some performance tests that run regularly, and they focus data ingestion. My changes have caused a significant drop in performance. I am struggling to figure out what I can do to get performance relatively close to what it was. Here is my configuration:

<infinispan>
  <cache-container default-cache="default">
    <transport cluster="HawkularMetrics"/>
    <distributed-cache name="rawData" mode="ASYNC" >
      <groups enabled="true"/>
      <eviction size="50000" strategy="LRU" type="COUNT"/>
      <persistence passivation="true">
        <leveldb-store path="/tmp/hawkular-metrics/raw/data" shared="true" preload="true">
          <expiration path="/tmp/hawkular-metrics/raw/expired"/>
          <compression type="NONE" />
          <implementation type="JAVA"/>
          <write-behind/>
        </leveldb-store>
      </persistence>
    </distributed-cache>
  </cache-container>
</infinispan>

It was evident that long and frequent GC pauses are a big contributor to the performance drop. I refactor my cache entry class to have a much smaller memory foot print. This made a big improvement. Adding the eviction and leveldb cache store seemed to make the GC activity a lot more stable, but overall performance is still way down. The performance test uses 300 client threads. Each client request includes 10 data points, and requests are submitted as fast as the Hawkular Metrics server will process them. The server has a 4 GB heap with no adaptive sizing (i.e., min heap == max heap). Earlier today we updated the test to pass the following options to the server JVM - -XX:NewSize=1024M, -XX:MaxNewSize=1024M, -XX:+UseCompressedOop. It didn't seem to help at all.

Before I made some of the changes, I was seeing stop the world pauses in excess of 10 seconds. Granted I have not spent a whole lot of time studying GC logs in general, but I think things look pretty sane now. What else should I look at?

1. Re: Write Performance

rvansa Aug 30, 2016 2:50 AM (in response to john.sanda)

In the question I miss what were the changes after which you observed the performance drop - I understand that adding LevelDB + eviction was only a countermeasure, right? In any case, use cache store when you can't fit all the data into memory (is that your problem according to GC telemetry?), but you won't speed up the cache with cache store.

I would recommend checking out how much live data you try to keep in memory and set JVM options according to that. Don't try to use all memory for storing your data - we have observed that the performance degrades when the memory is more than half-full. And don't forget the backup copies (in your case, you have 2 copies - do you need the redundancy = do you need to handle failures?).

And last remark, the pure-Java implementation of LevelDB uses MappedByteBuffer.unmap() operation which caused JVM crashes in the past. So proceed with caution. // looking into 1.8 the method is gone, so it may not be an issue anymore.
Actions
2. Re: Write Performance

john.sanda Aug 30, 2016 8:35 AM (in response to rvansa)

Adding LevelDB + eviction was a counter measure to deal with the constantly growing data set set. I had hoped that by effectively capping the cache capacity, GC activity would stabilize and performance would improve. GC seems better, but I still have a very large performance drop. I have tried different eviction sizes, and that hasn't helped much either.

Don't try to use all memory for storing your data - we have observed that the performance degrades when the memory is more than half-full. And don't forget the backup copies (in your case, you have 2 copies - do you need the redundancy = do you need to handle failures?).

Again, this is the purpose of the eviction + cache store. I have changed the eviction size to as little as 10,000, and it didn't do much performance-wise.

Where is the second copy of the data?
Actions
3. Re: Write Performance

rvansa Aug 30, 2016 9:01 AM (in response to john.sanda)

Okay, so the performance was OK as long as you used limited data set, and it dropped when you increased it beyond certain memory limits, right? And I assume that you can't just throw away the data.

With cache store, write performance won't be as good as in pure-memory version (write behind can help you handle load spikes, but under constant high load it won't provide much gain). So if you want to use just in-memory, you need to increase heap size, not just let slow cachestore handle that.

If you have long GC pauses even with low eviction threshold, you're probably using wrong/badly tuned GC - could you try with G1 or CMS? Infinispan produces quite a bit GC garbage, but if that gets promoted to old gen, it's a tuning issue.

And I apologize if I still misunderstand your situation.
Actions

Go to original post