FileCacheStore redesign

    This is an early design doc to redesign Infinispan's FileCacheStore.

     

    Some general ideas:

     

    • B+Tree-based, good for fast lookup (reading), but slower for writing.
    • Append-only store
      • Fast writing, slow to read
      • Useful if data set is held in memory and write through is purely for resilience (not expanded capacity).
      • Would require a separate thread/process to handle compacting (back into a B+Tree)

     

    Some good background reading:

     

    https://www.kernel.org/pub/linux/kernel/people/suparna/aio/262/results/aio-stress-results.txt

    http://www.acunu.com/2/post/2011/03/why-is-acunu-in-kernel.html

    http://www.datastax.com/dev/blog/what-persistence-and-why-does-it-matter

    http://www.datastax.com/dev/blog/cassandra-file-system-design

    http://wiki.apache.org/cassandra/Durability

    http://wiki.apache.org/cassandra/ArchitectureCommitLog

    http://www.slideshare.net/rbranson/cassandra-and-solid-state-drives

    http://antirez.com/post/redis-persistence-demystified.html

    http://hornetq.blogspot.co.uk/2009/08/persistence-on-hornetq.html

    http://hornetq.sourceforge.net/docs/hornetq-2.0.0.BETA5/user-manual/en/html/persistence.html

    http://hornetq.sourceforge.net/docs/hornetq-2.0.0.GA/user-manual/en/html/libaio.html

    https://code.google.com/p/leveldb/

     

    Related JIRAs:

     

    https://issues.jboss.org/browse/ISPN-1808

    https://issues.jboss.org/browse/ISPN-1362

    https://issues.jboss.org/browse/ISPN-1303

    https://issues.jboss.org/browse/ISPN-1302

    https://issues.jboss.org/browse/ISPN-1301

    https://issues.jboss.org/browse/ISPN-517

     

    Test plan:

    • Operations to test: load, store, remove, preload
    • These operations should be tested in two major scenarios:
      • Test operations, a local cache with no eviction plugged with the file cache store (no async store), in such way that the cache the cache store have exactly the same data. E.g. 1 GB data stored. This test aims to see how fast we can update a cache store. Reads would be very fast because they'd be served by the in-memory cache.
      • Test operations, in a small in-memory local cache with agreesive eviction settings plugged with a file based cache store (no async store) that's used as overflow. E.g. keep 1GB in memory and store 20 GB in file store. Here we're trying to get a better idea of how good the cache store is at reading data. That's because most of the data will be present in the cache store and not in the cache, so it requires reading the cache store and storing that data in-memory.
    • Before writing any cache stores, we should evaluate the performance of the cache stores available right now, which are:
    • Preferably, tests should be run in a modern SSD drives.

     

    Objectives:

    • For each of the major scenarions, target performance objectives need to be set. TBD.

     

    Current results:

    All setups used local-cache, benchmark was executed via Radargun (actually version not merged into master yet [2]). I've used 4 nodes just to get more data - each slave was absolutely independent of the others.

     

    First test was preloading performance - the cache started and tried to load 1GB of data from harddrive. Without cachestore the startup takes about 2 - 4 seconds, average numbers for the cachestores are below:

     

     

    Cache storeStartup-time
    FileCacheStore9.8 s
    KarstenFileCacheStore14 s
    LevelDB-JAVA impl.12.3 s
    LevelDB-JNI impl12.9 s

     

     

    IMO nothing special, all times seem affordable. We don't benchmark exactly storing the data into the cachestore, here FileCacheStore took about 44 minutes, while Karsten about 38 seconds, LevelDB-JAVA 4 minutes and LevelDB-JNI 96 seconds. The units are right, it's minutes compared to seconds. But we all know that FileCacheStore is bloody slow.

     

    Second test is stress test (5 minutes, preceded by 2 minute warmup) where each of 10 threads works on 10k entries with 1kB values (~100 MB in total). 20 % writes, 80 % reads, as usual. No eviction is configured, therefore the cache-store works as a persistent storage only for case of crash.

     

    Cache storereads/swrites/s
    note
    FileCacheStore3.1M112on one node the performance was only 2.96M reads/s 75 writes/s
    KarstenFileCacheStore9.2M226k
    LevelDB-JAVA impl.3.9M5100
    LevelDB-JNI impl.6.6M14kon one node the performance was 3.9M/8.3k - about half of the others
    Without cache store15.5M4.4M

     

    Karsten implementation pretty rules here for two reasons. First of all, it does not flush the data (it calls only RandomAccessFile.write()). Other cheat is that it stores in-memory the keys and offsets of data values in the database file. Therefore, it's definitely the best choice for this scenario, but it does not allow to scale the cache-store, especially in cases where the keys are big and values small. However, this performance boost is definitely worth checking - I could think of caching the disk offsets in memory and querying persistent index only in case of missing record, with part of the persistent index flushed asynchronously (the index can be always rebuilt during the preloading for case of crash).

     

    The third test should have tested the scenario with more data to be stored than memory - therefore, the stressors operated on 100k entries (~100 MB of data) but eviction was set to 10k entries (9216 entries ended up in memory after the test has ended).

     

    Cache storereads/s

    writes/s

    note
    FileCacheStore750285one node had only 524 reads and 213 writes per second
    KarstenFileCacheStore458k137k
    LevelDB-JAVA impl.21k9kthese values are for mmap implementation (typo in test)
    LevelDB-JNI impl.13k-46k6.6k-15.2kthe performance varied a lot!

     

    We have also tested the second and third scenario with increased amount of data used - each thread operated on 200k entries, giving about 2 GB of data in total. The test execution was also prolonged to 5 minute warmup and 10 minute test. FileCacheStore was excluded from this comparison.

    Update: I have also added the FileChannel.force(false) calls to the Karsten implementation and the results are provided.

     

    Persistent storage scenario:

     

     

    Cache storereads/swrites/snote
    KarstenFileCacheStore3.8M-5.3M3600-7700

    KarstenFileCacheStore - force(false)

    3.2M1650
    LevelDB-JAVA3.8M2200
    LevelDB-JAVA - force(false)3.2M400
    LevelDB-JAVA - force(false), SNAPPY (iq80)3.2M390
    LevelDB-JNI5.3M4650
    LevelDB-JNI - sync writes3.0M1240
    LevelDB-JNI - sync writes, SNAPPY3.2M1240
    Without cache store6.2M1.9M

     

    Overflow scenario:

     

    Cache storereads/swrites/snote
    KarstenFileCacheStore265k16kone node had 21k writes/s
    KarstenFileCacheStore - force(false)285k1200
    LevelDB-JAVA500 or 5900400 or 4000

    one node 10x faster! It shows different memory and CPU usage pattern.

    these values are for mmap implementation (typo in test)

    LevelDB-JAVA - force(false)950520
    LevelDB-JAVA - force(false), SNAPPY (iq80)950515
    LevelDB-JNI9200-14.4k5400-6500
    LevelDB-JNI - sync writes15.5k900some variance between nodes
    LevelDB-JNI - sync writes, SNAPPY14k-19k750-1100one node slower at writes

     

    Obviously the performance dropped radically from the 100 MB case.

     

    Another test tried to find out the impact of value size. We have used the persistent configuration, with each thread operating on 100k entries with value size 1kB, 25k entries with value size 4kB or 6125 entries with value size 16kB.

     

    Cache store
    1kB values
    4kB values16kB values
    KarstenFileCacheStore13k writes/s, one node 22k13k writes/s one node 24k12.5k writes/s, one node 19k
    LevelDB-JNI6k writes/s1400 writes/s400 writes/s

     

    Next test used 1kB, 4kB or 16kB keys and empty values:

     

    Cache store
    1kB keys
    4kB keys16kB keys
    KarstenFileCacheStore13k writes/s12k writes/s7k writes/s
    LevelDB-JNI8k writes/s490 writes/s130 writes/s