I am experiencing stability issues running hawkular metrics on openshift. Every 10 minutes or so my 3 node cassandra cluster becomes unavavailable and I lose metrics. I guessing it's either garbage collection or compaction because it only happens after the cluster has been running for a while.
Some notes:
Each node has ample resources 64GB of RAM and 24 Cores
The nodes never use more than 10 gigs or get above 600 millicores
During the unavailable times the cassandra nodes report high numbers in the mutation pending column
I'm not really collecting that many metrics right now (maybe 20 containers). God help me when there hundreds of containers running!
Some adjustments and ideas:
I changed the jvm heap according to this document Tuning Java resources however I did not change the garbage collector type.
I noticed that hawkular is using the LCS compaction strategy. Wouldn't the DTCS strategy be more appropriate? Configuring compaction
Any help would be great! Let me know if you need any more info.