6 Replies Latest reply on Aug 6, 2014 11:23 PM by genman

RHQ Storage/Cassandra performance issue in RHQ 4.12

genman Aug 1, 2014 4:48 PM

This is my (fairly large) test system, with 3 cassandra nodes, 6 disks each, etc.

RHQ 4.9 performed fairly well, with compaction/aggregation times, so I'm not sure why performance is worse or different.

This is what I observe.

Overall, I am getting a high number of timeout exceptions on reading data. I increased request timeout from 12 seconds to 60 through a code change and it somewhat helped.

The nodes 'go down' and then come back frequently. Frequently the nodes hang (with high CPU) either outputting stuff like this: (which I think is GC cycling)

 INFO [ScheduledTasks:1] 2014-08-01 20:00:03,766 GCInspector.java (line 119) GC for ParNew: 609 ms for 2 collections, 5112200464 used; max is 8380219392
 INFO [ScheduledTasks:1] 2014-08-01 20:00:06,641 GCInspector.java (line 119) GC for ParNew: 1895 ms for 1 collections, 5760737488 used; max is 8380219392
ERROR [Native-Transport-Requests:10200] 2014-08-01 20:09:07,215 ErrorMessage.java (line 210) Unexpected exception during request
java.io.IOException: Connection reset by peer
  at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
  at sun.nio.ch.IOUtil.read(IOUtil.java:192)
  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:375)
  at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)

Or the print out the stats frequently

 INFO [ScheduledTasks:1] 2014-08-01 20:04:00,223 StatusLogger.java (line 114) ColumnFamily                Memtable ops,data
...

 INFO [ScheduledTasks:1] 2014-08-01 20:04:00,225 StatusLogger.java (line 117) rhq.raw_metrics              1426355,24117248
 INFO [ScheduledTasks:1] 2014-08-01 20:04:00,225 StatusLogger.java (line 117) rhq.twenty_four_hour_metrics                 0,0
 INFO [ScheduledTasks:1] 2014-08-01 20:04:00,225 StatusLogger.java (line 117) rhq.one_hour_metrics                      0,0
 INFO [ScheduledTasks:1] 2014-08-01 20:04:00,225 StatusLogger.java (line 117) rhq.schema_version                        0,0
 INFO [ScheduledTasks:1] 2014-08-01 20:04:00,225 StatusLogger.java (line 117) rhq.metrics_cache            1991582,66060288
 INFO [ScheduledTasks:1] 2014-08-01 20:04:00,225 StatusLogger.java (line 117) rhq.metrics_cache_index       144488,11534336

Startup often fails with this:

00:31:02,625 WARN  [org.rhq.enterprise.server.storage.StorageClientManager] (pool-6-thread-1) Storage client subsystem wasn't initialized because it wasn't possible to connect to the storage cluster. The RHQ server is set to MAINTENANCE mode. Please start the storage cluster as soon as possible.: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /17.176.208.117 (Timeout during read), /17.176.208.118 (Timeout during read), /17.176.208.119 (Timeout during read))

I feel like I hit a bug in Cassandra, or there is an installation issue, or there is a change in RHQ 4.12 such as queries failing to perform well, like rows that are too wide.

My nodes have 8G heap and pretty much the same cassandra.yaml configuration as 4.9.

The weird thing is I can't seem to run repair reliably, even with RHQ down, so I suspect a bad disk, bad network performance, etc. I really can't pin it on anything.

My thinking is that the queries are returning more data than can fit into 8g.

1. Re: RHQ Storage/Cassandra performance issue in RHQ 4.12

genman Aug 1, 2014 5:57 PM (in response to genman)

My CF stats:

                Column Family: metrics_cache_index
                SSTable count: 10
                Space used (live): 1000602058
                Space used (total): 1000602058
                Number of Keys (estimate): 1280
                Memtable Columns Count: 0
                Memtable Data Size: 0
                Memtable Switch Count: 1
                Read Count: 0
                Read Latency: NaN ms.
                Write Count: 932
                Write Latency: 0.110 ms.
                Pending Tasks: 0
                Bloom Filter False Positives: 0
                Bloom Filter False Ratio: 0.00000
                Bloom Filter Space Used: 10400
                Compacted row minimum size: 3312
                Compacted row maximum size: 322381140
                Compacted row mean size: 119919129

And

                Column Family: metrics_cache
                SSTable count: 5
                Space used (live): 10959279068
                Space used (total): 10959279068
                Number of Keys (estimate): 16665600
                Memtable Columns Count: 2444
                Memtable Data Size: 1048576
                Memtable Switch Count: 1
                Read Count: 0
                Read Latency: NaN ms.
                Write Count: 88555
                Write Latency: 0.022 ms.
                Pending Tasks: 0
                Bloom Filter False Positives: 0
                Bloom Filter False Ratio: 0.00000
                Bloom Filter Space Used: 22764864
                Compacted row minimum size: 51
                Compacted row maximum size: 51012
                Compacted row mean size: 1346

Is there a way to better tune things?

I created Bug 1126105 for now.

Bug 1126105 – RHQ storage determineMostRecentRawDataSinceLastShutdown sub-optimal
Actions
2. Re: RHQ Storage/Cassandra performance issue in RHQ 4.12

john.sanda Aug 4, 2014 11:11 PM (in response to genman)

RHQ 4.10 introduced some performance changes to speed up metrics aggregation that runs during the data purge job. One of the things that I learned in large part through your testing and feedback is that while there was improvement in overall execution time, we still experience a lot of I/O contention. We performed N + 1 queries to aggregate raw data where N is the number of measurement schedules. Several changes were made to the aggregation code in 4.12. Some of those changes were aimed at reducing the queries that have to be executed during data aggregation. I have done testing with a couple different 4.12 installs that were collecting small amounts of metric data. One environment was a default installation. The other was running two nodes. In both cases even though there was a small load, I was generating alerts for high heap usage on my storage nodes. I suspect the increased memory usage and GC activity is due to the metrics_cache table and the wide rows in metrics_cache_index that get queried.

When you say that nodes go down, are you referring to the long, increased GC pauses that you are experiencing? Or are nodes actually crashing? And are you noticing their availability being reported as DOWN as well?
Actions
3. Re: RHQ Storage/Cassandra performance issue in RHQ 4.12

john.sanda Aug 5, 2014 12:32 AM (in response to john.sanda)

I forgot to ask before about your heap settings. What settings have you changed in cassandra-jvm.properties?
Actions
4. Re: RHQ Storage/Cassandra performance issue in RHQ 4.12

genman Aug 5, 2014 3:05 PM (in response to john.sanda)

John,

I think the problems are mostly resolved with the fix you brought forward here:

https://github.com/jsanda/rhq/commit/9fa4bfdb5c73c4aa6588a46d4f0a33c1364233d3

The GC issues were caused when some of the aggregation queries, namely the one for 'determineMostRecentRawDataSinceLastShutdown' would cause the cluster to be overwhelmed. What I would see is high CPU, timeouts, temporary UP/DOWN statuses being logged, followed by sometimes crashing.
Actions
5. Re: RHQ Storage/Cassandra performance issue in RHQ 4.12

john.sanda Aug 6, 2014 12:29 AM (in response to genman)

That commit did a couple things - disabled the check for most recent raw data at start up and disabled the usage of the new metrics_cache table. The queries used for checking for recent raw data are also used during aggregation, so I am not entirely convinced that the problems are resolved. I would be interested to see how things look if you enabled the metrics_cache table again with the recent raw data check disabled.

How many servers are you running and how frequently do they get restarted?
Actions
6. Re: RHQ Storage/Cassandra performance issue in RHQ 4.12

genman Aug 6, 2014 11:23 PM (in response to john.sanda)

John and I worked out most of the issues over email. Some of the problems may have been installation related.

My JDK settings weren't different. I also noticed I updated to JDK 8 (desperately) and should go back to 7.

There may or may not be an option to disable the metrics_cache table and other fixes. These are in Bugzilla with some fixes proposed tested...

Bug 1126208 - Purge job can hang indefinitely (edit)
Bug 1126105 - RHQ storage determineMostRecentRawDataSinceLastShutdown sub-optimal (edit)
Actions

Go to original post