RHQ Storage/Cassandra performance issue in RHQ 4.12
genman Aug 1, 2014 4:48 PMThis is my (fairly large) test system, with 3 cassandra nodes, 6 disks each, etc.
RHQ 4.9 performed fairly well, with compaction/aggregation times, so I'm not sure why performance is worse or different.
This is what I observe.
Overall, I am getting a high number of timeout exceptions on reading data. I increased request timeout from 12 seconds to 60 through a code change and it somewhat helped.
The nodes 'go down' and then come back frequently. Frequently the nodes hang (with high CPU) either outputting stuff like this: (which I think is GC cycling)
INFO [ScheduledTasks:1] 2014-08-01 20:00:03,766 GCInspector.java (line 119) GC for ParNew: 609 ms for 2 collections, 5112200464 used; max is 8380219392 INFO [ScheduledTasks:1] 2014-08-01 20:00:06,641 GCInspector.java (line 119) GC for ParNew: 1895 ms for 1 collections, 5760737488 used; max is 8380219392 ERROR [Native-Transport-Requests:10200] 2014-08-01 20:09:07,215 ErrorMessage.java (line 210) Unexpected exception during request java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:375) at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
Or the print out the stats frequently
INFO [ScheduledTasks:1] 2014-08-01 20:04:00,223 StatusLogger.java (line 114) ColumnFamily Memtable ops,data ... INFO [ScheduledTasks:1] 2014-08-01 20:04:00,225 StatusLogger.java (line 117) rhq.raw_metrics 1426355,24117248 INFO [ScheduledTasks:1] 2014-08-01 20:04:00,225 StatusLogger.java (line 117) rhq.twenty_four_hour_metrics 0,0 INFO [ScheduledTasks:1] 2014-08-01 20:04:00,225 StatusLogger.java (line 117) rhq.one_hour_metrics 0,0 INFO [ScheduledTasks:1] 2014-08-01 20:04:00,225 StatusLogger.java (line 117) rhq.schema_version 0,0 INFO [ScheduledTasks:1] 2014-08-01 20:04:00,225 StatusLogger.java (line 117) rhq.metrics_cache 1991582,66060288 INFO [ScheduledTasks:1] 2014-08-01 20:04:00,225 StatusLogger.java (line 117) rhq.metrics_cache_index 144488,11534336
Startup often fails with this:
00:31:02,625 WARN [org.rhq.enterprise.server.storage.StorageClientManager] (pool-6-thread-1) Storage client subsystem wasn't initialized because it wasn't possible to connect to the storage cluster. The RHQ server is set to MAINTENANCE mode. Please start the storage cluster as soon as possible.: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /17.176.208.117 (Timeout during read), /17.176.208.118 (Timeout during read), /17.176.208.119 (Timeout during read))
I feel like I hit a bug in Cassandra, or there is an installation issue, or there is a change in RHQ 4.12 such as queries failing to perform well, like rows that are too wide.
My nodes have 8G heap and pretty much the same cassandra.yaml configuration as 4.9.
The weird thing is I can't seem to run repair reliably, even with RHQ down, so I suspect a bad disk, bad network performance, etc. I really can't pin it on anything.
My thinking is that the queries are returning more data than can fit into 8g.