6 Replies Latest reply on Aug 6, 2014 11:23 PM by genman

    RHQ Storage/Cassandra performance issue in RHQ 4.12

    genman

      This is my (fairly large) test system, with 3 cassandra nodes, 6 disks each, etc.

       

      RHQ 4.9 performed fairly well, with compaction/aggregation times, so I'm not sure why performance is worse or different.

       

      This is what I observe.

       

      Overall, I am getting a high number of timeout exceptions on reading data. I increased request timeout from 12 seconds to 60 through a code change and it somewhat helped.

       

      The nodes 'go down' and then come back frequently. Frequently the nodes hang (with high CPU) either outputting stuff like this: (which I think is GC cycling)

       

       INFO [ScheduledTasks:1] 2014-08-01 20:00:03,766 GCInspector.java (line 119) GC for ParNew: 609 ms for 2 collections, 5112200464 used; max is 8380219392
       INFO [ScheduledTasks:1] 2014-08-01 20:00:06,641 GCInspector.java (line 119) GC for ParNew: 1895 ms for 1 collections, 5760737488 used; max is 8380219392
      ERROR [Native-Transport-Requests:10200] 2014-08-01 20:09:07,215 ErrorMessage.java (line 210) Unexpected exception during request
      java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:192)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:375)
        at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
      

       

      Or the print out the stats frequently

       

       INFO [ScheduledTasks:1] 2014-08-01 20:04:00,223 StatusLogger.java (line 114) ColumnFamily                Memtable ops,data
      ...
      
       INFO [ScheduledTasks:1] 2014-08-01 20:04:00,225 StatusLogger.java (line 117) rhq.raw_metrics              1426355,24117248
       INFO [ScheduledTasks:1] 2014-08-01 20:04:00,225 StatusLogger.java (line 117) rhq.twenty_four_hour_metrics                 0,0
       INFO [ScheduledTasks:1] 2014-08-01 20:04:00,225 StatusLogger.java (line 117) rhq.one_hour_metrics                      0,0
       INFO [ScheduledTasks:1] 2014-08-01 20:04:00,225 StatusLogger.java (line 117) rhq.schema_version                        0,0
       INFO [ScheduledTasks:1] 2014-08-01 20:04:00,225 StatusLogger.java (line 117) rhq.metrics_cache            1991582,66060288
       INFO [ScheduledTasks:1] 2014-08-01 20:04:00,225 StatusLogger.java (line 117) rhq.metrics_cache_index       144488,11534336
      

      Startup often fails with this:

       

      00:31:02,625 WARN  [org.rhq.enterprise.server.storage.StorageClientManager] (pool-6-thread-1) Storage client subsystem wasn't initialized because it wasn't possible to connect to the storage cluster. The RHQ server is set to MAINTENANCE mode. Please start the storage cluster as soon as possible.: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /17.176.208.117 (Timeout during read), /17.176.208.118 (Timeout during read), /17.176.208.119 (Timeout during read))
      

       

      I feel like I hit a bug in Cassandra, or there is an installation issue, or there is a change in RHQ 4.12 such as queries failing to perform well, like rows that are too wide.

       

      My nodes have 8G heap and pretty much the same cassandra.yaml configuration as 4.9.

       

      The weird thing is I can't seem to run repair reliably, even with RHQ down, so I suspect a bad disk, bad network performance, etc. I really can't pin it on anything.

       

      My thinking is that the queries are returning more data than can fit into 8g.

        • 1. Re: RHQ Storage/Cassandra performance issue in RHQ 4.12
          genman

          My CF stats:

           

                          Column Family: metrics_cache_index

                          SSTable count: 10

                          Space used (live): 1000602058

                          Space used (total): 1000602058

                          Number of Keys (estimate): 1280

                          Memtable Columns Count: 0

                          Memtable Data Size: 0

                          Memtable Switch Count: 1

                          Read Count: 0

                          Read Latency: NaN ms.

                          Write Count: 932

                          Write Latency: 0.110 ms.

                          Pending Tasks: 0

                          Bloom Filter False Positives: 0

                          Bloom Filter False Ratio: 0.00000

                          Bloom Filter Space Used: 10400

                          Compacted row minimum size: 3312

                          Compacted row maximum size: 322381140

                          Compacted row mean size: 119919129

           

          And

           

                          Column Family: metrics_cache

                          SSTable count: 5

                          Space used (live): 10959279068

                          Space used (total): 10959279068

                          Number of Keys (estimate): 16665600

                          Memtable Columns Count: 2444

                          Memtable Data Size: 1048576

                          Memtable Switch Count: 1

                          Read Count: 0

                          Read Latency: NaN ms.

                          Write Count: 88555

                          Write Latency: 0.022 ms.

                          Pending Tasks: 0

                          Bloom Filter False Positives: 0

                          Bloom Filter False Ratio: 0.00000

                          Bloom Filter Space Used: 22764864

                          Compacted row minimum size: 51

                          Compacted row maximum size: 51012

                          Compacted row mean size: 1346

           

          Is there a way to better tune things?

           

          I created Bug 1126105 for now.

           

          Bug 1126105 – RHQ storage determineMostRecentRawDataSinceLastShutdown sub-optimal

          • 2. Re: RHQ Storage/Cassandra performance issue in RHQ 4.12
            john.sanda

            RHQ 4.10 introduced some performance changes to speed up metrics aggregation that runs during the data purge job. One of the things that I learned in large part through your testing and feedback is that while there was improvement in overall execution time, we still experience a lot of I/O contention. We performed N + 1 queries to aggregate raw data where N is the number of measurement schedules. Several changes were made to the aggregation code in 4.12. Some of those changes were aimed at reducing the queries that have to be executed during data aggregation. I have done testing with a couple different 4.12 installs that were collecting small amounts of metric data. One environment was a default installation. The other was running two nodes. In both cases even though there was a small load, I was generating alerts for high heap usage on my storage nodes. I suspect the increased memory usage and GC activity is due to the metrics_cache table and the wide rows in metrics_cache_index that get queried.

             

            When you say that nodes go down, are you referring to the long, increased GC pauses that you are experiencing? Or are nodes actually crashing? And are you noticing their availability being reported as DOWN as well?

            • 3. Re: RHQ Storage/Cassandra performance issue in RHQ 4.12
              john.sanda

              I forgot to ask before about your heap settings. What settings have you changed in cassandra-jvm.properties?

              • 4. Re: RHQ Storage/Cassandra performance issue in RHQ 4.12
                genman

                John,

                 

                I think the problems are mostly resolved with the fix you brought forward here:

                 

                https://github.com/jsanda/rhq/commit/9fa4bfdb5c73c4aa6588a46d4f0a33c1364233d3

                 

                The GC issues were caused when some of the aggregation queries, namely the one for 'determineMostRecentRawDataSinceLastShutdown' would cause the cluster to be overwhelmed. What I would see is high CPU, timeouts, temporary UP/DOWN statuses being logged, followed by sometimes crashing.

                • 5. Re: RHQ Storage/Cassandra performance issue in RHQ 4.12
                  john.sanda

                  That commit did a couple things - disabled the check for most recent raw data at start up and disabled the usage of the new metrics_cache table. The queries used for checking for recent raw data are also used during aggregation, so I am not entirely convinced that the problems are resolved. I would be interested to see how things look if you enabled the metrics_cache table again with the recent raw data check disabled.

                   

                  How many servers are you running and how frequently do they get restarted?

                  • 6. Re: RHQ Storage/Cassandra performance issue in RHQ 4.12
                    genman

                    John and I worked out most of the issues over email. Some of the problems may have been installation related.

                     

                    My JDK settings weren't different. I also noticed I updated to JDK 8 (desperately) and should go back to 7.

                     

                    There may or may not be an option to disable the metrics_cache table and other fixes. These are in Bugzilla with some fixes proposed tested...

                     

                    Bug 1126208 - Purge job can hang indefinitely (edit)

                    Bug 1126105 - RHQ storage determineMostRecentRawDataSinceLastShutdown sub-optimal (edit)