7 Replies Latest reply on Dec 16, 2013 3:04 PM by john.sanda

    RHQ 4.9; Data compression taking a long, long time

    genman

      Initially I noticed that graphs were missing data more than a week old.

       

      Digging a bit deeper, I've been seeing the following being logged:

       

      server.log:08:41:43,250 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Data Purge Job FINISHED [811155108]ms
      

       

      For a job that runs hourly, this is a long time... However, performance is very good with reading and writing metrics data in general.

       

      So given the purge job takes a long, long time hourly data is very sporadic. This is disk space percentage for a filesystem, as an example:

       

      cqlsh> select * from rhq.one_hour_metrics where schedule_id = 5246208;
      
      
       schedule_id | time                     | type | value
      -------------+--------------------------+------+-----------
           5246208 | 2013-11-20 17:00:00+0000 |    0 |        nan
           5246208 | 2013-11-20 17:00:00+0000 |    1 |        nan
           5246208 | 2013-11-20 17:00:00+0000 |    2 |          0
           5246208 | 2013-11-20 18:00:00+0000 |    0 | 2.7201e+08
           5246208 | 2013-11-20 18:00:00+0000 |    1 |   2.72e+08
           5246208 | 2013-11-20 18:00:00+0000 |    2 | 2.7201e+08
           5246208 | 2013-11-21 22:00:00+0000 |    0 | 2.7199e+08
           5246208 | 2013-11-21 22:00:00+0000 |    1 | 2.7199e+08
           5246208 | 2013-11-21 22:00:00+0000 |    2 | 2.7199e+08
           5246208 | 2013-11-24 14:00:00+0000 |    0 |        nan
           5246208 | 2013-11-24 14:00:00+0000 |    1 |        nan
           5246208 | 2013-11-24 14:00:00+0000 |    2 |          0
           5246208 | 2013-11-24 21:00:00+0000 |    0 |        nan
           5246208 | 2013-11-24 21:00:00+0000 |    1 |        nan
           5246208 | 2013-11-24 21:00:00+0000 |    2 |          0
           5246208 | 2013-12-03 18:00:00+0000 |    0 | 2.7957e+08
           5246208 | 2013-12-03 18:00:00+0000 |    1 | 2.7956e+08
           5246208 | 2013-12-03 18:00:00+0000 |    2 | 2.7957e+08
      

       

      I've also noticed things like queries to Cassandra are hanging:

       

      cqlsh> select * from system_auth.users;
      
       name      | super
      -----------+-------
        rhqadmin |  True
       cassandra |  True
      

       

      This can take about 30 seconds for some weird reason, though querying by 'where name = 'x'' comes back instantaneously. I've run node repair on this keyspace (and rhq keyspace) and it doesn't seem to speed up anything.

       

      I've also seen disk space for Cassandra simply go up and up. Aside from the possible decommission each node and start over plan, is there something that might be done here? I'm guessing there might be some carry-over from when I messed up the file system configuration, but I assume that repair would have fixed this.

        • 1. Re: RHQ 4.9; Data compression taking a long, long time
          john.sanda

          The data aggregation done during the purge job is the big culprit for those long execution times. That is being addressed in https://bugzilla.redhat.com/show_bug.cgi?id=1009945. I have already done a lot of the work for this, and I hope to get it into 4.10. In 4.9, all of the aggregation is done sequentially. The changes I have made make extensive use of the C* driver's async API and compute the aggregates in parallel. All of the work will still be performed by a single RHQ server; however, you will be able to configure the read/write throughput for client requests. I have a few other things on my plate right now, but when I resume working on this (hopefully in the next or two) I will start writing up some docs on the changes.

          1 of 1 people found this helpful
          • 2. Re: Re: RHQ 4.9; Data compression taking a long, long time
            genman

            I have a server with 22,000 metrics/minute for about 60,000 resources and the purge times I see are about 4-8 minutes. Which is much faster than you state in Bugzilla. (I probably have better hardware than you though.)

             

            12:03:50,437 INFO  [org.rhq.server.metrics.MetricsServer] (RHQScheduler_Worker-1) Finished computing aggregates for table [raw_metrics]230416 ms
            12:08:49,855 INFO  [org.rhq.server.metrics.MetricsServer] (RHQScheduler_Worker-1) Finished computing aggregates for table [one_hour_metrics] 270002 ms
            12:09:20,535 INFO  [org.rhq.server.metrics.MetricsServer] (RHQScheduler_Worker-1) Finished computing aggregates for table [six_hour_metrics] 0 ms
            
            

             

            The other server with about 150,000 metrics/minute for 400,000 resources takes not 8 times as much but a lot more (like many hours). I would expect the purge to scale linearly at worst, but are you saying it is worse than that?

             

            17:56:49,985 INFO  [org.rhq.server.metrics.MetricsServer] (RHQScheduler_Worker-4) Finished computing aggregates for table [raw_metrics]119701597 ms
            02:04:36:00,650 INFO  [org.rhq.server.metrics.MetricsServer] (RHQScheduler_Worker-4) Finished computing aggregates for table [one_hour_metrics] 11826795 ms
            
            

             

            I'm really worried that the problem is to do with my Cassandra cluster, not with the process itself, but as I said the performance seems pretty good except in this case.

             

            ... Edit: Actually I think most of this is because RHQ may be querying across to another datacenter. I have a fix https://bugzilla.redhat.com/show_bug.cgi?id=1033943 that I haven't gotten around to deploying to all servers. So I'll need to deploy this fix and maybe it will work at least a little better?

            • 3. Re: RHQ 4.9; Data compression taking a long, long time
              john.sanda

              At 150k metrics/minute I have seen things get pretty slow but nothing like the numbers you have reported. I have numbers as high (and even higher) than 3 hours. What kinds of read latencies are being reported? In the resource inventory go to Platform --> Storage Node --> Storage Proxy and see look at the read latency metrics. I haven't had a chance to review the patch, but I am very eager to hear how things turn out when you deploy the changes.

              • 4. Re: RHQ 4.9; Data compression taking a long, long time
                genman

                Recent read latency is about 70,000 microseconds, which is lousy. Write latency is about 300.

                 

                My lab cluster is reading at 500 and writing at 200.

                • 5. Re: Re: RHQ 4.9; Data compression taking a long, long time
                  genman

                  Things have gotten moderately better since I deployed the datacenter fix to all my RHQ servers. Latency is now at 30,000μs. (I'm actually not sure why this would fix this, as it's latency from the perspective of Cassandra, not RHQ.) Here's a graph to ponder. The m* nodes are in the second datacenter.

                   

                  Screen Shot 2013-12-10 at 10.10.11 AM.png

                   

                  The second data center (the flat line above) looks much better but I'm thinking not a lot of reads are going on. Still, it's like 60 times faster.

                   

                  Screen Shot 2013-12-10 at 10.55.24 AM.png

                  • 6. Re: RHQ 4.9; Data compression taking a long, long time
                    genman

                    John, not sure you'll see this. I've taken a look at your bug/1009945 branch, what kind of shape is it in? Can I test it now?

                     

                    One concern I have is the metric index table basically is changing, so I wonder if it might fail to work. It seems like the worst that might happen is aggregates aren't computed for existing entries.

                    • 7. Re: RHQ 4.9; Data compression taking a long, long time
                      john.sanda

                      It's actually been a while since I've had a chance to work on it. It still needs some testing before I can consider merging to master. If anyone wanted to volunteer with testing (hint hint )) that would certainly move things along more quickly. I do need to revisit/complete the changes to metrics_index. It may be a premature optimization. If those changes do stay though, then there will be a schema update change to store index entries in the new format. That schema update would at installation time just like when the RDBMS schema changes are applied.