6 Replies Latest reply on Jul 17, 2014 3:58 AM by esatikka

    Metric data disappears from Cassandra after two weeks

    esatikka

      Hi,

       

      There is an issue with the RHQ instance I'm administering that started already when I upgraded to RHQ 4.9 from 4.6 (thus migrating metrics into Cassandra) and as upgrading to 4.11 didn't seem to change anything regarding the issue I have started looking for a fix, but so far haven't got a clue.

       

      All the metric data in the system seems to get dropped after just two weeks, which in some cases is too early as it makes it harder to see how some things have evolved and differs from how things worked back in 4.6. I actually still have metrics from the time prior the 4.6->4.9 upgrade migrated in Cassandra from Oracle up to last December, after that there is a long gap until the tail of metric data in Cassandra is found. Availability data seems to persist, though. I have peeked into Cassandra to verify that indeed there are no metrics after now-14 days, i.e. currently

      select * from twenty_four_hour_metrics where time < '2014-05-12 00:00:00' allow filtering;

      returns only data from last year, other metrics column-families are completely empty for that time (but I guess they should be that way).

       

      As there is no system setting stating how long measurement data is to be kept I wonder how this could be changed/fixed. I hope some of you guys could help.

       

      Regards,

      Esa

        • 1. Re: Metric data disappears from Cassandra after two weeks
          genman

          The loss of data wasn't intentional... Data should be getting kept for around 6 months or so.

           

          After two weeks, the data goes from raw_data to the 1h table.

           

          Can you post the logs for org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob (grep for this). You should be seeing something going on here with measurement data compression.

           

          I would consider upgrading to 4.11 (taking note of the possible bugs) as well, as there are performance fixes for compression.

          • 2. Re: Re: Metric data disappears from Cassandra after two weeks
            esatikka

            Sorry I was not clear enough on it, but I am running 4.11 at the moment (I was kinda hoping the issue I first noticed in 4.9 to have been fixed there).

             

            Here's the logs on the first data purge job today:

            00:00:00,025 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Data Purge Job STARTING

            00:00:00,041 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Measurement data compression starting at Wed May 28 00:00:00 EEST 2014

            00:00:35,651 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Measurement data compression completed in [35610]ms

            00:00:35,651 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Measurement calltime data purge starting at Wed May 28 00:00:35 EEST 2014

            00:00:35,651 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Purging calltime data that is older than Sun Apr 27 00:00:35 EEST 2014

            00:00:35,682 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Calltime purged [0] - completed in [31]ms

            00:00:35,682 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Event data purge starting at Wed May 28 00:00:35 EEST 2014

            00:00:35,682 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Purging event data older than Wed May 14 00:00:35 EEST 2014

            00:00:37,151 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Event data purged [1570] - completed in [1469]ms

            00:00:37,151 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Alert data purge starting at Wed May 28 00:00:37 EEST 2014

            00:00:37,151 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Purging alert data older than Sun Apr 27 00:00:37 EEST 2014

            00:00:37,166 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Alert data purged [0] - completed in [15]ms

            00:00:37,166 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Alert definition unused purge starting at Wed May 28 00:00:37 EEST 2014

            00:00:37,182 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Alert definitions purged [0] - completed in [16]ms

            00:00:37,182 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Alert condition orphan purge starting at Wed May 28 00:00:37 EEST 2014

            00:00:37,182 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Purged [0] orphan alert conditions - completed in [0]ms

            00:00:37,182 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Alert notification orphan purge starting at Wed May 28 00:00:37 EEST 2014

            00:00:37,198 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Purged [0] orphan alert notifications - completed in [16]ms

            00:00:37,198 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Trait data purge starting at Wed May 28 00:00:37 EEST 2014

            00:00:37,198 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Purging traits that are older than Tue May 28 00:00:37 EEST 2013

            00:00:37,198 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Traits data purged [0] - completed in [0]ms

            00:00:37,198 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Availability data purge starting at Wed May 28 00:00:37 EEST 2014

            00:00:37,198 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Purging availablities that are older than Tue May 28 00:00:37 EEST 2013

            00:00:37,213 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Availability data purged [0] - completed in [15]ms

            00:00:37,213 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Drift file orphan purge starting at Wed May 28 00:00:37 EEST 2014

            00:00:37,213 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Purging orphaned drift files older than Sun Apr 27 00:00:37 EEST 2014

            00:00:37,245 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Purged [0] orphaned drift files - completed in [32]ms

            00:00:37,245 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Database maintenance starting at Wed May 28 00:00:37 EEST 2014

            00:00:37,245 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Performing daily database maintenance

            00:00:37,260 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Re-indexing data tables

            00:00:58,182 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Database maintenance completed in [20937]ms

            00:00:58,198 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Auto-calculation of baselines starting at Wed May 28 00:00:58 EEST 2014

            00:01:04,557 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Auto-calculation of baselines completed in [6359]ms

            00:01:04,557 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Auto-calculation of OOBs starting

            00:01:08,901 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Auto-calculation of OOBs completed in [4344]ms

            00:01:08,901 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Data Purge Job FINISHED [68876]ms

            What to me seems to not be explicitly stated neither in the log or in the config is purging of measurement data (other than calltime or trait data). Is the retention period hardcoded or based on some other setting?

            • 3. Re: Metric data disappears from Cassandra after two weeks
              genman

              The data retention is hard coded to raw being 7 days, one hour 14, six hour 31, and twenty four 365. So I'm guessing your six hour data isn't getting inserted?

               

              Check the date/time on your Cassandra host matches your server. Check the content of tables makes sense. You can also enable DEBUG to see how much data is going into the six hour side.

              • 4. Re: Re: Metric data disappears from Cassandra after two weeks
                esatikka

                There are six hour metrics, too, but also end at -2 weeks from now... The table contents look sensible to me but then again I have no way of telling if my tables differ from what they should be.

                The server time is synced from network and looks just fine (also haven't noticed significant agent-server time differences being reported). I have now enabled DEBUG logging, just have to wait and see until next data purge that touches the aggregated metrics... I'll get back when I have some results.

                • 5. Re: Re: Metric data disappears from Cassandra after two weeks
                  esatikka

                  I attached more detailed log on the data purge job last night. From what I can see, there are no explicit errors (except for OOBs which are not being used). I can't spot anything that would indicate to me that 6H metrics are being cut at -2 weeks.

                  I also did some reading of the project source code but uh, it would require quite a much more time to get somewhere with it than I have at my disposal at the moment. However, I did run into the metrics_index table in the source and did some queries there. I'm not completely sure what is the role of that index, but the data looks odd to me. Let's take an example schedule ID 15162:

                   

                  buckettimeschedule_id
                  one_hour_metrics2014-06-02 09:00:00+030015162
                  six_hour_metrics2014-06-02 06:00:00+030015162
                  twenty_four_hour_metrics2014-06-02 00:00:00+030015162
                  one_hour_metrics2014-05-06 23:00:00+030015162
                  one_hour_metrics2014-05-06 22:00:00+030015162
                  one_hour_metrics2014-05-06 21:00:00+030015162
                  one_hour_metrics2014-05-06 20:00:00+030015162
                  six_hour_metrics2014-05-06 18:00:00+030015162
                  twenty_four_hour_metrics2014-05-06 00:00:00+030015162
                  six_hour_metrics2014-03-30 00:00:00+020015162
                  twenty_four_hour_metrics2014-03-29 00:00:00+020015162
                  raw_metrics2013-12-09 09:00:00+020015162
                  raw_metrics2013-12-09 08:00:00+020015162
                  raw_metrics2013-12-09 07:04:24+020015162
                  raw_metrics2013-12-09 07:00:00+020015162
                  one_hour_metrics2013-12-09 06:00:00+020015162
                  one_hour_metrics2013-12-09 00:00:00+020015162
                  six_hour_metrics2013-12-09 00:00:00+020015162
                  six_hour_metrics2013-12-08 02:00:00+020015162
                  six_hour_metrics2013-12-08 00:00:00+020015162

                   

                  The installation was upgraded to 4.9 on 2013-12-09, which explains the data on that day and perhaps the previous day (migrated from Oracle?) but other than that it looks odd. Anyway, there are no 1H, 6H or 24H metrics for 2014-05-06 so that looks like a clear inconsistency.

                   

                  Since the only thing in the log that is saying anything about purging data on May 19th is event data purge, I experimentally increased the retention time for events from 14 days to 31 days to see if that would have any side effect. (Edit two days later: no it didn't make a difference regarding the disappearing metrics, which was expected)

                  • 6. Re: Re: Metric data disappears from Cassandra after two weeks
                    esatikka

                    It might be that whatever bug I was suffering from has eventually been fixed in 4.11, after all. Looking again at the Cassandra data for the same sched ID as in May, I now have full 31 days of 6 hour data and also continuous 24 hour data from now to 22nd May (before that there is a large gap until 9th Dec 2013 (when I upgraded to 4.9)).