-
1. Re: Metric data disappears from Cassandra after two weeks
genman May 27, 2014 3:36 PM (in response to esatikka)The loss of data wasn't intentional... Data should be getting kept for around 6 months or so.
After two weeks, the data goes from raw_data to the 1h table.
Can you post the logs for org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob (grep for this). You should be seeing something going on here with measurement data compression.
I would consider upgrading to 4.11 (taking note of the possible bugs) as well, as there are performance fixes for compression.
-
2. Re: Re: Metric data disappears from Cassandra after two weeks
esatikka May 28, 2014 1:53 AM (in response to genman)Sorry I was not clear enough on it, but I am running 4.11 at the moment (I was kinda hoping the issue I first noticed in 4.9 to have been fixed there).
Here's the logs on the first data purge job today:
00:00:00,025 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Data Purge Job STARTING
00:00:00,041 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Measurement data compression starting at Wed May 28 00:00:00 EEST 2014
00:00:35,651 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Measurement data compression completed in [35610]ms
00:00:35,651 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Measurement calltime data purge starting at Wed May 28 00:00:35 EEST 2014
00:00:35,651 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Purging calltime data that is older than Sun Apr 27 00:00:35 EEST 2014
00:00:35,682 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Calltime purged [0] - completed in [31]ms
00:00:35,682 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Event data purge starting at Wed May 28 00:00:35 EEST 2014
00:00:35,682 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Purging event data older than Wed May 14 00:00:35 EEST 2014
00:00:37,151 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Event data purged [1570] - completed in [1469]ms
00:00:37,151 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Alert data purge starting at Wed May 28 00:00:37 EEST 2014
00:00:37,151 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Purging alert data older than Sun Apr 27 00:00:37 EEST 2014
00:00:37,166 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Alert data purged [0] - completed in [15]ms
00:00:37,166 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Alert definition unused purge starting at Wed May 28 00:00:37 EEST 2014
00:00:37,182 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Alert definitions purged [0] - completed in [16]ms
00:00:37,182 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Alert condition orphan purge starting at Wed May 28 00:00:37 EEST 2014
00:00:37,182 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Purged [0] orphan alert conditions - completed in [0]ms
00:00:37,182 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Alert notification orphan purge starting at Wed May 28 00:00:37 EEST 2014
00:00:37,198 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Purged [0] orphan alert notifications - completed in [16]ms
00:00:37,198 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Trait data purge starting at Wed May 28 00:00:37 EEST 2014
00:00:37,198 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Purging traits that are older than Tue May 28 00:00:37 EEST 2013
00:00:37,198 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Traits data purged [0] - completed in [0]ms
00:00:37,198 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Availability data purge starting at Wed May 28 00:00:37 EEST 2014
00:00:37,198 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Purging availablities that are older than Tue May 28 00:00:37 EEST 2013
00:00:37,213 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Availability data purged [0] - completed in [15]ms
00:00:37,213 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Drift file orphan purge starting at Wed May 28 00:00:37 EEST 2014
00:00:37,213 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Purging orphaned drift files older than Sun Apr 27 00:00:37 EEST 2014
00:00:37,245 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Purged [0] orphaned drift files - completed in [32]ms
00:00:37,245 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Database maintenance starting at Wed May 28 00:00:37 EEST 2014
00:00:37,245 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Performing daily database maintenance
00:00:37,260 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Re-indexing data tables
00:00:58,182 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Database maintenance completed in [20937]ms
00:00:58,198 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Auto-calculation of baselines starting at Wed May 28 00:00:58 EEST 2014
00:01:04,557 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Auto-calculation of baselines completed in [6359]ms
00:01:04,557 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Auto-calculation of OOBs starting
00:01:08,901 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Auto-calculation of OOBs completed in [4344]ms
00:01:08,901 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Data Purge Job FINISHED [68876]ms
What to me seems to not be explicitly stated neither in the log or in the config is purging of measurement data (other than calltime or trait data). Is the retention period hardcoded or based on some other setting?
-
3. Re: Metric data disappears from Cassandra after two weeks
genman May 28, 2014 1:11 PM (in response to esatikka)The data retention is hard coded to raw being 7 days, one hour 14, six hour 31, and twenty four 365. So I'm guessing your six hour data isn't getting inserted?
Check the date/time on your Cassandra host matches your server. Check the content of tables makes sense. You can also enable DEBUG to see how much data is going into the six hour side.
-
4. Re: Re: Metric data disappears from Cassandra after two weeks
esatikka May 30, 2014 6:33 AM (in response to genman)There are six hour metrics, too, but also end at -2 weeks from now... The table contents look sensible to me but then again I have no way of telling if my tables differ from what they should be.
The server time is synced from network and looks just fine (also haven't noticed significant agent-server time differences being reported). I have now enabled DEBUG logging, just have to wait and see until next data purge that touches the aggregated metrics... I'll get back when I have some results.
-
5. Re: Re: Metric data disappears from Cassandra after two weeks
esatikka Jun 4, 2014 3:02 AM (in response to genman)I attached more detailed log on the data purge job last night. From what I can see, there are no explicit errors (except for OOBs which are not being used). I can't spot anything that would indicate to me that 6H metrics are being cut at -2 weeks.
I also did some reading of the project source code but uh, it would require quite a much more time to get somewhere with it than I have at my disposal at the moment. However, I did run into the metrics_index table in the source and did some queries there. I'm not completely sure what is the role of that index, but the data looks odd to me. Let's take an example schedule ID 15162:
bucket time schedule_id one_hour_metrics 2014-06-02 09:00:00+0300 15162 six_hour_metrics 2014-06-02 06:00:00+0300 15162 twenty_four_hour_metrics 2014-06-02 00:00:00+0300 15162 one_hour_metrics 2014-05-06 23:00:00+0300 15162 one_hour_metrics 2014-05-06 22:00:00+0300 15162 one_hour_metrics 2014-05-06 21:00:00+0300 15162 one_hour_metrics 2014-05-06 20:00:00+0300 15162 six_hour_metrics 2014-05-06 18:00:00+0300 15162 twenty_four_hour_metrics 2014-05-06 00:00:00+0300 15162 six_hour_metrics 2014-03-30 00:00:00+0200 15162 twenty_four_hour_metrics 2014-03-29 00:00:00+0200 15162 raw_metrics 2013-12-09 09:00:00+0200 15162 raw_metrics 2013-12-09 08:00:00+0200 15162 raw_metrics 2013-12-09 07:04:24+0200 15162 raw_metrics 2013-12-09 07:00:00+0200 15162 one_hour_metrics 2013-12-09 06:00:00+0200 15162 one_hour_metrics 2013-12-09 00:00:00+0200 15162 six_hour_metrics 2013-12-09 00:00:00+0200 15162 six_hour_metrics 2013-12-08 02:00:00+0200 15162 six_hour_metrics 2013-12-08 00:00:00+0200 15162 The installation was upgraded to 4.9 on 2013-12-09, which explains the data on that day and perhaps the previous day (migrated from Oracle?) but other than that it looks odd. Anyway, there are no 1H, 6H or 24H metrics for 2014-05-06 so that looks like a clear inconsistency.
Since the only thing in the log that is saying anything about purging data on May 19th is event data purge, I experimentally increased the retention time for events from 14 days to 31 days to see if that would have any side effect. (Edit two days later: no it didn't make a difference regarding the disappearing metrics, which was expected)
-
6. Re: Re: Metric data disappears from Cassandra after two weeks
esatikka Jul 17, 2014 3:58 AM (in response to esatikka)It might be that whatever bug I was suffering from has eventually been fixed in 4.11, after all. Looking again at the Cassandra data for the same sched ID as in May, I now have full 31 days of 6 hour data and also continuous 24 hour data from now to 22nd May (before that there is a large gap until 9th Dec 2013 (when I upgraded to 4.9)).