RHQ 4.2 agent stops collecting measurements; 100% CPU
genman Jul 30, 2012 7:30 PMI have an agent running that has stopped gathering metrics. It happened around this time.
There's only about 200 metrics or so to gather at maximum from this host, but somehow I see on the order of about 100,000+ being collected, supposedly:
2012-07-28 00:54:03,898 INFO [MeasurementManager.sender-1] (rhq.core.pc.measurement.MeasurementSenderRunner)- Measurement collection for [12] metrics took 2ms - sending report to Server... 2012-07-28 00:55:03,898 INFO [MeasurementManager.sender-1] (rhq.core.pc.measurement.MeasurementSenderRunner)- Measurement collection for [11] metrics took 2ms - sending report to Server... 2012-07-28 00:56:03,931 INFO [MeasurementManager.sender-1] (rhq.core.pc.measurement.MeasurementSenderRunner)- Measurement collection for [131374] metrics took 20900ms - sending report to Server... 2012-07-28 00:56:06,990 INFO [InventoryManager.availability-1] (rhq.core.pc.inventory.InventoryManager)- Sending availability report to Server... to Server... 2012-07-28 00:56:03,931 INFO [MeasurementManager.sender-1] (rhq.core.pc.measurement.MeasurementSenderRunner)- Measurement collection for [131374] metrics took 20900ms - sending report to Server... 2012-07-28 00:56:06,990 INFO [InventoryManager.availability-1] (rhq.core.pc.inventory.InventoryManager)- Sending availability report to Server... 2012-07-28 00:56:34,206 INFO [MeasurementManager.sender-1] (rhq.core.pc.measurement.MeasurementSenderRunner)- Measurement collection for [64475] metrics took 9278ms - sending report to Server...
I do have one measurement that takes about 5-10 seconds to gather (due to I/O), so I wonder if that's causing a bug to appear or not.
The only fix seems to be is to restart the agent using "--purgedata".
I also see huge CPU usage:
PID USER | PR NI VIRT RES SHR S %CPU %MEM | TIME+ COMMAND | |
1645 rhq | 19 0 1978m 327m 11m S 107.5 0.5 4566:35 java |
Is the agent getting itself brain damaged, perhaps from a bad plugin? Or a race condition?
I'm guessing there is some sort of data structure race.
This seems to take a few days to happen, so it's hard to reproduce.