3 Replies Latest reply on Jul 31, 2012 12:01 PM by jayshaughnessy

    RHQ 4.2 agent stops collecting measurements; 100% CPU

    genman









      I have an agent running that has stopped gathering metrics. It happened around this time.

       

      There's only about 200 metrics or so to gather at maximum from this host, but somehow I see on the order of about 100,000+ being collected, supposedly:

       

      2012-07-28 00:54:03,898 INFO  [MeasurementManager.sender-1] (rhq.core.pc.measurement.MeasurementSenderRunner)- Measurement collection for [12] metrics took 2ms - sending report 
      to Server...
      2012-07-28 00:55:03,898 INFO  [MeasurementManager.sender-1] (rhq.core.pc.measurement.MeasurementSenderRunner)- Measurement collection for [11] metrics took 2ms - sending report 
      to Server...
      2012-07-28 00:56:03,931 INFO  [MeasurementManager.sender-1] (rhq.core.pc.measurement.MeasurementSenderRunner)- Measurement collection for [131374] metrics took 20900ms - sending
       report to Server...
      2012-07-28 00:56:06,990 INFO  [InventoryManager.availability-1] (rhq.core.pc.inventory.InventoryManager)- Sending availability report to Server...
      to Server...
      2012-07-28 00:56:03,931 INFO  [MeasurementManager.sender-1] (rhq.core.pc.measurement.MeasurementSenderRunner)- Measurement collection for [131374] metrics took 20900ms - sending
       report to Server...
      2012-07-28 00:56:06,990 INFO  [InventoryManager.availability-1] (rhq.core.pc.inventory.InventoryManager)- Sending availability report to Server...
      2012-07-28 00:56:34,206 INFO  [MeasurementManager.sender-1] (rhq.core.pc.measurement.MeasurementSenderRunner)- Measurement collection for [64475] metrics took 9278ms - sending report to Server...
      

       

      I do have one measurement that takes about 5-10 seconds to gather (due to I/O), so I wonder if that's causing a bug to appear or not.

       

      The only fix seems to be is to restart the agent using "--purgedata".

       

      I also see huge CPU usage:

       

        PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND                                                                                                       
      1645 rhq   19   0 1978m 327m  11m S 107.5  0.5   4566:35 java

       

      Is the agent getting itself brain damaged, perhaps from a bad plugin? Or a race condition?

       

      I'm guessing there is some sort of data structure race.

       

      This seems to take a few days to happen, so it's hard to reproduce.