2 Replies Latest reply on Apr 26, 2013 12:16 PM by pathduck

Checking for Agent-Server clock difference

pathduck Apr 26, 2013 4:41 AM

Hi,

recently we've had a few situations where the Linux ntpd has fallen down, leading to the clock difference between server and agent increases, and RHQ setting the resources to state "Unknown". The agent log starts getting errors like "The server and agent clocks are not in sync."

To have better monitoring of this I am trying to create an alert checking the Agent-Server Clock Difference metric. The alert will be fired when the difference is larger than 30000ms (30s, to test). But turning off ntpd, setting time 5 minutes in advance and seeing the metric increase, the alert is still not fired.

So I have a few questions:

- How is it possible to create an alert for this and what am I doing wrong? For instance do I need to set alert to both larger than 30s and "smaller than" -30s to catch negative differences?

- What is the max value the difference can be before the RHQ agent starts logging the difference is large? At what point would RHQ set the resources Unknown?

- Can this value be tuned somewhere?

regards,

Stian

1. Re: Checking for Agent-Server clock difference

mazz Apr 26, 2013 9:14 AM (in response to pathduck)

The error message you see in the logs are merely just to let you know the times are getting skewed beyond 30s - just because you see the log message doesn't mean resources are immediately going to get specified as down (the code that logs this has nothing to do with determining resource availability). It logs the message if the time skew is greater than 30s in either direction (server is behind agent or agent is behind server). This value today cannot be tuned - we always log if the difference is 30s. There is nothing special about why this hardcoded time was used; it was just something that seemed an appropriate number that, if the time skew was larger than it, we'd better log something.

You should be able to alert on this - that's why the metric was created, so someone COULD do what you want - fire an RHQ alert if the time skew is too large. However, I'm thinking, perhaps the problems that time skew cause, may also cause alerts to misfire or malfunction, especially if you use dampening rules that rely on time data to determine if an alert should fire - just a guess?? What was your alert definition? I would create a very basic and minimal alert definition and see what happens (just have an alert definition that checks if the time diff is larger than 30s - no dampening. You could then either add a second alert def if the time diff is smaller than -30s or you could add a second condition to the alert definition with the "ANY" setting (so your alert def would essentially fire if either time skew > 30s or time skew < -30s.

One cavest, this time skew metric is the only one that I know of that could be negative - so I wonder if its possible there is a bug in here somewhere due to handling of negative metric data?? Something to keep in mind.
Actions
2. Re: Checking for Agent-Server clock difference

pathduck Apr 26, 2013 12:16 PM (in response to mazz)

Hi John,
the alert rule is simple:

Metric Value Threshold, Agent-Server Clock Difference > 30.0 s. No recovery, no dampening.

I saw it also was negative but I figured "larger than" would also account for negative numbers, but not sure how this logic works. Also tried setting "larger than 30 s" + "smaller than -30 s" but that didn't work either. But what does "smaller than" mean in the context of a negative number?

I think you are right that if there is a large time skew, alerts will simply not fire. But I saw the metric increase to 5 minutes in the data collected, which would indicate that the server was aware of the data, right? Is it the Agent that determines if an alert is fired?

Another question, what would cause the server to set resources "Unknown" when the skew is high, and at what time or limits does it occur? For instance, I saw six of seven Jboss servers in unknown state, restarted the agent. Then the one that was Up went state Unknown, and the rest went to state Up...
Actions

Go to original post