The error message you see in the logs are merely just to let you know the times are getting skewed beyond 30s - just because you see the log message doesn't mean resources are immediately going to get specified as down (the code that logs this has nothing to do with determining resource availability). It logs the message if the time skew is greater than 30s in either direction (server is behind agent or agent is behind server). This value today cannot be tuned - we always log if the difference is 30s. There is nothing special about why this hardcoded time was used; it was just something that seemed an appropriate number that, if the time skew was larger than it, we'd better log something.
You should be able to alert on this - that's why the metric was created, so someone COULD do what you want - fire an RHQ alert if the time skew is too large. However, I'm thinking, perhaps the problems that time skew cause, may also cause alerts to misfire or malfunction, especially if you use dampening rules that rely on time data to determine if an alert should fire - just a guess?? What was your alert definition? I would create a very basic and minimal alert definition and see what happens (just have an alert definition that checks if the time diff is larger than 30s - no dampening. You could then either add a second alert def if the time diff is smaller than -30s or you could add a second condition to the alert definition with the "ANY" setting (so your alert def would essentially fire if either time skew > 30s or time skew < -30s.
One cavest, this time skew metric is the only one that I know of that could be negative - so I wonder if its possible there is a bug in here somewhere due to handling of negative metric data?? Something to keep in mind.
the alert rule is simple:
Metric Value Threshold, Agent-Server Clock Difference > 30.0 s. No recovery, no dampening.
I saw it also was negative but I figured "larger than" would also account for negative numbers, but not sure how this logic works. Also tried setting "larger than 30 s" + "smaller than -30 s" but that didn't work either. But what does "smaller than" mean in the context of a negative number?
I think you are right that if there is a large time skew, alerts will simply not fire. But I saw the metric increase to 5 minutes in the data collected, which would indicate that the server was aware of the data, right? Is it the Agent that determines if an alert is fired?
Another question, what would cause the server to set resources "Unknown" when the skew is high, and at what time or limits does it occur? For instance, I saw six of seven Jboss servers in unknown state, restarted the agent. Then the one that was Up went state Unknown, and the rest went to state Up...