Yes, I do have more than a week's worth of data.
I wonder if there is a limitation with the amount of metrics I have, like RHQ can't recalculate more than 1000 or something at a time, and it can't get to all the metrics. It feels very sporadic what has a baseline or not.
There has been an implementation change with Cassandra and also the Bug you ran into ( https://bugzilla.redhat.com/show_bug.cgi?id=993513 ).
We are looking at that bug within the next few days. If you have more datapoints to share (e.g. server log for the hourly job), then please attach that to the BZ please.
BZ 993513 should now be fixed. Prior to this fix, the baseline code was creating database entries with NaN when there was no actual data collected or aggregated for a metric schedule. As John notes in the BZ, there are two issues with this. Entries were created for disabled metric schedules; and if a metric schedule was enabled or got aggregate data, then a new baseline would not be calculated until the empty baseline expired.
The fix prevents creating empty baselines until at least 1 one hour aggregate exists for a metric schedule. So no more NaN entries sent to the database. The main challenge with this fix was to avoid timeouts due to transaction boundaries. The old code was doing this in a brute force fashion by slicing the data and creating those phony baselines. The new code should be somewhat faster if the baseline calculations do not all expire at the same time.
One thing I noticed is that you still need a pretty hefty database backend if you have a ton of metrics.
I've been testing 4.9 with a three node Cassandra cluster and about 1,000 metrics per second. After a few days, the server eventually runs out of database connections. The error manifests as something like:
16:17:59,287 ERROR [org.jboss.as.ejb3.invocation] (RHQScheduler_Worker-1) JBAS014134: EJB Invocation failed on component MeasurementOOBManagerBean for method public abstract int org.rhq.enterprise.server.measurement.MeasurementOOBManagerLocal.calculateOOB(org.rhq.server.metrics.domain.AggregateNumericMetric): javax.ejb.EJBException: javax.persistence.PersistenceException: org.hibernate.exception.GenericJDBCException: Could not open connection at org.jboss.as.ejb3.tx.CMTTxInterceptor.handleExceptionInOurTx(CMTTxInterceptor.java:165) [jboss-as-ejb3-7.2.0.Alpha1-redhat-4.jar:7.2.0.Alpha1-redhat-4] at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInOurTx(CMTTxInterceptor.java:250) [jboss-as-ejb3-7.2.0.Alpha1-redhat-4.jar:7.2.0.Alpha1-redhat-4] at org.jboss.as.ejb3.tx.CMTTxInterceptor.requiresNew(CMTTxInterceptor.java:344) [jboss-as-ejb3-7.2.0.Alpha1-redhat-4.jar:7.2.0.Alpha1-redhat-4] at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:216) [jboss-as-ejb3-7.2.0.Alpha1-redhat-4.jar:7.2.0.Alpha1-redhat-4]
It seems that almost every database session in Oracle is executing things like:
SELECT s.id FROM rhq_measurement_sched s INNER JOIN rhq_measurement_def d ON s.definition = d.id LEFT JOIN rhq_measurement_bline b ON s.id = b.schedule_id WHERE s.enabled = 1 AND b.schedule_id IS NULL AND d.numeric_type = 0
Since it's just doing a select it seems like either there's a missing index or locking is an issue.
Is there a way to turn this off, out of curiosity? OOB isn't a feature I use or need. I do see it part of the DataPurgeJob, would it be safe to turn this off?
You ran into https://bugzilla.redhat.com/show_bug.cgi?id=1009640.
I will double check but I think that OOBs are primarily used in the suspect metrics report; so, I think you would be ok to turn it off. With that said, I would want to test that out first This could be a good feature request as well.
Is there a way to fix this problem, say by connection leak detection configuration setting in JBoss, or do I need a patch?
Looking at the list of scheduled jobs in src/main/java/org/rhq/enterprise/server/core/StartupBean.java, it seems like it would make better sense to re-write those all as server plugins. IMHO a bit of flexibility to re-schedule/configure/upgrade/patch these jobs separately would be helpful. Although it may end up being more coding/abstraction than benefit.
I will write a Bugzilla RFE but who knows?
I think the safest thing to do would be to patch your server.
I think migrating some of the Quartz jobs to server plugins is an interesting idea. As you pointed out, it adds some flexibility which can be help. It would introduce some more modularity on the server, something of which I am strongly in favor. We would probably need to look at each job on a case by case basis to see what makes sense.
I filed this in regards to doing the migration: https://bugzilla.redhat.com/show_bug.cgi?id=1010418
By the way, one of the purge jobs I came up with was to remove the resource errors. It seems like these can accumulate, mostly from things like availability timeouts. (Getting resource avail timeouts was solved by fixing the plugin container, as in BZ 971556.)
For doing a purge, I wrote my own plugin to do this rather than patch the server process. It sure also make it easier to debug and test it.
Rather than patch my 4.9, I was thinking of waiting for 4.10, since I have run into multiple issues in 4.9 already. Is the timeline like 6-8 weeks or more for a 4.10 release? I know you can't say but you can read the leaves better than I.
I've just upgraded from 4.4.0 to 4.9.0 on our test environment, and after a couple of days with no problems I unfortunately also upgraded the live operation environment.
I have now seen the exact same problem as described in https://bugzilla.redhat.com/show_bug.cgi?id=1009640 where RHQ is not working because it is unable to get database connections.
This means that I now have two RHQ installations I have to restart every couple of days.
I really hope the auto-update functionality of clients will work from 4.9.0 to 4.10.0 (it did not from 4.4.0 to 4.9.0) otherwise I have wasted a couple of days of manual client update.
I would recommend you patch your server. If you can use 'git' and get the build to mostly work, you can create a fork off of the 4.9.0 branch, then backport the commit (fix for BZ 1009640) to the branch and build that class file. Then you simply copy the fixed class file to your deployment. Obviously this is not ideal but better than manually doing a restart.
If you don't want to do that, I've attached the patched class file you can use...