11 Replies Latest reply on Jan 28, 2014 2:10 PM by genman

    Baseline calculations, trendsup metrics

    genman


      I'm using RHQ 4.5.1 and maybe this is different in 4.8 (Cassandra.)

       

      I noticed baselines are calculated for some metrics for some resources, but not all. (These are the low/high bands etc.) Why is this? Is there some triggering mechanism for baselines to appear?

       

      Also, does RHQ support baseline calculations for metrics that trend up ('per minute')? I seem to recall a message saying no.

       

      FWIW, it would be ideal if RHQ could support per-hour group trend lines, as in, create an alert off a group of metrics, looking at the trend line from one week ago and seeing how far off it is for the given time. But I know RHQ does not support group metric alerts, nor this sort of trend analysis...Off the top of my head, the server could simply look back a week at a metric, store the datapoints of a metric spanning 24 hours * N days it in memory (just a few K of memory), then do a recalculation after midnight. It seems actually pretty straightfoward to do this sort of thing on the client side, using the REST interface...

        • 1. Re: Baseline calculations, trendsup metrics
          tsegismont

          Hi Elias,

           

          Did your resources have more than seven days of metric data? See https://docs.jboss.org/author/display/RHQ/FAQ#FAQ-WhendoBaselinesautocalculate%3F

           

          Thomas

          • 2. Re: Baseline calculations, trendsup metrics
            genman

            Yes, I do have more than a week's worth of data.

             

            I wonder if there is a limitation with the amount of metrics I have, like RHQ can't recalculate more than 1000 or something at a time, and it can't get to all the metrics. It feels very sporadic what has a baseline or not.

            • 3. Re: Baseline calculations, trendsup metrics
              pilhuhn

              There has been an implementation change with Cassandra and also the Bug you ran into ( https://bugzilla.redhat.com/show_bug.cgi?id=993513 ).

              We are looking at that bug within the next few days. If you have more datapoints to share (e.g. server log for the hourly job), then please attach that to the BZ please.

              • 4. Re: Baseline calculations, trendsup metrics
                nstefan

                BZ 993513 should now be fixed. Prior to this fix, the baseline code was creating database entries with NaN when there was no actual data collected or aggregated for a metric schedule. As John notes in the BZ, there are two issues with this. Entries were created for disabled metric schedules; and if a metric schedule was enabled or got aggregate data, then a new baseline would not be calculated until the empty baseline expired.

                 

                The fix prevents creating empty baselines until at least 1 one hour aggregate exists for a metric schedule. So no more NaN entries sent to the database. The main challenge with this fix was to avoid timeouts due to transaction boundaries. The old code was doing this in a brute force fashion by slicing the data and creating those phony baselines. The new code should be somewhat faster if the baseline calculations do not all expire at the same time.

                • 5. Re: Re: Baseline calculations, trendsup metrics
                  genman

                  One thing I noticed is that you still need a pretty hefty database backend if you have a ton of metrics.

                   

                  I've been testing 4.9 with a three node Cassandra cluster and about 1,000 metrics per second. After a few days, the server eventually runs out of database connections. The error manifests as something like:

                   

                  16:17:59,287 ERROR [org.jboss.as.ejb3.invocation] (RHQScheduler_Worker-1) JBAS014134: EJB Invocation failed on component MeasurementOOBManagerBean for method public abstract int org.rhq.enterprise.server.measurement.MeasurementOOBManagerLocal.calculateOOB(org.rhq.server.metrics.domain.AggregateNumericMetric): javax.ejb.EJBException: javax.persistence.PersistenceException: org.hibernate.exception.GenericJDBCException: Could not open connection
                          at org.jboss.as.ejb3.tx.CMTTxInterceptor.handleExceptionInOurTx(CMTTxInterceptor.java:165) [jboss-as-ejb3-7.2.0.Alpha1-redhat-4.jar:7.2.0.Alpha1-redhat-4]
                          at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInOurTx(CMTTxInterceptor.java:250) [jboss-as-ejb3-7.2.0.Alpha1-redhat-4.jar:7.2.0.Alpha1-redhat-4]
                          at org.jboss.as.ejb3.tx.CMTTxInterceptor.requiresNew(CMTTxInterceptor.java:344) [jboss-as-ejb3-7.2.0.Alpha1-redhat-4.jar:7.2.0.Alpha1-redhat-4]
                          at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:216) [jboss-as-ejb3-7.2.0.Alpha1-redhat-4.jar:7.2.0.Alpha1-redhat-4]
                  


                  It seems that almost every database session in Oracle is executing things like:

                  SELECT
                      s.id
                  FROM
                      rhq_measurement_sched s
                  INNER JOIN rhq_measurement_def d
                  ON
                      s.definition = d.id
                  LEFT JOIN rhq_measurement_bline b
                  ON
                      s.id = b.schedule_id
                  WHERE
                      s.enabled = 1
                  AND b.schedule_id IS NULL
                  AND d.numeric_type = 0
                  

                   

                  Since it's just doing a select it seems like either there's a missing index or locking is an issue.

                   

                  Is there a way to turn this off, out of curiosity? OOB isn't a feature I use or need. I do see it part of the DataPurgeJob, would it be safe to turn this off?

                  • 6. Re: Re: Baseline calculations, trendsup metrics
                    john.sanda

                    You ran into https://bugzilla.redhat.com/show_bug.cgi?id=1009640.

                     

                    I will double check but I think that OOBs are primarily used in the suspect metrics report; so, I think you would be ok to turn it off. With that said, I would want to test that out first This could be a good feature request as well.

                    • 7. Re: Baseline calculations, trendsup metrics
                      genman

                      Is there a way to fix this problem, say by connection leak detection configuration setting in JBoss, or do I need a patch?

                       

                      Looking at the list of scheduled jobs in src/main/java/org/rhq/enterprise/server/core/StartupBean.java, it seems like it would make better sense to re-write those all as server plugins. IMHO a bit of flexibility to re-schedule/configure/upgrade/patch these jobs separately would be helpful. Although it may end up being more coding/abstraction than benefit.

                       

                      I will write a Bugzilla RFE but who knows?

                      • 8. Re: Baseline calculations, trendsup metrics
                        john.sanda

                        I think the safest thing to do would be to patch your server.

                         

                        I think migrating some of the Quartz jobs to server plugins is an interesting idea. As you pointed out, it adds some flexibility which can be help. It would introduce some more modularity on the server, something of which I am strongly in favor. We would probably need to look at each job on a case by case basis to see what makes sense.

                        • 9. Re: Baseline calculations, trendsup metrics
                          genman

                          Hi,

                           

                          I filed this in regards to doing the migration: https://bugzilla.redhat.com/show_bug.cgi?id=1010418

                           

                          By the way, one of the purge jobs I came up with was to remove the resource errors. It seems like these can accumulate, mostly from things like availability timeouts. (Getting resource avail timeouts was solved by fixing the plugin container, as in BZ 971556.)

                           

                          For doing a purge, I wrote my own plugin to do this rather than patch the server process. It sure also make it easier to debug and test it.

                           

                          Rather than patch my 4.9, I was thinking of waiting for 4.10, since I have run into multiple issues in 4.9 already. Is the timeline like 6-8 weeks or more for a 4.10 release? I know you can't say but you can read the leaves better than I.

                          • 10. Re: Baseline calculations, trendsup metrics
                            bech

                            I've just upgraded from 4.4.0 to 4.9.0 on our test environment, and after a couple of days with no problems I unfortunately also upgraded the live operation environment.

                             

                            I have now seen the exact same problem as described in https://bugzilla.redhat.com/show_bug.cgi?id=1009640 where RHQ is not working because it is unable to get database connections.

                             

                            This means that I now have two RHQ installations I have to restart every couple of days.

                             

                            I really hope the auto-update functionality of clients will work from 4.9.0 to 4.10.0 (it did not from 4.4.0 to 4.9.0) otherwise I have wasted a couple of days of manual client update.

                            • 11. Re: Re: Baseline calculations, trendsup metrics
                              genman

                              I would recommend you patch your server. If you can use 'git' and get the build to mostly work, you can create a fork off of the 4.9.0 branch, then backport the commit (fix for BZ 1009640) to the branch and build that class file. Then you simply copy the fixed class file to your deployment. Obviously this is not ideal but better than manually doing a restart.

                               

                              If you don't want to do that, I've attached the patched class file you can use...