1 2 Previous Next 23 Replies Latest reply on May 29, 2013 4:21 AM by tsegismont

    RHQ Availability Check

    spolti

      Hello guys,

       

      A few days here I am having the following problem,

       

      Some reources monitored by rhq server are showing the follow error:

       

       

      org.rhq.core.pc.inventory.TimeoutException: Call to [org.rhq.plugins.apache.ApacheServerComponent.getAvailability()] with args [] timed out after 5000 milliseconds - invocation thread will be interrupted.

      at org.rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler.invokeInNewThreadWithLock(ResourceContainer.java:574)

      at org.rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler.invoke(ResourceContainer.java:542)

      at $Proxy63.getAvailability(Unknown Source)

      at org.rhq.core.pc.inventory.AvailabilityExecutor.safeGetAvailability(AvailabilityExecutor.java:359)

      at org.rhq.core.pc.inventory.AvailabilityExecutor.checkInventory(AvailabilityExecutor.java:296)

      at org.rhq.core.pc.inventory.AvailabilityExecutor.checkInventory(AvailabilityExecutor.java:352)

      at org.rhq.core.pc.inventory.AvailabilityExecutor.call(AvailabilityExecutor.java:149)

      at org.rhq.core.pc.inventory.AvailabilityExecutor.run(AvailabilityExecutor.java:98)

      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)

      at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)

      at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)

      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)

      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181)

      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205)

      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

      at java.lang.Thread.run(Thread.java:619)

       

       

       

      Because this, the RHQ sends a lot of alerts, but in fact, the alerts are fake alerts caused by this issue.....

       

       

      Anyone knows how to solve this?

       

       

      Regrads.

        • 1. Re: RHQ Availability Check
          tsegismont

          Hi Filippe,

           

          Do you have no other message in the agent log?

           

          If you set the URL property in the Apache resource Connection Settings, can you make sure your agent is able to connect to this URL?

           

          Regards,

          Thomas

          • 2. Re: RHQ Availability Check
            pathduck

            We have this as well, RHQ 4.6 and 4.5.1. Not for Apache, but for Jboss resources and other types of resources.

             

            Basically any type of resource will show Availability go up and down. Sometime we have 10 minutes unavailable, sometimes just 2-3 minutes. For this reason for instance, we have to set situations to only alert if a resource is Unavailable for > 15 minutes which is not very good but at least avoids lots of false positives.

             

            I was going to open a thread about this as well soon. We have had this for a long time though, since starting with RHQ 4.4.

             

            EDIT: See attachment, this is how it looks in Monitoring > Availability. Server has not been down these intervals, but RHQ marks it as Unavailable.

             

            RHQ-Unavailable.png

            -Stian

            1 of 1 people found this helpful
            • 3. Re: RHQ Availability Check
              jgarat

              We are having the same issue.

               

              You could see how long it takes your jboss GCs, maybe this is the cause your Jboss availability response takes more than 5 secs.

               

              In the agent log I see that, in some cases, availability check takes 5 secs, in our case the jboss gc is fine. I think the threads is interrupted after this 5 secs.

                                       

              2013-03-11 09:40:00,325 INFO  [InventoryManager.availability-1] (rhq.core.pc.inventory.AvailabilityExecutor)- Scan Starting: Mon Mar 11 09:40:00 UYT 2013

              2013-03-11 09:40:05,684 INFO  [InventoryManager.availability-1] (rhq.core.pc.inventory.AvailabilityExecutor)- Scan Ended   : Mon Mar 11 09:40:05 UYT 2013 : Scan [startTime=1363005600325, endTime=1363005605684, runtime=5359, isFull=false, isForced=false, numResources=186, numGetAvailabilityCalls=15, numScheduledRandomly=0, numPushedByInterval=14, numAvailabilityChanges=1, numDeferToParent=0]

              2013-03-11 09:40:40,354 INFO  [InventoryManager.availability-1] (rhq.core.pc.inventory.AvailabilityExecutor)- Scan Starting: Mon Mar 11 09:40:40 UYT 2013

              2013-03-11 09:40:45,360 INFO  [InventoryManager.availability-1] (rhq.core.pc.inventory.AvailabilityExecutor)- Scan Ended   : Mon Mar 11 09:40:45 UYT 2013 : Scan [startTime=1363005640354, endTime=1363005645360, runtime=5006, isFull=false, isForced=false, numResources=186, numGetAvailabilityCalls=5, numScheduledRandomly=0, numPushedByInterval=8, numAvailabilityChanges=143, numDeferToParent=149]

              2013-03-11 09:41:20,620 INFO  [InventoryManager.availability-1] (rhq.core.pc.inventory.AvailabilityExecutor)- Scan Starting: Mon Mar 11 09:41:20 UYT 2013

              2013-03-11 09:41:20,622 INFO  [InventoryManager.availability-1] (rhq.core.pc.inventory.AvailabilityExecutor)- Scan Ended   : Mon Mar 11 09:41:20 UYT 2013 : Scan [startTime=1363005680620, endTime=1363005680622, runtime=2, isFull=false, isForced=false, numResources=186, numGetAvailabilityCalls=2, numScheduledRandomly=0, numPushedByInterval=10, numAvailabilityChanges=0, numDeferToParent=149]

              2013-03-11 09:41:50,628 INFO  [InventoryManager.availability-1] (rhq.core.pc.inventory.AvailabilityExecutor)- Scan Starting: Mon Mar 11 09:41:50 UYT 2013

              2013-03-11 09:41:55,888 INFO  [InventoryManager.availability-1] (rhq.core.pc.inventory.AvailabilityExecutor)- Scan Ended   : Mon Mar 11 09:41:55 UYT 2013 : Scan [startTime=1363005710627, endTime=1363005715888, runtime=5261, isFull=false, isForced=false, numResources=186, numGetAvailabilityCalls=154, numScheduledRandomly=149, numPushedByInterval=4, numAvailabilityChanges=144, numDeferToParent=0]

              2013-03-11 09:42:26,155 INFO  [InventoryManager.availability-1] (rhq.core.pc.inventory.AvailabilityExecutor)- Scan Starting: Mon Mar 11 09:42:26 UYT 2013

              2013-03-11 09:42:31,273 INFO  [InventoryManager.availability-1] (rhq.core.pc.inventory.AvailabilityExecutor)- Scan Ended   : Mon Mar 11 09:42:31 UYT 2013 : Scan [startTime=1363005746155, endTime=1363005751273, runtime=5118, isFull=false, isForced=false, numResources=186, numGetAvailabilityCalls=7, numScheduledRandomly=0, numPushedByInterval=6, numAvailabilityChanges=1, numDeferToParent=0]

              2013-03-11 09:43:01,291 INFO  [InventoryManager.availability-1] (rhq.core.pc.inventory.AvailabilityExecutor)- Scan Starting: Mon Mar 11 09:43:01 UYT 2013

              2013-03-11 09:43:06,297 INFO  [InventoryManager.availability-1] (rhq.core.pc.inventory.AvailabilityExecutor)- Scan Ended   : Mon Mar 11 09:43:06 UYT 2013 : Scan [startTime=1363005781291, endTime=1363005786297, runtime=5006, isFull=false, isForced=false, numResources=186, numGetAvailabilityCalls=6, numScheduledRandomly=0, numPushedByInterval=16, numAvailabilityChanges=143, numDeferToParent=149]

              2013-03-11 09:43:41,565 INFO  [InventoryManager.availability-1] (rhq.core.pc.inventory.AvailabilityExecutor)- Scan Starting: Mon Mar 11 09:43:41 UYT 2013

              2013-03-11 09:43:41,568 INFO  [InventoryManager.availability-1] (rhq.core.pc.inventory.AvailabilityExecutor)- Scan Ended   : Mon Mar 11 09:43:41 UYT 2013 : Scan [startTime=1363005821565, endTime=1363005821568, runtime=3, isFull=false, isForced=false, numResources=186, numGetAvailabilityCalls=6, numScheduledRandomly=0, numPushedByInterval=15, numAvailabilityChanges=0, numDeferToParent=149]

              2013-03-11 09:44:11,573 INFO  [InventoryManager.availability-1] (rhq.core.pc.inventory.AvailabilityExecutor)- Scan Starting: Mon Mar 11 09:44:11 UYT 2013

              2013-03-11 09:44:17,047 INFO  [InventoryManager.availability-1] (rhq.core.pc.inventory.AvailabilityExecutor)- Scan Ended   : Mon Mar 11 09:44:17 UYT 2013 : Scan [startTime=1363005851573, endTime=1363005857047, runtime=5474, isFull=false, isForced=false, numResources=186, numGetAvailabilityCalls=157, numScheduledRandomly=149, numPushedByInterval=7, numAvailabilityChanges=144, numDeferToParent=0]

              2013-03-11 09:44:47,312 INFO  [InventoryManager.availability-1] (rhq.core.pc.inventory.AvailabilityExecutor)- Scan Starting: Mon Mar 11 09:44:47 UYT 2013

              2013-03-11 09:44:47,743 INFO  [InventoryManager.availability-1] (rhq.core.pc.inventory.AvailabilityExecutor)- Scan Ended   : Mon Mar 11 09:44:47 UYT 2013 :

               

              I'm going to try setting rhq.agent.plugins.availability-scan.timeout in more than 5 secs in the agent.

               

              Juan

              • 4. Re: RHQ Availability Check
                mazz

                It is assumed that all resources can at least respond to a "ping" (an availability check) within a 5,000 millisecond period of time. If they cannot, it is assumed "DOWN".

                 

                There is a "backdoor" system property you can change to increase this timeout - but if you have lots of resources that take over 5s to respond, it could adversely affect the performance of the agent availability reporting as a whole. But if you want to, you can do this - pass in "-Drhq.agent.plugins.availability-scan.timeout=###" to the agent when starting it, where ### is the number of milliseconds you want the timeout to be.

                 

                Another thing to try - If the resource supports it, you can also enable "async availability checking" - I don't think the Apache resource supports this though - I do know the JBossAS Server resources do support this. Look at the resource's "Connection Properties" (aka plugin configuration) and look for things regarding "asynchronous availability checking".

                • 5. Re: RHQ Availability Check
                  jayshaughnessy

                  As John mentions, upping that property can be dangerous as avail checking will basically be held up for N seconds for each resource taking a long time to respond, and there still is no guarantee that the resource will respond.  Typically, but not always, if a resource takes 5s or more to respond then it may very well be DOWN, or responding so slowly as to be in need of some sort of maintanence.  Or there may be network/connectivity issues.

                   

                  As for how long the resource is reported DOWN, that is likely a function of the Availability checking interval for the resource.  By default it's 1 minute for Server resources and 10 minutes for Service resources.  The Avail checking thread runs with 30s intervals.

                   

                  Note that the Availability check interval can be adjusted on a type/group/resource level by setting the "Availability" metric schedule, just like other metric schedules.  So, if there are specific Services for which you don't want to wait for ~10 minutes between checks, you can set the value lower.  Be conservative, because setting everything low will potentially slow down the entire agent, as it must then repeatedly check avails for many resources, and in general resources stay UP.

                   

                  Since you are seeing the 5s timeouts, as evidenced in the log entry in the original post, you should perhaps try to understand why an UP apache instance is taking so long to respond.  Or, for the other poster, the AS resources.  Again, for AS resources you could enable the async avail checking, which would eliminate any timeouts by performing the avail checks asynchronously and reporting the last "known" avail.

                  • 6. Re: RHQ Availability Check
                    pathduck

                    Yeah I figured it might have to do with the availability-scan.timeout value but wasn't sure where to set it. Seems a lot of work to set it for each agent though...and quite risky as well?

                     

                    I would like to try the async avail checking but for AS7 I cannot seem to find this under Connection Settings for the AS7 plugin?

                    • 7. Re: RHQ Availability Check
                      mazz

                      > I would like to try the async avail checking but for AS7 I cannot seem to find this under Connection Settings for the AS7 plugin?

                       

                      that's a mistake and should be fixed. I created a BZ: https://bugzilla.redhat.com/show_bug.cgi?id=920214

                      • 8. Re: RHQ Availability Check
                        jayshaughnessy

                        Can you tell us a little more, like what is your AS7 server configuration? Standalone, domain?   is it the server resource avail check that is timing out?  Is the machine and/or AS7 server under heavy load?

                        • 9. Re: RHQ Availability Check
                          pathduck

                          > that's a mistake and should be fixed. I created a BZ: https://bugzilla.redhat.com/show_bug.cgi?id=920214

                            

                          Thanks for that, appreciated

                           

                          jay shaughnessy wrote:

                           

                          Can you tell us a little more, like what is your AS7 server configuration? Standalone, domain?   is it the server resource avail check that is timing out?  Is the machine and/or AS7 server under heavy load?

                           

                          Sure. AS7 Standalone, 7.1.1 or a couple 7.1.2. Management interface set up to use LDAP so Connection Settings have to be set accordingly to the management user.

                           

                          On Tomcat, 6.0.36, secure JMX enabled with a management user.

                           

                          The errors we are getting (managed component errors, type AVAILABILITY_CHECK) are

                          "Call to [org.rhq.modules.plugins.jbossas7.StandaloneASComponent.getAvailability()] with args [] has timed out after 5000 milliseconds - invocation thread will be interrupted."

                           

                          However, we are getting the same shorter Unavailabilities of 1 - 10 minutes on Tomcat servers, on Linux OS etc. For Tomcat it is then the call to TomcatServerComponent.getAvailablity that times out. Not sure about component errors for Linux OS.

                           

                          There is no indication the servers are under heavy load, they mostly peak at about 25-30% during the heaviest load times.

                           

                          Some more screens attached below.

                           

                          Tomcat-Unavailable.png

                           

                           

                          Linux-Unavailable.png

                          • 10. Re: RHQ Availability Check
                            spolti

                            Well, thx bro...

                             

                            This topic was very important, and we found a little bug.

                            for me this topic is closed, right everybody?

                            • 11. Re: RHQ Availability Check
                              pathduck

                              Filippe Spolti wrote:

                               

                              Well, thx bro...

                               

                              This topic was very important, and we found a little bug.

                              for me this topic is closed, right everybody?

                               

                              Well I am still considering opening a BZ about this since it is quite a problem for us. However I am not sure my problem is related to the getAvailability() timeout. We are moving to JON sometime during the year, and it will be interesting to see if the problem with intermittently available resources is also in JON. If so it will be a problem for RH support

                               

                              Filippe; could you tell me if you managed to fix it on your side, you say you found a bug, but are you referring to the one mentioned here?

                              • 12. Re: RHQ Availability Check
                                pathduck

                                Hi,

                                In the latest Snapshot 4.8 build for the AS7 Plugin, there is a new Property under Connection Settings - "Management Conncetion Timeout".

                                 

                                If I understand this correctly, this is the fix for having Async. Avail. Checking, and if it is zero, not set, or negative it is not enabled? It defaults to 5000 seconds I notice so will try to set it for all servers and see if it improves the situation.

                                 

                                I managed to dig up something John (I think) had written about this but lost the reference again.

                                 

                                Another thing, trying to set this value using a group of several Jboss AS7 servers, I get errors in the RHQ ui. Seems I have to go through every instance to set this?

                                The servers discovered by the 4.7 plugin does not have this value and I guess that gives an error.

                                 

                                Stian

                                • 13. Re: RHQ Availability Check
                                  tsegismont

                                  Stian,

                                   

                                  AS7 plugin uses a pool of persistent HTTP connections. The "Management Connection Timeout" parameter just defines the maximum time an idle connection will be kept alive. This is not related to Async avail checking.

                                   

                                  This new parameter was introduced a few weeks ago and the changed was shipped in RHQ 4.7. In most cases users should keep the default value.

                                   

                                  Can you please give more details about the errors you got when trying to update a group of servers?

                                   

                                  Thomas

                                  • 14. Re: RHQ Availability Check
                                    jayshaughnessy

                                    You should certainly be able to update connection settings in a group-wise way.  For total success the relevnt agents must all be running and the AS7 servers should all be UP.  If this is true then you should not be getting errors.  Are you saying the errors are relate to the existing servers not having a value set?

                                    1 2 Previous Next