3 Replies Latest reply on Mar 24, 2010 11:32 AM by mazz

    JOPR Newbie having problems with nodes failing collection

    Chris Cardone Newbie

      Hi everyone -

      JOPR Newbie here with a few questions about node failures and "Backend services unavailable"

       

      I installed JOPR (rhq-server 2.3.1) and had it collecting data on 4 nodes rather easily.  But at 4:30 am, one node just vanished - everything went red and no data was being collected.  I have restarted the agent, i have restarted the server (which caused problem 2 below ) but I cannot get the server to collect data from the node or agent running on the node.

       

      A manual rediscovery was successful

      Operation:Manual Autodiscovery
      Date Submitted:3/24/10, 11:56:00 AM, UTC
      Date Completed:3/24/10, 11:56:02 AM, UTC
      Requester:rhqadmin
      Status:Success

       

      But still no data is being collected... any suggestions?

       

      THe second problem i run into a lot - is if I restart the server, i get the "Backend services unavailable" message.  I have tried restarting the database as well as the server and local agent... the only thing that seems to work is a complete reboot of the server.

       

      Any ideas?

       

      Thanks in advance

       

      C

        • 1. Re: JOPR Newbie having problems with nodes failing collection
          mazz Master

          What OS/hardware is running your server and database? What kind of database (Postgres? Oracle?)?

           

          Whenever I see "Backend services unavailable" I immediately think the database can't keep up and its probably not tuned or is not running on hardware that can handle the load.

           

          Did you look in the server's log file to see what errors, if any, you go ( typically, if the database goes bad, the log will fill with errors - and if it is the database under high load not keeping up, the errors will be generic database errors).

           

          How fast are you collecting metrics and how many metrics? If you change most of your metrics to collect every 60 seconds, that would cause high load and you'd need a very good/tuned database to keep up.

           

          In short, it sounds like your hardware/DB setup can't handle the load for whatever reason.

          • 2. Re: JOPR Newbie having problems with nodes failing collection
            Chris Cardone Newbie

            HP Dual CPU - Intel(R) Xeon(TM)  CPU 3.60GHz

            8 gig ram

            [root@vitalsigns jopr-server-2.3.1]#  free
                         total       used       free     shared     buffers     cached
            Mem:       8309520    1935860    6373660           0     300496     528256
            -/+ buffers/cache:    1107108    7202412
            Swap:       2097144          0    2097144

             

            DB - Postgres 8.4

             

            OS - CentoOS 5.4

             

             

             

            Here is some  information from JOPR about itself (stand alone agent installed and  monitored)

            Type: Linux                          (Platform)Description: Linux Operating System
            Version:  Linux  2.6.18-164.11.1.el5PAEParent: none
            Hostname:  vitalsigns.xxx.xxxx.comArchitecture: i686
            OS  Name: LinuxDistribution Name: CentOS
            OS  Version: 2.6.18-164.11.1.el5PAEDistribution Version: release  5.4  (Final)

             

            Name
            Alerts
            Min
            Max
            Average
            Last
            Free Memory06.03GB6.78GB6.12GB6.07GB
            Free Swap Space02GB2GB2GB2GB
            System Load00.03%1.58%0.25%0.26%
            Total Memory07.9246GB7.9246GB7.9246GB7.9246GB
            Total Swap Space02GB2GB2GB2GB
            Used Memory01.14GB1.9GB1.8GB1.85GB
            Used Swap Space00B0B0B0B
            User Load00.2%40.9%1.8%1.6%

             

             

            SO i dont  believe its resource related - im only watching 4 nodes - I am open to  some tuning tips if you have any, but from a system perspective this  machine seems to be underutilized.  But again I am not familiar with the  inner workings of JOPR, im sure there are things I can improve.

             

            Server Logs:


            First sign of something strange


            2010-03-24  05:21:14,829 INFO   [org.rhq.enterprise.server.content.ContentManagerBean] Finished merging 1  packages in 10ms
            2010-03-24 05:21:23,795 INFO   [org.rhq.enterprise.server.content.ContentManagerBean] Merging packages  for resource ID [10682]. Package count [1]
            2010-03-24  05:21:23,804 INFO   [org.rhq.enterprise.server.content.ContentManagerBean] Finished merging 1  packages in 9ms
            ]
            script, name=RHQ Agent Launcher Script, parent=  s-ssf02-sjpop.xxx.xxxx.com RHQ Agent, version=1.3.1]] changed its  version from [1.3.1] to []
            , name=RHQ Agent JVM, parent=  s-ssf02-sjpop.xxx.xxxx.com RHQ Agent, version=1.5.0_14]] changed its  version from [1.5.0_14] to []
            099), version=2.0.1.GA]] changed its  version from [2.0.1.GA] to []
            ] changed its version from [JBoss  Messaging 1.4.0.SP3] to []
            r 24 05:00:23 UTC 2010]. Will be  backfilled since we suspect it is down

            2010-03-24  05:21:44,818 INFO   [org.rhq.enterprise.server.content.ContentManagerBean] Merging packages  for resource ID [10511]. Package count [1]

             

            Then started getting these at each interval:

             

            2010-03-24  05:22:32,289 WARN  [org.rhq.enterprise.server.core.AgentManagerBean]  Have not heard from agent [ s-ssf02-sjpop.xxx.xxxx.com] since [Wed Mar  24 05:00:23 UTC 2010]. Will be backfilled since we suspect it is down

             

            2010-03-24  05:23:32,288 WARN  [org.rhq.enterprise.server.core.AgentManagerBean]  Have not heard from agent [ s-ssf02-sjpop..xxx.xxxx.com] since [Wed Mar  24 05:00:23 UTC 2010]. Will be backfilled since we suspect it is down

             

            2010-03-24  05:24:32,288 WARN  [org.rhq.enterprise.server.core.AgentManagerBean]  Have not heard from agent [ s-ssf02-sjpop..xxx.xxxx.com] since [Wed Mar  24 05:00:23 UTC 2010]. Will be backfilled since we suspect it is down

             

            These  continued - however the node was not down (would have had LOTS of angry  people if it was, and ZenOss never lost communication with it or its  snmpd agent) and the rhq-agent was running as well -

             

            I ran a  manuall discovery - got a  "successful" message back but still no data  was being collected.

             

            in the end i deleted the node, and re ran the agent  installer, and now the node is back (but all data was lost)

             

            Hope this  helps, and thanks for commenting

             

            C

            • 3. Re: JOPR Newbie having problems with nodes failing collection
              mazz Master

              Looks like the comm traffic stopped between agent box to server box over port 7080 (which is the default, unless you changed it) - if your agent->server is  going over ssl, the port is 7443 by default.

               

              Either that, or the clocks between server and agent are not synced (??). Do you have NTP setup on agent and server boxes?

               

              Did you see any errors in the agent log that indicated that it failed to be able to send availability reports to the server?

               

              That message:

               

              "Have not heard from agent [ s-ssf02-sjpop..xxx.xxxx.com] since [Wed Mar  24 05:00:23 UTC 2010]. Will be backfilled since we suspect it is down"

               

              means that the agent failed to successfully send an availability report to the server in a reasonable amount of time. When the server fails to get an availability report from an agent in a reasonable amount of time, it "backfills" that agent, meaning it turns "red" all of the agent's resources, including its top level platform resource.

               

              You can test that an agent can send an availability report to the server via the agent prompt command "avail". You can test that an agent has connectivity to the server by executing the "ping" agent prompt command.

               

              Also, the fact that you were getting "backend datasource not available" (and your hardware/db setup looks ok) tells me its more about connectivity than anything. Perhaps there is network connectivity issues with the server box?? Do you have more than one server installed (i.e. are you using High-Availability (HA)).