3 Replies Latest reply on Mar 24, 2010 11:32 AM by mazz

JOPR Newbie having problems with nodes failing collection

modulus10 Mar 24, 2010 8:38 AM

Hi everyone -

JOPR Newbie here with a few questions about node failures and "Backend services unavailable"

I installed JOPR (rhq-server 2.3.1) and had it collecting data on 4 nodes rather easily. But at 4:30 am, one node just vanished - everything went red and no data was being collected. I have restarted the agent, i have restarted the server (which caused problem 2 below ) but I cannot get the server to collect data from the node or agent running on the node.

A manual rediscovery was successful

Operation:	Manual Autodiscovery
Date Submitted:	3/24/10, 11:56:00 AM, UTC
Date Completed:	3/24/10, 11:56:02 AM, UTC
Requester:	rhqadmin
Status:	Success

But still no data is being collected... any suggestions?

THe second problem i run into a lot - is if I restart the server, i get the "Backend services unavailable" message. I have tried restarting the database as well as the server and local agent... the only thing that seems to work is a complete reboot of the server.

Any ideas?

Thanks in advance

1. Re: JOPR Newbie having problems with nodes failing collection

mazz Mar 24, 2010 9:15 AM (in response to modulus10)

What OS/hardware is running your server and database? What kind of database (Postgres? Oracle?)?

Whenever I see "Backend services unavailable" I immediately think the database can't keep up and its probably not tuned or is not running on hardware that can handle the load.

Did you look in the server's log file to see what errors, if any, you go ( typically, if the database goes bad, the log will fill with errors - and if it is the database under high load not keeping up, the errors will be generic database errors).

How fast are you collecting metrics and how many metrics? If you change most of your metrics to collect every 60 seconds, that would cause high load and you'd need a very good/tuned database to keep up.

In short, it sounds like your hardware/DB setup can't handle the load for whatever reason.
Actions

2. Re: JOPR Newbie having problems with nodes failing collection

modulus10 Mar 24, 2010 10:17 AM (in response to mazz)

HP Dual CPU - Intel(R) Xeon(TM) CPU 3.60GHz

8 gig ram

[root@vitalsigns jopr-server-2.3.1]# free
             total       used       free     shared     buffers     cached
Mem:       8309520    1935860    6373660           0     300496     528256
-/+ buffers/cache:    1107108    7202412
Swap:       2097144          0    2097144

DB - Postgres 8.4

OS - CentoOS 5.4

Here is some information from JOPR about itself (stand alone agent installed and monitored)

Type: Linux (Platform)	Description: Linux Operating System
Version: Linux 2.6.18-164.11.1.el5PAE	Parent: none

Hostname: vitalsigns.xxx.xxxx.com	Architecture: i686
OS Name: Linux	Distribution Name: CentOS
OS Version: 2.6.18-164.11.1.el5PAE	Distribution Version: release 5.4 (Final)

Name	Alerts	Min	Max	Average	Last

Free Memory	0	6.03GB	6.78GB	6.12GB	6.07GB
Free Swap Space	0	2GB	2GB	2GB	2GB
System Load	0	0.03%	1.58%	0.25%	0.26%
Total Memory	0	7.9246GB	7.9246GB	7.9246GB	7.9246GB
Total Swap Space	0	2GB	2GB	2GB	2GB
Used Memory	0	1.14GB	1.9GB	1.8GB	1.85GB
Used Swap Space	0	0B	0B	0B	0B
User Load	0	0.2%	40.9%	1.8%	1.6%

SO i dont believe its resource related - im only watching 4 nodes - I am open to some tuning tips if you have any, but from a system perspective this machine seems to be underutilized. But again I am not familiar with the inner workings of JOPR, im sure there are things I can improve.

Server Logs:

First sign of something strange

2010-03-24 05:21:14,829 INFO   [org.rhq.enterprise.server.content.ContentManagerBean] Finished merging 1 packages in 10ms
2010-03-24 05:21:23,795 INFO   [org.rhq.enterprise.server.content.ContentManagerBean] Merging packages for resource ID [10682]. Package count [1]
2010-03-24 05:21:23,804 INFO   [org.rhq.enterprise.server.content.ContentManagerBean] Finished merging 1 packages in 9ms
]
script, name=RHQ Agent Launcher Script, parent= s-ssf02-sjpop.xxx.xxxx.com RHQ Agent, version=1.3.1]] changed its version from [1.3.1] to []
, name=RHQ Agent JVM, parent= s-ssf02-sjpop.xxx.xxxx.com RHQ Agent, version=1.5.0_14]] changed its version from [1.5.0_14] to []
099), version=2.0.1.GA]] changed its version from [2.0.1.GA] to []
] changed its version from [JBoss Messaging 1.4.0.SP3] to []
r 24 05:00:23 UTC 2010]. Will be backfilled since we suspect it is down
2010-03-24 05:21:44,818 INFO   [org.rhq.enterprise.server.content.ContentManagerBean] Merging packages for resource ID [10511]. Package count [1]

Then started getting these at each interval:

2010-03-24 05:22:32,289 WARN [org.rhq.enterprise.server.core.AgentManagerBean] Have not heard from agent [ s-ssf02-sjpop.xxx.xxxx.com] since [Wed Mar 24 05:00:23 UTC 2010]. Will be backfilled since we suspect it is down

2010-03-24 05:23:32,288 WARN [org.rhq.enterprise.server.core.AgentManagerBean] Have not heard from agent [ s-ssf02-sjpop..xxx.xxxx.com] since [Wed Mar 24 05:00:23 UTC 2010]. Will be backfilled since we suspect it is down

2010-03-24 05:24:32,288 WARN [org.rhq.enterprise.server.core.AgentManagerBean] Have not heard from agent [ s-ssf02-sjpop..xxx.xxxx.com] since [Wed Mar 24 05:00:23 UTC 2010]. Will be backfilled since we suspect it is down

These continued - however the node was not down (would have had LOTS of angry people if it was, and ZenOss never lost communication with it or its snmpd agent) and the rhq-agent was running as well -

I ran a manuall discovery - got a "successful" message back but still no data was being collected.

in the end i deleted the node, and re ran the agent installer, and now the node is back (but all data was lost)

Hope this helps, and thanks for commenting

3. Re: JOPR Newbie having problems with nodes failing collection

mazz Mar 24, 2010 11:32 AM (in response to modulus10)

Looks like the comm traffic stopped between agent box to server box over port 7080 (which is the default, unless you changed it) - if your agent->server is going over ssl, the port is 7443 by default.

Either that, or the clocks between server and agent are not synced (??). Do you have NTP setup on agent and server boxes?

Did you see any errors in the agent log that indicated that it failed to be able to send availability reports to the server?

That message:

"Have not heard from agent [ s-ssf02-sjpop..xxx.xxxx.com] since [Wed Mar 24 05:00:23 UTC 2010]. Will be backfilled since we suspect it is down"

means that the agent failed to successfully send an availability report to the server in a reasonable amount of time. When the server fails to get an availability report from an agent in a reasonable amount of time, it "backfills" that agent, meaning it turns "red" all of the agent's resources, including its top level platform resource.

You can test that an agent can send an availability report to the server via the agent prompt command "avail". You can test that an agent has connectivity to the server by executing the "ping" agent prompt command.

Also, the fact that you were getting "backend datasource not available" (and your hardware/db setup looks ok) tells me its more about connectivity than anything. Perhaps there is network connectivity issues with the server box?? Do you have more than one server installed (i.e. are you using High-Availability (HA)).
Actions

Go to original post