-
1. Re: JOPR Newbie having problems with nodes failing collection
mazz Mar 24, 2010 9:15 AM (in response to modulus10)What OS/hardware is running your server and database? What kind of database (Postgres? Oracle?)?
Whenever I see "Backend services unavailable" I immediately think the database can't keep up and its probably not tuned or is not running on hardware that can handle the load.
Did you look in the server's log file to see what errors, if any, you go ( typically, if the database goes bad, the log will fill with errors - and if it is the database under high load not keeping up, the errors will be generic database errors).
How fast are you collecting metrics and how many metrics? If you change most of your metrics to collect every 60 seconds, that would cause high load and you'd need a very good/tuned database to keep up.
In short, it sounds like your hardware/DB setup can't handle the load for whatever reason.
-
2. Re: JOPR Newbie having problems with nodes failing collection
modulus10 Mar 24, 2010 10:17 AM (in response to mazz)HP Dual CPU - Intel(R) Xeon(TM) CPU 3.60GHz
8 gig ram
[root@vitalsigns jopr-server-2.3.1]# free
total used free shared buffers cached
Mem: 8309520 1935860 6373660 0 300496 528256
-/+ buffers/cache: 1107108 7202412
Swap: 2097144 0 2097144DB - Postgres 8.4
OS - CentoOS 5.4
Here is some information from JOPR about itself (stand alone agent installed and monitored)
Type: Linux (Platform) Description: Linux Operating System Version: Linux 2.6.18-164.11.1.el5PAE Parent: none Hostname: vitalsigns.xxx.xxxx.com Architecture: i686 OS Name: Linux Distribution Name: CentOS OS Version: 2.6.18-164.11.1.el5PAE Distribution Version: release 5.4 (Final) NameAlertsMinMaxAverageLastFree Memory 0 6.03GB 6.78GB 6.12GB 6.07GB Free Swap Space 0 2GB 2GB 2GB 2GB System Load 0 0.03% 1.58% 0.25% 0.26% Total Memory 0 7.9246GB 7.9246GB 7.9246GB 7.9246GB Total Swap Space 0 2GB 2GB 2GB 2GB Used Memory 0 1.14GB 1.9GB 1.8GB 1.85GB Used Swap Space 0 0B 0B 0B 0B User Load 0 0.2% 40.9% 1.8% 1.6% SO i dont believe its resource related - im only watching 4 nodes - I am open to some tuning tips if you have any, but from a system perspective this machine seems to be underutilized. But again I am not familiar with the inner workings of JOPR, im sure there are things I can improve.
Server Logs:
First sign of something strange
2010-03-24 05:21:14,829 INFO [org.rhq.enterprise.server.content.ContentManagerBean] Finished merging 1 packages in 10ms
2010-03-24 05:21:23,795 INFO [org.rhq.enterprise.server.content.ContentManagerBean] Merging packages for resource ID [10682]. Package count [1]
2010-03-24 05:21:23,804 INFO [org.rhq.enterprise.server.content.ContentManagerBean] Finished merging 1 packages in 9ms
]
script, name=RHQ Agent Launcher Script, parent= s-ssf02-sjpop.xxx.xxxx.com RHQ Agent, version=1.3.1]] changed its version from [1.3.1] to []
, name=RHQ Agent JVM, parent= s-ssf02-sjpop.xxx.xxxx.com RHQ Agent, version=1.5.0_14]] changed its version from [1.5.0_14] to []
099), version=2.0.1.GA]] changed its version from [2.0.1.GA] to []
] changed its version from [JBoss Messaging 1.4.0.SP3] to []
r 24 05:00:23 UTC 2010]. Will be backfilled since we suspect it is down
2010-03-24 05:21:44,818 INFO [org.rhq.enterprise.server.content.ContentManagerBean] Merging packages for resource ID [10511]. Package count [1]Then started getting these at each interval:
2010-03-24 05:22:32,289 WARN [org.rhq.enterprise.server.core.AgentManagerBean] Have not heard from agent [ s-ssf02-sjpop.xxx.xxxx.com] since [Wed Mar 24 05:00:23 UTC 2010]. Will be backfilled since we suspect it is down
2010-03-24 05:23:32,288 WARN [org.rhq.enterprise.server.core.AgentManagerBean] Have not heard from agent [ s-ssf02-sjpop..xxx.xxxx.com] since [Wed Mar 24 05:00:23 UTC 2010]. Will be backfilled since we suspect it is down
2010-03-24 05:24:32,288 WARN [org.rhq.enterprise.server.core.AgentManagerBean] Have not heard from agent [ s-ssf02-sjpop..xxx.xxxx.com] since [Wed Mar 24 05:00:23 UTC 2010]. Will be backfilled since we suspect it is down
These continued - however the node was not down (would have had LOTS of angry people if it was, and ZenOss never lost communication with it or its snmpd agent) and the rhq-agent was running as well -
I ran a manuall discovery - got a "successful" message back but still no data was being collected.
in the end i deleted the node, and re ran the agent installer, and now the node is back (but all data was lost)
Hope this helps, and thanks for commenting
C
-
3. Re: JOPR Newbie having problems with nodes failing collection
mazz Mar 24, 2010 11:32 AM (in response to modulus10)Looks like the comm traffic stopped between agent box to server box over port 7080 (which is the default, unless you changed it) - if your agent->server is going over ssl, the port is 7443 by default.
Either that, or the clocks between server and agent are not synced (??). Do you have NTP setup on agent and server boxes?
Did you see any errors in the agent log that indicated that it failed to be able to send availability reports to the server?
That message:
"Have not heard from agent [ s-ssf02-sjpop..xxx.xxxx.com] since [Wed Mar 24 05:00:23 UTC 2010]. Will be backfilled since we suspect it is down"
means that the agent failed to successfully send an availability report to the server in a reasonable amount of time. When the server fails to get an availability report from an agent in a reasonable amount of time, it "backfills" that agent, meaning it turns "red" all of the agent's resources, including its top level platform resource.
You can test that an agent can send an availability report to the server via the agent prompt command "avail". You can test that an agent has connectivity to the server by executing the "ping" agent prompt command.
Also, the fact that you were getting "backend datasource not available" (and your hardware/db setup looks ok) tells me its more about connectivity than anything. Perhaps there is network connectivity issues with the server box?? Do you have more than one server installed (i.e. are you using High-Availability (HA)).