-
15. Re: Few Issues
mazz Jul 23, 2009 3:37 PM (in response to jfrazier)As expected, VM Health check thread isn't doing anything to cause the VM to die.
I wonder why your measurement collections are getting delayed.
Are the machines that are running your agents and servers synced? i.e. are their clocks synced so they have the same time? This would have no bearing on your agent just dying out of the blue like that, but, you do need to have your clocks synced. I'm not sure if these measurement delays are a symptom of that. See http://jopr.org/confluence/display/JOPR2/Jopr+Server+Installation+Preparation#JoprServerInstallationPreparation-Synchronizedmachineclocks -
16. Re: Few Issues
jfrazier Jul 23, 2009 7:39 PM (in response to jfrazier)"mazz" wrote:
As expected, VM Health check thread isn't doing anything to cause the VM to die.
I wonder why your measurement collections are getting delayed.
Are the machines that are running your agents and servers synced? i.e. are their clocks synced so they have the same time? This would have no bearing on your agent just dying out of the blue like that, but, you do need to have your clocks synced. I'm not sure if these measurement delays are a symptom of that. See http://jopr.org/confluence/display/JOPR2/Jopr+Server+Installation+Preparation#JoprServerInstallationPreparation-Synchronizedmachineclocks
Clock are in sync, they sync up to an internal ntp server every hour.
I setup some alerts in the jopr UI to alert me when someone issues a shutdown command. That wont tell me who did it but will tell me if someone is messing with me here. -
17. Re: Few Issues
jfrazier Jul 24, 2009 12:48 PM (in response to jfrazier)It just went down again, I had two alerts setup on the rhq agent in the UI:
If Condition: Shutdown Agent INPROGRESS
If Condition: Availability goes DOWN
In this case only the 2nd item generated an alert. When i did some testing after setting it up if i initiated the shutdown from the UI i got both alerts instead of just the one. The log still showw the same thing:
The agent itself is running as root and as a daemon in the background so if someone were to kill -9 it as root it would just die in the logs instead of shutting down nicely. 99% of the users that login login as a different user which cant possibly kill the agent using the kill command.
I still cant find anyone causing the agent to shutdown. Where else can we look?2009-07-24 12:11:52,229 INFO [MeasurementManager.sender-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Measurement collection for [8] metrics took 4ms - sending report to Server... 2009-07-24 12:12:43,405 INFO [InventoryManager.availability-1] (rhq.core.pc.inventory.InventoryManager)- Sending availability report to Server... 2009-07-24 12:12:52,229 INFO [MeasurementManager.sender-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Measurement collection for [5] metrics took 2ms - sending report to Server... 2009-07-24 12:13:43,457 INFO [InventoryManager.availability-1] (rhq.core.pc.inventory.InventoryManager)- Sending availability report to Server... 2009-07-24 12:13:52,229 INFO [MeasurementManager.sender-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Measurement collection for [235] metrics took 549ms - sending report to Server... 2009-07-24 12:14:03,062 INFO [InventoryManager.discovery-1] (rhq.core.pc.inventory.AutoDiscoveryExecutor)- Executing server discovery scan... 2009-07-24 12:14:03,195 WARN [ResourceDiscoveryComponent.invoker.daemon-139] (org.rhq.plugins.virt.VirtualizationDiscoveryComponent)- Can not load native library for libvirt: Could not initialize class org.rhq.plugins.virt.LibVirt 2009-07-24 12:14:03,198 INFO [InventoryManager.discovery-1] (rhq.core.pc.inventory.AutoDiscoveryExecutor)- Process scan auto-detected new server resource: scan=[ProcessScan: query=[process|basename|match=sshd,process|basename|nomatch|parent=sshd], name=[SSHD]], discovered-process=[process: pid=[5478], name=[/usr/sbin/sshd], ppid=[1]] 2009-07-24 12:14:03,353 INFO [ResourceDiscoveryComponent.invoker.daemon-139] (org.rhq.plugins.agent.AgentDiscoveryComponent)- Discovering RHQ Agent... 2009-07-24 12:14:03,355 INFO [InventoryManager.discovery-1] (rhq.core.pc.inventory.AutoDiscoveryExecutor)- Process scan auto-detected new server resource: scan=[ProcessScan: query=[process|basename|match=^java.*,arg|org.jboss.Main|match=.*], name=[JBoss4]], discovered-process=[process: pid=[7030], name=[/vol/app/common/java/jdk1.6.0_12/bin/java], ppid=[7016]] 2009-07-24 12:14:03,366 INFO [ResourceDiscoveryComponent.invoker.daemon-139] (org.rhq.plugins.cli.CliDiscoveryComponent)- Processing discovered CLI resources 2009-07-24 12:14:03,369 INFO [InventoryManager.discovery-1] (rhq.core.pc.inventory.InventoryManager)- Sending server inventory report to Server... 2009-07-24 12:14:03,421 INFO [InventoryManager.discovery-1] (rhq.core.pc.inventory.InventoryManager)- Syncing local inventory with Server inventory... 2009-07-24 12:14:03,430 INFO [InventoryManager.discovery-1] (rhq.core.pc.inventory.AutoDiscoveryExecutor)- Found 0 servers. 2009-07-24 12:14:43,521 INFO [InventoryManager.availability-1] (rhq.core.pc.inventory.InventoryManager)- Sending availability report to Server... 2009-07-24 12:14:52,228 INFO [MeasurementManager.sender-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Measurement collection for [5] metrics took 2ms - sending report to Server... 2009-07-24 12:15:43,581 INFO [InventoryManager.availability-1] (rhq.core.pc.inventory.InventoryManager)- Sending availability report to Server... 2009-07-24 12:15:52,227 INFO [MeasurementManager.sender-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Measurement collection for [8] metrics took 2ms - sending report to Server... 2009-07-24 12:16:43,632 INFO [InventoryManager.availability-1] (rhq.core.pc.inventory.InventoryManager)- Sending availability report to Server... 2009-07-24 12:16:52,228 INFO [MeasurementManager.sender-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Measurement collection for [5] metrics took 2ms - sending report to Server... 2009-07-24 12:17:43,687 INFO [InventoryManager.availability-1] (rhq.core.pc.inventory.InventoryManager)- Sending availability report to Server... 2009-07-24 12:17:47,803 INFO [Thread-7] (org.rhq.enterprise.agent.AgentShutdownHook)- {AgentShutdownHook.exit.shutting-down}Shutting down... 2009-07-24 12:17:47,803 INFO [Thread-7] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.shutting-down}Agent is being shut down... 2009-07-24 12:17:47,804 INFO [RHQ Primary Server Switchover Thread] (org.rhq.enterprise.agent.AgentMain)- {PrimaryServerSwitchoverThread.stopped}The primary server switchover thread has stopped. 2009-07-24 12:17:47,805 INFO [Thread-7] (rhq.core.pc.content.ContentManager)- Shutting down Content Manager... 2009-07-24 12:17:47,806 INFO [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementManager)- Shutting down measurement collection... 2009-07-24 12:17:47,815 ERROR [ResourceContainer.invoker.daemon-247] (rhq.core.pc.event.EventManager)- Failed to remove poller with PollerKey[resourceId=502071, eventType=SnmpTrap] from thread pool. 2009-07-24 12:17:47,816 WARN [ResourceContainer.invoker.daemon-247] (plugins.platform.content.yum.YumServer)- Stop ignored: not running 2009-07-24 12:17:48,104 INFO [Thread-7] (org.rhq.enterprise.communications.ServiceContainer)- {ServiceContainer.shutting-down}Service container shutting down... 2009-07-24 12:17:48,122 INFO [Thread-7] (org.rhq.enterprise.communications.ServiceContainer)- {ServiceContainer.shutdown}Service container shut down - no longer accepting incoming commands 2009-07-24 12:17:48,122 INFO [Thread-7] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.shut-down}Agent has been shut down 2009-07-24 12:17:48,126 INFO [Thread-7] (org.rhq.enterprise.agent.AgentShutdownHook)- {AgentShutdownHook.wait}The agent will wait for [0] threads to die 2009-07-24 12:17:48,127 INFO [Thread-7] (org.rhq.enterprise.agent.AgentShutdownHook)- {AgentShutdownHook.exit.shutdown-complete}Shutdown complete - agent will now exit.
-
18. Re: Few Issues
mazz Jul 24, 2009 1:12 PM (in response to jfrazier)I don't think its a user doing it.
How exactly are you starting your agent?
I wonder if you did a "rhq-agent.sh &" and then the console is exited, killing the process (i.e. you are not doing a nohup). Running the agent as a background process using "rhq-agent-wrapper.sh" is the way to run it as a background daemon process - it passes the --daemon arg by default and runs things in background. I've not heard of this problem at all, much less when using that wrapper script.
I really have no explanation - nothing in the logs indicate anything is going wrong. I guess hunt around and look for reasons why the VM would trigger shutdown hooks. I really think the OS is signaling the VM process for some reason - its the only explanation I can think of.
I came across this:
http://www.wmusers.com/forum/showthread.php?t=4687
and this guy worked aroudn what I think the problem is (that being Linux sending a signal (kill signal or whatever) to the JVM. He used -Xrs - you can pass that in via RHQ_AGENT_ADDITIONAL_JAVA_OPTS. Might be worth a shot. although that might cause the shutdown hook to NEVER get called.
this might be useful, I don't know: http://www.roseindia.net/javatutorials/switching_off_os_signals_at_runtime.shtml -
19. Re: Few Issues
jfrazier Jul 24, 2009 1:46 PM (in response to jfrazier)"mazz" wrote:
I don't think its a user doing it.
How exactly are you starting your agent?
I wonder if you did a "rhq-agent.sh &" and then the console is exited, killing the process (i.e. you are not doing a nohup). Running the agent as a background process using "rhq-agent-wrapper.sh" is the way to run it as a background daemon process - it passes the --daemon arg by default and runs things in background. I've not heard of this problem at all, much less when using that wrapper script.
I really have no explanation - nothing in the logs indicate anything is going wrong. I guess hunt around and look for reasons why the VM would trigger shutdown hooks. I really think the OS is signaling the VM process for some reason - its the only explanation I can think of.
I came across this:
http://www.wmusers.com/forum/showthread.php?t=4687
and this guy worked aroudn what I think the problem is (that being Linux sending a signal (kill signal or whatever) to the JVM. He used -Xrs - you can pass that in via RHQ_AGENT_ADDITIONAL_JAVA_OPTS. Might be worth a shot. although that might cause the shutdown hook to NEVER get called.
this might be useful, I don't know: http://www.roseindia.net/javatutorials/switching_off_os_signals_at_runtime.shtml
I was simply running it ./rhq-agent.sh -d & I ran it this way on our other servers as we are still testing all this out and those servers have been fine. Ijust started it back up using the wrapper script so well see how that does. Ill look into the other threads.
Thanks again for all your help so far. -
20. Re: Few Issues
mazz Jul 24, 2009 1:55 PM (in response to jfrazier)What if you log into your machine, run "rhq-agent.sh -d &" and then log out of that session?
Would the agent die?
In other words, would you need "nohup" to keep the agent up after you logged out?
"nohup rhq-agent.sh -d &"
BTW: I suggest reading the documentation on "Running the Jopr Agent" - it talks about running as a daemon among other things:
http://jopr.org/confluence/display/JOPR2/Running+the+Jopr+Agent -
21. Re: Few Issues
jfrazier Jul 24, 2009 3:34 PM (in response to jfrazier)No, the agent keeps running after closing my session.
I know about setting it up to run on startup/etc but since we are still testing i didnt want to go through the trouble of doing that till we are sure were going to stick with this (that it doesnt add to much overhead etc) as we had purchased a producted called DynaTrace that was suppose to do all this but would bring out application down from the extra overhead.
We currently have 2 environments in jopr. One env has 2 jboss instances (2 seperate VM's) and those agents are started the same way "rhq-agent.sh -d &" and those never have an issue and have been running ~ 2 weeks straight.
This environment is the same code / jboss ver /etc and cant stay up for 24 hours it seems like without exiting.
Resource has been UP since: 7/24/09, 10:45:29 AM, PDT
Availability: 46.652% Failures: 8
Down for: 2 days, 11 hours, 49 minutes
Up for: 2 days, 4 hours, 18 minutes
MTBF: 14 hours, 1 minute Mean Time Between Failures over the known resource lifetime. MTTR: 7 hours, 28 minutes Mean Time To Recover from failure over the known resource lifetime.
Availablility Start End Duration
UP Fri Jul 24 10:45:29 PDT 2009 1 hour, 46 minutes
DOWN Fri Jul 24 09:20:02 PDT 2009 Fri Jul 24 10:45:29 PDT 2009 1 hour, 25 minutes
UP Thu Jul 23 12:13:27 PDT 2009 Fri Jul 24 09:20:02 PDT 2009 21 hours, 6 minutes
DOWN Thu Jul 23 11:49:02 PDT 2009 Thu Jul 23 12:13:27 PDT 2009 24 minutes
UP Thu Jul 23 11:37:41 PDT 2009 Thu Jul 23 11:49:02 PDT 2009 11 minutes
DOWN Thu Jul 23 11:21:32 PDT 2009 Thu Jul 23 11:37:41 PDT 2009 16 minutes
UP Thu Jul 23 11:07:05 PDT 2009 Thu Jul 23 11:21:32 PDT 2009 14 minutes
DOWN Thu Jul 23 02:07:32 PDT 2009 Thu Jul 23 11:07:05 PDT 2009 8 hours, 59 minutes
UP Wed Jul 22 19:55:50 PDT 2009 Thu Jul 23 02:07:32 PDT 2009 6 hours, 11 minutes
DOWN Tue Jul 21 22:44:02 PDT 2009 Wed Jul 22 19:55:50 PDT 2009 21 hours, 11 minutes
UP Tue Jul 21 17:29:02 PDT 2009 Tue Jul 21 22:44:02 PDT 2009 5 hours, 15 minutes
DOWN Mon Jul 20 16:54:32 PDT 2009 Tue Jul 21 17:29:02 PDT 2009 1 day, 34 minutes
UP Mon Jul 20 13:17:41 PDT 2009 Mon Jul 20 16:54:32 PDT 2009 3 hours, 36 minutes
DOWN Mon Jul 20 10:24:02 PDT 2009 Mon Jul 20 13:17:41 PDT 2009 2 hours, 53 minutes
UP Mon Jul 20 08:28:31 PDT 2009 Mon Jul 20 10:24:02 PDT 2009 1 hour, 55 minutes
In all cases i launch the agent and then close my session and the agent keeps running for random lengths of time. So far it hasnt died again since starting it using the wrapper but again its pretty random. -
22. Re: Few Issues
jfrazier Jul 24, 2009 3:34 PM (in response to jfrazier)No, the agent keeps running after closing my session.
I know about setting it up to run on startup/etc but since we are still testing i didnt want to go through the trouble of doing that till we are sure were going to stick with this (that it doesnt add to much overhead etc) as we had purchased a producted called DynaTrace that was suppose to do all this but would bring out application down from the extra overhead.
We currently have 2 environments in jopr. One env has 2 jboss instances (2 seperate VM's) and those agents are started the same way "rhq-agent.sh -d &" and those never have an issue and have been running ~ 2 weeks straight.
This environment is the same code / jboss ver /etc and cant stay up for 24 hours it seems like without exiting.
Resource has been UP since: 7/24/09, 10:45:29 AM, PDT
Availability: 46.652% Failures: 8
Down for: 2 days, 11 hours, 49 minutes
Up for: 2 days, 4 hours, 18 minutes
MTBF: 14 hours, 1 minute Mean Time Between Failures over the known resource lifetime. MTTR: 7 hours, 28 minutes Mean Time To Recover from failure over the known resource lifetime.
Availablility Start End Duration
UP Fri Jul 24 10:45:29 PDT 2009 1 hour, 46 minutes
DOWN Fri Jul 24 09:20:02 PDT 2009 Fri Jul 24 10:45:29 PDT 2009 1 hour, 25 minutes
UP Thu Jul 23 12:13:27 PDT 2009 Fri Jul 24 09:20:02 PDT 2009 21 hours, 6 minutes
DOWN Thu Jul 23 11:49:02 PDT 2009 Thu Jul 23 12:13:27 PDT 2009 24 minutes
UP Thu Jul 23 11:37:41 PDT 2009 Thu Jul 23 11:49:02 PDT 2009 11 minutes
DOWN Thu Jul 23 11:21:32 PDT 2009 Thu Jul 23 11:37:41 PDT 2009 16 minutes
UP Thu Jul 23 11:07:05 PDT 2009 Thu Jul 23 11:21:32 PDT 2009 14 minutes
DOWN Thu Jul 23 02:07:32 PDT 2009 Thu Jul 23 11:07:05 PDT 2009 8 hours, 59 minutes
UP Wed Jul 22 19:55:50 PDT 2009 Thu Jul 23 02:07:32 PDT 2009 6 hours, 11 minutes
DOWN Tue Jul 21 22:44:02 PDT 2009 Wed Jul 22 19:55:50 PDT 2009 21 hours, 11 minutes
UP Tue Jul 21 17:29:02 PDT 2009 Tue Jul 21 22:44:02 PDT 2009 5 hours, 15 minutes
DOWN Mon Jul 20 16:54:32 PDT 2009 Tue Jul 21 17:29:02 PDT 2009 1 day, 34 minutes
UP Mon Jul 20 13:17:41 PDT 2009 Mon Jul 20 16:54:32 PDT 2009 3 hours, 36 minutes
DOWN Mon Jul 20 10:24:02 PDT 2009 Mon Jul 20 13:17:41 PDT 2009 2 hours, 53 minutes
UP Mon Jul 20 08:28:31 PDT 2009 Mon Jul 20 10:24:02 PDT 2009 1 hour, 55 minutes
In all cases i launch the agent and then close my session and the agent keeps running for random lengths of time. So far it hasnt died again since starting it using the wrapper but again its pretty random. -
23. Re: Few Issues
mazz Jul 24, 2009 4:15 PM (in response to jfrazier)What are your ulimits on the boxes in question?
Login as the user running the agent, and post what you see when you execute "ulimit -a".
Run that command on the box where the agent disappears, and on the boxes where the agent is running fine. -
24. Re: Few Issues
jfrazier Jul 24, 2009 4:53 PM (in response to jfrazier)"mazz" wrote:
What are your ulimits on the boxes in question?
Login as the user running the agent, and post what you see when you execute "ulimit -a".
Run that command on the box where the agent disappears, and on the boxes where the agent is running fine.
Working:core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited max nice (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 30720 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 2048 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 max rt priority (-r) 0 stack size (kbytes, -s) 65536 cpu time (seconds, -t) unlimited max user processes (-u) 30720 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
Not working:core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 53248 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 65536 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 65536 cpu time (seconds, -t) unlimited max user processes (-u) 53248 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
output from these was slightly different and made me look at the kernel running on each:
Working:Linux pdxuat8app02vm.corp.unicru.com 2.6.18-128.1.1.el5 #1 SMP Mon Jan 26 13:58:24 EST 2009 x86_64 x86_64 x86_64 GNU/Linux
Not working:Linux pdxcfgapp802vm.corp.unicru.com 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
Would the minor kernel difference cause issues? -
25. Re: Few Issues
jfrazier Jul 27, 2009 11:23 AM (in response to jfrazier)Just by way of update after starting it using the wrapper script it hasnt exited on its own yet. I want to give it a bit more as its just now monday but its been online since Friday morning so far which is the longest stretch.
So far its looking like my being lazy trying to test is what caused the random exits :) -
26. Re: Few Issues
jfrazier Jul 28, 2009 6:41 PM (in response to jfrazier)"jfrazier" wrote:
Just by way of update after starting it using the wrapper script it hasnt exited on its own yet. I want to give it a bit more as its just now monday but its been online since Friday morning so far which is the longest stretch.
So far its looking like my being lazy trying to test is what caused the random exits :)
Its now most of the way through tuesday and still has not exited unexpectedly. I am going to conclude that it was probably the way i was starting it. Odd how that method works on some systems but not others but atleast its resolved and was just me being lazy :)
Thanks again for all your help trying to figure this out mazz.