1 2 Previous Next 26 Replies Latest reply on Jul 28, 2009 6:41 PM by jfrazier Go to original post
      • 15. Re: Few Issues
        mazz

        As expected, VM Health check thread isn't doing anything to cause the VM to die.

        I wonder why your measurement collections are getting delayed.

        Are the machines that are running your agents and servers synced? i.e. are their clocks synced so they have the same time? This would have no bearing on your agent just dying out of the blue like that, but, you do need to have your clocks synced. I'm not sure if these measurement delays are a symptom of that. See http://jopr.org/confluence/display/JOPR2/Jopr+Server+Installation+Preparation#JoprServerInstallationPreparation-Synchronizedmachineclocks

        • 16. Re: Few Issues

           

          "mazz" wrote:
          As expected, VM Health check thread isn't doing anything to cause the VM to die.

          I wonder why your measurement collections are getting delayed.

          Are the machines that are running your agents and servers synced? i.e. are their clocks synced so they have the same time? This would have no bearing on your agent just dying out of the blue like that, but, you do need to have your clocks synced. I'm not sure if these measurement delays are a symptom of that. See http://jopr.org/confluence/display/JOPR2/Jopr+Server+Installation+Preparation#JoprServerInstallationPreparation-Synchronizedmachineclocks


          Clock are in sync, they sync up to an internal ntp server every hour.
          I setup some alerts in the jopr UI to alert me when someone issues a shutdown command. That wont tell me who did it but will tell me if someone is messing with me here.


          • 17. Re: Few Issues

            It just went down again, I had two alerts setup on the rhq agent in the UI:

            If Condition: Shutdown Agent INPROGRESS
            If Condition: Availability goes DOWN

            In this case only the 2nd item generated an alert. When i did some testing after setting it up if i initiated the shutdown from the UI i got both alerts instead of just the one. The log still showw the same thing:

            The agent itself is running as root and as a daemon in the background so if someone were to kill -9 it as root it would just die in the logs instead of shutting down nicely. 99% of the users that login login as a different user which cant possibly kill the agent using the kill command.

            I still cant find anyone causing the agent to shutdown. Where else can we look?

            2009-07-24 12:11:52,229 INFO [MeasurementManager.sender-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Measurement collection for [8] metrics took 4ms - sending report to Server...
            2009-07-24 12:12:43,405 INFO [InventoryManager.availability-1] (rhq.core.pc.inventory.InventoryManager)- Sending availability report to Server...
            2009-07-24 12:12:52,229 INFO [MeasurementManager.sender-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Measurement collection for [5] metrics took 2ms - sending report to Server...
            2009-07-24 12:13:43,457 INFO [InventoryManager.availability-1] (rhq.core.pc.inventory.InventoryManager)- Sending availability report to Server...
            2009-07-24 12:13:52,229 INFO [MeasurementManager.sender-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Measurement collection for [235] metrics took 549ms - sending report to Server...
            2009-07-24 12:14:03,062 INFO [InventoryManager.discovery-1] (rhq.core.pc.inventory.AutoDiscoveryExecutor)- Executing server discovery scan...
            2009-07-24 12:14:03,195 WARN [ResourceDiscoveryComponent.invoker.daemon-139] (org.rhq.plugins.virt.VirtualizationDiscoveryComponent)- Can not load native library for libvirt: Could not initialize class org.rhq.plugins.virt.LibVirt
            2009-07-24 12:14:03,198 INFO [InventoryManager.discovery-1] (rhq.core.pc.inventory.AutoDiscoveryExecutor)- Process scan auto-detected new server resource: scan=[ProcessScan: query=[process|basename|match=sshd,process|basename|nomatch|parent=sshd], name=[SSHD]], discovered-process=[process: pid=[5478], name=[/usr/sbin/sshd], ppid=[1]]
            2009-07-24 12:14:03,353 INFO [ResourceDiscoveryComponent.invoker.daemon-139] (org.rhq.plugins.agent.AgentDiscoveryComponent)- Discovering RHQ Agent...
            2009-07-24 12:14:03,355 INFO [InventoryManager.discovery-1] (rhq.core.pc.inventory.AutoDiscoveryExecutor)- Process scan auto-detected new server resource: scan=[ProcessScan: query=[process|basename|match=^java.*,arg|org.jboss.Main|match=.*], name=[JBoss4]], discovered-process=[process: pid=[7030], name=[/vol/app/common/java/jdk1.6.0_12/bin/java], ppid=[7016]]
            2009-07-24 12:14:03,366 INFO [ResourceDiscoveryComponent.invoker.daemon-139] (org.rhq.plugins.cli.CliDiscoveryComponent)- Processing discovered CLI resources
            2009-07-24 12:14:03,369 INFO [InventoryManager.discovery-1] (rhq.core.pc.inventory.InventoryManager)- Sending server inventory report to Server...
            2009-07-24 12:14:03,421 INFO [InventoryManager.discovery-1] (rhq.core.pc.inventory.InventoryManager)- Syncing local inventory with Server inventory...
            2009-07-24 12:14:03,430 INFO [InventoryManager.discovery-1] (rhq.core.pc.inventory.AutoDiscoveryExecutor)- Found 0 servers.
            2009-07-24 12:14:43,521 INFO [InventoryManager.availability-1] (rhq.core.pc.inventory.InventoryManager)- Sending availability report to Server...
            2009-07-24 12:14:52,228 INFO [MeasurementManager.sender-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Measurement collection for [5] metrics took 2ms - sending report to Server...
            2009-07-24 12:15:43,581 INFO [InventoryManager.availability-1] (rhq.core.pc.inventory.InventoryManager)- Sending availability report to Server...
            2009-07-24 12:15:52,227 INFO [MeasurementManager.sender-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Measurement collection for [8] metrics took 2ms - sending report to Server...
            2009-07-24 12:16:43,632 INFO [InventoryManager.availability-1] (rhq.core.pc.inventory.InventoryManager)- Sending availability report to Server...
            2009-07-24 12:16:52,228 INFO [MeasurementManager.sender-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Measurement collection for [5] metrics took 2ms - sending report to Server...
            2009-07-24 12:17:43,687 INFO [InventoryManager.availability-1] (rhq.core.pc.inventory.InventoryManager)- Sending availability report to Server...
            2009-07-24 12:17:47,803 INFO [Thread-7] (org.rhq.enterprise.agent.AgentShutdownHook)- {AgentShutdownHook.exit.shutting-down}Shutting down...
            2009-07-24 12:17:47,803 INFO [Thread-7] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.shutting-down}Agent is being shut down...
            2009-07-24 12:17:47,804 INFO [RHQ Primary Server Switchover Thread] (org.rhq.enterprise.agent.AgentMain)- {PrimaryServerSwitchoverThread.stopped}The primary server switchover thread has stopped.
            2009-07-24 12:17:47,805 INFO [Thread-7] (rhq.core.pc.content.ContentManager)- Shutting down Content Manager...
            2009-07-24 12:17:47,806 INFO [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementManager)- Shutting down measurement collection...
            2009-07-24 12:17:47,815 ERROR [ResourceContainer.invoker.daemon-247] (rhq.core.pc.event.EventManager)- Failed to remove poller with PollerKey[resourceId=502071, eventType=SnmpTrap] from thread pool.
            2009-07-24 12:17:47,816 WARN [ResourceContainer.invoker.daemon-247] (plugins.platform.content.yum.YumServer)- Stop ignored: not running
            2009-07-24 12:17:48,104 INFO [Thread-7] (org.rhq.enterprise.communications.ServiceContainer)- {ServiceContainer.shutting-down}Service container shutting down...
            2009-07-24 12:17:48,122 INFO [Thread-7] (org.rhq.enterprise.communications.ServiceContainer)- {ServiceContainer.shutdown}Service container shut down - no longer accepting incoming commands
            2009-07-24 12:17:48,122 INFO [Thread-7] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.shut-down}Agent has been shut down
            2009-07-24 12:17:48,126 INFO [Thread-7] (org.rhq.enterprise.agent.AgentShutdownHook)- {AgentShutdownHook.wait}The agent will wait for [0] threads to die
            2009-07-24 12:17:48,127 INFO [Thread-7] (org.rhq.enterprise.agent.AgentShutdownHook)- {AgentShutdownHook.exit.shutdown-complete}Shutdown complete - agent will now exit.


            • 18. Re: Few Issues
              mazz

              I don't think its a user doing it.

              How exactly are you starting your agent?

              I wonder if you did a "rhq-agent.sh &" and then the console is exited, killing the process (i.e. you are not doing a nohup). Running the agent as a background process using "rhq-agent-wrapper.sh" is the way to run it as a background daemon process - it passes the --daemon arg by default and runs things in background. I've not heard of this problem at all, much less when using that wrapper script.

              I really have no explanation - nothing in the logs indicate anything is going wrong. I guess hunt around and look for reasons why the VM would trigger shutdown hooks. I really think the OS is signaling the VM process for some reason - its the only explanation I can think of.

              I came across this:

              http://www.wmusers.com/forum/showthread.php?t=4687

              and this guy worked aroudn what I think the problem is (that being Linux sending a signal (kill signal or whatever) to the JVM. He used -Xrs - you can pass that in via RHQ_AGENT_ADDITIONAL_JAVA_OPTS. Might be worth a shot. although that might cause the shutdown hook to NEVER get called.

              this might be useful, I don't know: http://www.roseindia.net/javatutorials/switching_off_os_signals_at_runtime.shtml

              • 19. Re: Few Issues

                 

                "mazz" wrote:
                I don't think its a user doing it.

                How exactly are you starting your agent?

                I wonder if you did a "rhq-agent.sh &" and then the console is exited, killing the process (i.e. you are not doing a nohup). Running the agent as a background process using "rhq-agent-wrapper.sh" is the way to run it as a background daemon process - it passes the --daemon arg by default and runs things in background. I've not heard of this problem at all, much less when using that wrapper script.

                I really have no explanation - nothing in the logs indicate anything is going wrong. I guess hunt around and look for reasons why the VM would trigger shutdown hooks. I really think the OS is signaling the VM process for some reason - its the only explanation I can think of.

                I came across this:

                http://www.wmusers.com/forum/showthread.php?t=4687

                and this guy worked aroudn what I think the problem is (that being Linux sending a signal (kill signal or whatever) to the JVM. He used -Xrs - you can pass that in via RHQ_AGENT_ADDITIONAL_JAVA_OPTS. Might be worth a shot. although that might cause the shutdown hook to NEVER get called.

                this might be useful, I don't know: http://www.roseindia.net/javatutorials/switching_off_os_signals_at_runtime.shtml



                I was simply running it ./rhq-agent.sh -d & I ran it this way on our other servers as we are still testing all this out and those servers have been fine. Ijust started it back up using the wrapper script so well see how that does. Ill look into the other threads.

                Thanks again for all your help so far.

                • 20. Re: Few Issues
                  mazz

                  What if you log into your machine, run "rhq-agent.sh -d &" and then log out of that session?

                  Would the agent die?

                  In other words, would you need "nohup" to keep the agent up after you logged out?

                  "nohup rhq-agent.sh -d &"

                  BTW: I suggest reading the documentation on "Running the Jopr Agent" - it talks about running as a daemon among other things:

                  http://jopr.org/confluence/display/JOPR2/Running+the+Jopr+Agent

                  • 21. Re: Few Issues

                    No, the agent keeps running after closing my session.

                    I know about setting it up to run on startup/etc but since we are still testing i didnt want to go through the trouble of doing that till we are sure were going to stick with this (that it doesnt add to much overhead etc) as we had purchased a producted called DynaTrace that was suppose to do all this but would bring out application down from the extra overhead.

                    We currently have 2 environments in jopr. One env has 2 jboss instances (2 seperate VM's) and those agents are started the same way "rhq-agent.sh -d &" and those never have an issue and have been running ~ 2 weeks straight.

                    This environment is the same code / jboss ver /etc and cant stay up for 24 hours it seems like without exiting.

                    Resource has been UP since: 7/24/09, 10:45:29 AM, PDT
                    Availability: 46.652% Failures: 8
                    Down for: 2 days, 11 hours, 49 minutes
                    Up for: 2 days, 4 hours, 18 minutes
                    MTBF: 14 hours, 1 minute Mean Time Between Failures over the known resource lifetime. MTTR: 7 hours, 28 minutes Mean Time To Recover from failure over the known resource lifetime.

                    Availablility Start End Duration


                    UP Fri Jul 24 10:45:29 PDT 2009 1 hour, 46 minutes
                    DOWN Fri Jul 24 09:20:02 PDT 2009 Fri Jul 24 10:45:29 PDT 2009 1 hour, 25 minutes
                    UP Thu Jul 23 12:13:27 PDT 2009 Fri Jul 24 09:20:02 PDT 2009 21 hours, 6 minutes
                    DOWN Thu Jul 23 11:49:02 PDT 2009 Thu Jul 23 12:13:27 PDT 2009 24 minutes
                    UP Thu Jul 23 11:37:41 PDT 2009 Thu Jul 23 11:49:02 PDT 2009 11 minutes
                    DOWN Thu Jul 23 11:21:32 PDT 2009 Thu Jul 23 11:37:41 PDT 2009 16 minutes
                    UP Thu Jul 23 11:07:05 PDT 2009 Thu Jul 23 11:21:32 PDT 2009 14 minutes
                    DOWN Thu Jul 23 02:07:32 PDT 2009 Thu Jul 23 11:07:05 PDT 2009 8 hours, 59 minutes
                    UP Wed Jul 22 19:55:50 PDT 2009 Thu Jul 23 02:07:32 PDT 2009 6 hours, 11 minutes
                    DOWN Tue Jul 21 22:44:02 PDT 2009 Wed Jul 22 19:55:50 PDT 2009 21 hours, 11 minutes
                    UP Tue Jul 21 17:29:02 PDT 2009 Tue Jul 21 22:44:02 PDT 2009 5 hours, 15 minutes
                    DOWN Mon Jul 20 16:54:32 PDT 2009 Tue Jul 21 17:29:02 PDT 2009 1 day, 34 minutes
                    UP Mon Jul 20 13:17:41 PDT 2009 Mon Jul 20 16:54:32 PDT 2009 3 hours, 36 minutes
                    DOWN Mon Jul 20 10:24:02 PDT 2009 Mon Jul 20 13:17:41 PDT 2009 2 hours, 53 minutes
                    UP Mon Jul 20 08:28:31 PDT 2009 Mon Jul 20 10:24:02 PDT 2009 1 hour, 55 minutes


                    In all cases i launch the agent and then close my session and the agent keeps running for random lengths of time. So far it hasnt died again since starting it using the wrapper but again its pretty random.

                    • 22. Re: Few Issues

                      No, the agent keeps running after closing my session.

                      I know about setting it up to run on startup/etc but since we are still testing i didnt want to go through the trouble of doing that till we are sure were going to stick with this (that it doesnt add to much overhead etc) as we had purchased a producted called DynaTrace that was suppose to do all this but would bring out application down from the extra overhead.

                      We currently have 2 environments in jopr. One env has 2 jboss instances (2 seperate VM's) and those agents are started the same way "rhq-agent.sh -d &" and those never have an issue and have been running ~ 2 weeks straight.

                      This environment is the same code / jboss ver /etc and cant stay up for 24 hours it seems like without exiting.

                      Resource has been UP since: 7/24/09, 10:45:29 AM, PDT
                      Availability: 46.652% Failures: 8
                      Down for: 2 days, 11 hours, 49 minutes
                      Up for: 2 days, 4 hours, 18 minutes
                      MTBF: 14 hours, 1 minute Mean Time Between Failures over the known resource lifetime. MTTR: 7 hours, 28 minutes Mean Time To Recover from failure over the known resource lifetime.

                      Availablility Start End Duration


                      UP Fri Jul 24 10:45:29 PDT 2009 1 hour, 46 minutes
                      DOWN Fri Jul 24 09:20:02 PDT 2009 Fri Jul 24 10:45:29 PDT 2009 1 hour, 25 minutes
                      UP Thu Jul 23 12:13:27 PDT 2009 Fri Jul 24 09:20:02 PDT 2009 21 hours, 6 minutes
                      DOWN Thu Jul 23 11:49:02 PDT 2009 Thu Jul 23 12:13:27 PDT 2009 24 minutes
                      UP Thu Jul 23 11:37:41 PDT 2009 Thu Jul 23 11:49:02 PDT 2009 11 minutes
                      DOWN Thu Jul 23 11:21:32 PDT 2009 Thu Jul 23 11:37:41 PDT 2009 16 minutes
                      UP Thu Jul 23 11:07:05 PDT 2009 Thu Jul 23 11:21:32 PDT 2009 14 minutes
                      DOWN Thu Jul 23 02:07:32 PDT 2009 Thu Jul 23 11:07:05 PDT 2009 8 hours, 59 minutes
                      UP Wed Jul 22 19:55:50 PDT 2009 Thu Jul 23 02:07:32 PDT 2009 6 hours, 11 minutes
                      DOWN Tue Jul 21 22:44:02 PDT 2009 Wed Jul 22 19:55:50 PDT 2009 21 hours, 11 minutes
                      UP Tue Jul 21 17:29:02 PDT 2009 Tue Jul 21 22:44:02 PDT 2009 5 hours, 15 minutes
                      DOWN Mon Jul 20 16:54:32 PDT 2009 Tue Jul 21 17:29:02 PDT 2009 1 day, 34 minutes
                      UP Mon Jul 20 13:17:41 PDT 2009 Mon Jul 20 16:54:32 PDT 2009 3 hours, 36 minutes
                      DOWN Mon Jul 20 10:24:02 PDT 2009 Mon Jul 20 13:17:41 PDT 2009 2 hours, 53 minutes
                      UP Mon Jul 20 08:28:31 PDT 2009 Mon Jul 20 10:24:02 PDT 2009 1 hour, 55 minutes


                      In all cases i launch the agent and then close my session and the agent keeps running for random lengths of time. So far it hasnt died again since starting it using the wrapper but again its pretty random.

                      • 23. Re: Few Issues
                        mazz

                        What are your ulimits on the boxes in question?

                        Login as the user running the agent, and post what you see when you execute "ulimit -a".

                        Run that command on the box where the agent disappears, and on the boxes where the agent is running fine.

                        • 24. Re: Few Issues

                           

                          "mazz" wrote:
                          What are your ulimits on the boxes in question?

                          Login as the user running the agent, and post what you see when you execute "ulimit -a".

                          Run that command on the box where the agent disappears, and on the boxes where the agent is running fine.


                          Working:
                          core file size (blocks, -c) 0
                          data seg size (kbytes, -d) unlimited
                          max nice (-e) 0
                          file size (blocks, -f) unlimited
                          pending signals (-i) 30720
                          max locked memory (kbytes, -l) 32
                          max memory size (kbytes, -m) unlimited
                          open files (-n) 2048
                          pipe size (512 bytes, -p) 8
                          POSIX message queues (bytes, -q) 819200
                          max rt priority (-r) 0
                          stack size (kbytes, -s) 65536
                          cpu time (seconds, -t) unlimited
                          max user processes (-u) 30720
                          virtual memory (kbytes, -v) unlimited
                          file locks (-x) unlimited
                          


                          Not working:
                          core file size (blocks, -c) 0
                          data seg size (kbytes, -d) unlimited
                          scheduling priority (-e) 0
                          file size (blocks, -f) unlimited
                          pending signals (-i) 53248
                          max locked memory (kbytes, -l) 32
                          max memory size (kbytes, -m) unlimited
                          open files (-n) 65536
                          pipe size (512 bytes, -p) 8
                          POSIX message queues (bytes, -q) 819200
                          real-time priority (-r) 0
                          stack size (kbytes, -s) 65536
                          cpu time (seconds, -t) unlimited
                          max user processes (-u) 53248
                          virtual memory (kbytes, -v) unlimited
                          file locks (-x) unlimited
                          


                          output from these was slightly different and made me look at the kernel running on each:

                          Working:
                          Linux pdxuat8app02vm.corp.unicru.com 2.6.18-128.1.1.el5 #1 SMP Mon Jan 26 13:58:24 EST 2009 x86_64 x86_64 x86_64 GNU/Linux
                          



                          Not working:
                          Linux pdxcfgapp802vm.corp.unicru.com 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
                          



                          Would the minor kernel difference cause issues?

                          • 25. Re: Few Issues

                            Just by way of update after starting it using the wrapper script it hasnt exited on its own yet. I want to give it a bit more as its just now monday but its been online since Friday morning so far which is the longest stretch.

                            So far its looking like my being lazy trying to test is what caused the random exits :)

                            • 26. Re: Few Issues

                               

                              "jfrazier" wrote:
                              Just by way of update after starting it using the wrapper script it hasnt exited on its own yet. I want to give it a bit more as its just now monday but its been online since Friday morning so far which is the longest stretch.

                              So far its looking like my being lazy trying to test is what caused the random exits :)


                              Its now most of the way through tuesday and still has not exited unexpectedly. I am going to conclude that it was probably the way i was starting it. Odd how that method works on some systems but not others but atleast its resolved and was just me being lazy :)

                              Thanks again for all your help trying to figure this out mazz.

                              1 2 Previous Next