4 Replies Latest reply on Sep 16, 2013 3:04 PM by genman

    RHQ 4.9 bugs and quirks

    genman
      20:13:09,518 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper Worker 28) ARJUNA012113: TransactionReaper::doCancellations worker Thread[Transaction Reaper Worker 28,5,main] miss
      ed interrupt when cancelling TX 0:ffff11b0d33f:-3da8ead:52324661:11297e7 -- exiting as zombie (zombie count decremented to 4)20:13:09,519 WARN  [com.arjuna.ats.arjuna] (http-/0.0.0.0:7080-8) ARJUNA012077: Abort called on already aborted atomic action 0:ffff11b0d33f:-3da8ead:52324661:11297e7
      20:13:09,519 ERROR [org.jboss.as.ejb3.invocation] (http-/0.0.0.0:7080-8) JBAS014134: EJB Invocation failed on component MeasurementOOBManagerBean for method public abstract org.r
      hq.core.domain.util.PageList org.rhq.enterprise.server.measurement.MeasurementOOBManagerLocal.getHighestNOOBsForResource(org.rhq.core.domain.auth.Subject,int,int): javax.ejb.EJBTransactionRolledbackException: Transaction rolled back
      

       

      I'm seeing the above when clicking on the metric tab, and change the time (1h to 1d) for example, then the timeline renders wrong and the above shows up.

       

      This is on Chrome, and does seem to happen on Safari.


      Also, I also don't know if "Get Live" value is working or what it does anymore. It is useful so I hope it stays.


      I also have gotten into a state where the storage node doesn't return any metrics at all. The components are up but no metrics are being returned.


      Also, this hangs:

       

      $ ./rhqctl stop --agent
      20:21:45,447 INFO  [org.jboss.modules] JBoss Modules version 1.2.0.CR1
      Stopping RHQ Agent...
      RHQ Agent (pid=352) is stopping...
      

       

      The agent seems to wait for a thread that's a scheduled executor:

       

      "pool-3-thread-1" prio=10 tid=0x00007fe78c4c0800 nid=0x192 waiting on condition [0x00007fe788126000]
         java.lang.Thread.State: TIMED_WAITING (parking)
              at sun.misc.Unsafe.park(Native Method)
              - parking to wait for  <0x00000000e1309b98> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
              at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
              at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1090)
              at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807)
              at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
              at java.lang.Thread.run(Thread.java:724)
      

       

      Not sure where this is coming from. Maybe this?

       

      modules/core/native-system/src/main/java/org/rhq/core/system/SigarAccessHandler.java
          SigarAccessHandler(SigarFactory sigarFactory) {
              this.sigarFactory = sigarFactory;
              sharedSigarLock = new ReentrantLock();
              localSigarLock = new ReentrantLock();
              scheduledExecutorService = Executors.newSingleThreadScheduledExecutor();
              scheduledExecutorService.scheduleWithFixedDelay(new ThresholdChecker(), 1, 5, MINUTES);
              localSigarInstancesCount = 0;
          }
      
        • 1. Re: RHQ 4.9 bugs and quirks
          pathduck

          Yeah, I've been seeing the same hangs when stopping the Agent in 4.9. Was hoping it would be something obvious that would be ok in Final. Will check on my side once I get 4.9 Final reinstalled.

          • 2. Re: RHQ 4.9 bugs and quirks
            tsegismont

            Hi Elias,

             

            I'm not sure for the first part of your question. We'll have a closer look.

             

            With respect to the agent shutdown issue, you may be right. The scheduled executor is shut down in org.rhq.core.system.SigarAccessHandler#close:

             

                void close() {
                    if (sharedSigar != null) {
                        sharedSigarLock.lock();
                        try {
                            sharedSigar.close();
                            sharedSigar = null;
                        } finally {
                            sharedSigarLock.unlock();
                        }
                    }
                    scheduledExecutorService.shutdownNow();
                }
            

             

            The problem is org.rhq.core.system.SigarAccessHandler#close never gets called when the agent goes down.

             

            I created this BZ to track the issue: https://bugzilla.redhat.com/show_bug.cgi?id=1008570

             

            Thanks for your feedback!

             

            Thomas

            • 3. Re: RHQ 4.9 bugs and quirks
              tsegismont

              Elias,

               

              Note that the agent should not hang forever, after one minute or so, the JVM should terminate with messages like this one:

               

              2013-09-16 18:06:25,569 INFO  [Thread-6] (org.rhq.enterprise.agent.AgentShutdownHook)- {AgentShutdownHook.wait}The agent will wait for [1] threads to die
              2013-09-16 18:06:35,571 INFO  [Thread-6] (org.rhq.enterprise.agent.AgentShutdownHook)- {AgentShutdownHook.wait}The agent will wait for [1] threads to die
              2013-09-16 18:06:45,573 INFO  [Thread-6] (org.rhq.enterprise.agent.AgentShutdownHook)- {AgentShutdownHook.wait}The agent will wait for [1] threads to die
              2013-09-16 18:06:55,574 INFO  [Thread-6] (org.rhq.enterprise.agent.AgentShutdownHook)- {AgentShutdownHook.wait}The agent will wait for [1] threads to die
              2013-09-16 18:07:05,576 INFO  [Thread-6] (org.rhq.enterprise.agent.AgentShutdownHook)- {AgentShutdownHook.wait}The agent will wait for [1] threads to die
              2013-09-16 18:07:15,578 INFO  [Thread-6] (org.rhq.enterprise.agent.AgentShutdownHook)- {AgentShutdownHook.wait}The agent will wait for [1] threads to die
              2013-09-16 18:07:25,580 INFO  [Thread-6] (org.rhq.enterprise.agent.AgentShutdownHook)- {AgentShutdownHook.wait}The agent will wait for [1] threads to die
              2013-09-16 18:07:35,582 INFO  [Thread-6] (org.rhq.enterprise.agent.AgentShutdownHook)- {AgentShutdownHook.wait}The agent will wait for [1] threads to die
              2013-09-16 18:07:45,584 INFO  [Thread-6] (org.rhq.enterprise.agent.AgentShutdownHook)- {AgentShutdownHook.wait}The agent will wait for [1] threads to die
              2013-09-16 18:07:55,585 INFO  [Thread-6] (org.rhq.enterprise.agent.AgentShutdownHook)- {AgentShutdownHook.wait}The agent will wait for [1] threads to die
              2013-09-16 18:08:05,586 INFO  [Thread-6] (org.rhq.enterprise.agent.AgentShutdownHook)- {AgentShutdownHook.no-more-wait}[1] threads are not dying - agent will not wait anymore
              2013-09-16 18:08:05,586 INFO  [Thread-6] (org.rhq.enterprise.agent.AgentShutdownHook)- {AgentShutdownHook.threads-still-alive}There are still [1] threads left - the kill thread will
              exit the VM shortly if these threads do not die 
              2013-09-16 18:08:05,586 INFO  [Thread-6] (org.rhq.enterprise.agent.AgentShutdownHook)- {AgentShutdownHook.exit.shutdown-complete}Shutdown complete - agent will now exit.
              

               

              Thomas

              • 4. Re: Re: RHQ 4.9 bugs and quirks
                genman

                It sounds like the UI quirk was (maybe) fixed in BZ 1000175. It's not a big deal.

                 

                I've gotten the storage node stuff working better now. One problem I ran into was adding two storage nodes at a time doesn't seem to work at all. I basically ended up hanging my cluster. (But I haven't really tried reproducing this scenario to file a bug for.)

                 

                As for the agent shutdown getting stuck, I'm doing this:

                 

                diff --git a/modules/core/plugin-container/src/main/java/org/rhq/core/pc/PluginContainer.java b/modules/core/plugin-container/src/main/java/org/rhq/core/pc/PluginContainer.java
                index 4f6a9a1..8b78f02 100644
                --- a/modules/core/plugin-container/src/main/java/org/rhq/core/pc/PluginContainer.java
                +++ b/modules/core/plugin-container/src/main/java/org/rhq/core/pc/PluginContainer.java
                @@ -442,6 +443,7 @@ public boolean shutdown() {
                            started = false;
                            shuttingDown = false;
                
                +            SigarAccess.close();
                            log.info("Plugin container is now shutdown.");
                
                            // we typically do not want to do this if embedded somewhere other than the Agent VM
                
                

                 

                It works (fixes the problem), but I'm not sure this really makes sense to close here or not. This may break some tests or something that recycle the container.