10 Replies Latest reply on Nov 15, 2013 12:22 PM by tsegismont

    Bug 1017961, which has to do with MBeans appearing down

    genman

      I've seen this a lot, especially with certain MBeans.

      Screen Shot 2013-11-03 at 9.15.44 AM.png

       

      The issue is documented here: https://bugzilla.redhat.com/show_bug.cgi?id=1017961

       

      I'm not sure this causes trouble for the storage node itself, but I've had a lot of reliability issues with storage node maintenance.

        • 1. Re: Bug 1017961, which has to do with MBeans appearing down
          tsegismont

          Hi Elias,

           

          I finally found a way to reproduce your issue.

           

          I tried with a Tomcat server in inventory, and Tomcat needs to be back up *BEFORE* the agent could notice the "UP -> DOWN" availability change. After some time (depending on how the availability scan interval is configured), the nested resources come back in UP state.

           

          Can you confirm?

           

          I think it happens when:

          * the managed server goes down

          * an availability check is executed for a nested resource

          * an availability check is executed for the top server resource

           

          How far have you got with your patch? You can attach it (even untested) to the bug report.

           

          Thanks for tracking this. Nice catch!

          Thomas

          • 2. Re: Bug 1017961, which has to do with MBeans appearing down
            genman

            Thomas Segismont wrote:

             

            Hi Elias,

             

            I finally found a way to reproduce your issue.

             

            I tried with a Tomcat server in inventory, and Tomcat needs to be back up *BEFORE* the agent could notice the "UP -> DOWN" availability change. After some time (depending on how the availability scan interval is configured), the nested resources come back in UP state.

             

            Can you confirm?

             

            I think it happens when:

            * the managed server goes down

            * an availability check is executed for a nested resource

            * an availability check is executed for the top server resource

             

            How far have you got with your patch? You can attach it (even untested) to the bug report.

             

            Thanks for tracking this. Nice catch!

            Thomas

             

            Yes, the agent has to be up, the managed server restarted, and some random ordering. I created a test harness but it couldn't reproduce the problem on OS X. I think you are right about the ordering issue. I didn't realize the ordering of the checks mattered.

             

            I want to test the patch for a bit. I also need to get it looked at by my company, which takes a few weeks sometimes.

             

            The one downside with my patch is the EmsBeans are not cached and looked up each time. (They are cached anyway by the server, just the Map lookup is done each check.) I don't think the caching is worth the extra risk.

            • 3. Re: Bug 1017961, which has to do with MBeans appearing down
              tsegismont

              Elias,

               

              Thanks for your answer. I think we can keep caching if we refresh the EmsConnection when the availability check fails. I'll push a fix soon and will tell you about it.

               

              Cheers,

              Thomas

              • 4. Re: Bug 1017961, which has to do with MBeans appearing down
                genman

                I tried doing a 'refresh' on EmsConnection--it wasn't working right. It also has the unintended side-effect of tossing out your entire cache if one MBean really is gone, so the performance is likely worse than before in many circumstances.

                 

                Reviewing the code, there is a lot of cruft that needs to be gotten rid of.

                 

                Simpuru izu besto, as the Japanese would say.

                • 5. Re: Bug 1017961, which has to do with MBeans appearing down
                  tsegismont

                  Hi Elias,

                   

                  The issue I found with Tomcat turned out to be something unrelated. I created another BZ to track it:

                  Bug 1029373 - "Tomcat Web Application (WAR)" components stay down when server comes back up


                  So I'm stuck again because I can't reproduce your issue with Storage Node. You talked about Flume as well. How do you monitor it? A custom plugin based on JMX plugin?


                  Regards,

                  Thomas

                  • 6. Re: Bug 1017961, which has to do with MBeans appearing down
                    genman

                    Flume uses the JMX plugin as base, yes. I've seen the same issue with any component that uses the JMX plugin, including the storage nodes as you can see in the original post.

                     

                    As for Tomcat, my fix works for the Tomcat HTTP connector, which was appearing down. As for the .war component appearing down, that may be a separate issue like you say.

                     

                    My issue 1017961 may be a problem specific to the version of Linux distro (EL6) or JVM (1.6.0_38) I'm using. Still, my fix works well and improves the code.

                     

                    I hope you also consider https://bugzilla.redhat.com/show_bug.cgi?id=971615 as well. I'm getting tired of having to port my fixes to each subsequent RHQ release.

                    • 7. Re: Bug 1017961, which has to do with MBeans appearing down
                      tsegismont

                      Flume uses the JMX plugin as base, yes. I've seen the same issue with any component that uses the JMX plugin, including the storage nodes as you can see in the original post.

                       

                      As for Tomcat, my fix works for the Tomcat HTTP connector, which was appearing down. As for the .war component appearing down, that may be a separate issue like you say.

                       

                      My issue 1017961 may be a problem specific to the version of Linux distro (EL6) or JVM (1.6.0_38) I'm using. Still, my fix works well and improves the code.

                       

                      I understand your fix works but I need to be able to reproduce the problem. That's why I was asking about Flume. Are you using the Oracle VM or OpenJDK? I will try again with Java 6.

                      I hope you also consider https://bugzilla.redhat.com/show_bug.cgi?id=971615 as well. I'm getting tired of having to port my fixes to each subsequent RHQ release.

                      Mazz has recently re-targeted BZ971615 to RHQ4.10. I think there are good chances to see it fixed in the next RHQ release.

                       

                      I hope you will keep on reporting issues and providing patches as you already do. You're a great contributor to our community and we would all be sad to see you move away.

                       

                      Thomas

                      • 8. Re: Bug 1017961, which has to do with MBeans appearing down
                        genman

                        -bash-4.1$ java -version

                        java version "1.6.0_38"

                        Java(TM) SE Runtime Environment (build 1.6.0_38-b05)

                        Java HotSpot(TM) 64-Bit Server VM (build 20.13-b02, mixed mode)

                         

                        It's the Oracle version.

                        • 9. Re: Re: Bug 1017961, which has to do with MBeans appearing down
                          tsegismont

                          Elias,

                           

                          A quick update: I was able to reproduce your problem with a Tomcat server on a Linux machine running OpenJDK6:

                           

                          2013-11-14 17:41:39,740 WARN  [InventoryManager.availability-1] (rhq.core.pc.inventory.AvailabilityExecutor)- Availability collection failed with exception on Resource[id=10206, uuid=c
                          24df09c-0ee0-4171-bce0-e2b6a4721493, type={Tomcat}Memory Pool, key=java.lang:name=CMS Perm Gen,type=MemoryPool, name=CMS Perm Gen, parent=Memory Subsystem], availability will be report
                          ed as DOWN
                          java.lang.reflect.UndeclaredThrowableException
                                  at sun.proxy.$Proxy113.isRegistered(Unknown Source)
                                  at org.mc4j.ems.impl.jmx.connection.bean.DMBean.isRegistered(DMBean.java:188)
                                  at org.rhq.plugins.jmx.MBeanResourceComponent.isMBeanAvailable(MBeanResourceComponent.java:242)
                                  at org.rhq.plugins.jmx.MBeanResourceComponent.getAvailability(MBeanResourceComponent.java:229)
                                  at sun.reflect.GeneratedMethodAccessor51.invoke(Unknown Source)
                                  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                                  at java.lang.reflect.Method.invoke(Method.java:616)
                                  at org.rhq.core.pc.inventory.ResourceContainer$ComponentInvocation.call(ResourceContainer.java:654)
                                  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
                                  at java.util.concurrent.FutureTask.run(FutureTask.java:166)
                                  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
                                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
                                  at java.lang.Thread.run(Thread.java:679)
                          Caused by: java.rmi.ConnectException: Connection refused to host: 192.168.13.13; nested exception is:
                                  java.net.ConnectException: Connection refused
                                  at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:619)
                                  at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:216)
                                  at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:202)
                                  at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:128)
                                  at com.sun.jmx.remote.internal.PRef.invoke(Unknown Source)
                                  at javax.management.remote.rmi.RMIConnectionImpl_Stub.isRegistered(Unknown Source)
                                  at javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection.isRegistered(RMIConnector.java:847)
                                  at sun.reflect.GeneratedMethodAccessor91.invoke(Unknown Source)
                                  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                                  at java.lang.reflect.Method.invoke(Method.java:616)
                                  at org.mc4j.ems.impl.jmx.connection.support.providers.proxy.JMXRemotingMBeanServerProxy.invoke(JMXRemotingMBeanServerProxy.java:59)
                                  ... 13 more
                          Caused by: java.net.ConnectException: Connection refused
                                  at java.net.PlainSocketImpl.socketConnect(Native Method)
                                  at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:327)
                                  at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:193)
                                  at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:180)
                                  at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:385)
                                  at java.net.Socket.connect(Socket.java:546)
                                  at java.net.Socket.connect(Socket.java:495)
                                  at java.net.Socket.<init>(Socket.java:392)
                                  at java.net.Socket.<init>(Socket.java:206)
                                  at sun.rmi.transport.proxy.RMIDirectSocketFactory.createSocket(RMIDirectSocketFactory.java:40)
                                  at sun.rmi.transport.proxy.RMIMasterSocketFactory.createSocket(RMIMasterSocketFactory.java:146)
                                  at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:613)
                                  ... 23 more
                          
                          

                           

                          I hope to close the bug by tomorrow.

                           

                          Regards,

                          Thomas

                          • 10. Re: Bug 1017961, which has to do with MBeans appearing down
                            tsegismont

                            Hi Elias,

                             

                            Please have a look at https://bugzilla.redhat.com/show_bug.cgi?id=1017961#c5 and following. Let's continue the discussion there.

                             

                            Regards,