1 2 Previous Next 18 Replies Latest reply on Mar 21, 2005 7:13 AM by dimitris

    JBoss HeartWatch service

    ivelin.ivanov

       

      "ivelin" wrote:
      "ivelin" wrote:

      Here is a task, that Scott and Bela thought it might be good to do.

      Implement a JMX HeartWatch service, which monitors the basic system resources - CPU Utilization and Memory. If any of them is starved for a long time, broadcast a JMX notification.
      Two helper services will be interested in these notifications:
      - System restart service, which will ask the JBoss kernel to unload all modules and redeploy from scratch.
      - Email notifier, which will use the mailer service to send a message to the administrator.

      The use cases for this service are production systems which have a slow memory leak or a rare time spiking scenario, which is hard to reproduce in the lab. The heartwatch service will keep these systems going while the problem is being identified and resolved. Apache HTTPD and ASP.NET offer similar service.

      Here are the pieces that need to be implemented:

      1) Scheduling based CPU estimate. Schedule a regular heartbeat task which will measure the time between two runs. If the delay is over the scheduled interval for a prolonged (configurable) time, then broadcast JMX notification.

      2) Memory monitor. A similarly scheduled task which measures the available memory and if it aproaches a certain limit, will send a Warning JMX Notification. If it reaches a critical limit, it will send an Alarm notification. The latter will probably cause the kernel to redeploy all modules.

      2) Out-Of-Memory life saver. A soft-referenced buffer of 100K (exact size to be determined), which will give enough room for the kernel to restart the modules in case of a memory starvation.

      3) Server restart service. Should use the Server.shutdown() method and then start(). Some refactoring of the ServerImpl may need to occur to prevent the shutdown() from exiting the VM.
      If there is a way to determine which module is the offending one, then the kernel should only redploy that modules. However there does not seem to be a pure-Java way to do this currently.

      If you are interested to take on this task, please holler.
      I will try to assist.


      Ivelin





        • 1. Re: JBoss HeartWatch service
          dimitris

          Hi,

          I'd like to do parts of this. Although it can be solved by writing specific MBeans, I was thinking of a more generic mechanism that will monitor JMX notification and through the execution of BeanShell scripts will decide what is an alarm and what is not, so as to be able to monitor any aspect of a deployment, not just memory and CPU.

          All come back with a more concrete proposal.

          Regards
          /Dimitris Andreadis

          • 2. Re: JBoss HeartWatch service
            ivelin.ivanov

            We can discuss your idea. Do you have a more detailed design document handy?

            Ivelin

            • 3. Re: JBoss HeartWatch service
              dimitris

              Ok, started writing a short design doc.
              Will come back, soon
              Cheers
              /Dimitris

              • 4. Re: JBoss HeartWatch service
                fordprefect

                Instead of the restart service using shutdown() and start() why not add two states to the server, suspend() and resume() and maybe also add a restart() to ServerImpl (restart calling suspend() and resume()
                start would then basically pass the 'suspended' state and call resume() probably somewhere aroun the where doStart() calls initBootLibraries()

                suspend() would basically do what is done by shutdown() but leave both the ServerImplMBean and ServerConfigImplMBean around and the MBeanServer alive.

                This way you´d have a little more controll over the server´s lifecycle. In a clustered environment you could start a node in 'suspended' state and have it resume if a node in the cluster fails.

                Sory if this is a bit off topic

                • 5. Re: JBoss HeartWatch service
                  ivelin.ivanov


                  The semantics of suspend/resume make sense. The names are misleading. Suspend usually means that the state is frozen and when you resume the process will continue where it stopped.

                  I would think that clustering is independent of this problem. If you shutdown the clustering MBeans the other nodes in the cluster will automatically recognize this fact and regroup.

                  While working on the design keep in mind that JDK 1.5 includes some of the JVM monitoring MBeans that would be necessary to solve this problem. It would be nice if you follow the naming convention of the upcoming MBean names and classes, but use a best approximation logic to implement.
                  http://java.sun.com/j2se/1.5.0/docs/relnotes/features.html#jmx

                  Cheers,

                  Ivelin

                  • 6. Re: JBoss HeartWatch service
                    dimitris

                     

                    "ivelin" wrote:

                    ...
                    1) Scheduling based CPU estimate. Schedule a regular heartbeat task which will measure the time between two runs. If the delay is over the scheduled interval for a prolonged (configurable) time, then broadcast JMX notification.

                    2) Memory monitor. A similarly scheduled task which measures the available memory and if it aproaches a certain limit, will send a Warning JMX Notification. If it reaches a critical limit, it will send an Alarm notification. The latter will probably cause the kernel to redeploy all modules.
                    ...


                    I'm done with [2]:

                    http://sourceforge.net/tracker/index.php?func=detail&aid=913422&group_id=22866&atid=381174

                    Ivelin, could you clarify a little bit [1]? If I understand periodically run and time the execution of a certain task. If it takes too long to execute (for a consequetive number of measurements) then broadcast a notification.

                    What could make a suitable task? It could be configurable (e.g. execute a syncronous operation on some other MBean), still, what makes a good task?

                    Regards
                    /Dimitris




                    • 7. Re: JBoss HeartWatch service
                      ivelin.ivanov

                      Very nice, Dimitris.

                      I am not sure what a suitable task would be for the CPU monitoring. It is probably not as important what the actual task is, but rather how do we measure the CPU usage. An example method could be to remember the highest performance within a 10 (configurable) second interval. If at some point of time there are 10 seconds, such that the task is executed within 5% of the time that they were executed in the best case, then there is a CPU spike.

                      What do you think?

                      Ivelin

                      • 8. Re: JBoss HeartWatch service
                        dimitris

                         

                        "ivelin" wrote:
                        Very nice, Dimitris.

                        I am not sure what a suitable task would be for the CPU monitoring. It is probably not as important what the actual task is, but rather how do we measure the CPU usage. An example method could be to remember the highest performance within a 10 (configurable) second interval. If at some point of time there are 10 seconds, such that the task is executed within 5% of the time that they were executed in the best case, then there is a CPU spike.

                        What do you think?

                        Ivelin


                        It seems to be very hard to do proper CPU load calculations in a portable way. I did some googling and found out a few research project that want to do this... A friend recommended to have some particular method for each popular O/S (e.g. read /dev/proc stats in linux) and use that instead. I was wondering what commercial products do?

                        /Dimitris


                        • 9. Re: JBoss HeartWatch service
                          ivelin.ivanov

                          I am thinking that it is not as important whether we detect the same CPU % as the OS. It is more important to know whether a regular Java thread in the VM is being resumed more or less often.
                          If we answer the following question positively, then we have a reason for serious doubt:
                          "Did the monitoring thread reach a point where it's being resumed 10 times less often within the last 15 minutes time slice than it was in any period of 15 minutes since the VM started?"
                          Of course these numbers are arbitrary and should be possible to configure per deployment.

                          Ivelin

                          • 10. Re: JBoss HeartWatch service
                            starksm64

                            Right, without support from the vm/os your not going to be able to monitor thread liveness. What is important is does a monitor thread actual get sufficient time to emit a timely notification when there is an out of control thread spinning on a single cpu systems. Just mock up such a service and see if it works.

                            • 11. Re: JBoss HeartWatch service
                              dimitris

                              You gave me some ideas. I think I'm going to use as a baseline the "best" run (in terms of how much a thread actually waited on a timed wait and how much time it took to execute a simple CPU bound task) and measure the difference.

                              Cheers
                              /Dimitris

                              • 12. Re: JBoss HeartWatch service
                                chiefcujo

                                Hello,

                                I'm not sure this is where I should post this, but I have found no other area where this is applicable.

                                The biggest issue I see with this project when measuring Server utilization is that JBoss does not manage thread utilization.

                                Correct me if I'm wrong, but I have been through the source and read the forums and have not found a global thread pool mechanism. I have found pools in the JMS code, but they are unbounded so what is the point.

                                I see thread counts extremely high when we deploy our application on JBoss. Threads as high as +200. Performance seems ok, but this high of a thread count cannot be good from a management standpoint.

                                So instead of recreating the wheel, I googled around and found an Open Source (GPL) Thead Pool called ThreadWorks. http://www.dvt.com/threadworks/DvtThreadWorks.html

                                Would there be any interest from the JBoss Org. if I applied this package through out the JBoss server. I would also expose the management via an MBean, so as to keep the configuration more JBossish.

                                This would enable goal # 1 of this project to be much more accurate.

                                Also, when JDK1.5 goes production, I would want to modify ThreadWorks to take advantage of the CPU utilization measurements offered by the VM in the java.lang.management.ThreadMBean.

                                Well, I'm just throwing it out there. To see if there is any interest.

                                Cheers

                                • 13. Re: JBoss HeartWatch service
                                  chiefcujo

                                  Ignore my previous post about there not being a ThreadPool... I must have my eyes examined I found it.

                                  Sorry If I annoyed anyone.

                                  Cheers

                                  • 14. Re: JBoss HeartWatch service/SNMP proxy
                                    qiminghe


                                    Regarding CPU/memory monitoring, they are not EJB-specific performance data and most of these data are available through
                                    SNMP interface. There are a couple of mature SNMP agents
                                    providing such data via various MIB (e.g. net-snmp)

                                    So we do not have to re-ivent the wheel. IMHO, the best way to do it
                                    is to write a snmp proxy to "delegate-out" all such calls.
                                    BTW, weblogic has such a thing.

                                    I have a snmp proxy prototye built. Do you guys want me to check
                                    in code?

                                    1 2 Previous Next