12 Replies Latest reply on Dec 11, 2008 10:33 AM by belaban

    Thread dumps generated by triggers/events?

    galder.zamarreno

      A while back, I started a thread about generating thread dumps if a Cache TimeoutException occurred (see http://www.jboss.com/index.html?module=bb&op=viewtopic&t=123989). Another situation that comes to my mind would be when deployments take ~1 min, which are the result of a DRM.add cluster sync call hanging or just been slow.

      In both situations, getting a local or remote thread dump at the right point is key to find out what it's going on. Currently, we're relying on users providing such thread dumps and if they get the timing wrong, which can easily happen, it can get quite frustrating cos they'd have to retry...etc.

      Maybe this is something a monitoring tool could do? I.e. generate thread dumps in all nodes in the cluster upon certain circumstances/triggers/events? This is certainly a cross cutting concern not clustering specifically related, but it's key for Clustering in general, hence why I opened the thread here. If there's a different forum where this should be raised, please let me know.

        • 1. Re: Thread dumps generated by triggers/events?
          brian.stansberry

          So are you thinking about a service that generates thread dumps (or tweaking the existing one), with configuration options to control where the dumps go? With hooks to inject the service into other interested services, e.g. HAPartition? HAPartition would then decide whether an event (e.g. a timeout on an RPC) justifies calling into the thread dump service.

          In the case of an RPC timeout, only the caller knows it happened.

          • 2. Re: Thread dumps generated by triggers/events?
            galder.zamarreno

             

            "bstansberry@jboss.com" wrote:
            So are you thinking about a service that generates thread dumps (or tweaking the existing one), with configuration options to control where the dumps go?


            Something along those lines but was thinking more about a service that lies in the management console that responds to specific JMX events and then requests generation of thread dumps. Could be a standard service in AS, but seems to fit more the monitoring area.

            This service should be able to instruct not only the local node, but other cluster members to generate thread dumps as well. For example if the TE happened when doing a sync repl, or when doing a sync JGroups RPC call, request the node where the sync repl/rpc failed to generate a thread dump.

            Information that would be needed:
            - timestamp (kill -3 does not provide timestamp of the thread dump!)
            - thread dump
            - some kind of unique id shared by all thread dumps in all nodes that were generated from a specific failure.
            - some information to match the thread dump(s) to the failure in the logs.

            The aim is for someone to be able to say something like this: "Machine A reported a TE (with suspected=false) and these are the thread dumps taken immediately after from Machines B,C,D in the cluster that are associated with this TE. I have already the GC logs in case the TE was due to long garbage collection."

            "bstansberry@jboss.com" wrote:
            With hooks to inject the service into other interested services, e.g. HAPartition? HAPartition would then decide whether an event (e.g. a timeout on an RPC) justifies calling into the thread dump service.

            In the case of an RPC timeout, only the caller knows it happened.


            That could work as well. My comment above on JMX notifications would be pretty much this. HAPartition generates a JMX notification upon an RPC timeout and a service in the monitoring tool does the job.

            Where do you think such service would fit better?

            One thing I need to get to the bottom of regarding thread dumps is that Thread.getAllStackTraces() does not provide the same information a kill -3. Some lock information seems to be missing from Thread.getAllStackTraces(), which is why I recommend against JMX method to generate stack traces.

            • 3. Re: Thread dumps generated by triggers/events?
              galder.zamarreno

              Couple of things on Thread.getAllStackTraces()

              It leaks :). Fixed in 6u1
              http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6434648

              From: http://wiki.jboss.org/wiki/Wiki.jsp?page=StackTrace

              This function returns the information provided by Thread.getAllStackTraces(). This does not include the light weight thread id, which is important for troubleshooting hung and spinning processes; therefore, it is recommended to use the methods described above to obtain the stack trace if possible.


              • 4. Re: Thread dumps generated by triggers/events?
                galder.zamarreno

                There're 4 ways to generate a thread dump. I have taken the output from the same run with different methods:

                1.- kill -3. Can be parsed by tools like https://tda.dev.java.net/ (simply, one of the most useful tools I've found out there!). Example:

                "JBossLifeThread" prio=1 tid=0x0a690cc0 nid=0x1741 in Object.wait() [0x867f2000..0x867f30b0]
                 at java.lang.Object.wait(Native Method)
                 - waiting on <0x9247e200> (a java.lang.Object)
                 at java.lang.Object.wait(Object.java:474)
                 at org.jboss.system.server.ServerImpl$LifeThread.run(ServerImpl.java:940)
                 - locked <0x9247e200> (a java.lang.Object)


                2.- JMX via org.jboss.system.server.ServerInfo.listThreadDump(). Is it the same as what Thread.getAllStackTraces() would do? Need a JIRA to get to the bottom of this. Here's some output:

                Thread: JBossLifeThread : priority:5, demon:false, threadId:58, threadState:WAITING, lockName:java.lang.Object@10cff6b
                
                java.lang.Object.wait(Native Method)
                java.lang.Object.wait(Object.java:474)
                org.jboss.system.server.ServerImpl$LifeThread.run(ServerImpl.java:940)


                Lock name is different to lock monitor achieved via kill -3. Doesn't looks like it prints previously locked monitors. Cannot be parsed by TDA tools because it's in JBoss specific format. Useless.

                3.- Thread.getAllStackTraces().

                Haven't tested the output from this. What would this print? Couldn't find an MBean that uses/parses the output from this, but I suspect it's the same as method 2, specially since http://jira.jboss.com/jira/browse/JBAS-1448 has already been fixed.

                4.- JConsole Threads view:

                Name: JBossLifeThread
                State: WAITING on java.lang.Object@10cff6b
                Total blocked: 0 Total waited: 1
                
                Stack trace:
                java.lang.Object.wait(Native Method)
                java.lang.Object.wait(Object.java:474)
                org.jboss.system.server.ServerImpl$LifeThread.run(ServerImpl.java:940)


                Seems to print same info as JBoss' ServerInfo.

                --

                This has been bugging for me for ages, had to get to get to the bottom of it :). So far, there's nothing like "killing" JBoss.

                • 5. Re: Thread dumps generated by triggers/events?
                  galder.zamarreno

                  http://jira.jboss.com/jira/browse/JBAS-5163 for the thread dump discussion.

                  I'll create a JIRA for the other topic asap.

                  • 6. Re: Thread dumps generated by triggers/events?
                    galder.zamarreno

                    I've just realised that I never created a JIRA for a service that would request thread dumps from 1..n nodes upon some trigger or event.

                    Couple of weeks back I saw this JIRA which seems to attempt to do a similar thing: get a thread dump when the server shutdown times out.

                    https://jira.jboss.org/jira/browse/JBASM-3

                    The idea is basically the same except that the triggers are different. My initial motivation is members that didn't response to a sync call, whether AS cluster wide RPC, or JBC's repl sync of data.

                    • 7. Re: Thread dumps generated by triggers/events?
                      galder.zamarreno

                      I finally created a JIRA for this:

                      https://jira.jboss.org/jira/browse/JBAS-6311

                      • 8. Re: Thread dumps generated by triggers/events?
                        belaban

                        To get a stack trace of all nodes in a cluster, you can use JGroups' probe:

                        probe.sh dump


                        http://www.jboss.org/community/docs/DOC-11689 discusses how probe can be extended, e.g. to add custom functionality.

                        The down side is that you get back a string, there is currently no API to access the results. This is planned though

                        • 9. Re: Thread dumps generated by triggers/events?
                          galder.zamarreno

                          Probe dump would indeed be really handy for trying to retrieve the thread dumps accross the cluster but it would be called from some Java logic (notification consumer for each different type of situation when to generate a thread) and hence I would need to call JGroups API to get the same that "probe.sh dump" would return.

                          Also, I would expect all thread dumps from problem X to be dumped locally and then someone to inspect them with TDA or Samurai, so string sounds good to me. Note that it would help that the thread dump string is formatted in the same way as the JVM does (kill -3) as this would help compatibility with tools that are already able to read these type of dumps.

                          • 10. Re: Thread dumps generated by triggers/events?
                            belaban

                            Formatting is the same as the JVM dumps.

                            Regarding the API: the benefit here would be that you can drop responses from different clusters. Well, you can do that too with probe.sh, but you'd need a little help from sed/awk/cut and friends

                            • 11. Re: Thread dumps generated by triggers/events?
                              galder.zamarreno

                              Hmmm, I don't understand this very well. I need to test it but ideally what you want is that whatever API you call to get thread dumps for all nodes in the cluster returns in an array or similar each individual thread dump with information where the thread dump is coming from.

                              The idea here is that the problem plus each individual node's thread dumps are logged somewhere. Obviously, if the same problem happen a lot of times in a row, you'd only log it once, otherwise you could end up swamping the server's HD.

                              I'm not sure we'd want thread dump responses from different clusters at this point unless there was some kind inter cluster sync RPC call or replication.

                              • 12. Re: Thread dumps generated by triggers/events?
                                belaban

                                The JIRA issue for a programmatic API for Probe in JGroups is https://jira.jboss.org/jira/browse/JGRP-840