1 2 Previous Next 29 Replies Latest reply on Aug 23, 2007 8:49 PM by ramdas

    Could this be deadlock when modifying TreeCache

    ramdas

      Our app uses the JBoss TreeCache from within JBoss AS 3.2.6. The version of the Jgroups JAR that we use is jgroups-2.2.9-beta.jar

      I came across this situation wherein a large percentage(290 out of 400) of the Tomcat threads configured within JBoss seemed to be waiting to modify the TreeCache as seen by the thread dump. Given below is the stack trace for one of those threads taken from the dump. Was wondering what could have caused this?

      Thanks

      Ramdas

      ---------------------------------
      "TP-Processor400" daemon prio=1 tid=0x081c3140 nid=0x641d in Object.wait() [343fe000..343ff8d0]
      at java.lang.Object.wait(Native Method)
      at org.jgroups.protocols.FC.handleDownMessage(FC.java:360)
      - locked <0x52074af8> (a java.lang.Object)
      at org.jgroups.protocols.FC.down(FC.java:300)
      at org.jgroups.stack.Protocol.receiveDownEvent(Protocol.java:517)
      at org.jgroups.protocols.FC.receiveDownEvent(FC.java:294)
      at org.jgroups.stack.Protocol.passDown(Protocol.java:551)
      at org.jgroups.protocols.FRAG.down(FRAG.java:139)
      at org.jgroups.stack.Protocol.receiveDownEvent(Protocol.java:517)
      at org.jgroups.stack.Protocol.passDown(Protocol.java:551)
      at org.jgroups.protocols.pbcast.STATE_TRANSFER.down(STATE_TRANSFER.java:233)
      at org.jgroups.stack.Protocol.receiveDownEvent(Protocol.java:517)
      at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:341)
      at org.jgroups.JChannel.down(JChannel.java:1093)
      at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.down(MessageDispatcher.java:715)
      at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.passDown(MessageDispatcher.java:692)
      at org.jgroups.blocks.RequestCorrelator.sendRequest(RequestCorrelator.java:277)
      at org.jgroups.blocks.GroupRequest.doExecute(GroupRequest.java:446)
      at org.jgroups.blocks.GroupRequest.execute(GroupRequest.java:188)
      at org.jgroups.blocks.MessageDispatcher.castMessage(MessageDispatcher.java:417)
      at org.jgroups.blocks.RpcDispatcher.callRemoteMethods(RpcDispatcher.java:165)
      at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:2196)
      at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:2227)
      at org.jboss.cache.interceptors.ReplicationInterceptor.handleReplicatedMethod(ReplicationInterceptor.java:111)
      at org.jboss.cache.interceptors.ReplicationInterceptor.invoke(ReplicationInterceptor.java:85)
      at org.jboss.cache.TreeCache.invokeMethod(TreeCache.java:3116)
      at org.jboss.cache.TreeCache.put(TreeCache.java:1762)
      at org.jboss.cache.TreeCache.put(TreeCache.java:1702)

        • 1. Re: Could this be deadlock when modifying TreeCache
          manik

          You'd need to tune your flow control stack. the JGroups FC protocol blocks when you have fast senders and slow receivers in your cluster.

          Are you using your cache in REPL_SYNC mode? If so you can just remove FC from your JGroups configuration.

          • 2. Re: Could this be deadlock when modifying TreeCache
            ramdas

            Manik,

            Thanks for the response. The TreeCache is set up for "REPL_ASYNC". I am not very familiar with the JBoss cache tuning params. Was wondering if u had any comments on the configuration shown below. Since this happens on our production system and is hard to reproduce, do you have any recommendations on how to tune this? Do you still recommend removing FC?

            The JBoss cache is hosted on a 8 node Linux cluster running Debian Linux based on the 2.4 kernel.

            -Ramdas

             <mbean code="org.jboss.cache.TreeCache"
             name="jboss.cache:service=TreeCache">
             <depends>jboss:service=Naming</depends>
             <depends>jboss:service=TransactionManager</depends>
             <attribute name="TransactionManagerLookupClass">org.jboss.cache.JBossTransactionManagerLookup</attribute>
             <attribute name="IsolationLevel">READ_UNCOMMITTED</attribute>
             <attribute name="CacheMode">REPL_ASYNC</attribute>
             <attribute name="UseReplQueue">false</attribute>
             <attribute name="ClusterName">TreeCache-Cluster</attribute>
             <attribute name="ClusterConfig">
             <config>
             <UDP mcast_addr="227.1.2.3" mcast_port="45566"
             ip_ttl="32" ip_mcast="true"
             mcast_send_buf_size="10000000" mcast_recv_buf_size="10000000"
             ucast_send_buf_size="10000000" ucast_recv_buf_size="10000000"
             max_bundle_size="64000"
             max_bundle_timeout="30"
             use_incoming_packet_handler="false"
             use_outgoing_packet_handler="true"
             loopback="false"
             />
             <PING timeout="2000" down_thread="false" num_initial_members="3"/>
             <MERGE2 max_interval="10000" down_thread="false" min_interval="5000"/>
             <FD_SOCK down_thread="false"/>
             <VERIFY_SUSPECT timeout="1500" down_thread="false"/>
             <pbcast.NAKACK max_xmit_size="60000" down_thread="false" use_mcast_xmit="true" gc_lag="50" retransmit_timeout="300,600,1200,2400,4800"/>
             <UNICAST timeout="300,600,1200,2400,3600" down_thread="false"/>
             <pbcast.STABLE stability_delay="1000" desired_avg_gossip="5000" down_thread="false" max_bytes="250000"/>
             <pbcast.GMS print_local_addr="true" join_timeout="3000" down_thread="false" join_retry_timeout="2000" shun="true"/>
             <FC max_credits="1000000" down_thread="false" min_threshold="0.10"/>
             <FRAG frag_size="60000" down_thread="false" up_thread="true"/>
             <pbcast.STATE_TRANSFER down_thread="false" up_thread="false"/>
             </config>
             </attribute>
             <attribute name="FetchStateOnStartup">true</attribute>
             <attribute name="InitialStateRetrievalTimeout">5000</attribute>
             <attribute name="SyncReplTimeout">10000</attribute>
             <attribute name="LockAcquisitionTimeout">15000</attribute>
             <attribute name="EvictionPolicyClass"></attribute>
             </mbean>


            • 3. Re: Could this be deadlock when modifying TreeCache
              manik

              I would look at your max_credits FC param. If you have a lot of state being moved around (and I suspect you do given the cluster size and number of Tomcat threads) you probably want a higher max_credits - try about 20 million instead of the 1 million you already have configured.

              • 4. Re: Could this be deadlock when modifying TreeCache
                ramdas

                Manik,

                Thanks for your tip. Is there a way to monitor the FC stats to find out if it really is the cause of the bottleneck. There is a lot of history behind the values that have been set currently and the team is reluctant to make changes since in the past their experience has not been very pleasant.

                thanks

                Ramdas

                • 5. Re: Could this be deadlock when modifying TreeCache
                  belaban

                  Yes, FC exposes its stats via JMX. I'm not sure though whether JGroups registers its MBeans via the MBeanServer in standalone JBossCache. Manik ?

                  You can look at the stats (with jconsole) if you run for example the JGroups demo Draw with "-jmx".

                  • 6. Re: Could this be deadlock when modifying TreeCache
                    manik

                    Binding to JMX when running standalone: no it doesn't actually do this at the moment although I agree it should (JBCACHE-1140)

                    This particular case is running within AS 3.2.6 though.

                    How do you construct your cache instance?

                    • 7. Re: Could this be deadlock when modifying TreeCache
                      manik

                       

                      "manik.surtani@jboss.com" wrote:
                      Binding to JMX when running standalone: no it doesn't actually do this at the moment although I agree it should (<a href="http://jira.jboss.com/jira/browse/JBCACHE-1140">JBCACHE-1140</a>)


                      Sorry my mistake, it *does* actually do this. (Just tested it again in an attempt to close 1140). Now looking into why it either didn't do this when I tested it earlier today.

                      • 8. Re: Could this be deadlock when modifying TreeCache
                        ramdas

                        Thanks for checking on this. I have not had a chance to instrument the JConsole as yet. But we had couple of more failures on our production cluster with the same symptoms as i listed first.
                        I had a question though on why the threads are blocking given that the CACHE_MODE is set as "REPL_ASYNC"?
                        Manik... in response to when the cache instance gets created, in the app it gets created using a singleton when the JBoss appserver comes up.

                        Thanks

                        Ramdas

                        • 9. Re: Could this be deadlock when modifying TreeCache
                          manik

                           

                          "ramdas" wrote:

                          I had a question though on why the threads are blocking given that the CACHE_MODE is set as "REPL_ASYNC"?


                          Async only means that once the call is placed on the network it won't block and wait for a result. In this case, the thread blocks in the FC protocol in JGroups, *before* it gets to the network - this is just FC doing it's job and this is why I said FC may need tuning.


                          • 10. Re: Could this be deadlock when modifying TreeCache
                            hmesha

                            I had experienced the same error on one of our clusters (JBoss AS 4.0.5 and JGroups 2.2.7 though). I solved the issue by upgrading JGroups to 2.4 which fixed a deadlock issue as per http://jira.jboss.com/jira/browse/JGRP-292

                            I'm not sure if 2.2.9-beta has the same issue, just thought I'd share my experience.

                            • 11. Re: Could this be deadlock when modifying TreeCache
                              ramdas

                              I was hoping to find the jgroups MBeans in the jmx-console that comes with JBoss 3.2.6 but did not find them, unless i was looking at the wrong place.
                              If not, do i need to start the JBoss appserver with a particular command line param. The application uses JDK 1.4, hence i will have to connect remotely if i use JConsole.

                              Thanks

                              Ramdas

                              • 12. Re: Could this be deadlock when modifying TreeCache
                                manik

                                Sorry, only just saw which version you're referring to. JMX attribs were only exposed in JBoss Cache from version 1.3.0.GA.

                                The version embedded in JBoss App Server 3.2.6 is pre-1.0.

                                So even though your version of JGroups exposes JMX info, JBoss Cache doesn't bind this to JMX until 1.3.0.GA.

                                A few questions:

                                1) Does your JBoss AS installation run in a cluster?
                                2) Is upgrading your version of JBoss AS feasible?

                                Thanks
                                Manik

                                • 13. Re: Could this be deadlock when modifying TreeCache
                                  ramdas

                                  We have a newer version of JBoss cache and jgroups which we were suggested to download and use when we initially deployed this application in production along with JBoss 3.2.6.
                                  Given below is the dump from the Manifest files from the JBoss cache/jgroups and Jboss AS.

                                  The JBoss AS is clustered and we currently have 8 members in the cluster.
                                  Almost all our work is handled within the Tomcat threads running within JBoss.

                                  The application outages as a result of the threads blocking has become more frequent(every couple of days) in the last month. Given this scenario, we as a team are willing to be a bit more aggressive in terms of changes to the existing infrastructure software though changing the AS version would be a major one and be more difficult to test out and implement. I would prefer if we could tune the existing software version

                                  Thanks for your time and attention on this issue.

                                  -Ramdas

                                  ---------------JBoss cache---------------------------------
                                  Manifest-Version: 1.0
                                  Ant-Version: Apache Ant 1.6.1
                                  Created-By: 1.4.2-b28 (Sun Microsystems Inc.)
                                  Built-By: bela
                                  Created-On: June 1 2005
                                  Main-Class: org.jboss.cache.Version
                                  Name: JBossCache
                                  Specification-Title: JBossCache
                                  Specification-Version: 1.2.3
                                  Specification-Vendor: JBoss Inc.
                                  Implementation-Title: JBossCache
                                  Implementation-Version: 1.2.3
                                  Implementation-Vendor: JBoss Inc.


                                  ----------------------------JGroups----------------------
                                  Manifest-Version: 1.0
                                  Ant-Version: Apache Ant 1.6.1
                                  Created-By: Apache Ant 1.5Beta2
                                  Main-Class: org.jgroups.Version
                                  Implementation-Version: 2.2.9 beta

                                  -------------------------JBoss AS(jboss.jar)--------------
                                  Manifest-Version: 1.0
                                  Ant-Version: Apache Ant 1.6.2
                                  Created-By: 1.4.2_07-b05 (Sun Microsystems Inc.)
                                  Specification-Title: JBoss
                                  Specification-Version: 3.2.6
                                  Specification-Vendor: JBoss (http://www.jboss.org/)
                                  Implementation-Title: JBoss [WonderLand]
                                  Implementation-URL: http://www.jboss.org/
                                  Implementation-Version: 3.2.6 (build: CVSTag=JBoss_3_2_6 date=20050802)
                                  Implementation-Vendor: JBoss.org
                                  Implementation-Vendor-Id: http://www.jboss.org/

                                  • 14. Re: Could this be deadlock when modifying TreeCache
                                    manik

                                    Ok, either way, your version of JBoss Cache will not give you JMX info. And I wouldn't recommend running a production system on a beta version of JGroups!

                                    Even if you are reluctant to move to a newer version of JBossAS or JBoss Cache, I'd certainly recommend using a production version of JGroups.

                                    http://labs.jboss.org/jbosscache/compatibility/index.html

                                    1 2 Previous Next