9 Replies Latest reply on Jun 30, 2004 1:53 PM by belaban

    JBoss 3.2.2 Clustering Issues After Servers are Running for

    rcostanzo

      I am running JBoss 3.2.2 in production. Our system processes hundreds of thousands transactions a day. We are using clustering with cache invalidation with 3 instances across 3 machines. In general, everything works great. However, every other week or so, I will see the system "hang", with errors like these thrown:

      2004-06-15 05:46:41,257 WARN [org.jboss.tm.TransactionImpl] Transaction TransactionImpl:XidImpl [FormatId=257, GlobalId=www7.mateso.com//55944547, BranchQual=] timed out. status=STATUS_ACTIVE
      2004-06-15 05:46:41,266 WARN [org.jboss.ejb.plugins.AbstractInstanceCache] Unable to passivate due to ctx lock, id=5084257
      2004-06-15 05:46:41,354 WARN [org.jboss.tm.TransactionImpl] Transaction TransactionImpl:XidImpl [FormatId=257, GlobalId=www7.mateso.com//55944555, BranchQual=] timed out. status=STATUS_ACTIVE
      2004-06-15 05:47:18,543 WARN [org.jboss.tm.TransactionImpl] Transaction TransactionImpl:XidImpl [FormatId=257, GlobalId=www7.mateso.com//55944825, BranchQual=] timed out. status=STATUS_ACTIVE
      2004-06-15 05:47:52,166 WARN [org.jboss.tm.TransactionImpl] Transaction TransactionImpl:XidImpl [FormatId=257, GlobalId=www7.mateso.com//55945145, BranchQual=] timed out. status=STATUS_ACTIVE
      2004-06-15 05:47:52,168 ERROR [org.jboss.ejb.BeanLock] PoolThread-7Saw rolled back tx=TransactionImpl:XidImpl [FormatId=257, GlobalId=www7.mateso.com//55945145, BranchQual=] waiting for txLock
      2004-06-15 05:47:52,168 WARN [org.jboss.tm.TransactionImpl] Lock contention, tx=TransactionImpl:XidImpl [FormatId=257, GlobalId=www7.mateso.com//55945145, BranchQual=]
      2004-06-15 05:47:52,168 ERROR [org.jboss.ejb.plugins.LogInterceptor] RuntimeException:
      java.lang.RuntimeException: Transaction marked for rollback, possibly a timeout
      at org.jboss.ejb.plugins.lock.QueuedPessimisticEJBLock.waitForTx(QueuedPessimisticEJBLock.java(Compiled Code))
      at org.jboss.ejb.plugins.lock.QueuedPessimisticEJBLock.doSchedule(QueuedPessimisticEJBLock.java(Compiled Code))
      at org.jboss.ejb.plugins.lock.QueuedPessimisticEJBLock.schedule(QueuedPessimisticEJBLock.java(Compiled Code))
      at org.jboss.ejb.plugins.EntityLockInterceptor.invoke(EntityLockInterceptor.java(Compiled Code))
      at org.jboss.ejb.plugins.EntityCreationInterceptor.invoke(EntityCreationInterceptor.java(Compiled Code))
      at org.jboss.ejb.plugins.AbstractTxInterceptor.invokeNext(AbstractTxInterceptor.java(Compiled Code))
      at org.jboss.ejb.plugins.TxInterceptorCMT.runWithTransactions(TxInterceptorCMT.java(Compiled Code))
      at org.jboss.ejb.plugins.TxInterceptorCMT.invoke(TxInterceptorCMT.java(Compiled Code))
      at org.jboss.ejb.plugins.SecurityInterceptor.invoke(SecurityInterceptor.java(Compiled Code))
      at org.jboss.ejb.plugins.LogInterceptor.invoke(LogInterceptor.java(Compiled Code))
      at org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor.invoke(ProxyFactoryFinderInterceptor.java(Compiled Code))
      at org.jboss.ejb.EntityContainer.internalInvoke(EntityContainer.java(Compiled Code))
      at org.jboss.ejb.Container.invoke(Container.java(Compiled Code))
      at org.jboss.ejb.plugins.local.BaseLocalProxyFactory.invoke(BaseLocalProxyFactory.java(Compiled Code))
      at org.jboss.ejb.plugins.local.EntityProxy.invoke(EntityProxy.java(Compiled Code))
      at $Proxy219.getConsumerValue(Unknown Source)
      at mts.psp.consumer.ConsumerManagerEJB.getConsumerByEmail(ConsumerManagerEJB.java:248)
      at sun.reflect.GeneratedMethodAccessor745.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java(Compiled Code))
      at java.lang.reflect.Method.invoke(Method.java(Compiled Code))
      at org.jboss.ejb.StatelessSessionContainer$ContainerInterceptor.invoke(StatelessSessionContainer.java(Compiled Code))
      at org.jboss.resource.connectionmanager.CachedConnectionInterceptor.invoke(CachedConnectionInterceptor.java(Compiled Code))
      at org.jboss.ejb.plugins.StatelessSessionInstanceInterceptor.invoke(StatelessSessionInstanceInterceptor.java(Compiled Code))
      at org.jboss.ejb.plugins.AbstractTxInterceptor.invokeNext(AbstractTxInterceptor.java(Compiled Code))
      at org.jboss.ejb.plugins.TxInterceptorCMT.runWithTransactions(TxInterceptorCMT.java(Compiled Code))
      at org.jboss.ejb.plugins.TxInterceptorCMT.invoke(TxInterceptorCMT.java(Compiled Code))
      at org.jboss.ejb.plugins.SecurityInterceptor.invoke(SecurityInterceptor.java(Compiled Code))
      at org.jboss.ejb.plugins.LogInterceptor.invoke(LogInterceptor.java(Compiled Code))
      at org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor.invoke(ProxyFactoryFinderInterceptor.java(Compiled Code))
      at org.jboss.ejb.StatelessSessionContainer.internalInvoke(StatelessSessionContainer.java(Compiled Code))
      at org.jboss.ejb.Container.invoke(Container.java(Compiled Code))
      at org.jboss.ejb.plugins.local.BaseLocalProxyFactory.invoke(BaseLocalProxyFactory.java(Compiled Code))
      at org.jboss.ejb.plugins.local.StatelessSessionProxy.invoke(StatelessSessionProxy.java(Compiled Code))
      at $Proxy579.getConsumerByEmail(Unknown Source)
      at mts.psp.consumer.ConsumerDelegate.getConsumerByEmail(ConsumerDelegate.java:290)


      This occurs across all 3 instances at the same time, and the site becomes unaccessible. 9 times out of 10, the servers will snap out of this state within a few minutes, and everything will work fine again. However, sometimes it stays in this state for a longer period of time and I need to bounce the instances. I'm am pretty sure that there is no external program locking our database (which would cause similar behavior), especially since just an application server restart fixes the issue.

      Another oddity (which may or may not be related), is I will see
      -------------------------------------------------------
      GMS: address is www7:32832
      -------------------------------------------------------
      Thrown out all of a sudden (www7 is one of the server names). Usually, you only see this on startup. After seeing this output, the other two members of the cluster will detect a dead member, and then detect a new member.

      Any help or clues as to what is going on would be greatly appreciated.

      A stack trace from one of the other servers looks pretty similar but is a little different:

      2004-06-15 07:18:42,089 WARN [org.jboss.tm.TransactionImpl] Transaction TransactionImpl:XidImpl [FormatId=257, GlobalId=www4.mateso.com//50385479, BranchQual=] timed out. status=STATUS_ACTIVE
      2004-06-15 07:18:42,120 ERROR [org.jboss.ejb.plugins.LogInterceptor] TransactionRolledbackLocalException in method: public abstract void mts.psp.consumer.ConsumerManagerLocal.updateSiteLoginDate(java.lang.Integer,java.lang.Integer), causedBy:
      org.jboss.tm.JBossRollbackException: Unable to commit, tx=TransactionImpl:XidImpl [FormatId=257, GlobalId=www4.mateso.com//50385479, BranchQual=] status=STATUS_NO_TRANSACTION
      at org.jboss.tm.TransactionImpl.commit(TransactionImpl.java(Compiled Code))
      at org.jboss.ejb.plugins.TxInterceptorCMT.endTransaction(TxInterceptorCMT.java(Compiled Code))
      at org.jboss.ejb.plugins.TxInterceptorCMT.runWithTransactions(TxInterceptorCMT.java(Compiled Code))
      at org.jboss.ejb.plugins.TxInterceptorCMT.invoke(TxInterceptorCMT.java(Compiled Code))
      at org.jboss.ejb.plugins.SecurityInterceptor.invoke(SecurityInterceptor.java(Compiled Code))
      at org.jboss.ejb.plugins.LogInterceptor.invoke(LogInterceptor.java(Compiled Code))
      at org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor.invoke(ProxyFactoryFinderInterceptor.java(Compiled Code))
      at org.jboss.ejb.StatelessSessionContainer.internalInvoke(StatelessSessionContainer.java(Compiled Code))
      at org.jboss.ejb.Container.invoke(Container.java(Compiled Code))
      at org.jboss.ejb.plugins.local.BaseLocalProxyFactory.invoke(BaseLocalProxyFactory.java(Compiled Code))
      at org.jboss.ejb.plugins.local.StatelessSessionProxy.invoke(StatelessSessionProxy.java(Compiled Code))
      at mts.psp.consumer.ConsumerDelegate.updateSiteLoginDate(ConsumerDelegate.java(Compiled Code))
      at mts.psp.web.struts.util.WebSessionUtil.setLoggedIn(WebSessionUtil.java(Compiled Code))
      at mts.psp.web.struts.auth.SessionFilter.handleCookiesAndMungeCode(SessionFilter.java(Compiled Code))
      at mts.psp.web.struts.auth.SessionFilter.doFilter(SessionFilter.java(Compiled Code))
      at org.mortbay.jetty.servlet.WebApplicationHandler$Chain.doFilter(WebApplicationHandler.java(Compiled Code))
      at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java(Compiled Code))
      at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java(Compiled Code))
      at org.mortbay.http.HttpContext.handle(HttpContext.java(Compiled Code))
      at org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java(Compiled Code))
      at org.mortbay.http.HttpContext.handle(HttpContext.java(Compiled Code))
      at org.mortbay.http.HttpServer.service(HttpServer.java(Compiled Code))
      at org.jboss.jetty.Jetty.service(Jetty.java(Compiled Code))
      at org.mortbay.http.HttpConnection.service(HttpConnection.java(Inlined Compiled Code))
      at org.mortbay.http.ajp.AJP13Connection.handleNext(AJP13Connection.java(Compiled Code))
      at org.mortbay.http.HttpConnection.handle(HttpConnection.java(Compiled Code))
      at org.mortbay.http.ajp.AJP13Listener.handleConnection(AJP13Listener.java(Compiled Code))
      at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java(Compiled Code))
      at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java(Compiled Code))

        • 1. Re: JBoss 3.2.2 Clustering Issues After Servers are Running
          nmartel

          We noticed similar quirks/execptions thrown when running similar loads over long periods of time. With testing and artificial loads, I was able to reproduce a similar fault within 15-30 minutes.

          With MySQL transactions (InnoDB tables) and tree-cache in use -- it was observed that there are times when objects are idled and reclaimed by the system and some CPUs have reclaimed/idled objects ahead of others. The cache invalidator is doing its thing (via JGroups), but it takes a while for everything to synch up (much longer explanation abbreviated here for simplicity.) For us, one of the root causes was that the mq.pm.TxManager.java was throwing an exception because the code didn't handle the "Null" list case. We changed the code to handle the condition by adding a test for the NULL, and the exceptions and erroneous transaction rollback messages dis-appeared. We first verified that indeed the nulls could and did occur. It wouldn't make sense to add inter-system synch for these objects, so we decided to allow the null case to return and for the code to do "nothing". Basically, you are just performing house-keeping that (due to race conditions) already occurred.


          The updated routine is as follows:
          protected void releaseList( LinkedList list )
          {
          if (list == null) return;

          synchronized ( listPool )
          {
          if ( listPool.size() < MAX_POOL_SIZE )
          {
          list.clear();
          listPool.add( list );
          }
          }
          }

          Unfortunately (or fortunately) you need to rebuild the JBoss source to incorporate the fix.

          There are other instances where the null list or null object can occur, the JBoss code just doesn't handle these well (yet). The JBoss creators may have assumed/envisioned that the NULL wouldn't occur, however, with heavy loads, large clusters, fast machines, and context switching, we DO see these creep through.

          • 2. Re: JBoss 3.2.2 Clustering Issues After Servers are Running
            nmartel

            Ignore my last post. I re-read this specific fault and noticed the first exception to start you down the path to destruction is the database commit deadlock detection.

            The last post could be useful, but only if you are using persistent messages. There was a similar quirk and deadlock detection timeout that went into that rathole.

            • 3. Re: JBoss 3.2.2 Clustering Issues After Servers are Running
              nmartel

              We modeled and tested various combinations of entity beans and used session-wrapped transactions to cure/avoid most race conditions. We noticed though, that there are still ways to get into deadlock states.

              We cured our remaining deadlocks by selecting MySQL (and thus InnoDB) and using the built-in database deadlock detection and rollback mechanism to detect and recover from the occassional race. (To do this requires you set your default table types to InnoDB with XDoclet.)

              Other important things we noted was that we had to set the MySQL/InnoDB detection time to a reasonable number (10 seconds for us) and then we had to ensure that there was a sufficient number of threads in the datasource pool <min-pool-size> and <max-pool-size> to handle the worst-case load. An insufficient number of threads in the DS pool will cause a deadlock that can't be broken since there is no thread to perform a rollback. The 10 second timer (or whatever is reasonable for you) ensures that you notice the deadlock in a timely fashion in order to correct it. Too low a timer setting, and you cause un-necessary rollbacks, too high a setting, and you get the exception you noted above.

              Getting large loads to run is a slow and difficult job. I can only generalize on issues seen and corrected. If you can post more details about your specific situation, maybe I can suggest more things to look at.

              Other quirks that got us into trouble was the J-Groups and tree-cache. It was noted that the J-Groups multi-casting method can be problematic in that in high traffic farms, the switches tend to toss lower prioroity traffic such as broadcasts and multicasts. You may want to measure the traffic levels on your network interfaces and look at your network stats to ensure that traffic discard is not a contibutor to the scenario.

              • 4. Re: JBoss 3.2.2 Clustering Issues After Servers are Running
                nmartel

                I re-read your log again. A server name being thrown out could be the j-groups dropping you and re-adding you. Do you see a message indicating a new generation of the cluster being formed?

                Again, I am thinking that the multi-cast messages are getting dropped due to traffic conditions. Multi-casting could cause some gyrations. You could open up the grace window by changing some of the parameters in cluster-service.xml. You could increase the number of attempts from 3 to 5. You could also increase the reply time timeout (it's possible the 2 seconds default is TOO slow for the load running on your server). You might have to soften the timeout to more like 5 seconds. Lastly, if the message delivery is really nasty, you could use QOS or some other means to give your J-Groups traffic priority. Failing that, you might want to consider changing the J-Groups traffic to TCP/IP unicast messages. Although this would be more cumbersome, the deliver of unicast IP traffic is almost guaranteed (as compared to Multicasts).

                • 5. Re: JBoss 3.2.2 Clustering Issues After Servers are Running
                  rcostanzo

                  We do have an issue where servers will randomly drop out of the cluster. I will try adjusting the JGroups settings as you've recommended. I don't have a lot of faith in the environment that our production boxes are currently hosted in, so I wouldn't be surprised if we are dropping a good amount of the multicast packets.

                  I will also try decreasing the transaction rollback timeout, which is currently set to 60 seconds. I'm thinkng 30 seconds will be a good amount to try initially.

                  Thanks for the suggestions.

                  • 6. Re: JBoss 3.2.2 Clustering Issues After Servers are Running
                    rcostanzo

                    A couple of questions on which configs to change. My current jgroups settings in cluster-service looks like:



                    <!-- Name of the partition being built -->
                    ProductionPartition
                    <!-- Determine if deadlock detection is enabled -->
                    False
                    <!-- The JGroups protocol configuration -->


                    <!-- UDP: if you have a multihomed machine,
                    set the bind_addr attribute to the appropriate NIC IP address -->
                    <!-- UDP: On Windows machines, because of the media sense feature
                    being broken with multicast (even after disabling media sense)
                    set the loopback attribute to true -->
                    <UDP mcast_addr="228.1.2.3" mcast_port="45566"
                    ip_ttl="64" ip_mcast="true"
                    mcast_send_buf_size="150000" mcast_recv_buf_size="80000"
                    ucast_send_buf_size="150000" ucast_recv_buf_size="80000"
                    loopback="false" />
                    <PING timeout="2000" num_initial_members="3"
                    up_thread="true" down_thread="true" />
                    <MERGE2 min_interval="5000" max_interval="10000" />
                    <FD shun="true" up_thread="true" down_thread="true"
                    timeout="2500" max_tries="5" />
                    <VERIFY_SUSPECT timeout="3000" num_msgs="3"
                    up_thread="true" down_thread="true" />
                    <pbcast.NAKACK gc_lag="50" retransmit_timeout="300,600,1200,2400,4800"
                    up_thread="true" down_thread="true" />
                    <pbcast.STABLE desired_avg_gossip="20000"
                    up_thread="true" down_thread="true" />
                    <UNICAST timeout="5000" window_size="100" min_threshold="10"
                    down_thread="true" />
                    <FRAG frag_size="8192"
                    down_thread="true" up_thread="true" />
                    <pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
                    shun="true" print_local_addr="true" />
                    <pbcast.STATE_TRANSFER up_thread="true" down_thread="true" />




                    Which properties map to the number of attempts to send a message and the timeout for detecting another member is down? Any other parameters which look as if they need some tweaking?

                    Thanks again.

                    • 7. Re: JBoss 3.2.2 Clustering Issues After Servers are Running
                      nmartel

                      Okay... you have the correct config file... cluster-service.xml

                      See the section that says PING Timeout---change the 2000 to 5000 (which is 5 seconds) (Bela may have other insight or preferences here... but this works for me).

                      PING Msgs is how J-Groups performs its status checks. Upping the timeout to 5 seconds assumes that the multi-casts are actually delivered and allows your overloaded server time to run and reply to the pings.

                      In the FD section, change the timeout from 2500 to 4500, I noticed that your retries amount has already been bumped from 3 to 5... good.

                      Change the verify suspect from 3000 to 4750.

                      This is a good start and SHOULD handle many conditions.

                      You need to stabilize this mechanism before anything else can happen. The J-Groups is fundamental to the cluster operating correctly along with the tree-cache and manny other services.

                      Once done, you will need to monitor your network ports (using TcpDump or similar tool) and watch the MCAST packets. Watch the delay on the response packets. If you see too much of a strectch between the response and request, then tweaking these numbers get's a bit trickier. (A poor man's perf tool is PING. You could just run a series of pings while your server is loaded and measure the response/stretch. This is if'y at best since you don't have the Java Runtime and JBoss stack under you, but it does show machine/network latency). If you see missing responses or pings from the cluster peers, then we need to determine why those are missing. If you are using fast Ethernet ports and a lower end switch, chances are the multi-casts are being discarded. Again, make your adjustments, and run the server.

                      When the load gets up there, try the ping test. Also, if you can, capture a mintue or two with TCP dump and analyze the multi-cast packet flow and the time gaps between messages.

                      Once this is solid, we can move on to transactions, deadlocks, and rollbacks.

                      • 8. Re: JBoss 3.2.2 Clustering Issues After Servers are Running
                        nmartel

                        You could provide some helpful data for example:

                        CPU/System/Config -- such as Dell Precision 650 (P-IV 3.0 Ghz, 2G Ram), 3-Drive RAID (striped set) OS = RedHat Linux 9

                        Network -- Single 3Com NIC card with Gigabit Ethernet

                        Switch -- HP ProCurve 2424M

                        ===

                        Just some simple stuff to get me grounded...

                        Also, what is the number of sessions/transacations per second per server?
                        If this is web-facing, is there a load balancing switch in front? Is it a Cisco/Arrowpoint switch? Is sticky IP being used? Is there a separate back-end network for the server farm (versus a font-end network) for the Web/session traffic?

                        See where I am going with the load/config analysis?

                        • 9. Re: JBoss 3.2.2 Clustering Issues After Servers are Running
                          belaban

                          You could also replace FD with FD_SOCK, so you establish sockets between peers rather than use messaging/heartbeating to detect a crashed server.

                          Note, however, that this will *not* exclude a hung server.

                          Also search for my more detailed expl on this forum earlier.

                          Bela