4 Replies Latest reply on Mar 18, 2014 11:21 AM by rhusar

    Trouble with mod_jk, JBoss EAP 6.1, clustering configuration - urgent help requested!

    dpasiuk

      Hello,

       

      I am looking for some urgently needed help with my JBoss 6.1 EAP clustering problem, at least I believe it is a clustering issue.  The error stacks are coming from JGroups, and it happens only with load, although not as high as it could be.  I'll try to give as much information as possible.

       

      ENVIRONMENT (LOAD TESTING, NOT LIVE)

      o  two physical app servers, each with two JBoss server instances, clustered

      o  running an e-commerce production application where users access a store, browse products, search, etc.

      o  running on Apache 2.2.23/mod_jk, for comparison to JBoss 5 running on the same webserver

         o  I understand mod_cluster is the preferred load balancer, but for comparison to previous releases, I am running mod_jk

         o  unless I hear otherwise, I understand mod_jk is supported, albiet not well documented

      o  when starting up server instances, all running servers recognize the new member

      o  JVM is Oracle 1.7.0_40, see JAVA_OPTS below

       

      SYMPTOMS

      o  with a load tool, I can get up to about 1,400 users, then afterward, massive GCing and virtually no response

      o  see attached gc_graph.png, for representative server instance

      o  shows 8G heap, small new gen on top, heap usage for about 3.5hr

      o  all looks normal for about 3hr, which leads me to believe I at least have the basics configured correctly

      o  see attached errors.xls for errors, some examples:

      05:49:05,279 WARN  [nucleusNamespace.atg.userprofiling.ProfileAdapterRepository] (http-executor-threads - 113) Incremented an unexpected number of records.  Incremented: 2. Expected: 1

       

      05:50:15,219 WARN  [org.jgroups.protocols.pbcast.GMS] (ViewHandler,web,node10_1/web) node10_1/web: failed to collect all ACKs (expected=2) for view [node10_1/web|5] after 5000ms, missing ACKs from [node11_2/web]

       

      05:50:26,880 WARN  [org.jboss.as.clustering.web.infinispan] (OOB-44,shared=udp) JBAS010325: Possible concurrency problem: Replicated version id 7 is less than or equal to in-memory version for session KjczrMUhPuZ6-NIpg6mRw3GR

       

      05:51:27,507 ERROR [nucleusNamespace.atg.dynamo.servlet.dafpipeline.VirtualContextRootInterceptor] (http-executor-threads - 1139) Could not forward request to context org.apache.catalina.core.ApplicationContextFacade@58253cde: ClientAbortException:  java.net.SocketException: Broken pipe

       

       

      APP SERVERS

      o  I have tried several different JAVA_OPTS, here are my current settings:

      JAVA_OPTS="-Xms8g -Xmx8g -XX:MaxPermSize=256m -XX:ThreadStackSize=128k -Djava.net.preferIPv4Stack=true -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+ExplicitGCInvokesConcurrent -Dtomcat.util.buf.StringCache.byte.enabled=true -Dtomcat.util.buf.StringCache.char.enabled=true -Dtomcat.util.buf.StringCache.trainThreshold=5 -Dtomcat.util.buf.StringCache.cacheSize=500 -XX:+PrintCommandLineFlags -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Xloggc:<logname>_gc.log -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000 -Dsun.lang.ClassLoader.allowArraySyntax=true"

       

      o  see attached gc_graph.png

       

      o  here are my current settings as they pertain to connections to mod_jk:

      <subsystem xmlns="urn:jboss:domain:threads:1.1">

         <bounded-queue-thread-pool name="http-executor">

         <core-threads count="200"/>

         <queue-length count="50"/>

         <max-threads count="1000"/>

         <keepalive-time time="10" unit="seconds"/>

         </bounded-queue-thread-pool>

      </subsystem>

       

      <subsystem xmlns="urn:jboss:domain:web:1.4" default-virtual-server="default-host" instance-id="bus22410_node1" native="false">

      <connector name="http" protocol="HTTP/1.1" scheme="http" socket-binding="http"/>

      <connector name="ajp" protocol="AJP/1.3" enabled="true" scheme="http" socket-binding="ajp" executor="http-executor" max-connections="4000"/>

                  <virtual-server name="default-host" enable-welcome-root="true">

                      <alias name="localhost"/>

                      <alias name="example.com"/>

                  </virtual-server>

      </subsystem>

       

      o  see attached production_1.xml (from template standalone-ha.xml)

       

      WEBSERVER

      o  see attached mod_jk.conf, httpd.conf, workers.properties, and mod_jk_reconfig.log (not much to show in error_log)

      o  mod_jk_reconfig.log - no entries until 5:07am, cping/cpong; just a few similar errors until 5:48am; the last three lines appear to be thrown for the rest of the test time:

       

      [Sun Mar 16 05:07:46.162 2014] [32215:140024305211136] [error] ajp_connect_to_endpoint::jk_ajp_common.c (1026): (bus22410_node1) cping/cpong after connecting to the backend server failed (errno=110)

      [Sun Mar 16 05:07:46.162 2014] [32215:140024305211136] [error] ajp_send_request::jk_ajp_common.c (1630): (bus22410_node1) connecting to backend failed. Tomcat is probably not started or is listening on the wrong port (errno=110)

      [Sun Mar 16 05:48:29.902 2014] [32487:140025110910720] [error] ajp_get_reply::jk_ajp_common.c (2126): (bus22410_node2) Tomcat is down or refused connection. No response has been sent to the client (yet)

      [Sun Mar 16 05:48:30.393 2014] [29751:140025253586688] [error] ajp_get_reply::jk_ajp_common.c (2126): (bus22410_node2) Tomcat is down or refused connection. No response has been sent to the client (yet)

      [Sun Mar 16 05:48:30.395 2014] [13405:140024741631744] [error] ajp_get_reply::jk_ajp_common.c (2126): (bus22410_node2) Tomcat is down or refused connection. No response has been sent to the client (yet)

      [Sun Mar 16 05:48:30.446 2014] [32487:140024439494400] [error] ajp_get_reply::jk_ajp_common.c (2154): (bus22410_node2) Tomcat is down or network problems. Part of the response has already been sent to the client

      [Sun Mar 16 05:48:45.635 2014] [13202:140025018590976] [error] ajp_send_request::jk_ajp_common.c (1630): (bus22410_node2) connecting to backend failed. Tomcat is probably not started or is listening on the wrong port (errno=115)

      [Sun Mar 16 05:48:45.635 2014] [13202:140025018590976] [error] ajp_service::jk_ajp_common.c (2643): (bus22410_node2) connecting to tomcat failed.

      [Sun Mar 16 05:48:45.644 2014] [13405:140024405923584] [error] ajp_get_reply::jk_ajp_common.c (2154): (bus22410_node1) Tomcat is down or network problems. Part of the response has already been sent to the client

       

       

      That's it for now, I could go on...

       

      Thanks for any help you can provide.

       

      Dave Pasiuk

        • 1. Re: Trouble with mod_jk, JBoss EAP 6.1, clustering configuration - urgent help requested!
          wdfink

          Looks like the GC is not able to clean the memory as fast as needed.

          The picture shows a utilisation between 3 and 8G over the time but it getting ocupied faster and faster and at the end the GC is not able to free it fast enough.

           

          Does you increase the load?

          You should anaylze whether the Young or Old Gen gets more used and whether short living objects are not GC'ed before reaching the OldGen.

          • 2. Re: Trouble with mod_jk, JBoss EAP 6.1, clustering configuration - urgent help requested!
            dpasiuk

            Hi, again, Wolf-Dieter!  You helped me before with a similar problem.  This is the same environment, but a little further down the road.

             

            To answer your question, yes, the load is increased over time, so the occupied heap before it starts GC'ing out of control, is most likely due to that.

             

            Do you have any insight on the urn:jboss:domain:threads and subsystem xmlns="urn:jboss:domain:web:1.4" values I have set?  That's the new information I'm most interested in learning.  It seems to me that Apache/mod_jk is running out of connections, but I have those things set high.  Maybe too high?  I don't know.

             

            Thanks

            • 3. Re: Trouble with mod_jk, JBoss EAP 6.1, clustering configuration - urgent help requested!
              wdfink

              It looks like that you run in a situation where too many resources are blocked and the GC is not able to free enough memory, therefor the GC will run more often, use more CPU power, other threads get slowed and you have more memory blocked.

              This will escalate and you end in a GC hell where the GC use most of the CPU and your application thread are running only for millies until the next GC

               

              So you need to analyze how many user-request you can handle with a configuration.

              Maybe you are able to change the GC settings (i.e. apply more parallel GC threads if you have enough cores, increase or decrease heap-size, rebalance old-young gen) or use the G1 GC with Java7.

              But consider that each fine-tuning for GC might have issues if the load-profil or application is changed.

               

              If you run with more than 75% of your resources for this instance I would add another JBoss instance if you need to handle such load.

              • 4. Re: Trouble with mod_jk, JBoss EAP 6.1, clustering configuration - urgent help requested!
                rhusar

                Totally looks like a forum post I have seen before already :-)

                 

                Anyhow, looking at the GC graph, that might as well look like a memory leak -- are you absolutely sure this is really just a cause of a high load and not a memory leak? Run a simple test such as before reading a OOM, turn off all clients, run GC and make sure it drops all the way to the initial levels.

                 

                Thinking about some mitigations in cases when you cannot buy more HW or add more nodes... so if the AS just cannot handle any more load, you might want to find the manageable limit; and only open that many AJP/HTTP connections and queue the rest on the LB. Or not queue, and show people a static page that the server is overloaded, etc. I have done something similar before, showing a loader bar and set a redirect after few seconds. People seemed to tolerate that well.