I am looking for some urgently needed help with my JBoss 6.1 EAP clustering problem, at least I believe it is a clustering issue. The error stacks are coming from JGroups, and it happens only with load, although not as high as it could be. I'll try to give as much information as possible.
ENVIRONMENT (LOAD TESTING, NOT LIVE)
o two physical app servers, each with two JBoss server instances, clustered
o running an e-commerce production application where users access a store, browse products, search, etc.
o running on Apache 2.2.23/mod_jk, for comparison to JBoss 5 running on the same webserver
o I understand mod_cluster is the preferred load balancer, but for comparison to previous releases, I am running mod_jk
o unless I hear otherwise, I understand mod_jk is supported, albiet not well documented
o when starting up server instances, all running servers recognize the new member
o JVM is Oracle 1.7.0_40, see JAVA_OPTS below
o with a load tool, I can get up to about 1,400 users, then afterward, massive GCing and virtually no response
o see attached gc_graph.png, for representative server instance
o shows 8G heap, small new gen on top, heap usage for about 3.5hr
o all looks normal for about 3hr, which leads me to believe I at least have the basics configured correctly
o see attached errors.xls for errors, some examples:
05:49:05,279 WARN [nucleusNamespace.atg.userprofiling.ProfileAdapterRepository] (http-executor-threads - 113) Incremented an unexpected number of records. Incremented: 2. Expected: 1
05:50:15,219 WARN [org.jgroups.protocols.pbcast.GMS] (ViewHandler,web,node10_1/web) node10_1/web: failed to collect all ACKs (expected=2) for view [node10_1/web|5] after 5000ms, missing ACKs from [node11_2/web]
05:50:26,880 WARN [org.jboss.as.clustering.web.infinispan] (OOB-44,shared=udp) JBAS010325: Possible concurrency problem: Replicated version id 7 is less than or equal to in-memory version for session KjczrMUhPuZ6-NIpg6mRw3GR
05:51:27,507 ERROR [nucleusNamespace.atg.dynamo.servlet.dafpipeline.VirtualContextRootInterceptor] (http-executor-threads - 1139) Could not forward request to context org.apache.catalina.core.ApplicationContextFacade@58253cde: ClientAbortException: java.net.SocketException: Broken pipe
o I have tried several different JAVA_OPTS, here are my current settings:
JAVA_OPTS="-Xms8g -Xmx8g -XX:MaxPermSize=256m -XX:ThreadStackSize=128k -Djava.net.preferIPv4Stack=true -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+ExplicitGCInvokesConcurrent -Dtomcat.util.buf.StringCache.byte.enabled=true -Dtomcat.util.buf.StringCache.char.enabled=true -Dtomcat.util.buf.StringCache.trainThreshold=5 -Dtomcat.util.buf.StringCache.cacheSize=500 -XX:+PrintCommandLineFlags -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Xloggc:<logname>_gc.log -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000 -Dsun.lang.ClassLoader.allowArraySyntax=true"
o see attached gc_graph.png
o here are my current settings as they pertain to connections to mod_jk:
<keepalive-time time="10" unit="seconds"/>
<subsystem xmlns="urn:jboss:domain:web:1.4" default-virtual-server="default-host" instance-id="bus22410_node1" native="false">
<connector name="http" protocol="HTTP/1.1" scheme="http" socket-binding="http"/>
<connector name="ajp" protocol="AJP/1.3" enabled="true" scheme="http" socket-binding="ajp" executor="http-executor" max-connections="4000"/>
<virtual-server name="default-host" enable-welcome-root="true">
o see attached production_1.xml (from template standalone-ha.xml)
o see attached mod_jk.conf, httpd.conf, workers.properties, and mod_jk_reconfig.log (not much to show in error_log)
o mod_jk_reconfig.log - no entries until 5:07am, cping/cpong; just a few similar errors until 5:48am; the last three lines appear to be thrown for the rest of the test time:
[Sun Mar 16 05:07:46.162 2014] [32215:140024305211136] [error] ajp_connect_to_endpoint::jk_ajp_common.c (1026): (bus22410_node1) cping/cpong after connecting to the backend server failed (errno=110)
[Sun Mar 16 05:07:46.162 2014] [32215:140024305211136] [error] ajp_send_request::jk_ajp_common.c (1630): (bus22410_node1) connecting to backend failed. Tomcat is probably not started or is listening on the wrong port (errno=110)
[Sun Mar 16 05:48:29.902 2014] [32487:140025110910720] [error] ajp_get_reply::jk_ajp_common.c (2126): (bus22410_node2) Tomcat is down or refused connection. No response has been sent to the client (yet)
[Sun Mar 16 05:48:30.393 2014] [29751:140025253586688] [error] ajp_get_reply::jk_ajp_common.c (2126): (bus22410_node2) Tomcat is down or refused connection. No response has been sent to the client (yet)
[Sun Mar 16 05:48:30.395 2014] [13405:140024741631744] [error] ajp_get_reply::jk_ajp_common.c (2126): (bus22410_node2) Tomcat is down or refused connection. No response has been sent to the client (yet)
[Sun Mar 16 05:48:30.446 2014] [32487:140024439494400] [error] ajp_get_reply::jk_ajp_common.c (2154): (bus22410_node2) Tomcat is down or network problems. Part of the response has already been sent to the client
[Sun Mar 16 05:48:45.635 2014] [13202:140025018590976] [error] ajp_send_request::jk_ajp_common.c (1630): (bus22410_node2) connecting to backend failed. Tomcat is probably not started or is listening on the wrong port (errno=115)
[Sun Mar 16 05:48:45.635 2014] [13202:140025018590976] [error] ajp_service::jk_ajp_common.c (2643): (bus22410_node2) connecting to tomcat failed.
[Sun Mar 16 05:48:45.644 2014] [13405:140024405923584] [error] ajp_get_reply::jk_ajp_common.c (2154): (bus22410_node1) Tomcat is down or network problems. Part of the response has already been sent to the client
That's it for now, I could go on...
Thanks for any help you can provide.