8 Replies Latest reply on Nov 11, 2013 5:02 AM by nichele

Tomcat instance lost and mod-cluster behaviour. modcluster-97 ?

nichele Nov 6, 2013 7:02 AM

Hi all,

i'm using mod-cluster 1.2.0 with tomcat 7. My envrionemnt is deployed on Amazon and, just to have the discussione easier, i have 1 httpd instance + 2 tomcat instances (3 different VMs).

I have noticed that if a tomcat instance is lost (stopped or terminated from amazon console) my environemnt starts to be not stable.

HTTPD still report my instance in mod_cluster-manager sometime view with status ok and sometime with status notok.

At this point i have seen some requests still sent to the dead node and, after 1 min, sent to the other live node.

I have replicated this behavior just using iptables on my tomcat instance closing outgoing 6666 connections and incoming traffic on 8009.

iptables -A OUTPUT -p tcp -m state --state NEW -m tcp --dport 6666 -j DROP

iptables -A INPUT -m state --state NEW,ESTABLISHED -p tcp --dport 8009 -j DROP

My concern is about the fact the mod-cluster should know that the node is not working as expected (at least since it is not sending anymore status information) and should remove it from the list of the nodes.

Just found this issue:

[MODCLUSTER-97] httpd should remove workers who crashed - JBoss Issue Tracker

that is really similar to my case. Is it a regression ?

This is my tomcat configuration

<Listener className="org.jboss.modcluster.container.catalina.standalone.ModClusterListener"

advertise="false"

proxyList="zzzzz:6666"

maxAttempts="3"

nodeTimeout="600"

workerTimeout="-1"

ping="60"

stickySession="true"

stickySessionRemove="false"

stickySessionForce="false"

loadMetricClass="org.jboss.modcluster.load.metric.impl.AverageSystemLoadMetric"

loadMetricCapacity="20"

and this is my httpd conf:

Listen *:6666

Order deny,allow

#Deny from all

Allow from all

</Directory>

KeepAliveTimeout 60

MaxKeepAliveRequests 0

ManagerBalancerName mycluster

ServerAdvertise Off

EnableMCPMReceive

</VirtualHost>

Many thanks in advance.

ste

1. Re: Tomcat instance lost and mod-cluster behaviour. modcluster-97 ?

rhusar Nov 6, 2013 8:48 AM (in response to nichele)

Sorry, what happens if you simulate the failover and the request was about to be routed to the node that is failed?

Also, the testing is not exactly right, DROP in iptables does not correspond 1:1 to pulled cable/dead process. You could try kill -9 of the java process.
Actions
2. Re: Re: Tomcat instance lost and mod-cluster behaviour. modcluster-97 ?

nichele Nov 6, 2013 4:28 PM (in response to rhusar)

Thanks for you answer.
What do you mean with simulate the failover ?

About iptables vs kill -9, note that i'm not talking about process dead but instance dead.
If i kill the process (with -9 option), mod-cluster marks the node as NOTOK really soon (indeed the expected behavior).

But when an VM instance dies, the behavior is exaclty as what i have with iptables.
Main difference for instance:
nc -vz xx.rr.tt.ww 8009
returns immediately an error if the process is down. But if xx.rr.tt.www :8009 is blocked since a firewall, nc command goes in timeout after 60 sec. Same think if the instance is dead (nc command goes in timeout after 60 sec).
So i suppose that if the process is dead but the instance is reachable even modcluster (like nc) is able to detect quickly that tomcat process is not running. If the instance is not reachable (since firewall or since it is dead) event modcluster (like nc) needs a lot of time (my PING value ?) to detect that tomcat is not reachable. In the meantime...something bad happen.

For instance:
at 21:01:16 i run iptables command.
at 21.02:21 mod-cluster marked the node as NOTOK
at 21:02:39 the node is marked as OK
at 21:02:55 the node is marked as NOTOK
at 21:03:43 the node is marked as OK
at 21:03:55 the node is marked as NOTOK
at 21:05:57 the node is marked as OK
at 21:06:58 the node is marked as NOTOK
at 21:08:53 the node is marked as OK
at 21:09:54 the node is marked as NOTOK

Meanwhile some requests were forwared to the blocked node and they were waiting 1m (i suppose since my PING value) and then sent to the working node.

Ideas ?

thx
ste
Actions
3. Re: Tomcat instance lost and mod-cluster behaviour. modcluster-97 ?

nichele Nov 7, 2013 2:32 AM (in response to nichele)

Just done a test with PING=10

at 07:17:33 firewall closed
at 07:17:53 NOTOK
at 07:18:12 OK
at 07:18:24 NOTOK
at 07:18:30 OK
at 07:18:39 NOTOK
at 07:18:42 OK
at 07:18:51 NOTOK
...and so on...

As before, meanwhile some requests were forwared to the blocked node...

thx
ste
Actions
4. Re: Tomcat instance lost and mod-cluster behaviour. modcluster-97 ?

jfclere Nov 7, 2013 3:55 AM (in response to nichele)

Could you try with mod_cluster-1.2.6? Set LogLevel to debug and send the trace.
What is the httpd -V output?
Actions
5. Re: Tomcat instance lost and mod-cluster behaviour. modcluster-97 ?

nichele Nov 7, 2013 8:10 AM (in response to jfclere)

Ok, i'll try latest vesion.

In the meantime, this is the required output :

[root]# httpd -V
Server version: Apache/2.2.24 (Unix)
Server built:   Mar 19 2013 14:33:22
Server's Module Magic Number: 20051115:31
Server loaded: APR 1.4.6, APR-Util 1.4.1
Compiled using: APR 1.4.6, APR-Util 1.4.1
Architecture:   64-bit
Server MPM:     Worker
threaded:     yes (fixed thread count)
    forked:     yes (variable process count)
Server compiled with....
-D APACHE_MPM_DIR="server/mpm/worker"
-D APR_HAS_SENDFILE
-D APR_HAS_MMAP
-D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
-D APR_USE_SYSVSEM_SERIALIZE
-D APR_USE_PTHREAD_SERIALIZE
-D APR_HAS_OTHER_CHILD
-D AP_HAVE_RELIABLE_PIPED_LOGS
-D DYNAMIC_MODULE_LIMIT=128
-D HTTPD_ROOT="/usr/local/httpd-2.2.24"
-D SUEXEC_BIN="/usr/local/httpd-2.2.24/bin/suexec"
-D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
-D DEFAULT_ERRORLOG="logs/error_log"
-D AP_TYPES_CONFIG_FILE="conf/mime.types"
-D SERVER_CONFIG_FILE="conf/httpd.conf"

thanks again
ste
Actions
6. Re: Tomcat instance lost and mod-cluster behaviour. modcluster-97 ?

nichele Nov 7, 2013 11:33 AM (in response to nichele)

I replicated the issue with mod-cluster 1.2.6

Here you can find the log: http://tny.cz/eb0e6021
(i didn't find a way to attach it)

For helping:
- at 15:57:28 i started httpd
- at 15:59:31 i run IPTABLES on NODE02 (10.2.2.2) (before both nodes were OK)
- at 15:59:52 NODE02 was marked as NOTOK
- at 16:00:21 NODE02 was marked as OK

thx
ste
Actions
7. Re: Tomcat instance lost and mod-cluster behaviour. modcluster-97 ?

jfclere Nov 8, 2013 5:27 AM (in response to nichele)

After looking to the trace there are unexpected STATUS from NODE02 the cping/cpong retries it and also mark it OK to do the retry and during that time some requests are going to NODE2.
That is a bug triggered by your weird test. You should create JIRA (and attach the log file to the JIRA).
Actions
8. Re: Tomcat instance lost and mod-cluster behaviour. modcluster-97 ?

nichele Nov 11, 2013 5:02 AM (in response to jfclere)

Created [MODCLUSTER-369] httpd should remove lost node/worker - JBoss Issue Tracker.
Just for clarify, i reproducce the issue closing traffic on port 666 (my iptables command was incomplete since it doesn't block already existing connections). So at the end the behaviour is the same but there are not unexpected status commands from NODE02).

ste
Actions

Go to original post