6 Replies Latest reply on May 6, 2008 4:13 PM by brian.stansberry

Failover is taking 15 seconds to recognize by other node.

nobleman1997 Apr 30, 2008 1:47 AM

Hi,
I am using jboss 4.0.4 version of jboss. I have set up two linux machine and each has jboss AS installed. Both jboss instances are in cluster.I have used apache 2.0 and mod_jk for loadbalancing. I have configured according to jboss 4.0.4 document.Both servers are set up
as node1 and node2 in both loadbalacer.Also i have used stickysession=1 in worker.properties file.
When both servers are up, my application works fine.
Here is the test case.
When i am browsing my application it is on node2. Now i shut it down node2 gracefully. when i try to continue my application within couple of seconds, it gives error like bad request or application doesn't recognized. if i check log on node1, after 15 seconds node 1 says view accepted (it has list of node in cluster which is just one now ). So to me this fail over is not smooth.
shouldn't it work even though node2 is down ? if i try after 15 seconds it works fine.
Same scenario happen when i use stickysession=0.
So my case is like if somebody is buying stuff online and customer is on node2 of cluster and for some reason node2 is down then it should take over instantly (within milliseconds ) so user's experiance would be smooth otherwise user would
have error page.

Could you please let me know what type of configuration should i setup so i can avoid this type of problem?

Thanks
nobleman

1. Re: Failover is taking 15 seconds to recognize by other node

brian.stansberry May 1, 2008 9:31 AM (in response to nobleman1997)

You're shutting node2 down gracefully? How are you doing it?

Your symptoms sound more like a hard kill, where it's taking 15 secs for the other node to recognize missed heartbeats and suspect the dead node. To fix that you can add the FD_SOCK protocol to the JGroups config in the tc5-cluster-service.xml file. See http://wiki.jboss.org/wiki/FDVersusFD_SOCK for details.
Actions
2. Re: Failover is taking 15 seconds to recognize by other node

nobleman1997 May 1, 2008 2:32 PM (in response to nobleman1997)

Hi Brian,
I appreciate your reply. I manually shutdown node2 for testing pupose so in production for some reason one node goes down for some reason or hangs it self then fail over could be smooth and user can be easily transfered to other node without even noticing any difference.
So in my /server/all/deploy/tc5-cluster.sar/META-INF/jboss-service.xml
following config is there.

<UDP mcast_addr="${jboss.partition.udpGroup:230.1.2.7}"
mcast_port="45577"
ucast_recv_buf_size="20000000"
ucast_send_buf_size="640000"
mcast_recv_buf_size="25000000"
mcast_send_buf_size="640000"
loopback="false"
max_bundle_size="64000"
max_bundle_timeout="30"
use_incoming_packet_handler="true"
use_outgoing_packet_handler="true"
ip_ttl="2"
down_thread="false" up_thread="false"
enable_bundling="true"/>
<PING timeout="2000"
down_thread="false" up_thread="false" num_initial_members="3"/>
<MERGE2 max_interval="100000"
down_thread="false" up_thread="false" min_interval="20000"/>
<FD shun="true" up_thread="false" down_thread="false"
timeout="2500" max_tries="5"/>
<VERIFY_SUSPECT timeout="1500"
up_thread="false" down_thread="false"/>
<pbcast.NAKACK max_xmit_size="60000"
use_mcast_xmit="false" gc_lag="50"
retransmit_timeout="100,200,300,600,1200,2400,4800"
down_thread="false" up_thread="false"
discard_delivered_msgs="true"/>
<UNICAST timeout="300,600,1200,2400,3600"
down_thread="false" up_thread="false"/>
<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
down_thread="false" up_thread="false"
max_bytes="2100000"/>
<pbcast.GMS print_local_addr="true" join_timeout="3000"
down_thread="false" up_thread="false"
join_retry_timeout="2000" shun="true"/>

<FC max_credits="10000000" down_thread="false" up_thread="false"
min_threshold="0.20"/>
<FRAG2 frag_size="60000" down_thread="false" up_thread="false"/>
<pbcast.STATE_TRANSFER down_thread="false" up_thread="false"/>

so you are suggesting to replace FD with FD_SOCK ? and that way node1 would recognize instantly?
Actions
3. Re: Failover is taking 15 seconds to recognize by other node

brian.stansberry May 1, 2008 3:38 PM (in response to nobleman1997)

No, don't replace FD, add FD_SOCK, below FD.

You didn't answer my question about exactly how you shut down the node. :) It's important; if you are truly doing a clean shutdown the node shutting down informs the cluster of that fact and the other node should recognize the shutdown quickly.
Actions
4. Re: Failover is taking 15 seconds to recognize by other node

nobleman1997 May 1, 2008 8:27 PM (in response to nobleman1997)

Thanks Brian again,
yes, I am shutting down other server by ./bin/shudown.sh -S option.
Even after adding <FD_SOCK/> , it didn't worked. still it is getting more time to recognize and so request fails and loose continuity.
In Browser i get following error message:
No Host matches server name test1.dev.test.com
Your browser sent a request that this server could not understand.

Thanks
Actions
5. Re: Failover is taking 15 seconds to recognize by other node

nobleman1997 May 6, 2008 1:26 PM (in response to nobleman1997)

Hi brian,
I did changes you recommended, but still there no difference. It still takes long time to recognize for another node ( 15 to 20 sec ) which looks impractical in production environment.
Actions
6. Re: Failover is taking 15 seconds to recognize by other node

brian.stansberry May 6, 2008 4:13 PM (in response to nobleman1997)

Been on vacation.

Please post your server.log showing what happens.
Actions

Go to original post