-
1. Re: Failover is taking 15 seconds to recognize by other node
brian.stansberry May 1, 2008 9:31 AM (in response to nobleman1997)You're shutting node2 down gracefully? How are you doing it?
Your symptoms sound more like a hard kill, where it's taking 15 secs for the other node to recognize missed heartbeats and suspect the dead node. To fix that you can add the FD_SOCK protocol to the JGroups config in the tc5-cluster-service.xml file. See http://wiki.jboss.org/wiki/FDVersusFD_SOCK for details. -
2. Re: Failover is taking 15 seconds to recognize by other node
nobleman1997 May 1, 2008 2:32 PM (in response to nobleman1997)Hi Brian,
I appreciate your reply. I manually shutdown node2 for testing pupose so in production for some reason one node goes down for some reason or hangs it self then fail over could be smooth and user can be easily transfered to other node without even noticing any difference.
So in my /server/all/deploy/tc5-cluster.sar/META-INF/jboss-service.xml
following config is there.
<UDP mcast_addr="${jboss.partition.udpGroup:230.1.2.7}"
mcast_port="45577"
ucast_recv_buf_size="20000000"
ucast_send_buf_size="640000"
mcast_recv_buf_size="25000000"
mcast_send_buf_size="640000"
loopback="false"
max_bundle_size="64000"
max_bundle_timeout="30"
use_incoming_packet_handler="true"
use_outgoing_packet_handler="true"
ip_ttl="2"
down_thread="false" up_thread="false"
enable_bundling="true"/>
<PING timeout="2000"
down_thread="false" up_thread="false" num_initial_members="3"/>
<MERGE2 max_interval="100000"
down_thread="false" up_thread="false" min_interval="20000"/>
<FD shun="true" up_thread="false" down_thread="false"
timeout="2500" max_tries="5"/>
<VERIFY_SUSPECT timeout="1500"
up_thread="false" down_thread="false"/>
<pbcast.NAKACK max_xmit_size="60000"
use_mcast_xmit="false" gc_lag="50"
retransmit_timeout="100,200,300,600,1200,2400,4800"
down_thread="false" up_thread="false"
discard_delivered_msgs="true"/>
<UNICAST timeout="300,600,1200,2400,3600"
down_thread="false" up_thread="false"/>
<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
down_thread="false" up_thread="false"
max_bytes="2100000"/>
<pbcast.GMS print_local_addr="true" join_timeout="3000"
down_thread="false" up_thread="false"
join_retry_timeout="2000" shun="true"/>
<!-- If your CacheMode is set to REPL_SYNC we recommend you
comment out the FC (flow control) protocol -->
<FC max_credits="10000000" down_thread="false" up_thread="false"
min_threshold="0.20"/>
<FRAG2 frag_size="60000" down_thread="false" up_thread="false"/>
<pbcast.STATE_TRANSFER down_thread="false" up_thread="false"/>
so you are suggesting to replace FD with FD_SOCK ? and that way node1 would recognize instantly? -
3. Re: Failover is taking 15 seconds to recognize by other node
brian.stansberry May 1, 2008 3:38 PM (in response to nobleman1997)No, don't replace FD, add FD_SOCK, below FD.
You didn't answer my question about exactly how you shut down the node. :) It's important; if you are truly doing a clean shutdown the node shutting down informs the cluster of that fact and the other node should recognize the shutdown quickly. -
4. Re: Failover is taking 15 seconds to recognize by other node
nobleman1997 May 1, 2008 8:27 PM (in response to nobleman1997)Thanks Brian again,
yes, I am shutting down other server by ./bin/shudown.sh -S option.
Even after adding <FD_SOCK/> , it didn't worked. still it is getting more time to recognize and so request fails and loose continuity.
In Browser i get following error message:
No Host matches server name test1.dev.test.com
Your browser sent a request that this server could not understand.
Thanks -
5. Re: Failover is taking 15 seconds to recognize by other node
nobleman1997 May 6, 2008 1:26 PM (in response to nobleman1997)Hi brian,
I did changes you recommended, but still there no difference. It still takes long time to recognize for another node ( 15 to 20 sec ) which looks impractical in production environment. -
6. Re: Failover is taking 15 seconds to recognize by other node
brian.stansberry May 6, 2008 4:13 PM (in response to nobleman1997)Been on vacation.
Please post your server.log showing what happens.