3 Replies Latest reply on Jun 1, 2005 6:17 PM by menkun

Failover take extreme long time with unplug the network cabl

menkun Jun 1, 2005 10:00 AM

Hi, guys
I am sorry to bother you, actually I have two servers (A and B) in my clustering, both run w2k and jboss4.0.2.

I have a stateful session bean run on two server, and one standalone client run on one server (A). when I try the failover function of Jboss, I found that if I kill one server by Ctrl+C, my standalone client will not notice it, so it failover to another server perfectly. However, if I unplug the network cable of server B, I found that it will take extreme long time to failover to A. And there is the info. on A after I unplug the cable of B:

09:46:22,181 WARN [FD] ping_dest is null: members=[RDUL15163:3305 (additio
nal data: 18 bytes), RDUL05301:1135 (additional data: 18 bytes)], pingable_
mbrs=[RDUL05301:1135 (additional data: 18 bytes)], local_addr=RDUL0530
1:1135 (additional data: 18 bytes)
09:46:22,692 INFO [DefaultPartition] Suspected member: RDUL15163:3305 (add
itional data: 18 bytes)
09:46:22,702 INFO [DefaultPartition] New cluster view for partition DefaultPart
ition (id: 6, delta: -1) : [192.168.1.100:1099]
09:46:22,712 INFO [DefaultPartition] I am (192.168.1.100:1099) received members
hipChanged event:
09:46:22,712 INFO [DefaultPartition] Dead members: 1 ([192.168.1.101:1099])
09:46:22,712 INFO [DefaultPartition] New Members : 0 ([])
09:46:22,712 INFO [DefaultPartition] All Members : 1 ([192.168.1.100:1099])
09:46:23,823 INFO [TomcatDeployer] deploy, ctxPath=/jbossmq-httpil, warUrl=file
:/C:/Kun/jboss-4.0.2/server/all/deploy-hasingleton/jms/jbossmq-httpil.sar/jbossm
q-httpil.war/
09:46:24,384 INFO [A] Bound to JNDI name: queue/A
09:46:24,394 INFO [B] Bound to JNDI name: queue/B
09:46:24,414 INFO [C] Bound to JNDI name: queue/C
09:46:24,424 INFO [D] Bound to JNDI name: queue/D
09:46:24,444 INFO [ex] Bound to JNDI name: queue/ex
09:46:24,504 INFO [testTopic] Bound to JNDI name: topic/testTopic
09:46:24,524 INFO [securedTopic] Bound to JNDI name: topic/securedTopic
09:46:24,544 INFO [testDurableTopic] Bound to JNDI name: topic/testDurableTopic

09:46:24,554 INFO [testQueue] Bound to JNDI name: queue/testQueue
09:46:24,674 INFO [UILServerILService] JBossMQ UIL service available at : /0.0.
0.0:8093
09:46:24,835 INFO [DLQ] Bound to JNDI name: queue/DLQ

It seems that it take about almost 30seconds to failover to A, could anybody give me a suggestion to shorten this waitting time? thanks a lot!!!

1. Re: Failover take extreme long time with unplug the network

schrouf Jun 1, 2005 10:44 AM (in response to menkun)

I would gues that this delay is most probably caused by the underlying RMI TCP/UDP socket connect/read default timeout configuration ?!?

Regards
Ulf
Actions

2. Re: Failover take extreme long time with unplug the network

menkun Jun 1, 2005 1:27 PM (in response to menkun)

Thanks for your advice, I have changed my tc5-cluster-service.xml on both server, however, it seems doesnot work, it still need such a long time to failover (if I unplug the network cable). So do I need to change those configuration in the jgroup.jar? Here is my tc5-cluster-service.xml file:

<?xml version="1.0" encoding="UTF-8"?>

<!-- ===================================================================== -->
<!-- -->
<!-- Customized TreeCache Service Configuration for Tomcat 5 Clustering -->
<!-- -->
<!-- ===================================================================== -->

<server>

 <classpath codebase="./lib" archives="jboss-cache.jar, jgroups.jar"/>

 <!-- ==================================================================== -->
 <!-- Defines TreeCache configuration -->
 <!-- ==================================================================== -->

 <mbean code="org.jboss.cache.TreeCache"
 name="jboss.cache:service=TomcatClusteringCache">

 <depends>jboss:service=Naming</depends>
 <depends>jboss:service=TransactionManager</depends>

 <!-- Configure the TransactionManager -->
 <attribute name="TransactionManagerLookupClass">org.jboss.cache.JBossTransactionManagerLookup</attribute>

 <!--
 Isolation level : SERIALIZABLE
 REPEATABLE_READ (default)
 READ_COMMITTED
 READ_UNCOMMITTED
 NONE
 -->
 <attribute name="IsolationLevel">REPEATABLE_READ</attribute>

 <!--
 Valid modes are LOCAL, REPL_ASYNC and REPL_SYNC
 -->
 <attribute name="CacheMode">REPL_ASYNC</attribute>

 <!-- Name of cluster. Needs to be the same for all clusters, in order
 to find each other
 -->
 <attribute name="ClusterName">Tomcat-Cluster</attribute>

 <!-- JGroups protocol stack properties. Can also be a URL,
 e.g. file:/home/bela/default.xml
 <attribute name="ClusterProperties"></attribute>
 -->

 <attribute name="ClusterConfig">
 <!--
 The default UDP stack:
 - If you have a multihomed machine, set the UDP protocol's bind_addr attribute to the
 appropriate NIC IP address, e.g bind_addr="192.168.0.2".
 - On Windows machines, because of the media sense feature being broken with multicast
 (even after disabling media sense) set the UDP protocol's loopback attribute to true
 -->
 <config>
 <UDP mcast_addr="230.1.2.7" mcast_port="45577"
 ip_ttl="8" ip_mcast="true"
 mcast_send_buf_size="150000" mcast_recv_buf_size="80000"
 ucast_send_buf_size="150000" ucast_recv_buf_size="80000"
 loopback="true"/>
 <PING timeout="20" num_initial_members="1"
 up_thread="false" down_thread="false"/>
 <MERGE2 min_interval="100" max_interval="200"/>
 <FD_SOCK/>
 <VERIFY_SUSPECT timeout="15"
 up_thread="false" down_thread="false"/>
 <pbcast.NAKACK gc_lag="5" retransmit_timeout="6,12,24,48"
 max_xmit_size="8192" up_thread="false" down_thread="false"/>
 <UNICAST timeout="6,12,24" window_size="100" min_threshold="10"
 down_thread="false"/>
 <pbcast.STABLE desired_avg_gossip="200"
 up_thread="false" down_thread="false"/>
 <FRAG frag_size="8192"
 down_thread="false" up_thread="false"/>
 <pbcast.GMS join_timeout="5" join_retry_timeout="2"
 shun="true" print_local_addr="true"/>
 <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
 </config>

 <!-- Alternate TCP stack: customize it for your environment, change bind_addr and initial_hosts -->
 <!--
 <config>
 <TCP bind_addr="thishost" start_port="7810" loopback="true"/>
 <TCPPING initial_hosts="thishost[7810],otherhost[7810]" port_range="3" timeout="3500"
 num_initial_members="3" up_thread="true" down_thread="true"/>
 <MERGE2 min_interval="5000" max_interval="10000"/>
 <FD shun="true" timeout="2500" max_tries="5" up_thread="true" down_thread="true" />
 <VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false" />
 <pbcast.NAKACK down_thread="true" up_thread="true" gc_lag="100"
 retransmit_timeout="3000"/>
 <pbcast.STABLE desired_avg_gossip="20000" down_thread="false" up_thread="false" />
 <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="false"
 print_local_addr="true" down_thread="true" up_thread="true"/>
 <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
 </config>
 -->

 </attribute>

 <!-- Max number of milliseconds to wait for a lock acquisition -->
 <attribute name="LockAcquisitionTimeout">15000</attribute>

 </mbean>

</server>

3. Re: Failover take extreme long time with unplug the network

menkun Jun 1, 2005 6:17 PM (in response to menkun)

I have changed the cluster-service.xml, so if I unplug B, A will detect it alsmost instantly, however. It still take a long time to do the failover, I go through the log info, and found:

2005-06-01 17:45:56,897 DEBUG [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] End notifyListeners, viewID: 10
2005-06-01 17:46:52,297 DEBUG [org.jboss.ha.framework.server.HAPartitionImpl] dests=[], method_call=SessionState-'/HASessionState/Default'._setOwnership(ejb/MyBank, 192.168.1.100:1099:e9g3an3e-3, 192.168.1.100:1099, 3), mode=2, timeout=60000
2005-06-01 17:46:52,297 DEBUG [org.jboss.ha.framework.server.HAPartitionImpl] destination list is non-null and empty: no need to send message
2005-06-01 17:46:52,297 INFO [STDOUT] Show the fail over result!

I am just intested that from 17:45:56,897 to 17:46:52,297 , what Jboss has been doing in almost 1min?
Actions

Go to original post