3 Replies Latest reply on Jun 1, 2005 6:17 PM by menkun

    Failover take extreme long time with unplug the network cabl

    menkun

      Hi, guys
      I am sorry to bother you, actually I have two servers (A and B) in my clustering, both run w2k and jboss4.0.2.

      I have a stateful session bean run on two server, and one standalone client run on one server (A). when I try the failover function of Jboss, I found that if I kill one server by Ctrl+C, my standalone client will not notice it, so it failover to another server perfectly. However, if I unplug the network cable of server B, I found that it will take extreme long time to failover to A. And there is the info. on A after I unplug the cable of B:

      09:46:22,181 WARN [FD] ping_dest is null: members=[RDUL15163:3305 (additio
      nal data: 18 bytes), RDUL05301:1135 (additional data: 18 bytes)], pingable_
      mbrs=[RDUL05301:1135 (additional data: 18 bytes)], local_addr=RDUL0530
      1:1135 (additional data: 18 bytes)
      09:46:22,692 INFO [DefaultPartition] Suspected member: RDUL15163:3305 (add
      itional data: 18 bytes)
      09:46:22,702 INFO [DefaultPartition] New cluster view for partition DefaultPart
      ition (id: 6, delta: -1) : [192.168.1.100:1099]
      09:46:22,712 INFO [DefaultPartition] I am (192.168.1.100:1099) received members
      hipChanged event:
      09:46:22,712 INFO [DefaultPartition] Dead members: 1 ([192.168.1.101:1099])
      09:46:22,712 INFO [DefaultPartition] New Members : 0 ([])
      09:46:22,712 INFO [DefaultPartition] All Members : 1 ([192.168.1.100:1099])
      09:46:23,823 INFO [TomcatDeployer] deploy, ctxPath=/jbossmq-httpil, warUrl=file
      :/C:/Kun/jboss-4.0.2/server/all/deploy-hasingleton/jms/jbossmq-httpil.sar/jbossm
      q-httpil.war/
      09:46:24,384 INFO [A] Bound to JNDI name: queue/A
      09:46:24,394 INFO [B] Bound to JNDI name: queue/B
      09:46:24,414 INFO [C] Bound to JNDI name: queue/C
      09:46:24,424 INFO [D] Bound to JNDI name: queue/D
      09:46:24,444 INFO [ex] Bound to JNDI name: queue/ex
      09:46:24,504 INFO [testTopic] Bound to JNDI name: topic/testTopic
      09:46:24,524 INFO [securedTopic] Bound to JNDI name: topic/securedTopic
      09:46:24,544 INFO [testDurableTopic] Bound to JNDI name: topic/testDurableTopic

      09:46:24,554 INFO [testQueue] Bound to JNDI name: queue/testQueue
      09:46:24,674 INFO [UILServerILService] JBossMQ UIL service available at : /0.0.
      0.0:8093
      09:46:24,835 INFO [DLQ] Bound to JNDI name: queue/DLQ


      It seems that it take about almost 30seconds to failover to A, could anybody give me a suggestion to shorten this waitting time? thanks a lot!!!

        • 1. Re: Failover take extreme long time with unplug the network
          schrouf

          I would gues that this delay is most probably caused by the underlying RMI TCP/UDP socket connect/read default timeout configuration ?!?

          Regards
          Ulf

          • 2. Re: Failover take extreme long time with unplug the network
            menkun

            Thanks for your advice, I have changed my tc5-cluster-service.xml on both server, however, it seems doesnot work, it still need such a long time to failover (if I unplug the network cable). So do I need to change those configuration in the jgroup.jar? Here is my tc5-cluster-service.xml file:

            <?xml version="1.0" encoding="UTF-8"?>
            
            <!-- ===================================================================== -->
            <!-- -->
            <!-- Customized TreeCache Service Configuration for Tomcat 5 Clustering -->
            <!-- -->
            <!-- ===================================================================== -->
            
            <server>
            
             <classpath codebase="./lib" archives="jboss-cache.jar, jgroups.jar"/>
            
             <!-- ==================================================================== -->
             <!-- Defines TreeCache configuration -->
             <!-- ==================================================================== -->
            
             <mbean code="org.jboss.cache.TreeCache"
             name="jboss.cache:service=TomcatClusteringCache">
            
             <depends>jboss:service=Naming</depends>
             <depends>jboss:service=TransactionManager</depends>
            
             <!-- Configure the TransactionManager -->
             <attribute name="TransactionManagerLookupClass">org.jboss.cache.JBossTransactionManagerLookup</attribute>
            
             <!--
             Isolation level : SERIALIZABLE
             REPEATABLE_READ (default)
             READ_COMMITTED
             READ_UNCOMMITTED
             NONE
             -->
             <attribute name="IsolationLevel">REPEATABLE_READ</attribute>
            
             <!--
             Valid modes are LOCAL, REPL_ASYNC and REPL_SYNC
             -->
             <attribute name="CacheMode">REPL_ASYNC</attribute>
            
             <!-- Name of cluster. Needs to be the same for all clusters, in order
             to find each other
             -->
             <attribute name="ClusterName">Tomcat-Cluster</attribute>
            
             <!-- JGroups protocol stack properties. Can also be a URL,
             e.g. file:/home/bela/default.xml
             <attribute name="ClusterProperties"></attribute>
             -->
            
             <attribute name="ClusterConfig">
             <!--
             The default UDP stack:
             - If you have a multihomed machine, set the UDP protocol's bind_addr attribute to the
             appropriate NIC IP address, e.g bind_addr="192.168.0.2".
             - On Windows machines, because of the media sense feature being broken with multicast
             (even after disabling media sense) set the UDP protocol's loopback attribute to true
             -->
             <config>
             <UDP mcast_addr="230.1.2.7" mcast_port="45577"
             ip_ttl="8" ip_mcast="true"
             mcast_send_buf_size="150000" mcast_recv_buf_size="80000"
             ucast_send_buf_size="150000" ucast_recv_buf_size="80000"
             loopback="true"/>
             <PING timeout="20" num_initial_members="1"
             up_thread="false" down_thread="false"/>
             <MERGE2 min_interval="100" max_interval="200"/>
             <FD_SOCK/>
             <VERIFY_SUSPECT timeout="15"
             up_thread="false" down_thread="false"/>
             <pbcast.NAKACK gc_lag="5" retransmit_timeout="6,12,24,48"
             max_xmit_size="8192" up_thread="false" down_thread="false"/>
             <UNICAST timeout="6,12,24" window_size="100" min_threshold="10"
             down_thread="false"/>
             <pbcast.STABLE desired_avg_gossip="200"
             up_thread="false" down_thread="false"/>
             <FRAG frag_size="8192"
             down_thread="false" up_thread="false"/>
             <pbcast.GMS join_timeout="5" join_retry_timeout="2"
             shun="true" print_local_addr="true"/>
             <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
             </config>
            
             <!-- Alternate TCP stack: customize it for your environment, change bind_addr and initial_hosts -->
             <!--
             <config>
             <TCP bind_addr="thishost" start_port="7810" loopback="true"/>
             <TCPPING initial_hosts="thishost[7810],otherhost[7810]" port_range="3" timeout="3500"
             num_initial_members="3" up_thread="true" down_thread="true"/>
             <MERGE2 min_interval="5000" max_interval="10000"/>
             <FD shun="true" timeout="2500" max_tries="5" up_thread="true" down_thread="true" />
             <VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false" />
             <pbcast.NAKACK down_thread="true" up_thread="true" gc_lag="100"
             retransmit_timeout="3000"/>
             <pbcast.STABLE desired_avg_gossip="20000" down_thread="false" up_thread="false" />
             <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="false"
             print_local_addr="true" down_thread="true" up_thread="true"/>
             <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
             </config>
             -->
            
             </attribute>
            
             <!-- Max number of milliseconds to wait for a lock acquisition -->
             <attribute name="LockAcquisitionTimeout">15000</attribute>
            
             </mbean>
            
            </server>
            



            • 3. Re: Failover take extreme long time with unplug the network
              menkun

              I have changed the cluster-service.xml, so if I unplug B, A will detect it alsmost instantly, however. It still take a long time to do the failover, I go through the log info, and found:


              2005-06-01 17:45:56,897 DEBUG [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] End notifyListeners, viewID: 10
              2005-06-01 17:46:52,297 DEBUG [org.jboss.ha.framework.server.HAPartitionImpl] dests=[], method_call=SessionState-'/HASessionState/Default'._setOwnership(ejb/MyBank, 192.168.1.100:1099:e9g3an3e-3, 192.168.1.100:1099, 3), mode=2, timeout=60000
              2005-06-01 17:46:52,297 DEBUG [org.jboss.ha.framework.server.HAPartitionImpl] destination list is non-null and empty: no need to send message
              2005-06-01 17:46:52,297 INFO [STDOUT] Show the fail over result!



              I am just intested that from 17:45:56,897 to 17:46:52,297 , what Jboss has been doing in almost 1min?