6 Replies Latest reply on May 6, 2008 4:13 PM by Brian Stansberry

    Failover is taking 15 seconds to recognize by other node.

    nobleman nobleman Newbie

      I am using jboss 4.0.4 version of jboss. I have set up two linux machine and each has jboss AS installed. Both jboss instances are in cluster.I have used apache 2.0 and mod_jk for loadbalancing. I have configured according to jboss 4.0.4 document.Both servers are set up
      as node1 and node2 in both loadbalacer.Also i have used stickysession=1 in worker.properties file.
      When both servers are up, my application works fine.
      Here is the test case.
      When i am browsing my application it is on node2. Now i shut it down node2 gracefully. when i try to continue my application within couple of seconds, it gives error like bad request or application doesn't recognized. if i check log on node1, after 15 seconds node 1 says view accepted (it has list of node in cluster which is just one now ). So to me this fail over is not smooth.
      shouldn't it work even though node2 is down ? if i try after 15 seconds it works fine.
      Same scenario happen when i use stickysession=0.
      So my case is like if somebody is buying stuff online and customer is on node2 of cluster and for some reason node2 is down then it should take over instantly (within milliseconds ) so user's experiance would be smooth otherwise user would
      have error page.

      Could you please let me know what type of configuration should i setup so i can avoid this type of problem?


        • 1. Re: Failover is taking 15 seconds to recognize by other node
          Brian Stansberry Master

          You're shutting node2 down gracefully? How are you doing it?

          Your symptoms sound more like a hard kill, where it's taking 15 secs for the other node to recognize missed heartbeats and suspect the dead node. To fix that you can add the FD_SOCK protocol to the JGroups config in the tc5-cluster-service.xml file. See http://wiki.jboss.org/wiki/FDVersusFD_SOCK for details.

          • 2. Re: Failover is taking 15 seconds to recognize by other node
            nobleman nobleman Newbie

            Hi Brian,
            I appreciate your reply. I manually shutdown node2 for testing pupose so in production for some reason one node goes down for some reason or hangs it self then fail over could be smooth and user can be easily transfered to other node without even noticing any difference.
            So in my /server/all/deploy/tc5-cluster.sar/META-INF/jboss-service.xml
            following config is there.

            <UDP mcast_addr="${jboss.partition.udpGroup:}"
            down_thread="false" up_thread="false"
            <PING timeout="2000"
            down_thread="false" up_thread="false" num_initial_members="3"/>
            <MERGE2 max_interval="100000"
            down_thread="false" up_thread="false" min_interval="20000"/>
            <FD shun="true" up_thread="false" down_thread="false"
            timeout="2500" max_tries="5"/>
            <VERIFY_SUSPECT timeout="1500"
            up_thread="false" down_thread="false"/>
            <pbcast.NAKACK max_xmit_size="60000"
            use_mcast_xmit="false" gc_lag="50"
            down_thread="false" up_thread="false"
            <UNICAST timeout="300,600,1200,2400,3600"
            down_thread="false" up_thread="false"/>
            <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
            down_thread="false" up_thread="false"
            <pbcast.GMS print_local_addr="true" join_timeout="3000"
            down_thread="false" up_thread="false"
            join_retry_timeout="2000" shun="true"/>
            <!-- If your CacheMode is set to REPL_SYNC we recommend you
            comment out the FC (flow control) protocol -->
            <FC max_credits="10000000" down_thread="false" up_thread="false"
            <FRAG2 frag_size="60000" down_thread="false" up_thread="false"/>
            <pbcast.STATE_TRANSFER down_thread="false" up_thread="false"/>

            so you are suggesting to replace FD with FD_SOCK ? and that way node1 would recognize instantly?

            • 3. Re: Failover is taking 15 seconds to recognize by other node
              Brian Stansberry Master

              No, don't replace FD, add FD_SOCK, below FD.

              You didn't answer my question about exactly how you shut down the node. :) It's important; if you are truly doing a clean shutdown the node shutting down informs the cluster of that fact and the other node should recognize the shutdown quickly.

              • 4. Re: Failover is taking 15 seconds to recognize by other node
                nobleman nobleman Newbie

                Thanks Brian again,
                yes, I am shutting down other server by ./bin/shudown.sh -S option.
                Even after adding <FD_SOCK/> , it didn't worked. still it is getting more time to recognize and so request fails and loose continuity.
                In Browser i get following error message:
                No Host matches server name test1.dev.test.com
                Your browser sent a request that this server could not understand.


                • 5. Re: Failover is taking 15 seconds to recognize by other node
                  nobleman nobleman Newbie

                  Hi brian,
                  I did changes you recommended, but still there no difference. It still takes long time to recognize for another node ( 15 to 20 sec ) which looks impractical in production environment.

                  • 6. Re: Failover is taking 15 seconds to recognize by other node
                    Brian Stansberry Master

                    Been on vacation.

                    Please post your server.log showing what happens.