8 Replies Latest reply on Mar 26, 2009 6:27 AM by arupkumarm

    Failure Detection when group coordinator dies

    luckywhd

      Hello,

      JBossCache and JGroups Services documentation (http://docs.jboss.org/jbossas/jboss4guide/r4/html/jbosscache.chapt.html#jbosscache-jgroups-fd-fd) states that when FD discovery protocol is being used, the current group coordinator is responsible of updating the cluster's view when a node of a cluster dies. The documentation does not however tell what is being done when the group coordinator itself dies.

      We have been experiencing problems in a situation, where a node acting as the cluster's coordinator crashes, is rebooted or shut down. It seems that this leads to a situation, where other nodes still see the dead node as the coordinator and are, for some reason, unable to vote for the new one.

      Any ideas how to fix this?

        • 1. Re: Failure Detection when group coordinator dies
          manik

          The next available member takes over as a coordinator.

          • 2. Re: Failure Detection when group coordinator dies
            luckywhd

             

            "manik.surtani@jboss.com" wrote:
            The next available member takes over as a coordinator.


            Ok, that makes sense.

            Still, I'm confused with the behaviour of our cluster. We'we been trying different configurations for the FD and still are occasionally experiencing problems with the lost coordinator.

            • 3. Re: Failure Detection when group coordinator dies
              manik
              • 4. Re: Failure Detection when group coordinator dies
                luckywhd

                 

                "manik.surtani@jboss.com" wrote:
                Have you looked at http://wiki.jboss.org/wiki/FDVersusFD_SOCK ?


                We've also tried FD_SOCK, but it seems that it won't solve the problem.

                • 5. Re: Failure Detection when group coordinator dies
                  manik

                  Have you tried using both?

                  • 6. Re: Failure Detection when group coordinator dies
                    luckywhd

                     

                    "manik.surtani@jboss.com" wrote:
                    Have you tried using both?


                    I quess we cold try if that helps...

                    Below is a bit more detailed example of the hanging up of our cluster. It isn't excatly the same I described above (new coordinator is assigned when current dies), but still deals with the problem related to coordinator.


                    Nodes in the cluster: A,B,C,D,E,F

                    1. Node A (the current coordinator) is shut down
                    -> Node C becomes a new coordinator

                    2. Node A is restarted
                    -> Node A sees two candidates for the coordinator: itself and C
                    -> Node A's join message to node C times out, A is unable to join the cluster

                    3. Node E is restarted
                    -> Node E sees two candidates for the coordinator: node A (node A is currently dead!) and C
                    -> Node E's join message to node C times out, E is unable to join the cluster

                    4. Node C is shut down
                    -> Node B becomes a new coordinator

                    5. Node A, E and C are restarted
                    -> each node is able to join the cluster


                    Summary of the problems encountered:
                    - nodes were unable to join the cluster when it was assigned a new coordinator (C)
                    - even though ex-coordinator (A) was down, it was still seen as a candidate for the coordinator
                    - new coordinator (C) had to be shut down and new coordinator voted (B) in order to get the cluster working again


                    I'm wondering if the problem might be in shutting down the original coordinator (A). Here are the log messages of node A (10.195.0.121) when it is shut down, the missing ACK is from node C (10.195.0.123), which becomes a new coordinator:

                    12:00:09,500 INFO [resin-destroy] TreeCache:1616 - stopService(): closing the channel
                    12:00:11,567 WARN [ViewHandler] GMS:409 - failed to collect all ACKs (5) for view [10.195.0.121:60413|44] [10.195.0.123:48908, 10.195.0.112:35954, 10.195.0.122:54362, 10.195.0.120:38607, 10.195.0.105:54567] after 2000ms, missing ACKs from [10.195.0.123:48908] (received=[10.195.0.112:35954, 10.195.0.120:38607, 10.195.0.105:54567, 10.195.0.121:60413, 10.195.0.122:54362]), local_addr=10.195.0.121:60413
                    12:00:11,572 INFO [resin-destroy] TreeCache:1622 - stopService(): stopping the dispatcher
                    12:00:12,073 WARN [resin-destroy] TreeCache:413 - Unexpected error during removal. jboss.cache:service=TreeCache-Invalidation-Cluster-prod
                    javax.management.InstanceNotFoundException: jboss.system:service=ServiceController
                     at com.caucho.jmx.AbstractMBeanServer.invoke(AbstractMBeanServer.java:728)
                     at org.jboss.system.ServiceMBeanSupport.postDeregister(ServiceMBeanSupport.java:409)
                     at com.caucho.jmx.MBeanContext.unregisterMBean(MBeanContext.java:304)
                     at com.caucho.jmx.MBeanContext.destroy(MBeanContext.java:565)
                     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
                     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                     at java.lang.reflect.Method.invoke(Method.java:585)
                     at com.caucho.loader.WeakCloseListener.classLoaderDestroy(WeakCloseListener.java:86)
                     at com.caucho.loader.Environment.closeGlobal(Environment.java:621)
                     at com.caucho.server.resin.ResinServer.destroy(ResinServer.java:653)
                     at com.caucho.server.resin.Resin$1.run(Resin.java:639)
                    


                    When the A (10.195.0.121) is shut down, nodes B,D,E,F log:

                    12:00:09,559 INFO [UpHandler (STATE_TRANSFER)] TreeCache:5673 - viewAccepted(): [10.195.0.121:60413|44] [10.195.0.123:48908, 10.195.0.112:35954, 10.195.0.122:54362, 10.195.0.120:38607, 10.195.0.105:54567]
                    12:00:36,662 INFO [UpHandler (STATE_TRANSFER)] TreeCache:5673 - viewAccepted(): [10.195.0.123:48910|44] [10.195.0.123:48910, 10.195.0.112:35956, 10.195.0.122:54364, 10.195.0.120:38610, 10.195.0.105:54569]
                    


                    but node C (10.195.0.123) only logs:

                    12:00:36,662 INFO [UpHandler (STATE_TRANSFER)] TreeCache:5673 - viewAccepted(): [10.195.0.123:48910|44] [10.195.0.123:48910, 10.195.0.112:35956, 10.195.0.122:54364, 10.195.0.120:38610, 10.195.0.105:54569]
                    




                    And here's what we get when we are trying to start nodes A and E when C is coordinator:

                    16:03:06,847 WARN [DownHandler (GMS)] GMS:339 - there was more than 1 candidate for coordinator: {10.195.0.121:60413=1, 10.195.0.123:48908=3}
                    16:03:11,887 WARN [DownHandler (GMS)] GMS:127 - join(10.195.0.112:42521) sent to 10.195.0.123:48908 timed out, retrying
                    16:03:15,907 WARN [DownHandler (GMS)] GMS:339 - there was more than 1 candidate for coordinator: {10.195.0.121:60413=1, 10.195.0.123:48908=3}
                    16:03:20,920 WARN [DownHandler (GMS)] GMS:127 - join(10.195.0.112:42521) sent to 10.195.0.123:48908 timed out, retrying
                    ...


                    • 7. Re: Failure Detection when group coordinator dies
                      lexsoto

                      Hello,

                      I'm experiencing the same problem. Was the last suggestion a resolution for this problem?

                      TIA,
                      Alex

                      • 8. Re: Failure Detection when group coordinator dies
                        arupkumarm

                        HI
                        We have been facing the exactly same problem. Is there a solution to it.

                        Thanks
                        Arup