6 Replies Latest reply on Sep 1, 2006 12:29 PM by jbirkenmaier

    Network Loss in a 1 to n Node Cluster

    jbirkenmaier

      Hi,

      I am testing my application using 1 and 2 nodes in a cluster. I am using a shared TreeCache to share data which has a listener attached to it. I use this listener to tell me when a node disappears from the cluster. There are a couple of problems with this that I hope someone can help me with.

      First Problem (2-node cluster): if I Ctrl-C one of the two nodes, the remaining node detectes the node loss and all is well. If I instead unplug the network cable on one of the nodes, JBoss detects a node loss on each of the two nodes. However, the TreeCache doesn't fire off an event. Is there another way for JBoss to inform me directly that there has been a node change in the cluster?

      Second Problem (Single node): When the network cable is unplugged I get no indication that JBoss has noticed. I know that one node does not a cluster make but is there some way for JBoss to inform me of a loss of network?

      Thanks in advance for the help,
      Jim

        • 1. Re: Network Loss in a 1 to n Node Cluster
          brian.stansberry

          First problem: TreeCacheListener.viewChange(View new_view) passes you a JGroups View object whenever there is a cluster topology change.

          Second problem: No, we have nothing like that.

          • 2. Re: Network Loss in a 1 to n Node Cluster
            jbirkenmaier

            Thanks for the quick reply. Regarding the first problem. I see the view change on node 1 when I Ctrl-C JBoss on node 2. However, I don't see the view change when I unplug the network cable.

            Here is what's logged when I unplug the cable:

            10:41:27,090 INFO [dragoneyes] (UpHandler (STATE_TRANSFER)) Suspected member: 192.168.69.253:33013
            10:41:27,092 INFO [dragoneyes] (UpHandler (STATE_TRANSFER)) New cluster view for partition dragoneyes (id: 2, delta: -1) : [192.168.69.122:1099]
            10:41:27,092 INFO [dragoneyes] (AsynchViewChangeHandler Thread) I am (192.168.69.122:1099) received membershipChanged event:
            10:41:27,092 INFO [dragoneyes] (AsynchViewChangeHandler Thread) Dead members: 1 ([192.168.69.253:1099])
            10:41:27,092 INFO [dragoneyes] (AsynchViewChangeHandler Thread) New Members : 0 ([])
            10:41:27,093 INFO [dragoneyes] (AsynchViewChangeHandler Thread) All Members : 1 ([192.168.69.122:1099])


            Here is what's logged when the cable is plugged back in:

            10:42:12,912 INFO [dragoneyes] (UpHandler (STATE_TRANSFER)) New cluster view for partition dragoneyes (id: 3, delta: 1) : [192.168.69.122:1099, 192.168.69.253:1099]
            10:42:12,914 INFO [dragoneyes] (AsynchViewChangeHandler Thread) Merging partitions...
            10:42:12,914 INFO [dragoneyes] (AsynchViewChangeHandler Thread) Dead members: 0
            10:42:12,914 INFO [dragoneyes] (AsynchViewChangeHandler Thread) Originating groups: [[192.168.69.122:32809|2] [192.168.69.122:32809], [192.168.69.253:33013|2] [192.168.69.253:33013]]

            JBoss/JGroups sees the loss and restoration of the network. Is there no way to hook into UpHandler or AsynchViewChangeHandler to catch the notification? The viewChange method just isn't doing it.

            • 3. Re: Network Loss in a 1 to n Node Cluster
              brian.stansberry

              OK, your problem is the JGroups channel that your cache instance is using isn't detecting the unplugging of the cable. I bet if you wait about a minute, it will.

              The logging you posted is actually for a completely different channel. Technically its a different cluster, even though from a surface point of view it seems like there is only one "cluster".

              There is a semi-complicated mechanism for registering for the view change events you posted, but that's really not the right thing to do. The right thing is to ensure your JBoss Cache channel detects the cable unplug.

              In your cache config file, find the ClusterConfig element and replace FD with:

              <FD_SOCK down_thread="false" up_thread="false"/>
              <FD timeout="10000" max_tries="5" down_thread="false" up_thread="false" shun="true"/>


              That's the config we're starting to use everywhere now. See http://wiki.jboss.org/wiki/Wiki.jsp?page=FDVersusFD_SOCK for more details.

              • 4. Re: Network Loss in a 1 to n Node Cluster
                jbirkenmaier

                I read the Wiki page and changed the cluster-service.xml file. After 2 hours, the tree cache was notified about the loss of a cluster node and it in turn notified my application. However, 2 hours is much too long. I need to know a lot sooner. Is there a way to change the timeout value from 2 hours to, say, 30 seconds? Right now, the entry looks like what you suggested:
                <FD_SOCK down_thread="false" up_thread="false"/>
                <FD timeout="10000" max_tries="5" down_thread="false" up_thread="false" shun="true"/>

                The documentation says that setting this timeout value should override the system default of 2 hours but it still waited the 2 hours when it should have waited 50 seconds (?).

                • 5. Re: Network Loss in a 1 to n Node Cluster
                  brian.stansberry

                  The cluster-service.xml file does not affect your tree cache in any way. Completely unrelated. But, changing that one as well was good :-)

                  The tree cache you're using for the shared data must have a config file as well. (Unless you're configuring everything programatically, in which case I'd say use a config file.) You need to apply the FD/FD_SOCK change to that file.

                  • 6. Re: Network Loss in a 1 to n Node Cluster
                    jbirkenmaier

                    OK, once I changed the proper tree cache deployment file, it works. Thanks for your help!!