6 Replies Latest reply on Dec 18, 2009 9:48 AM by kosiakk

    Basic TCP Cluster of two nodes fails to recover from a netwo

    kpandey

      This was tested on both 3.2.8 sp1 and 4.0.4 GA.

      Within a subnet I confiured two nodes A and B using the TCP protocol for JGroup(I'm primarily interested in WAN cluster for site failover so next step would have been to separate the nodes across WAN).

      Node A is started and runs in Master mode. Node B is started and joins the DefaultPartition cluster along with node A in Slave mode. So far so good.
      Now I disconnect the lan line on the Node B machine. It fails to ping Node A and thus changes its mode to Master.

      Now plug the connection back in. Node B never joins the DefaultPartition cluster with node A. Node B continues to run in Master mode.

      I'd expect it to join the old cluster back and reset its mode to Slave from Master. This will present a serious issue of trying to use Singleton Tasks in a cluster.

      Any suggestions from anyone if the behavior I'm oberving is not because of some n/w configuration issue at Jgroup level for recovery?

      Thanks
      Kumar Pandey



        • 1. Re: Basic TCP Cluster of two nodes fails to recover from a n
          kpandey

          Some more details on test setup.
          Node A running on XP in a NAT address of say 10.0.1.4
          Node B running in Red Hat 9 in a NAT address of 10.0.1.5

          • 2. Re: Basic TCP Cluster of two nodes fails to recover from a n
            belaban

            Do you have a MERGE2 protocol in your stack ?

            • 3. Re: Basic TCP Cluster of two nodes fails to recover from a n
              kpandey

              Yes, here's my JGroup TCP settings

              <TCP bind_addr="10.0.1.62" start_port="7800" loopback="true"/>
              <TCPPING initial_hosts="10.0.1.62[7800],10.0.1.61[7800]" port_range="3" timeout="3500"
              num_initial_members="2" up_thread="true" down_thread="true"/>
              <MERGE2 min_interval="5000" max_interval="10000"/>
              <FD shun="true" timeout="2500" max_tries="5" up_thread="true" down_thread="true" />
              <VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false" />
              <pbcast.NAKACK down_thread="true" up_thread="true" gc_lag="100"
              retransmit_timeout="3000"/>
              <pbcast.STABLE desired_avg_gossip="20000" down_thread="false" up_thread="false" />
              <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="false"
              print_local_addr="true" down_thread="true" up_thread="true"/>
              <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>

              I have similiar setting on the other node.


              • 4. Re: Basic TCP Cluster of two nodes fails to recover from a n
                kpandey

                I have more information on this issue --

                I have three node setup using TCP for JGroup. It all works fine and if I stop
                a node and restart or do a kill -9 and restart oldest becomes Master and all is well.
                Now while testing error condition with network I'm running into problems. So in the normal working case I have three nodes whose
                DefaultPartition CurrentView is
                [10.0.1.48:1099, 10.0.2.130:1099, 10.0.1.61:1099]


                Now I unplug the network cable from 10.0.1.61

                I see the following debug trace in 10.0.2.130

                02:05:18,276 INFO [DefaultPartition] Suspected member: 10.0.1.61:7800 (additional data: 14 bytes)
                02:05:18,278 INFO [DefaultPartition] Suspected member: 10.0.1.48:7800 (additional data: 14 bytes)
                02:05:18,280 INFO [DefaultPartition] New cluster view for partition DefaultPartition (id: 3, delta: -2) : [10.0.2.130:1099]
                02:05:18,281 INFO [DefaultPartition] I am (10.0.2.130:1099) received membershipChanged event:
                02:05:18,281 INFO [DefaultPartition] Dead members: 2 ([10.0.1.48:1099, 10.0.1.61:1099])
                02:05:18,282 INFO [DefaultPartition] New Members : 0 ([])
                02:05:18,282 INFO [DefaultPartition] All Members : 1 ([10.0.2.130:1099]

                I do not undersatnd why it thought 10.0.1.48 was dead as well?1.48

                debug trace in 10.0.1.48 is --

                9:50:43,033 INFO [DefaultPartition] Suspected member: 10.0.1.61:7800 (additional data: 14 bytes)
                19:50:44,611 WARN [FD] I was suspected, but will not remove myself from membership (waiting for EXIT message)
                19:50:45,533 INFO [DefaultPartition] Suspected member: 10.0.1.61:7800 (additional data: 14 bytes)
                19:50:46,122 WARN [CoordGmsImpl] I am the coord and I'm being am suspected -- will probably leave shortly
                19:50:46,132 WARN [GMS] checkSelfInclusion() failed, 10.0.1.48:7810 is not a member of view [10.0.2.130:7810|3] [10.0.2.130:7810]; discarding view
                19:50:46,517 WARN [FD] I was suspected, but will not remove myself from membership (waiting for EXIT message)
                19:50:48,023 WARN [GMS] checkSelfInclusion() failed, 10.0.1.48:7800 (additional data: 14 bytes) is not a member of view [10.0.2.130:7800 (additional data: 15 bytes)|3] [10.0.2.130:7800 (additional data: 15 bytes)]; discarding view
                19:50:48,032 WARN [CoordGmsImpl] I am the coord and I'm being am suspected -- will probably leave shortly
                19:50:48,033 INFO [DefaultPartition] Suspected member: 10.0.1.61:7800 (additional data: 14 bytes)
                19:50:48,034 INFO [DefaultPartition] Suspected member: vallance-lnx:7800 (additional data: 14 bytes)


                Why is 10.0.1.48 a suspect?

                The result is that both 10.0.1.48 and 10.0.2.130 now runs in Master mode and not in a cluster.

                Upon connecting the nework cable back to 10.0.1.61 , the cluster goes thru some variance of group and finally stettles down the following view on all three views
                [10.0.2.130:1099, 10.0.1.61:1099, 10.0.1.48:1099]

                How do I troubleshoot this? I would expect 10.0.2.130 and 10.0.1.48 to never loose the cluser group and 10.0.1.61 tojoin at the end as the newest.

                Testing on jboss-3.2.8sp1 and jdk1.5

                Thanks
                Kumar

                • 5. Re: Basic TCP Cluster of two nodes fails to recover from a n
                  kpandey

                  Here's sample of JGroup setting in 10.0.1.62. Other have similiar setting(ie all three are mentioned in initial hosts )

                  <TCP bind_addr="10.0.1.62" start_port="7800" loopback="true"/>
                  <TCPPING initial_hosts="10.0.1.62[7800],10.0.1.61[7800],10.0.2.130[7800]" port_range="3" timeout="3500"
                  num_initial_members="3" up_thread="true" down_thread="true"/>
                  <MERGE2 min_interval="5000" max_interval="10000"/>
                  <FD shun="true" timeout="2500" max_tries="5" up_thread="true" down_thread="true" />
                  <VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false" />
                  <pbcast.NAKACK down_thread="true" up_thread="true" gc_lag="100"
                  retransmit_timeout="3000"/>
                  <pbcast.STABLE desired_avg_gossip="20000" down_thread="false" up_thread="false" />
                  <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="false"
                  print_local_addr="true" down_thread="true" up_thread="true"/>
                  <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>

                  • 6. Re: Basic TCP Cluster of two nodes fails to recover from a netwo

                    Well, it seems that the problem is still there!

                     

                    Instead on unplugging network cable, you can suspend the process by debug breakpoint or something like that - cluster splits to independent parts and newer recovers by itself - has to restart everything.

                     

                    New nodes join the two clusters simulataneously, so the mess will just grow.

                     

                    Am I doing something wrong? Is it an expected behaviour?

                    It looks like a show-stopper issue to me...