10 Replies Latest reply on Sep 1, 2006 5:51 PM by brian.stansberry

    Forcing Topology change

    kpandey

      I have a requirement that between the two nodes a and b in a cluster node a should always run in the Master mode when available.

      Normal cluster functions between the two nodes are working fine. I'm using tcp as oppose to udp protocol.

      It seems the first name in the current view returned from the channel is used to determine if the HASingletonDeployer should be in master mode.

      JGroup/JChannel doesn't seem to provide a way to manipulate the arrangement of items in the view.

      I tried to see if starting and stopping the ClusterPartition would do the trick. So when node a is the first item in the CurrentView from jmx-console(ie master node) I invoked stop. This closed the DefaultPartition and node b then assumed the master role. On starting the ClusterPartition on node a again didn't see node b ? Is this a bug or am I missing some subtleties?

      Is there a better way to leave and join the cluster programatically?

      Thanks
      Kumar

        • 1. Re: Forcing Topology change
          brian.stansberry

          What JBoss version?

          JIRA http://jira.jboss.com/jira/browse/JBCLUSTER-38 deals with this; it was at least partially fixed in 4.0.3 and more completely fixed in 4.0.4.

          • 2. Re: Forcing Topology change
            kpandey

            Sorry for breaking the first rule of forum posting. Here's the version info

            Version
            Version: 4.0.4GA(build: CVSTag=JBoss_4_0_4_GA date=200605151000)

            Version Name: Zion

            Built on: May 15 2006


            I also see start hanging after second stop.

            • 3. Re: Forcing Topology change
              brian.stansberry

              In 4.0.4 you should be able to stop a ClusterPartition and then restart it; there's a unit test that does this. In fact I just played around with the jmx-console starting and stopping partitions, without problems.

              The unit test and my playing around just now were with the default UDP config. If you switch to UDP does it work? If so that will narrow down the issue and I can try to continue debugging via the forum. If UDP doesn't work, you'll need to create a test case that shows the problem and open a JIRA.

              • 4. Re: Forcing Topology change
                kpandey

                Yes UDP works for me too. Please let me know if you need debug logs
                Thx

                • 5. Re: Forcing Topology change
                  brian.stansberry

                  Please paste your protocol stack config from cluster-service.xml. (Be sure to surround with

                  or xml won't come through.)

                  • 6. Re: Forcing Topology change
                    kpandey

                    On first machine 10.0.1.62 ie kumar-pc

                     <Config>
                     <TCP bind_addr="kumar-pc" start_port="7800" loopback="true"/>
                     <TCPPING initial_hosts="kumar-pc[7800],kumar-lnx[7800]" port_range="3" timeout="3500"
                     num_initial_members="3" up_thread="true" down_thread="true"/>
                     <MERGE2 min_interval="5000" max_interval="10000"/>
                     <FD shun="true" timeout="2500" max_tries="5" up_thread="true" down_thread="true" />
                     <VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false" />
                     <pbcast.NAKACK down_thread="true" up_thread="true" gc_lag="100"
                     retransmit_timeout="3000"/>
                     <pbcast.STABLE desired_avg_gossip="20000" down_thread="false" up_thread="false" />
                     <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="false"
                     print_local_addr="true" down_thread="true" up_thread="true"/>
                     <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
                     </Config>


                    On second machine 10.0.1.61 ie kumar-lnx
                    <Config>
                     <TCP bind_addr="kumar-lnx" start_port="7800" loopback="true"/>
                     <TCPPING initial_hosts="kumar-lnx[7800],kumar-pc[7800]" port_range="3" timeout="3500"
                     num_initial_members="3" up_thread="true" down_thread="true"/>
                     <MERGE2 min_interval="5000" max_interval="10000"/>
                     <FD shun="true" timeout="2500" max_tries="5" up_thread="true" down_thread="true" />
                     <VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false" />
                     <pbcast.NAKACK down_thread="true" up_thread="true" gc_lag="100"
                     retransmit_timeout="3000"/>
                     <pbcast.STABLE desired_avg_gossip="20000" down_thread="false" up_thread="false" />
                     <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="false"
                     print_local_addr="true" down_thread="true" up_thread="true"/>
                     <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
                     </Config>


                    I bring 10.0.1.62 before 10.0.1.61 so that it is running as Master. Then I stop partition on 10.0.1.62. Partition closes. 10.0.1.62 is removed from 10.0.1.61's current view.
                    I start partition on 10.0.1.62. It comes up but is unable to see 10.0.1.61 and hence runs in master mode.

                    console output on kumar-pc (10.0.1.62) after invoking partition stop


                    10:21:04,578 INFO [DefaultPartition] Closing partition DefaultPartition
                    10:21:04,594 INFO [ConnectionTable] exception is java.net.SocketException: socket closed
                    10:21:04,594 INFO [ConnectionTable] exception is java.net.SocketException: socket closed
                    10:21:04,594 INFO [ConnectionTable] addr=10.0.1.61:7800 (additional data: 14 bytes), connections are connections (0):
                    10:21:04,594 INFO [DefaultPartition] Partition DefaultPartition closed.
                    10:21:04,594 WARN [ConnectionTable] local_addr is null
                    10:21:04,594 INFO [ConnectionTable] connection was created to 10.0.1.61:7800 (additional data: 14 bytes)
                    10:21:04,609 INFO [ConnectionTable] created socket to 10.0.1.61:7800 (additional data: 14 bytes)
                    10:21:04,609 INFO [ConnectionTable] exception is java.io.EOFException
                    10:21:04,609 INFO [ConnectionTable] addr=10.0.1.61:7800 (additional data: 14 bytes), connections are connections (0):

                    console output on kumar-pc (10.0.1.62) after invoking partition start

                    10:21:36,235 INFO [ConnectionTable] server socket created on kumar-pc:7800
                    10:21:36,251 INFO [STDOUT]
                    -------------------------------------------------------
                    GMS: address is kumar-pc:7800 (additional data: 14 bytes)
                    -------------------------------------------------------
                    10:21:39,751 INFO [DefaultPartition] Number of cluster members: 1
                    10:21:39,751 INFO [DefaultPartition] Other members: 0
                    10:21:39,751 INFO [DefaultPartition] New cluster view for partition DefaultPartition (id: 0, delta: 0) : [10.0.1.62:1099]
                    10:21:39,751 INFO [DefaultPartition] Fetching state (will wait for 30000 milliseconds):
                    10:21:39,766 INFO [DefaultPartition] I am (10.0.1.62:1099) received membershipChanged event:
                    10:21:39,766 INFO [DefaultPartition] Dead members: 0 ([])
                    10:21:39,766 INFO [DefaultPartition] New Members : 0 ([])
                    10:21:39,766 INFO [DefaultPartition] All Members : 1 ([10.0.1.62:1099])
                    10:21:46,048 INFO [ConnectionTable] accepted connection, client_sock=Socket[addr=/10.0.1.61,port=47354,localport=7800]
                    10:21:46,048 INFO [ConnectionTable] input_cookie is bela
                    10:21:46,938 INFO [ConnectionTable] connection was created to 10.0.1.61:7800 (additional data: 14 bytes)

                    Note that it establishes connection to 10.0.1.61 but doesn't find the instance running there.

                    • 7. Re: Forcing Topology change
                      kpandey

                      Brian
                      Did you get a change to test with the TCP stack? Also if you could give me some pointers on where to troubleshoot this that would be great.

                      I'm looking to using cluster across WAN so will need to use TCP instead of UDP multicast.

                      Thanks
                      Kumar

                      • 8. Re: Forcing Topology change
                        brian.stansberry

                        I had problems with TCP as well, although didn't have time to debug.

                        Can you try with JGroups 2.3.SP2? Just replace the jgroups.jar in server/all/lib. I'm swamped and it might be a while before I can dig into this. There've been a number of improvements in this area between 2.2.7 and 2.3.SP2; perhaps it's fixed.

                        • 9. Re: Forcing Topology change
                          kpandey

                          Brian
                          I tried Jgroups 2.3SP1 ( I didn't see 2.3 sp2 in sorceforge) and it works!
                          Stop and start of partition is working across node1 and node2. However I see a behavior that is probably a bug.
                          So I have 10.0.1.61 and 10.0.1.62 in a cluster with 10.0.1.61 as Master.
                          I close partition on 10.0.1.61. 10.0.1.62's current view changes to contain just itself and becomes Master.
                          However 10.,0.1.61's current view still has [10.0.1.61. 10.0.1.62] although the Master flag is false in SingletonController.
                          Shouldn't the current view be reset in 10.0.1.61 when it leaves the partition?

                          Now I start partition on 10.0.1.61. It initially starts a new view with itself and does a merge with 10.0.1.62. However instead of 10.0.1.62 becoming Master. It remains a master. Since 10.0.1.61 left the partition , shouldn't it have come up as non master ? Also it is receiving a membershipChanged event with 10.0.1.62 as dead member. I couldn't understand this. I'll try the scenario with UDP as well to see if both behaves similarly.


                          Below is the server log

                          23:52:34,434 INFO [DefaultPartition] Closing partition DefaultPartition
                          23:52:39,440 INFO [DefaultPartition] Partition DefaultPartition closed.
                          23:53:41,839 INFO [STDOUT]
                          -------------------------------------------------------
                          GMS: address is 10.0.1.61:7800
                          -------------------------------------------------------
                          23:53:45,342 INFO [DefaultPartition] New cluster view for partition DefaultPartition (id: 0, delta: -1) : [10.0.1.61:1099]
                          23:53:45,343 INFO [DefaultPartition] Number of cluster members: 1
                          23:53:45,343 INFO [DefaultPartition] Other members: 0
                          23:53:45,343 INFO [DefaultPartition] Fetching state (will wait for 30000 milliseconds):
                          23:53:45,345 INFO [DefaultPartition] I am (10.0.1.61:1099) received membershipChanged event:
                          23:53:45,345 INFO [DefaultPartition] Dead members: 1 ([10.0.1.62:1099])
                          23:53:45,345 INFO [DefaultPartition] New Members : 0 ([])
                          23:53:45,346 INFO [DefaultPartition] All Members : 1 ([10.0.1.61:1099])
                          23:53:58,979 INFO [DefaultPartition] New cluster view for partition DefaultPartition (id: 11, delta: 1) : [10.0.1.61:1099, 10.0.1.62:1099]
                          23:53:58,980 INFO [DefaultPartition] Merging partitions...
                          23:53:58,981 INFO [DefaultPartition] Dead members: 0
                          23:53:58,981 INFO [DefaultPartition] Originating groups: [[10.0.1.61:7800|0] [10.0.1.61:7800], [10.0.1.61:7800|10] [10.0.1.62:7800]]
                          10.0.1.62:1099

                          • 10. Re: Forcing Topology change
                            brian.stansberry

                            From the log it looks like the start happened before all the stop work was done. Does it work correctly if there is a greater time lag?