10 Replies Latest reply on Sep 1, 2006 5:51 PM by brian.stansberry

Forcing Topology change

kpandey Aug 29, 2006 4:56 PM

I have a requirement that between the two nodes a and b in a cluster node a should always run in the Master mode when available.

Normal cluster functions between the two nodes are working fine. I'm using tcp as oppose to udp protocol.

It seems the first name in the current view returned from the channel is used to determine if the HASingletonDeployer should be in master mode.

JGroup/JChannel doesn't seem to provide a way to manipulate the arrangement of items in the view.

I tried to see if starting and stopping the ClusterPartition would do the trick. So when node a is the first item in the CurrentView from jmx-console(ie master node) I invoked stop. This closed the DefaultPartition and node b then assumed the master role. On starting the ClusterPartition on node a again didn't see node b ? Is this a bug or am I missing some subtleties?

Is there a better way to leave and join the cluster programatically?

Thanks
Kumar

1. Re: Forcing Topology change

brian.stansberry Aug 29, 2006 7:00 PM (in response to kpandey)

What JBoss version?

JIRA http://jira.jboss.com/jira/browse/JBCLUSTER-38 deals with this; it was at least partially fixed in 4.0.3 and more completely fixed in 4.0.4.
Actions
2. Re: Forcing Topology change

kpandey Aug 29, 2006 7:13 PM (in response to kpandey)

Sorry for breaking the first rule of forum posting. Here's the version info

Version
Version: 4.0.4GA(build: CVSTag=JBoss_4_0_4_GA date=200605151000)

Version Name: Zion

Built on: May 15 2006

I also see start hanging after second stop.
Actions
3. Re: Forcing Topology change

brian.stansberry Aug 30, 2006 12:14 AM (in response to kpandey)

In 4.0.4 you should be able to stop a ClusterPartition and then restart it; there's a unit test that does this. In fact I just played around with the jmx-console starting and stopping partitions, without problems.

The unit test and my playing around just now were with the default UDP config. If you switch to UDP does it work? If so that will narrow down the issue and I can try to continue debugging via the forum. If UDP doesn't work, you'll need to create a test case that shows the problem and open a JIRA.
Actions
4. Re: Forcing Topology change

kpandey Aug 30, 2006 1:21 AM (in response to kpandey)

Yes UDP works for me too. Please let me know if you need debug logs
Thx
Actions
5. Re: Forcing Topology change

brian.stansberry Aug 30, 2006 9:49 AM (in response to kpandey)
Please paste your protocol stack config from cluster-service.xml. (Be sure to surround with
or xml won't come through.)
Actions
6. Re: Forcing Topology change

kpandey Aug 30, 2006 1:36 PM (in response to kpandey)
On first machine 10.0.1.62 ie kumar-pc
<Config> <TCP bind_addr="kumar-pc" start_port="7800" loopback="true"/> <TCPPING initial_hosts="kumar-pc[7800],kumar-lnx[7800]" port_range="3" timeout="3500" num_initial_members="3" up_thread="true" down_thread="true"/> <MERGE2 min_interval="5000" max_interval="10000"/> <FD shun="true" timeout="2500" max_tries="5" up_thread="true" down_thread="true" /> <VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false" /> <pbcast.NAKACK down_thread="true" up_thread="true" gc_lag="100" retransmit_timeout="3000"/> <pbcast.STABLE desired_avg_gossip="20000" down_thread="false" up_thread="false" /> <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="false" print_local_addr="true" down_thread="true" up_thread="true"/> <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/> </Config>

On second machine 10.0.1.61 ie kumar-lnx
<Config> <TCP bind_addr="kumar-lnx" start_port="7800" loopback="true"/> <TCPPING initial_hosts="kumar-lnx[7800],kumar-pc[7800]" port_range="3" timeout="3500" num_initial_members="3" up_thread="true" down_thread="true"/> <MERGE2 min_interval="5000" max_interval="10000"/> <FD shun="true" timeout="2500" max_tries="5" up_thread="true" down_thread="true" /> <VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false" /> <pbcast.NAKACK down_thread="true" up_thread="true" gc_lag="100" retransmit_timeout="3000"/> <pbcast.STABLE desired_avg_gossip="20000" down_thread="false" up_thread="false" /> <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="false" print_local_addr="true" down_thread="true" up_thread="true"/> <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/> </Config>

I bring 10.0.1.62 before 10.0.1.61 so that it is running as Master. Then I stop partition on 10.0.1.62. Partition closes. 10.0.1.62 is removed from 10.0.1.61's current view.
I start partition on 10.0.1.62. It comes up but is unable to see 10.0.1.61 and hence runs in master mode.

console output on kumar-pc (10.0.1.62) after invoking partition stop

10:21:04,578 INFO [DefaultPartition] Closing partition DefaultPartition
10:21:04,594 INFO [ConnectionTable] exception is java.net.SocketException: socket closed
10:21:04,594 INFO [ConnectionTable] exception is java.net.SocketException: socket closed
10:21:04,594 INFO [ConnectionTable] addr=10.0.1.61:7800 (additional data: 14 bytes), connections are connections (0):
10:21:04,594 INFO [DefaultPartition] Partition DefaultPartition closed.
10:21:04,594 WARN [ConnectionTable] local_addr is null
10:21:04,594 INFO [ConnectionTable] connection was created to 10.0.1.61:7800 (additional data: 14 bytes)
10:21:04,609 INFO [ConnectionTable] created socket to 10.0.1.61:7800 (additional data: 14 bytes)
10:21:04,609 INFO [ConnectionTable] exception is java.io.EOFException
10:21:04,609 INFO [ConnectionTable] addr=10.0.1.61:7800 (additional data: 14 bytes), connections are connections (0):

console output on kumar-pc (10.0.1.62) after invoking partition start

10:21:36,235 INFO [ConnectionTable] server socket created on kumar-pc:7800
10:21:36,251 INFO [STDOUT]
-------------------------------------------------------
GMS: address is kumar-pc:7800 (additional data: 14 bytes)
-------------------------------------------------------
10:21:39,751 INFO [DefaultPartition] Number of cluster members: 1
10:21:39,751 INFO [DefaultPartition] Other members: 0
10:21:39,751 INFO [DefaultPartition] New cluster view for partition DefaultPartition (id: 0, delta: 0) : [10.0.1.62:1099]
10:21:39,751 INFO [DefaultPartition] Fetching state (will wait for 30000 milliseconds):
10:21:39,766 INFO [DefaultPartition] I am (10.0.1.62:1099) received membershipChanged event:
10:21:39,766 INFO [DefaultPartition] Dead members: 0 ([])
10:21:39,766 INFO [DefaultPartition] New Members : 0 ([])
10:21:39,766 INFO [DefaultPartition] All Members : 1 ([10.0.1.62:1099])
10:21:46,048 INFO [ConnectionTable] accepted connection, client_sock=Socket[addr=/10.0.1.61,port=47354,localport=7800]
10:21:46,048 INFO [ConnectionTable] input_cookie is bela
10:21:46,938 INFO [ConnectionTable] connection was created to 10.0.1.61:7800 (additional data: 14 bytes)

Note that it establishes connection to 10.0.1.61 but doesn't find the instance running there.
Actions
7. Re: Forcing Topology change

kpandey Aug 31, 2006 12:34 PM (in response to kpandey)

Brian
Did you get a change to test with the TCP stack? Also if you could give me some pointers on where to troubleshoot this that would be great.

I'm looking to using cluster across WAN so will need to use TCP instead of UDP multicast.

Thanks
Kumar
Actions
8. Re: Forcing Topology change

brian.stansberry Aug 31, 2006 1:48 PM (in response to kpandey)

I had problems with TCP as well, although didn't have time to debug.

Can you try with JGroups 2.3.SP2? Just replace the jgroups.jar in server/all/lib. I'm swamped and it might be a while before I can dig into this. There've been a number of improvements in this area between 2.2.7 and 2.3.SP2; perhaps it's fixed.
Actions
9. Re: Forcing Topology change

kpandey Aug 31, 2006 8:16 PM (in response to kpandey)

Brian
I tried Jgroups 2.3SP1 ( I didn't see 2.3 sp2 in sorceforge) and it works!
Stop and start of partition is working across node1 and node2. However I see a behavior that is probably a bug.
So I have 10.0.1.61 and 10.0.1.62 in a cluster with 10.0.1.61 as Master.
I close partition on 10.0.1.61. 10.0.1.62's current view changes to contain just itself and becomes Master.
However 10.,0.1.61's current view still has [10.0.1.61. 10.0.1.62] although the Master flag is false in SingletonController.
Shouldn't the current view be reset in 10.0.1.61 when it leaves the partition?

Now I start partition on 10.0.1.61. It initially starts a new view with itself and does a merge with 10.0.1.62. However instead of 10.0.1.62 becoming Master. It remains a master. Since 10.0.1.61 left the partition , shouldn't it have come up as non master ? Also it is receiving a membershipChanged event with 10.0.1.62 as dead member. I couldn't understand this. I'll try the scenario with UDP as well to see if both behaves similarly.

Below is the server log

23:52:34,434 INFO [DefaultPartition] Closing partition DefaultPartition
23:52:39,440 INFO [DefaultPartition] Partition DefaultPartition closed.
23:53:41,839 INFO [STDOUT]
-------------------------------------------------------
GMS: address is 10.0.1.61:7800
-------------------------------------------------------
23:53:45,342 INFO [DefaultPartition] New cluster view for partition DefaultPartition (id: 0, delta: -1) : [10.0.1.61:1099]
23:53:45,343 INFO [DefaultPartition] Number of cluster members: 1
23:53:45,343 INFO [DefaultPartition] Other members: 0
23:53:45,343 INFO [DefaultPartition] Fetching state (will wait for 30000 milliseconds):
23:53:45,345 INFO [DefaultPartition] I am (10.0.1.61:1099) received membershipChanged event:
23:53:45,345 INFO [DefaultPartition] Dead members: 1 ([10.0.1.62:1099])
23:53:45,345 INFO [DefaultPartition] New Members : 0 ([])
23:53:45,346 INFO [DefaultPartition] All Members : 1 ([10.0.1.61:1099])
23:53:58,979 INFO [DefaultPartition] New cluster view for partition DefaultPartition (id: 11, delta: 1) : [10.0.1.61:1099, 10.0.1.62:1099]
23:53:58,980 INFO [DefaultPartition] Merging partitions...
23:53:58,981 INFO [DefaultPartition] Dead members: 0
23:53:58,981 INFO [DefaultPartition] Originating groups: [[10.0.1.61:7800|0] [10.0.1.61:7800], [10.0.1.61:7800|10] [10.0.1.62:7800]]
10.0.1.62:1099
Actions
10. Re: Forcing Topology change

brian.stansberry Sep 1, 2006 5:51 PM (in response to kpandey)

From the log it looks like the start happened before all the stop work was done. Does it work correctly if there is a greater time lag?
Actions

Go to original post