11 Replies Latest reply on Dec 13, 2006 3:51 PM by rajeshchande

Clustering NOT working on physical separate boxes.

rajeshchande Dec 6, 2006 9:08 AM

Hello,

I am working on solaris 10(SunOS 5.10 Generic_118822-18 sun4u sparc SUNW,Sun-Fire-V240), jboss 4.0.3sp1, jdk 1.5.0_01-b08. I have configured two nodes "devl-01" and "devl-02" (copy of "all") on the same physical machine. When I start them one by one. The cluster is detected and they see each other. I have following output on the console:

14:54:41,241 INFO [partition-01] All Members : 2 ([3.187.196.86:53043, 3.187.196.86:63043])
14:49:25,936 INFO [DetachedHANamingService$AutomaticDiscovery] Listening on /0.0.0.0:1102, group=228.1.2.4, HA-JNDI address=3.187.196.86:1100

so thats fine.

But I have one more node "devl-03" on a separate physical machine (versions of OS, jboss and jdk are the same). When I start this 3rd node, it does not join the existing cluster. I have specified the same partition name, mcast_address and port by specifiying the System properties as :

-Djboss.partition.name=partition-01 -Djboss.partition.udpGroup=228.1.2.4"

But after startup the 3rd node has output on console like:

14:51:24,492 INFO [partition-01] All Members : 1 ([3.187.200.23:53043])
14:51:24,727 INFO [DetachedHANamingService$AutomaticDiscovery] Listening on /0.0.0.0:1102, group=228.1.2.4, HA-JNDI address=3.187.200.23:1100

Can any one guide me, why the 3rd node is not joining the cluster?

1. Re: Clustering NOT working on physical separate boxes.

rajeshchande Dec 7, 2006 5:29 AM (in response to rajeshchande)

Hello JBoss team,

Any ideas or guidance for me?

Regards,
Rajesh.
Actions
2. Re: Clustering NOT working on physical separate boxes.

rajeshchande Dec 7, 2006 8:19 AM (in response to rajeshchande)

hello,

Does the 2 separate machines need to be on the same subnet mask?

If Yes, is that documented in JBoss docs?

Regards,
Rajesh.
Actions
3. Re: Clustering NOT working on physical separate boxes.

rajeshchande Dec 7, 2006 10:21 AM (in response to rajeshchande)

Hello All,

I would appreciate if someone can answer my questions above. Were they so silly that i dont see any replies? ;-)

I hope I am not wrong somewhere drastically.

Regards,
Rajesh.
Actions
4. Re: Clustering NOT working on physical separate boxes.

brian.stansberry Dec 7, 2006 12:31 PM (in response to rajeshchande)

http://wiki.jboss.org/wiki/Wiki.jsp?page=TestingJBoss
http://www.jgroups.org/javagroupsnew/docs/manual/html/ch03.html#ItDoesntWork

Re: the subnet, the requirement is that the packets must be able to pass between the servers. If you are using UDP, that relies on multicast, so if the servers are not on the same subnet you need to be sure your router will pass multicast.
Actions

5. Re: Clustering NOT working on physical separate boxes.

rajeshchande Dec 12, 2006 11:12 AM (in response to rajeshchande)

Hello JBoss team,

Thanks for the links, that did help me perform the necessary steps. Finally I was able to get each of the nodes identify each other.

#####m/c1#########
15:59:33,679 INFO [Server] JBoss (MX MicroKernel) [4.0.3SP1 (build: CVSTag=JBos
s_4_0_3_SP1 date=200510231054)] Started in 41s:253ms
16:02:20,112 INFO [devl-partition-01] New cluster view for partition devl-partition-01 (id: 1, delta: 1) : [3.187.196.86:53043, 3.187.200.23:53043]
16:02:20,116 INFO [devl-partition-01] Merging partitions...
16:02:20,116 INFO [devl-partition-01] Dead members: 0
16:02:20,117 INFO [devl-partition-01] Originating groups: [[server-1
-bak:49934 (additional data: 18 bytes)|0] [server-1-bak:49934 (additional data: 18 bytes)], [172.17.132.140:59013 (additional data: 18 bytes)|0] [172.17.132.140:59013 (additional data: 18 bytes)]]
#####m/c1#########

#####m/c-2#########
15:55:59,974 INFO [Server] JBoss (MX MicroKernel) [4.0.3SP1 (build: CVSTag=JBos
s_4_0_3_SP1 date=200510231054)] Started in 1m:8s:90ms
16:02:20,110 INFO [devl-partition-01] New cluster view for partition devl-partition-01: 1 ([3.187.196.86:53043, 3.187.200.23:53043] delta: 1)
16:02:20,111 INFO [devl-partition-01] Merging partitions...
16:02:20,112 INFO [devl-partition-01] Dead members: 0
16:02:20,113 INFO [devl-partition-01] Originating groups: [[172.17.132
.70:49934 (additional data: 18 bytes)|0] [172.17.132.70:49934 (additional data:18 bytes)], [server2_b:59013 (additional data: 18 bytes)|0] [server2_b:590
13 (additional data: 18 bytes)]]
#####m/c-2#########

I did the following changes:

1) In the "run.sh" specified -Djboss.partition.name=devl-partition-01 -Djboss.partition.udpGroup=228.1.2.4
2) UDP parameters deploy/cluster-service.xml look like as shown below:

############
 <UDP bind_addr="172.17.132.70" mcast_addr="228.8.8.8" mcast_port="45566"
 ip_ttl="32" ip_mcast="true"
 mcast_send_buf_size="150000" mcast_recv_buf_size="80000"
 ucast_send_buf_size="150000" ucast_recv_buf_size="80000"
 loopback="true" max_bundle_size="60000" max_bundle_timeout="30"
 use_incoming_packet_handler="false" use_outgoing_packet_handler="false"
 enable_bundling="false" />
############

So they see each other now.

But, When i deploy a war file in the "farm" folder of server-1 on first m/c, local deployment on the server-1 happens successfully, but the node on the second m/c does not seem to pull the newly deployed WAR file and vice versa. So the FARM Deployment is failing.

Any idea, why this is failing? Do I need to change anything in deploy/tc5-cluster-service.xml to make "farm" deployment workable?

6. Re: Clustering NOT working on physical separate boxes.

rajeshchande Dec 12, 2006 1:04 PM (in response to rajeshchande)
Hello All,

few observations:

- When I start the instances "devl-01" and "devl-02" on the same physical machine, Assumming that "devl-01" is started first and then "devl-02", the "devl-02" seems to be JOINING the cluster.
- When i start third instance "devl-03" on a separate machine, its NOT joining the existing partition(cluster), instead its saying:

18:32:09,369 INFO [devl-partition-01] Merging partitions...

Also its NOT showing the following

18:41:13,799 INFO [FarmMemberService] **** pullNewDeployments ****

when the "devl-03" (separate m/c) is started, But I do get the same when "devl-02" (same m/c as devl-01) is started.

Why its merging the partition for 3rd instance and joing for 2nd instance on the same machine?
Actions
7. Re: Clustering NOT working on physical separate boxes.

brian.stansberry Dec 12, 2006 2:15 PM (in response to rajeshchande)

A merge is probably a sign that the during the initial discovery process when devl-03 started, it did not find the other servers during the timeout period. So it formed a cluster of one. Later it did find the other servers, and the two clusters merged.

For a UDP config, the behavior of the discovery process is controlled via the PING protocol. See http://wiki.jboss.org/wiki/Wiki.jsp?page=JGroupsPING for more details.

If you post the protocol stack portion of your cluster-service.xml file, I can have a look.
Actions

8. Re: Clustering NOT working on physical separate boxes.

rajeshchande Dec 13, 2006 4:27 AM (in response to rajeshchande)

Hello Brian,

Thanks for the reply.

Here is the config for protocol stack:

<Config>
 <UDP bind_addr="172.17.132.70" mcast_addr="228.8.8.8" mcast_port="45566" ip_ttl="32" ip_mcast="true"
 mcast_send_buf_size="100000" mcast_recv_buf_size="200000"
 ucast_send_buf_size="100000" ucast_recv_buf_size="200000"
 loopback="true" max_bundle_size="60000" max_bundle_timeout="30" use_incoming_packet_handler="false" use_outgoing_packet_handler="false" enable_bundling="false" />
 <PING timeout="2000" num_initial_members="3"
 up_thread="true" down_thread="true"/>
 <MERGE2 min_interval="10000" max_interval="20000"/>
 <FD shun="true" up_thread="true" down_thread="true"
 timeout="2500" max_tries="5"/>
 <VERIFY_SUSPECT timeout="3000" num_msgs="3"
 up_thread="true" down_thread="true"/>
 <pbcast.NAKACK gc_lag="50" retransmit_timeout="300,600,1200,2400,4800" max_xmit_size="8192"
 up_thread="true" down_thread="true"/>
 <UNICAST timeout="300,600,1200,2400,4800" window_size="100" min_threshold="10" down_thread="true"/>
 <pbcast.STABLE desired_avg_gossip="20000"
 up_thread="true" down_thread="true"/>
 <FRAG frag_size="8192" down_thread="true" up_thread="true"/>
 <pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
 shun="true" print_local_addr="true"/>
 <pbcast.STATE_TRANSFER up_thread="true" own_thread="true"/>
 </Config>

I have few question:

1) When they merge, there is no "pulling" of the deployments, why?
2) On the jboss index page, do we have "step-by-step" process to make jboss cluster work? (I see all information present, but its too scattered..no? ), If we have such a link can u please provide the same?

9. Re: Clustering NOT working on physical separate boxes.

rajeshchande Dec 13, 2006 7:55 AM (in response to rajeshchande)

Hello Brian,

As suggested I tried changing the PING and GMS parameters as below in both cluster-service.xml and tc5-cluster-service.xml, keeping other things as before:

<config>
 <UDP bind_addr="172.17.132.140" mcast_addr="228.8.8.8" mcast_port="45577"
 .....
 ... />
 <PING timeout="300000" num_initial_members="3"
 up_thread="false" down_thread="false"/>
 ....
 <pbcast.GMS join_timeout="50000" join_retry_timeout="2000"
 shun="true" print_local_addr="true"/>
 .....
 </config>

The nodes (devl-01 and devl-03)on different servers still dont join each other. Instead the merging also is now not happening after the above changes.

The outputs from each of them is as follows:

output from devl-01 on server-1:

12:14:28,366 INFO [ChannelSocket] JK: ajp13 listening on /0.0.0.0:53041
12:14:28,395 INFO [JkMain] Jk running ID=0 time=0/82 config=null
12:14:28,414 INFO [Server] JBoss (MX MicroKernel) [4.0.3SP1 (build: CVSTag=JBoss_4_0_3_SP1 date=200510231054)] Started in 10m:45s:134ms
12:40:29,541 INFO [TreeCache] viewAccepted(): new members: [server-1-bak:52692, 172.17.132.140:60857]
12:51:11,194 WARN [CoordGmsImpl] merge responses from subgroup coordinators <=1 ([sender=server-1-bak:52680 (additional data: 18 bytes), view=[server-1-bak:52680 (additional data: 18 bytes)|0] [server-1-bak:52680 (additional data: 18 bytes)], digest=[server-1-bak:52680 (additional data: 18 bytes): [0 : 4]]).Cancelling merge
12:51:53,440 ERROR [CoordGmsImpl] merge_id ([server-1-bak:52680 (additional data: 18 bytes)|1166010661184]) or this.merge_id (null) == null (sender=172.17.132.140:60847 (additional data: 18 bytes)).

output from devl-03 on server-2:

12:25:49,163 INFO [Server] JBoss (MX MicroKernel) [4.0.3SP1 (build: CVSTag=JBoss_4_0_3_SP1 date=200510231054)] Started in 11m:4s:198ms
12:40:29,540 INFO [TreeCache] viewAccepted(): new members: [172.17.132.70:52692, server-2_b:60857]
12:56:12,328 WARN [NAKACK] [server-2_b:60847 (additional data: 18 bytes)] discarded message from non-member 172.17.132.70:52680 (additional data: 18 bytes)
12:56:12,333 WARN [NAKACK] [server-2_b:60847 (additional data: 18 bytes)] discarded message from non-member 172.17.132.70:52680 (additional data: 18 bytes)

10. Re: Clustering NOT working on physical separate boxes.

brian.stansberry Dec 13, 2006 3:30 PM (in response to rajeshchande)

I checked and the farm service doesn't attempt to monitor merges. If you want you can add a feature request JIRA for that? JBoss Application Server project.

There really isn't a simple step-by-step guide. Typically the "run -c all on all the servers" approach just works. If it doesn't it's not a simple step-by-step thing to resolve, as the source of problems tends to be very specific to the particular environment. But, you're right, the docs are weak when it comes to the initial effort on getting things running; they dive in too much into details are architectural principles.

Re: your config, don't change the GMS.join_timeout. A PING timeout of 300000 is way to high; how about something like 5000 or 7500? To be honest, I have no idea what kind of weird effects a PING timeout of 300000 would cause. If your servers aren't able to discover each other normally within 5 seconds, I recommend you talk to your network guys to see if there are communication issues between the servers.
Actions
11. Re: Clustering NOT working on physical separate boxes.

rajeshchande Dec 13, 2006 3:51 PM (in response to rajeshchande)

Thanks Brian,

I really appreciate your timely and honest answers.

Regards,
Rajesh.
Actions

Go to original post