3 Replies Latest reply on Sep 27, 2008 10:12 AM by asookazian

Problems with JBoss clustering

agohar Aug 21, 2008 5:30 AM

Hi,
I am having problems in jboss clustering, here is what i am trying to do:

We have 8 servers at 2 physical locations and these are all clustered well and were working fine with jdk1.4.2 and jboss-4.0.2. We wanted to upgrade our servers with new java 5 and jboss-4.2.2 versions and applications to use JbossWS and EJB3.0 versions. For that we put down the 4 servers (2 from each location) and upgraded them with new java and jboss versions and new upgraded applications. We are using udp multicast for the clustering so i changed the ports and IPs in following files to split the cluster:

- cluster-service.xml
- tree cache files
- jboss-web-cluster/META-INF/jboss-service.xml

I didn't change the partition name though and tried to run these both clusters in parallel. But got into problems, when i start any two of them they startup fine and create a new separate cluster fine but when i try to add third one, it is very very slow during start up and take ages to start, but doesn't throw any exceptions during startup and when it is up it throw:

2008-08-20 01:07:45,008 ERROR [org.jgroups.protocols.UDP] [10.100.54.135:41757] failed receiving unicast packet
java.lang.ArrayIndexOutOfBoundsException
at java.net.PlainDatagramSocketImpl.receive0(Native Method)
at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136)
at java.net.DatagramSocket.receive(DatagramSocket.java:712)
at org.jgroups.protocols.UDP$UcastReceiver.run(UDP.java:885)
at java.lang.Thread.run(Thread.java:595)
2008-08-20 01:07:45,008 ERROR [org.jgroups.protocols.UDP] failure in multicast receive()
java.lang.ArrayIndexOutOfBoundsException

I tried every combination of servers any two servers run perfectly fine and third and fourth ones cause problems. The other cluster works perfectly fine its jsut the problem with the new one. I had upgraded jgroups version to jgroups-2.6.0-GA because of a bug and now using this (2.6.0-GA) version.

Please let me know if there is anything i can look at? Your help will be highly appreciated

Thank you

1. Re: Problems with JBoss clustering

agohar Aug 28, 2008 12:09 PM (in response to agohar)

Hi,

Just to figure out the problem, i've tried to put fresh copies of jboss-4.2.2 on 3 test servers on the same network with same cluster configurations but find the same issue that when third one joins the cluster it is very slow, but i've got some WARN messages in logs of the other two servers. Here is what i tried:

- Started Server A (bind_addr = 10.100.54.14)
- Started Server B (bind_addr = 10.100.54.135).. Joins the cluster and Everything looks fine
- Started Server C(bind_addr = 10.100.54.12) .. It does join the cluster but is very slow

Here are the logs on C, it stucks for long time here (posting relevant portion only):

-------------------------------------------------------
GMS: address is 10.100.54.12:34566
-------------------------------------------------------
16:22:35,096 WARN [GMS] join(10.100.54.12:34566) sent to 10.100.54.14:40469 timed out, retrying
16:22:39,129 INFO [TreeCache] viewAccepted(): [10.100.54.14:40469|2] [10.100.54.14:40469, 10.100.54.135:45846, 10.100.54.12:34566]
16:22:42,160 ERROR [FD_SOCK] received null cache; retrying
16:22:45,668 ERROR [FD_SOCK] received null cache; retrying
16:22:49,176 ERROR [FD_SOCK] received null cache; retrying
16:22:49,686 INFO [TreeCache] TreeCache local address is 10.100.54.12:34566
16:22:49,699 INFO [TreeCache] received the state (size=1024 bytes)
16:22:49,740 INFO [TreeCache] state was retrieved successfully (in 54 milliseconds)
16:22:49,740 INFO [TreeCache] parseConfig(): PojoCacheConfig is empty
16:22:49,949 INFO [STDOUT] no object for null
16:22:49,958 INFO [STDOUT] no object for null
16:22:50,011 INFO [STDOUT] no object for null
16:22:50,053 INFO [STDOUT] no object for {urn:jboss:bean-deployer}supplyType
16:22:50,075 INFO [STDOUT] no object for {urn:jboss:bean-deployer}dependsType
16:22:53,624 INFO [NativeServerConfig] JBoss Web Services - Native
16:22:53,624 INFO [NativeServerConfig] jbossws-native-2.0.1.SP2 (build=200710210837)
16:22:54,978 INFO [SnmpAgentService] SNMP agent going active
16:22:55,627 INFO [DefaultPartition] Initializing
16:22:55,714 INFO [STDOUT]
-------------------------------------------------------
GMS: address is 10.100.54.12:34571
-------------------------------------------------------
16:23:02,800 ERROR [FD_SOCK] received null cache; retrying
16:23:06,308 ERROR [FD_SOCK] received null cache; retrying
16:23:09,816 ERROR [FD_SOCK] received null cache; retrying
16:23:10,323 INFO [DefaultPartition] Number of cluster members: 3
16:23:10,323 INFO [DefaultPartition] Other members: 2
16:23:10,323 INFO [DefaultPartition] Fetching state (will wait for 30000 milliseconds):
16:23:10,374 INFO [DefaultPartition] state was retrieved successfully (in 50 milliseconds)
16:24:10,483 INFO [HANamingService] Started ha-jndi bootstrap jnpPort=1100, backlog=50, bindAddress=/0.0.0.0
16:24:10,497 INFO [DetachedHANamingService$AutomaticDiscovery] Listening on /0.0.0.0:1102, group=230.0.0.4, HA-JNDI address=10.100.54.12:1100

I can see these warnings on Server A's Logs:

16:22:32,150 INFO [TreeCache] viewAccepted(): [10.100.54.14:40469|2] [10.100.54.14:40469, 10.100.54.135:45846, 10.100.54.12:34566]
16:22:37,157 WARN [GMS] failed to collect all ACKs (2) for view [10.100.54.14:40469|2] [10.100.54.14:40469, 10.100.54.135:45846, 10.100.54.12:34566] after 5000ms, missing ACKs from [10.100.54.135:45846] (received=[10.100.54.14:40469]), local_addr=10.100.54.14:40469
16:22:39,106 WARN [GMS] 10.100.54.12:34566 already present; returning existing view [10.100.54.14:40469|2] [10.100.54.14:40469, 10.100.54.135:45846, 10.100.54.12:34566]
16:22:49,694 INFO [TreeCache] locking the subtree at / to transfer state
16:22:49,694 INFO [StateTransferGenerator_140] returning the state for tree rooted in /(1024 bytes)
16:22:57,784 INFO [DefaultPartition] New cluster view for partition DefaultPartition (id: 2, delta: 1) : [10.100.54.14:1099, 10.100.54.135:1099, 10.100.54.12:1099]
16:22:57,784 INFO [DefaultPartition] I am (10.100.54.14:1099) received membershipChanged event:
16:22:57,784 INFO [DefaultPartition] Dead members: 0 ([])
16:22:57,784 INFO [DefaultPartition] New Members : 1 ([10.100.54.12:1099])
16:22:57,784 INFO [DefaultPartition] All Members : 3 ([10.100.54.14:1099, 10.100.54.135:1099, 10.100.54.12:1099])
16:22:59,790 WARN [GMS] failed to collect all ACKs (2) for view [10.100.54.14:40472|2] [10.100.54.14:40472, 10.100.54.135:45849, 10.100.54.12:34571] after 2000ms, missing ACKs from [10.100.54.135:45849] (received=[10.100.54.14:40472]), local_addr=10.100.54.14:40472
16:26:13,214 INFO [TreeCache] viewAccepted(): [10.100.54.14:40474|1] [10.100.54.14:40474, 10.100.54.12:34573]
16:26:26,091 INFO [TreeCache] viewAccepted(): [10.100.54.14:40476|1] [10.100.54.14:40476, 10.100.54.12:34575]

and warnings on Server B:

16:22:40,007 WARN [NAKACK] 10.100.54.135:45846] discarded message from non-member 10.100.54.12:34566, my view is [10.100.54.14:40469|1] [10.100.54.14:40469, 10.100.54.135:45846]
16:23:05,714 WARN [NAKACK] 10.100.54.135:45849] discarded message from non-member 10.100.54.12:34571, my view is [10.100.54.14:40472|1] [10.100.54.14:40472, 10.100.54.135:45849]
16:23:10,452 WARN [NAKACK] 10.100.54.135:45849] discarded message from non-member 10.100.54.12:34571, my view is [10.100.54.14:40472|1] [10.100.54.14:40472, 10.100.54.135:45849]
16:24:10,502 WARN [NAKACK] 10.100.54.135:45849] discarded message from non-member 10.100.54.12:34571, my view is [10.100.54.14:40472|1] [10.100.54.14:40472, 10.100.54.135:45849]
16:24:45,147 WARN [NAKACK] 10.100.54.135:45846] discarded message from non-member 10.100.54.12:34566, my view is [10.100.54.14:40469|1] [10.100.54.14:40469, 10.100.54.135:45846]
16:25:09,831 WARN [NAKACK] 10.100.54.135:45849] discarded message from non-member 10.100.54.12:34571, my view is [10.100.54.14:40472|1] [10.100.54.14:40472, 10.100.54.135:45849]
16:25:10,504 WARN [NAKACK] 10.100.54.135:45849] discarded message from non-member 10.100.54.12:34571, my view is [10.100.54.14:40472|1] [10.100.54.14:40472, 10.100.54.135:45849]

Please note that another cluster is already running on the same network with 5 servers in it and it works fine. and i am looking to run both of these clusters in parallel.

Any clue?

2. Re: Problems with JBoss clustering

mearman Sep 26, 2008 7:28 PM (in response to agohar)

Did you find an answer to this? I have a very similar problem.
Actions
3. Re: Problems with JBoss clustering

asookazian Sep 27, 2008 10:12 AM (in response to agohar)

"agohar" wrote:
Please note that another cluster is already running on the same network with 5 servers in it and it works fine. and i am looking to run both of these clusters in parallel.

when you say you are trying to run both of these clusters in parallel do you want all servers to join the same cluster or not?

if not, you must move them onto different subnets and/or perform the JGroups channel isolation configuration as per the appropriate wiki docs.

you may also need to configure/create more than one partition.

please provide more detail as to the design/architecture goal(s) of this cluster(s). it sounds like you have a complicated setup.

also, are these horizontal or mixed (vertical + horz) clusters?
Actions

Go to original post