Hi,
Just to figure out the problem, i've tried to put fresh copies of jboss-4.2.2 on 3 test servers on the same network with same cluster configurations but find the same issue that when third one joins the cluster it is very slow, but i've got some WARN messages in logs of the other two servers. Here is what i tried:
- Started Server A (bind_addr = 10.100.54.14)
- Started Server B (bind_addr = 10.100.54.135).. Joins the cluster and Everything looks fine
- Started Server C(bind_addr = 10.100.54.12) .. It does join the cluster but is very slow
Here are the logs on C, it stucks for long time here (posting relevant portion only):
-------------------------------------------------------
GMS: address is 10.100.54.12:34566
-------------------------------------------------------
16:22:35,096 WARN [GMS] join(10.100.54.12:34566) sent to 10.100.54.14:40469 timed out, retrying
16:22:39,129 INFO [TreeCache] viewAccepted(): [10.100.54.14:40469|2] [10.100.54.14:40469, 10.100.54.135:45846, 10.100.54.12:34566]
16:22:42,160 ERROR [FD_SOCK] received null cache; retrying
16:22:45,668 ERROR [FD_SOCK] received null cache; retrying
16:22:49,176 ERROR [FD_SOCK] received null cache; retrying
16:22:49,686 INFO [TreeCache] TreeCache local address is 10.100.54.12:34566
16:22:49,699 INFO [TreeCache] received the state (size=1024 bytes)
16:22:49,740 INFO [TreeCache] state was retrieved successfully (in 54 milliseconds)
16:22:49,740 INFO [TreeCache] parseConfig(): PojoCacheConfig is empty
16:22:49,949 INFO [STDOUT] no object for null
16:22:49,958 INFO [STDOUT] no object for null
16:22:50,011 INFO [STDOUT] no object for null
16:22:50,053 INFO [STDOUT] no object for {urn:jboss:bean-deployer}supplyType
16:22:50,075 INFO [STDOUT] no object for {urn:jboss:bean-deployer}dependsType
16:22:53,624 INFO [NativeServerConfig] JBoss Web Services - Native
16:22:53,624 INFO [NativeServerConfig] jbossws-native-2.0.1.SP2 (build=200710210837)
16:22:54,978 INFO [SnmpAgentService] SNMP agent going active
16:22:55,627 INFO [DefaultPartition] Initializing
16:22:55,714 INFO [STDOUT]
-------------------------------------------------------
GMS: address is 10.100.54.12:34571
-------------------------------------------------------
16:23:02,800 ERROR [FD_SOCK] received null cache; retrying
16:23:06,308 ERROR [FD_SOCK] received null cache; retrying
16:23:09,816 ERROR [FD_SOCK] received null cache; retrying
16:23:10,323 INFO [DefaultPartition] Number of cluster members: 3
16:23:10,323 INFO [DefaultPartition] Other members: 2
16:23:10,323 INFO [DefaultPartition] Fetching state (will wait for 30000 milliseconds):
16:23:10,374 INFO [DefaultPartition] state was retrieved successfully (in 50 milliseconds)
16:24:10,483 INFO [HANamingService] Started ha-jndi bootstrap jnpPort=1100, backlog=50, bindAddress=/0.0.0.0
16:24:10,497 INFO [DetachedHANamingService$AutomaticDiscovery] Listening on /0.0.0.0:1102, group=230.0.0.4, HA-JNDI address=10.100.54.12:1100
I can see these warnings on Server A's Logs:
16:22:32,150 INFO [TreeCache] viewAccepted(): [10.100.54.14:40469|2] [10.100.54.14:40469, 10.100.54.135:45846, 10.100.54.12:34566]
16:22:37,157 WARN [GMS] failed to collect all ACKs (2) for view [10.100.54.14:40469|2] [10.100.54.14:40469, 10.100.54.135:45846, 10.100.54.12:34566] after 5000ms, missing ACKs from [10.100.54.135:45846] (received=[10.100.54.14:40469]), local_addr=10.100.54.14:40469
16:22:39,106 WARN [GMS] 10.100.54.12:34566 already present; returning existing view [10.100.54.14:40469|2] [10.100.54.14:40469, 10.100.54.135:45846, 10.100.54.12:34566]
16:22:49,694 INFO [TreeCache] locking the subtree at / to transfer state
16:22:49,694 INFO [StateTransferGenerator_140] returning the state for tree rooted in /(1024 bytes)
16:22:57,784 INFO [DefaultPartition] New cluster view for partition DefaultPartition (id: 2, delta: 1) : [10.100.54.14:1099, 10.100.54.135:1099, 10.100.54.12:1099]
16:22:57,784 INFO [DefaultPartition] I am (10.100.54.14:1099) received membershipChanged event:
16:22:57,784 INFO [DefaultPartition] Dead members: 0 ([])
16:22:57,784 INFO [DefaultPartition] New Members : 1 ([10.100.54.12:1099])
16:22:57,784 INFO [DefaultPartition] All Members : 3 ([10.100.54.14:1099, 10.100.54.135:1099, 10.100.54.12:1099])
16:22:59,790 WARN [GMS] failed to collect all ACKs (2) for view [10.100.54.14:40472|2] [10.100.54.14:40472, 10.100.54.135:45849, 10.100.54.12:34571] after 2000ms, missing ACKs from [10.100.54.135:45849] (received=[10.100.54.14:40472]), local_addr=10.100.54.14:40472
16:26:13,214 INFO [TreeCache] viewAccepted(): [10.100.54.14:40474|1] [10.100.54.14:40474, 10.100.54.12:34573]
16:26:26,091 INFO [TreeCache] viewAccepted(): [10.100.54.14:40476|1] [10.100.54.14:40476, 10.100.54.12:34575]
and warnings on Server B:
16:22:40,007 WARN [NAKACK] 10.100.54.135:45846] discarded message from non-member 10.100.54.12:34566, my view is [10.100.54.14:40469|1] [10.100.54.14:40469, 10.100.54.135:45846]
16:23:05,714 WARN [NAKACK] 10.100.54.135:45849] discarded message from non-member 10.100.54.12:34571, my view is [10.100.54.14:40472|1] [10.100.54.14:40472, 10.100.54.135:45849]
16:23:10,452 WARN [NAKACK] 10.100.54.135:45849] discarded message from non-member 10.100.54.12:34571, my view is [10.100.54.14:40472|1] [10.100.54.14:40472, 10.100.54.135:45849]
16:24:10,502 WARN [NAKACK] 10.100.54.135:45849] discarded message from non-member 10.100.54.12:34571, my view is [10.100.54.14:40472|1] [10.100.54.14:40472, 10.100.54.135:45849]
16:24:45,147 WARN [NAKACK] 10.100.54.135:45846] discarded message from non-member 10.100.54.12:34566, my view is [10.100.54.14:40469|1] [10.100.54.14:40469, 10.100.54.135:45846]
16:25:09,831 WARN [NAKACK] 10.100.54.135:45849] discarded message from non-member 10.100.54.12:34571, my view is [10.100.54.14:40472|1] [10.100.54.14:40472, 10.100.54.135:45849]
16:25:10,504 WARN [NAKACK] 10.100.54.135:45849] discarded message from non-member 10.100.54.12:34571, my view is [10.100.54.14:40472|1] [10.100.54.14:40472, 10.100.54.135:45849]
Please note that another cluster is already running on the same network with 5 servers in it and it works fine. and i am looking to run both of these clusters in parallel.
Any clue?