3 Replies Latest reply on Sep 27, 2008 10:12 AM by asookazian

    Problems with JBoss clustering

    agohar

      Hi,
      I am having problems in jboss clustering, here is what i am trying to do:

      We have 8 servers at 2 physical locations and these are all clustered well and were working fine with jdk1.4.2 and jboss-4.0.2. We wanted to upgrade our servers with new java 5 and jboss-4.2.2 versions and applications to use JbossWS and EJB3.0 versions. For that we put down the 4 servers (2 from each location) and upgraded them with new java and jboss versions and new upgraded applications. We are using udp multicast for the clustering so i changed the ports and IPs in following files to split the cluster:

      - cluster-service.xml
      - tree cache files
      - jboss-web-cluster/META-INF/jboss-service.xml

      I didn't change the partition name though and tried to run these both clusters in parallel. But got into problems, when i start any two of them they startup fine and create a new separate cluster fine but when i try to add third one, it is very very slow during start up and take ages to start, but doesn't throw any exceptions during startup and when it is up it throw:

      2008-08-20 01:07:45,008 ERROR [org.jgroups.protocols.UDP] [10.100.54.135:41757] failed receiving unicast packet
      java.lang.ArrayIndexOutOfBoundsException
      at java.net.PlainDatagramSocketImpl.receive0(Native Method)
      at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136)
      at java.net.DatagramSocket.receive(DatagramSocket.java:712)
      at org.jgroups.protocols.UDP$UcastReceiver.run(UDP.java:885)
      at java.lang.Thread.run(Thread.java:595)
      2008-08-20 01:07:45,008 ERROR [org.jgroups.protocols.UDP] failure in multicast receive()
      java.lang.ArrayIndexOutOfBoundsException


      I tried every combination of servers any two servers run perfectly fine and third and fourth ones cause problems. The other cluster works perfectly fine its jsut the problem with the new one. I had upgraded jgroups version to jgroups-2.6.0-GA because of a bug and now using this (2.6.0-GA) version.

      Please let me know if there is anything i can look at? Your help will be highly appreciated

      Thank you

        • 1. Re: Problems with JBoss clustering
          agohar

          Hi,

          Just to figure out the problem, i've tried to put fresh copies of jboss-4.2.2 on 3 test servers on the same network with same cluster configurations but find the same issue that when third one joins the cluster it is very slow, but i've got some WARN messages in logs of the other two servers. Here is what i tried:

          - Started Server A (bind_addr = 10.100.54.14)
          - Started Server B (bind_addr = 10.100.54.135).. Joins the cluster and Everything looks fine
          - Started Server C(bind_addr = 10.100.54.12) .. It does join the cluster but is very slow

          Here are the logs on C, it stucks for long time here (posting relevant portion only):

          -------------------------------------------------------
          GMS: address is 10.100.54.12:34566
          -------------------------------------------------------
          16:22:35,096 WARN [GMS] join(10.100.54.12:34566) sent to 10.100.54.14:40469 timed out, retrying
          16:22:39,129 INFO [TreeCache] viewAccepted(): [10.100.54.14:40469|2] [10.100.54.14:40469, 10.100.54.135:45846, 10.100.54.12:34566]
          16:22:42,160 ERROR [FD_SOCK] received null cache; retrying
          16:22:45,668 ERROR [FD_SOCK] received null cache; retrying
          16:22:49,176 ERROR [FD_SOCK] received null cache; retrying
          16:22:49,686 INFO [TreeCache] TreeCache local address is 10.100.54.12:34566
          16:22:49,699 INFO [TreeCache] received the state (size=1024 bytes)
          16:22:49,740 INFO [TreeCache] state was retrieved successfully (in 54 milliseconds)
          16:22:49,740 INFO [TreeCache] parseConfig(): PojoCacheConfig is empty
          16:22:49,949 INFO [STDOUT] no object for null
          16:22:49,958 INFO [STDOUT] no object for null
          16:22:50,011 INFO [STDOUT] no object for null
          16:22:50,053 INFO [STDOUT] no object for {urn:jboss:bean-deployer}supplyType
          16:22:50,075 INFO [STDOUT] no object for {urn:jboss:bean-deployer}dependsType
          16:22:53,624 INFO [NativeServerConfig] JBoss Web Services - Native
          16:22:53,624 INFO [NativeServerConfig] jbossws-native-2.0.1.SP2 (build=200710210837)
          16:22:54,978 INFO [SnmpAgentService] SNMP agent going active
          16:22:55,627 INFO [DefaultPartition] Initializing
          16:22:55,714 INFO [STDOUT]
          -------------------------------------------------------
          GMS: address is 10.100.54.12:34571
          -------------------------------------------------------
          16:23:02,800 ERROR [FD_SOCK] received null cache; retrying
          16:23:06,308 ERROR [FD_SOCK] received null cache; retrying
          16:23:09,816 ERROR [FD_SOCK] received null cache; retrying
          16:23:10,323 INFO [DefaultPartition] Number of cluster members: 3
          16:23:10,323 INFO [DefaultPartition] Other members: 2
          16:23:10,323 INFO [DefaultPartition] Fetching state (will wait for 30000 milliseconds):
          16:23:10,374 INFO [DefaultPartition] state was retrieved successfully (in 50 milliseconds)
          16:24:10,483 INFO [HANamingService] Started ha-jndi bootstrap jnpPort=1100, backlog=50, bindAddress=/0.0.0.0
          16:24:10,497 INFO [DetachedHANamingService$AutomaticDiscovery] Listening on /0.0.0.0:1102, group=230.0.0.4, HA-JNDI address=10.100.54.12:1100
          


          I can see these warnings on Server A's Logs:
          16:22:32,150 INFO [TreeCache] viewAccepted(): [10.100.54.14:40469|2] [10.100.54.14:40469, 10.100.54.135:45846, 10.100.54.12:34566]
          16:22:37,157 WARN [GMS] failed to collect all ACKs (2) for view [10.100.54.14:40469|2] [10.100.54.14:40469, 10.100.54.135:45846, 10.100.54.12:34566] after 5000ms, missing ACKs from [10.100.54.135:45846] (received=[10.100.54.14:40469]), local_addr=10.100.54.14:40469
          16:22:39,106 WARN [GMS] 10.100.54.12:34566 already present; returning existing view [10.100.54.14:40469|2] [10.100.54.14:40469, 10.100.54.135:45846, 10.100.54.12:34566]
          16:22:49,694 INFO [TreeCache] locking the subtree at / to transfer state
          16:22:49,694 INFO [StateTransferGenerator_140] returning the state for tree rooted in /(1024 bytes)
          16:22:57,784 INFO [DefaultPartition] New cluster view for partition DefaultPartition (id: 2, delta: 1) : [10.100.54.14:1099, 10.100.54.135:1099, 10.100.54.12:1099]
          16:22:57,784 INFO [DefaultPartition] I am (10.100.54.14:1099) received membershipChanged event:
          16:22:57,784 INFO [DefaultPartition] Dead members: 0 ([])
          16:22:57,784 INFO [DefaultPartition] New Members : 1 ([10.100.54.12:1099])
          16:22:57,784 INFO [DefaultPartition] All Members : 3 ([10.100.54.14:1099, 10.100.54.135:1099, 10.100.54.12:1099])
          16:22:59,790 WARN [GMS] failed to collect all ACKs (2) for view [10.100.54.14:40472|2] [10.100.54.14:40472, 10.100.54.135:45849, 10.100.54.12:34571] after 2000ms, missing ACKs from [10.100.54.135:45849] (received=[10.100.54.14:40472]), local_addr=10.100.54.14:40472
          16:26:13,214 INFO [TreeCache] viewAccepted(): [10.100.54.14:40474|1] [10.100.54.14:40474, 10.100.54.12:34573]
          16:26:26,091 INFO [TreeCache] viewAccepted(): [10.100.54.14:40476|1] [10.100.54.14:40476, 10.100.54.12:34575]
          


          and warnings on Server B:
          16:22:40,007 WARN [NAKACK] 10.100.54.135:45846] discarded message from non-member 10.100.54.12:34566, my view is [10.100.54.14:40469|1] [10.100.54.14:40469, 10.100.54.135:45846]
          16:23:05,714 WARN [NAKACK] 10.100.54.135:45849] discarded message from non-member 10.100.54.12:34571, my view is [10.100.54.14:40472|1] [10.100.54.14:40472, 10.100.54.135:45849]
          16:23:10,452 WARN [NAKACK] 10.100.54.135:45849] discarded message from non-member 10.100.54.12:34571, my view is [10.100.54.14:40472|1] [10.100.54.14:40472, 10.100.54.135:45849]
          16:24:10,502 WARN [NAKACK] 10.100.54.135:45849] discarded message from non-member 10.100.54.12:34571, my view is [10.100.54.14:40472|1] [10.100.54.14:40472, 10.100.54.135:45849]
          16:24:45,147 WARN [NAKACK] 10.100.54.135:45846] discarded message from non-member 10.100.54.12:34566, my view is [10.100.54.14:40469|1] [10.100.54.14:40469, 10.100.54.135:45846]
          16:25:09,831 WARN [NAKACK] 10.100.54.135:45849] discarded message from non-member 10.100.54.12:34571, my view is [10.100.54.14:40472|1] [10.100.54.14:40472, 10.100.54.135:45849]
          16:25:10,504 WARN [NAKACK] 10.100.54.135:45849] discarded message from non-member 10.100.54.12:34571, my view is [10.100.54.14:40472|1] [10.100.54.14:40472, 10.100.54.135:45849]
          


          Please note that another cluster is already running on the same network with 5 servers in it and it works fine. and i am looking to run both of these clusters in parallel.

          Any clue?

          • 2. Re: Problems with JBoss clustering
            mearman

            Did you find an answer to this? I have a very similar problem.

            • 3. Re: Problems with JBoss clustering
              asookazian

               

              "agohar" wrote:
              Please note that another cluster is already running on the same network with 5 servers in it and it works fine. and i am looking to run both of these clusters in parallel.


              when you say you are trying to run both of these clusters in parallel do you want all servers to join the same cluster or not?

              if not, you must move them onto different subnets and/or perform the JGroups channel isolation configuration as per the appropriate wiki docs.

              you may also need to configure/create more than one partition.

              please provide more detail as to the design/architecture goal(s) of this cluster(s). it sounds like you have a complicated setup.

              also, are these horizontal or mixed (vertical + horz) clusters?