2 Replies Latest reply on Mar 24, 2011 11:36 AM by andyd513

    Odd issue with JBoss 4.2.3GA, JGroups and Clustering

    andyd513

      My environment:

       

      4 clustered nodes running:

       

      RHEL 5.5

      JBoss 4.2.3GA

      Java(TM) SE Runtime Environment (build 1.6.0_20-b02)

      Java HotSpot(TM) Server VM (build 16.3-b01, mixed mode)

      JGroups 1.4.1SP-4

       

      Yesterday I had an extremely strange issue.

       

      Consider the below scenerio:

       

      Node 1 - Up

      Node 2 - Up

      Node 3 - Up <- coordinator

      Node 4 - Up

      For some reason the JVM for Node 3 hangs, won't respond to the shutdown command, appears the container is still open but when you hit it nothing is ever rendered. A kill command is issued, bringing node 3 down. A standard jboss restart is performed but on restart the log file shows the following:

       

      ------------------------------------------

      GMS Address: x.x.x.x:xxxx

      ------------------------------------------

       

      Then every few seconds the log line:

       

      WARN [org.jgroups.protocols.pbcast.GMS] Join(x.x.x.x:xxxx) to <address of coordinator that was killed> timed out, retrying

       

      I decided to try a restart of Node 1 with jgroups on trace in the log4j to see if it would generate the same symptoms, and it did. I also turned jgroups to trace on Node 2. I could then see Node 1 start up, send the appropriate REQ's and get a response back from node 2 stating the dead coordinator as still the active coordinator. Obviously Node 1 now can't contact the dead coordinator, resulting in the node never joining the cluster, never retrieving state and its' container never starting.

       

      Now I have:

       

      Node 1 - Down

      Node 3 - Down (dead coordinator)

       

      Node 2 - Up -> reporting coord_addr as Node 3

      Node 4 - Up -> I'd imagien reporting the same

       

      I had to bring the whole cluster down and start it fresh to get it working again.

       

      I'm lucky enough this was a QA environment, but I'm just curious if anyone's run into this before/is it a known issue with JGroups 1.4.1, a config issue, etc?

       

      Thanks for your assistance!

        • 1. Odd issue with JBoss 4.2.3GA, JGroups and Clustering
          wdfink

          I've had some strange behaviour. It was a network missconfiguration (maybe on the fly).

           

          I experiment a little with the jgroups test program and figure out that it is a difference (in my case) which ip address is bound (run.sh -b) and which multicast address is used.

          Maybe it helps in your case.

          See http://community.jboss.org/wiki/TestingJBoss

          1 of 1 people found this helpful
          • 2. Odd issue with JBoss 4.2.3GA, JGroups and Clustering
            andyd513

            Well, we have been unable to reproduce up to this point.

             

            I will definitely be using the link you provided to assist in troubleshooting any issues, and will reply on this thread if I see a reoccurance.

             

            Interestingly these systems are bound with both an intranet/Public IP and a private address IP, so it's very possible binding issues are causing problems.

             

            Thanks,

             

            Andrew