2 Replies Latest reply on Mar 24, 2011 11:36 AM by Andrew DuFour

    Odd issue with JBoss 4.2.3GA, JGroups and Clustering

    Andrew DuFour Newbie

      My environment:

       

      4 clustered nodes running:

       

      RHEL 5.5

      JBoss 4.2.3GA

      Java(TM) SE Runtime Environment (build 1.6.0_20-b02)

      Java HotSpot(TM) Server VM (build 16.3-b01, mixed mode)

      JGroups 1.4.1SP-4

       

      Yesterday I had an extremely strange issue.

       

      Consider the below scenerio:

       

      Node 1 - Up

      Node 2 - Up

      Node 3 - Up <- coordinator

      Node 4 - Up

      For some reason the JVM for Node 3 hangs, won't respond to the shutdown command, appears the container is still open but when you hit it nothing is ever rendered. A kill command is issued, bringing node 3 down. A standard jboss restart is performed but on restart the log file shows the following:

       

      ------------------------------------------

      GMS Address: x.x.x.x:xxxx

      ------------------------------------------

       

      Then every few seconds the log line:

       

      WARN [org.jgroups.protocols.pbcast.GMS] Join(x.x.x.x:xxxx) to <address of coordinator that was killed> timed out, retrying

       

      I decided to try a restart of Node 1 with jgroups on trace in the log4j to see if it would generate the same symptoms, and it did. I also turned jgroups to trace on Node 2. I could then see Node 1 start up, send the appropriate REQ's and get a response back from node 2 stating the dead coordinator as still the active coordinator. Obviously Node 1 now can't contact the dead coordinator, resulting in the node never joining the cluster, never retrieving state and its' container never starting.

       

      Now I have:

       

      Node 1 - Down

      Node 3 - Down (dead coordinator)

       

      Node 2 - Up -> reporting coord_addr as Node 3

      Node 4 - Up -> I'd imagien reporting the same

       

      I had to bring the whole cluster down and start it fresh to get it working again.

       

      I'm lucky enough this was a QA environment, but I'm just curious if anyone's run into this before/is it a known issue with JGroups 1.4.1, a config issue, etc?

       

      Thanks for your assistance!