4 Replies Latest reply on Mar 19, 2008 6:54 PM by brian.stansberry

    JGroups Multiple Registrations

    kvbisme

      We have two JBoss Servers (4.0.5.GA) clustered together. Each machine has two network cards, one connects to the world and the other to a small subnet of the JBoss Servers and other support servers used by the enterprise (database and stuff like that)
      In the run.conf at startup we added a -Djboss.bind.address= to point to the network card connected to our smaller subset of machines.
      Every so often . . . we start getting messages (the machine and IP address have been changed to accommodate my overly concerned boss):


      [org.jgroups.protocols.FD] I was suspected, but will not remove myself from membership (waiting for EXIT message)
      [org.jgroups.protocols.pbcast.GMS] checkSelfInclusion() failed, Machine1-int:33252 (additional data: 18 bytes) is not a member of view [Machine2-int:33200 (additional data: 18 bytes) |2] [Machine2-int:33200 (additional data: 18 bytes)]; discarding view
      [org.jgroups.protocols.pbcast.GMS] I (Machine1-int:33252 (additional data: 18 bytes)) am being shunned, will leave and rejoin group (prev_members are [Machine1-int:33252 (additional data: 18 bytes) Machine2-int:33200 (additional data: 18 bytes) ])
      [org.jgroups.protocols.pbcast.NAKACK] [Machine1-int:33252 (additional data: 18 bytes)] discarded message from non-member Machine2-int:33200 (additional data: 18 bytes)
      [org.jgroups.protocols.pbcast.NAKACK] [Machine1-int:33252 (additional data: 18 bytes)] discarded message from non-member Machine2-int:33200 (additional data: 18 bytes)
      [org.jgroups.protocol.PING] down_handler thread for PING was interrupted (in order to be terminated), but is is still alive
      ----------------------------------------------------------------------------
      GMS: address is Machine1-int:33265 (additional data: 18 bytes)
      ----------------------------------------------------------------------------
      [org.jgroups.protocol.pbcast.NAKACK] sender Machine1-int:33252 (additional data: 18 bytes) not found in received_msgs
      [org.jgroups.protocol.pbcast.NAKACK] range is null
      [org.jgroups.protocol.pbcast.NAKACK] sender Machine2-int:33200 (additional data: 18 bytes) not found in received_msgs
      [org.jgroups.protocol.pbcast.NAKACK] range is null
      [org.jgroups.protocol.pbcast.Digest] sender is null, will not add it !
      [org.jgroups.protocol.pbcast.Digest] sender is null, will not add it !
      [org.jgroups.protocols.pbcast.NAKACK] sender at index 1 in digest is null
      [org.jgroups.protocols.pbcast.NAKACK] sender at index 2 in digest is null
      [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.DefaultPartition] New cluster view for partition DefaultPartition ( id: 3, delta: 1) : [111.222.333.001:1099, 111.222.333.002:1099, 111.222.333.001:1099]
      [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] I am (111.222.333.001:1099) receivedmembershipChanged event:
      [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] Dead Members: 0 ([])
      [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] New Members: 0 ([])
      [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] All Members : 3 ([ 111.222.333.001:1099, 111.222.333.002:1099, 111.222.333.001:1099])
      [org.jgroups.protocols.pbcast.STATE_TRANSFER] GET_APPLSTATE_OK: received application state, but there are no requestors !

      Then there are four sets of the following messages with sequence numbers starting at zero and ending at 1273:

      [org.jgroups.protocols.pbcast.NAKACK] (requestor=Machine1-int:33265 (additional data: 18 bytes), local_addr=Machine1-int:33252 (additional data: 18 bytes)) message with seqno=0 not found in sent_msgs ! sent_msgs=[1274 -“ 1274]

      . . .

      [org.jgroups.protocols.pbcast.NAKACK] (requestor=Machine1-int:33265 (additional data: 18 bytes), local_addr=Machine1-int:33252 (additional data: 18 bytes)) message with seqno=1273 not found in sent_msgs ! sent_msgs=[1274 - 1274]

      At this point Machine1 starts adding itself to the cluster over and over again until we have to stop and restart the machine.

      What could possibly be going on here?


        • 1. Re: JGroups Multiple Registrations
          kvbisme

          Sorry I lied we do not have the -Djboss.bind.address set

          • 2. Re: JGroups Multiple Registrations
            brian.stansberry

            This looks to be the key to the problem:

            [org.jgroups.protocol.PING] down_handler thread for PING was interrupted (in order to be terminated), but is is still alive


            Machine1-int was detected as non-responsive and excluded from its group (see http://wiki.jboss.org/wiki/Wiki.jsp?page=FDVersusFD_SOCK for details); when it realized this is tried to stop and restart its JGroups channel in order to rejoin the group. The above message indicates a problem in stopping the channel, which I believe left a "zombie" channel running causing problems.

            Suggest you try upgrading JGroups to the latest compatible release, JGroups 2.4.2.

            • 3. Re: JGroups Multiple Registrations
              kvbisme

              Thanks for the reply Brian, and I agree. What is confusing to me is that Machine1 is the first machine in the cluster so I am confused what would cause Machine1 to become unresponsive to itself?

              Is there a particular fix I can point to in JGroups for justification of this upgrade to the boss?

              • 4. Re: JGroups Multiple Registrations
                brian.stansberry

                 

                "kvbisme" wrote:
                What is confusing to me is that Machine1 is the first machine in the cluster so I am confused what would cause Machine1 to become unresponsive to itself?


                It isn't unresponsive to itself, it's unresponsive to other members in the cluster. They decide to remove Machine1 from the group.

                Is there a particular fix I can point to in JGroups for justification of this upgrade to the boss?


                Can't recall anything specific, no. I have an (admittedly vague) memory of a similar issue coming up sometime in the past and moving off of JGroups 2.2.7 helped.

                In general, using JGroups 2.4 is recommended. 2.2.7 is very old. We were forced for policy reasons to leave it in all the 4.0.x releases (so for example a 4.0.5 server could interoperate with a 4.0.2). But if it weren't for that requirement, we would have used a much later JGroups release in AS 4.0.5.