5 Replies Latest reply on Sep 21, 2007 12:18 PM by belaban

    Cluster merge issues

    jbossmk

      We have a cluster of nodes deployed on the same machine with the cluster-service.xml having the following snippet:

      <TCP bind_addr="localhost" start_port="${jboss.cluster.tcp.port:7800}" loopback="true"/>
      <TCPPING initial_hosts="localhost[${jboss.cluster.tcp.port:7800}]" port_range="${jboss.cluster.tcp.port.range:5}" timeout="3500"
      num_initial_members="${jboss.cluster.tcp.members:5}" up_thread="true" down_thread="true"/>
      <MERGE2 min_interval="5000" max_interval="10000"/>
      <FD shun="true" timeout="5000" max_tries="5" up_thread="false" down_thread="false" />
      <VERIFY_SUSPECT timeout="4000" down_thread="false" up_thread="false" />
      <pbcast.NAKACK down_thread="true" up_thread="true" gc_lag="100"
      retransmit_timeout="3000"/>
      <pbcast.STABLE desired_avg_gossip="20000" down_thread="false" up_thread="false" />
      <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="false"
      print_local_addr="true" down_thread="true" up_thread="true"/>
      <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
      



      When a split happens, the nodes from the secondary partition doesn't merge at all. We are re-starting the node every time this happens.
      Could someone tell me if there is anything wrong in the configuration?
      Would setting shun="true" in the GMS change the behavior? I also heard that the JGroup channel's AUTO_RECONNECT should be set to true programatically, how do we do that declaratively?

      Your help is appreciated.

      Thanks.

        • 1. Re: Cluster merge issues
          belaban

          #1 Check that 'localhost' really resolve to the correct address (e.g. not to 127.0.0.1) on *all* hosts
          #2 You can't set AUTO_RECONNECT declaratively, use the following code to do this:

          JChannel ch;
          ch.setOpt(Channel.AUTO_RECONNECT, true);

          Possibly also
          ch.setOpt(Channel.AUTO_GETSTATE, true);

          • 2. Re: Cluster merge issues
            jbossmk

            The "localhost" resolves to a valid domain, and all of the nodes (five of them) run on a single box. We are using JGroups 2.4.1.

            Is the configuration Okay otherwise?

            Why does the split happen at all even if all the nodes are running on the same box in the first place? Could it be because of long GC pauses? Could there be any other reasons?

            Please let us know.

            Thanks.

            • 3. Re: Cluster merge issues
              belaban

              Use FD_SOCK instead of or on top of FD (see http://wiki.jboss.org/wiki/Wiki.jsp?page=FDVersusFD_SOCK for details). Suspicions can happen due to a number of reasons, e.g. garbage collection, up queue blocked by callback etc, also explained there

              • 4. Re: Cluster merge issues
                jbossmk

                Thanks for your suggestion Bela.

                If I use FD_SOCK on top of FD, then what happens when the FD has timed out after retrying, but the socket (FD_SOCK) is still active between the nodes? Would FD send a SUSPECT message?

                Thanks in advance.

                • 5. Re: Cluster merge issues
                  belaban

                  yes. So set the timeout in FD to a sufficiently high value