0 Replies Latest reply on Aug 8, 2016 11:08 PM by mohilkhare

    (Wildfly HA) FD_SOCK issue: Unable to reconnect and form tcp ring after firewall snaps tcp connection

    mohilkhare

      Hello,

       

      I am on wildfly 9 and running cluster of 3 nodes. We have following jgroups config:

       

                 <stack name="tcp">

                          <transport socket-binding="jgroups-tcp" type="TCP"/>

                          <protocol type="TCPPING">

                              <property name="initial_hosts">

                              10.9.2.2[7600],10.9.3.2[7600],10.9.1.2[7600]</property>

                              <property name="port_range">

                                  0

                              </property>

                          </protocol>

                          <protocol type="MERGE2"/>

                          <protocol socket-binding="jgroups-tcp-fd" type="FD_SOCK"/>

                          <protocol type="FD"/>

                          <protocol type="VERIFY_SUSPECT"/>

                          <protocol type="pbcast.NAKACK2"/>

                          <protocol type="UNICAST3"/>

                          <protocol type="pbcast.STABLE"/>

                          <protocol type="pbcast.GMS">

                              <property name="join_timeout">

                                  5000

                              </property>

                          </protocol>

                          <protocol type="MFC"/>

                          <protocol type="FRAG2"/>

                          <protocol type="RSVP"/>

                      </stack>

       

      <interfaces>

              <interface name="management">

                  <inet-address value="${jboss.bind.address.management:0.0.0.0}"/>

              </interface>

              <interface name="public">

                  <inet-address value="${jboss.bind.address:0.0.0.0}"/>

              </interface>

              <interface name="unsecure">

                  <inet-address value="${jboss.bind.address.unsecure:127.0.0.1}"/>

              </interface>

              <interface name="jgroup-tcp-interface">

              <inet-address value="10.9.2.2"/>

      </interface>

          </interfaces>

       

       

          <socket-binding-group default-interface="public" name="standard-sockets" port-offset="${jboss.socket.binding.port-offset:0}">

              <socket-binding interface="management" name="management-http" port="${jboss.management.http.port:9990}"/>

              <socket-binding interface="management" name="management-https" port="${jboss.management.https.port:9993}"/>

              <socket-binding name="ajp" port="${jboss.ajp.port:8009}"/>

              <socket-binding name="http" port="${jboss.http.port:8080}"/>

              <socket-binding name="https" port="${jboss.https.port:8443}"/>

              <socket-binding interface="jgroup-tcp-interface" name="jgroups-tcp" port="7600"/>

              <socket-binding interface="jgroup-tcp-interface" name="jgroups-tcp-fd" port="57600"/>

              <socket-binding name="txn-recovery-environment" port="4712"/>

              <socket-binding name="txn-status-manager" port="4713"/>

              <outbound-socket-binding name="mail-smtp">

                  <remote-destination host="localhost" port="25"/>

              </outbound-socket-binding>

          </socket-binding-group>

       

      Our kernel's tcp keep-alive is 2 hours. We deployed our cluster in an environment where there is a firewall between two cluster nodes. Since tcp connection to port 57600 is only used for Fd_SOCK and remains idle for most of the time, firewall rule broke that connection, thereby disrupting tcp socket ring.  After keep_alive got elapsed I was expecting socket reconnection, thereby reestablishing  socket ring; instead I ending up getting linear chain of sockets i.e.

       

      Before Firewall broke incoming and outgoing connection of A

       

      A <----B <---C --

      |----------------->|        

       

      After firewall broke connection ( before keep alive got elapsed and before "Received new cluster view" messages were seen because of MERGE)

       

      A    B<---C          

       

      After firewall broke connection ( after keep alive got elapsed and after "Received new cluster view" messages were seen because of MERGE)


      A<---B<---C


      This looks like some bug. Am I missing something here ?


      Thanks

      Mohil