(Wildfly HA) FD_SOCK issue: Unable to reconnect and form tcp ring after firewall snaps tcp connection
mohilkhare Aug 8, 2016 11:08 PMHello,
I am on wildfly 9 and running cluster of 3 nodes. We have following jgroups config:
<stack name="tcp">
<transport socket-binding="jgroups-tcp" type="TCP"/>
<protocol type="TCPPING">
<property name="initial_hosts">
10.9.2.2[7600],10.9.3.2[7600],10.9.1.2[7600]</property>
<property name="port_range">
0
</property>
</protocol>
<protocol type="MERGE2"/>
<protocol socket-binding="jgroups-tcp-fd" type="FD_SOCK"/>
<protocol type="FD"/>
<protocol type="VERIFY_SUSPECT"/>
<protocol type="pbcast.NAKACK2"/>
<protocol type="UNICAST3"/>
<protocol type="pbcast.STABLE"/>
<protocol type="pbcast.GMS">
<property name="join_timeout">
5000
</property>
</protocol>
<protocol type="MFC"/>
<protocol type="FRAG2"/>
<protocol type="RSVP"/>
</stack>
<interfaces>
<interface name="management">
<inet-address value="${jboss.bind.address.management:0.0.0.0}"/>
</interface>
<interface name="public">
<inet-address value="${jboss.bind.address:0.0.0.0}"/>
</interface>
<interface name="unsecure">
<inet-address value="${jboss.bind.address.unsecure:127.0.0.1}"/>
</interface>
<interface name="jgroup-tcp-interface">
<inet-address value="10.9.2.2"/>
</interface>
</interfaces>
<socket-binding-group default-interface="public" name="standard-sockets" port-offset="${jboss.socket.binding.port-offset:0}">
<socket-binding interface="management" name="management-http" port="${jboss.management.http.port:9990}"/>
<socket-binding interface="management" name="management-https" port="${jboss.management.https.port:9993}"/>
<socket-binding name="ajp" port="${jboss.ajp.port:8009}"/>
<socket-binding name="http" port="${jboss.http.port:8080}"/>
<socket-binding name="https" port="${jboss.https.port:8443}"/>
<socket-binding interface="jgroup-tcp-interface" name="jgroups-tcp" port="7600"/>
<socket-binding interface="jgroup-tcp-interface" name="jgroups-tcp-fd" port="57600"/>
<socket-binding name="txn-recovery-environment" port="4712"/>
<socket-binding name="txn-status-manager" port="4713"/>
<outbound-socket-binding name="mail-smtp">
<remote-destination host="localhost" port="25"/>
</outbound-socket-binding>
</socket-binding-group>
Our kernel's tcp keep-alive is 2 hours. We deployed our cluster in an environment where there is a firewall between two cluster nodes. Since tcp connection to port 57600 is only used for Fd_SOCK and remains idle for most of the time, firewall rule broke that connection, thereby disrupting tcp socket ring. After keep_alive got elapsed I was expecting socket reconnection, thereby reestablishing socket ring; instead I ending up getting linear chain of sockets i.e.
Before Firewall broke incoming and outgoing connection of A
A <----B <---C --
|----------------->|
After firewall broke connection ( before keep alive got elapsed and before "Received new cluster view" messages were seen because of MERGE)
A B<---C
After firewall broke connection ( after keep alive got elapsed and after "Received new cluster view" messages were seen because of MERGE)
A<---B<---C
This looks like some bug. Am I missing something here ?
Thanks
Mohil