1 2 Previous Next 15 Replies Latest reply on Feb 16, 2006 12:23 PM by belaban

    silent TCP disconnect not detected

    annecotter

      Hi,

      I am wondering if there are any configuration parameters to have JBoss/JGroups perform an application-level TCP keepalive. For example, I would like to have a probe sent across the socket every 30 seconds, and if no reply is received, close the connection.

      Specifically, my problem is as follows: I am using TCP as the transport protocol for clustering. There are only 2 members in the cluster. I have a firewall between the 2 members that periodically drops TCP connections (for example - connections that have been active for 4 hours or more). When the firewall silently drops the connection, JBoss failure detection kicks in. I need to have a new TCP connection established before the far-end member is declared suspect.

      Thanks in advance,
      Anne

        • 1. Re: silent TCP disconnect not detected
          brian.stansberry

          Are you using VERIFY_SUSPECT? (http://wiki.jboss.org/wiki/Wiki.jsp?page=JGroupsVERIFY_SUSPECT).

          You mention wanting to do this at the application level, so maybe I'm misunderstanding what you want.

          • 2. Re: silent TCP disconnect not detected
            annecotter

            Hi Brian,

            Yes, we are using VERIFY_SUSPECT. But the "ping" sent by VERIFY_SUSPECT seems to use the same socket as all the other clustering traffic. I suppose I need JBoss to question the integrity of the connection: drop the connection and attempt a reconnection before suspecting the member.

            Thanks
            Anne



            • 3. Re: silent TCP disconnect not detected
              brian.stansberry

              I want to be sure I understand what connection you're talking about. A connection opened by the TCP protocol for normal message traffic? The protocol itself should be able to handle that, so if it's not able to handle the firewall breaking the connection that's one issue.

              I interpreted your first post to be about the connection opened by the FD_SOCK protocol, which is opened and then sits idle for hours. If that connection gets broken, a suspect event will be sent up the stack, but then VERIFY_SUSPECT should kick in, send a new packet over the regular TCP connection, and see that the other member isn't really dead. At that point FD_SOCK should open a new connection.

              • 4. Re: silent TCP disconnect not detected
                annecotter

                Sorry, I should have provided more details in my original post - we use FD (not FD_SOCK) with timeout=5s and max_tries=9. We use TCP as the transport protocol. The firewall removes connections that have been established for more than 4 hours. The firewall accomplishes this by removing the connection from it's "pass" list, causing all further packets across that connection to be dropped. This manifests as a loss of visibility to the members in the cluster.

                I can recover from this by lowering the OS tcp_keepalive parameter, so that the TCP connection will timeout and be destroyed by the OS before JBoss failure detection causes the remote member to be deemed suspect. When the connection is destroyed by the OS, JBoss creates a new TCP connection and is able to reach the other member of the cluster.

                However, lowering the OS tcp_keepalive is not acceptable as a permanent solution. I was hoping that JBoss might have a configuration parameter to achieve this timeout behaviour at the application-level for sockets created by JBoss.

                I hope that's a little better, sorry for the confusion :)

                Thanks
                Anne

                • 5. Re: silent TCP disconnect not detected
                  brian.stansberry

                  Interesting. So with what the firewall is doing, JGroups must not be seeing any exceptions on the Socket, and thus doesn't close the connection.

                  I've been trying to think of a workaround involving the conn_expire_time property of TCP (see http://wiki.jboss.org/wiki/Wiki.jsp?page=JGroupsTCP) but it has the flaw of 1) not working if there is continuous traffic over the connection and 2) needing to recycle the connection every few seconds if FD is used.

                  Using FD_SOCK shouldn't help either; eventually the firewall will cut the main TCP connection. This won't cause suspicions any more, but messages still won't get through -- that's actually worse.

                  Will have to get back to you on this one :(. AFAICT, there are no simple hooks in the TCP protocol code where you can trigger a connection recycle.

                  • 6. Re: silent TCP disconnect not detected
                    belaban

                    If you use FD rather than FD_SOCK, the reject rule of the FW will discard packets, therefore heartbeats sent by FD won't be received, and the connection should be closed.
                    Does that work for you ?

                    • 7. Re: silent TCP disconnect not detected
                      annecotter

                      Hi Bela,

                      We do use FD, and the behaviour seems to be as follows:

                      - firewall starts dropping packets belonging to the cluster TCP connection
                      - FD kicks in, heartbeats are sent to neighbor but no ACKs are received
                      - max_tries is finally reached, and the neighbor is deemed suspect (we do have a VERIFY_SUSPECT here, but as above, no ACK is received)
                      - since that same TCP connection is still being used, we get stuck in a state where JBoss thinks the neighbor is down.

                      I am looking for a way to have JBoss close the socket it's using for clustering traffic and open a new one.

                      Thanks
                      Anne

                      • 8. Re: silent TCP disconnect not detected
                        belaban

                        Okay, got you.
                        I created http://jira.jboss.com/jira/browse/JGRP-185, and am fixing it right now. This is in CVS in 10 minutes

                        • 9. Re: silent TCP disconnect not detected
                          belaban

                          Okay, done (TCP and ConnectionTable)

                          • 10. Re: silent TCP disconnect not detected
                            annecotter

                            Super - thanks Bela!

                            Do you know if there are any existing JBoss config parameters that would accomplish the application-level tcp keepalive that I was mentioning above? Reason being, ideally I would like to have this scenario detected, and the connection dropped and re-created before the far-end member is declared suspect.

                            Thanks in advance
                            Anne

                            • 11. Re: silent TCP disconnect not detected
                              belaban

                              So you want me to do pinging on the TCP connection, and a missed heartbeat would close the connection, *before* FD detects the connection loss and generates a view change ?
                              What's the diff ? Why would you use this rather than FD ?

                              • 12. Re: silent TCP disconnect not detected
                                annecotter

                                I was thinking that I would like to use it in addition to FD, but on further consideration I'm thinking you're right and it's probably unnecessary. With JGRP-185 that you just checked in, would I see the following behaviour?

                                - firewall starts dropping packets
                                - FD kicks in and member is declared suspect
                                - connection to suspect member is closed
                                - a new connection is attempted and is successful, member joins group again

                                If so, then I guess that is exactly what I need!

                                Thanks
                                Anne

                                • 13. Re: silent TCP disconnect not detected
                                  belaban

                                  Yes. Try it out and let me know whether this works. Ran it in the debugger and it did, but feedback is welcome.

                                  • 14. Re: silent TCP disconnect not detected
                                    annecotter

                                    Hi Bela,

                                    I'm not sure how to obtain the change made for JGRP-185 - the fixed version is 2.3, but this doesn't seem to be available for download yet. This question might not belong in this forum, if there is a better place to have it answered please let me know.

                                    Thanks
                                    Anne

                                    1 2 Previous Next