-
1. Re: silent TCP disconnect not detected
brian.stansberry Jan 20, 2006 2:19 PM (in response to annecotter)Are you using VERIFY_SUSPECT? (http://wiki.jboss.org/wiki/Wiki.jsp?page=JGroupsVERIFY_SUSPECT).
You mention wanting to do this at the application level, so maybe I'm misunderstanding what you want. -
2. Re: silent TCP disconnect not detected
annecotter Jan 20, 2006 3:45 PM (in response to annecotter)Hi Brian,
Yes, we are using VERIFY_SUSPECT. But the "ping" sent by VERIFY_SUSPECT seems to use the same socket as all the other clustering traffic. I suppose I need JBoss to question the integrity of the connection: drop the connection and attempt a reconnection before suspecting the member.
Thanks
Anne -
3. Re: silent TCP disconnect not detected
brian.stansberry Jan 20, 2006 4:25 PM (in response to annecotter)I want to be sure I understand what connection you're talking about. A connection opened by the TCP protocol for normal message traffic? The protocol itself should be able to handle that, so if it's not able to handle the firewall breaking the connection that's one issue.
I interpreted your first post to be about the connection opened by the FD_SOCK protocol, which is opened and then sits idle for hours. If that connection gets broken, a suspect event will be sent up the stack, but then VERIFY_SUSPECT should kick in, send a new packet over the regular TCP connection, and see that the other member isn't really dead. At that point FD_SOCK should open a new connection. -
4. Re: silent TCP disconnect not detected
annecotter Jan 23, 2006 5:30 PM (in response to annecotter)Sorry, I should have provided more details in my original post - we use FD (not FD_SOCK) with timeout=5s and max_tries=9. We use TCP as the transport protocol. The firewall removes connections that have been established for more than 4 hours. The firewall accomplishes this by removing the connection from it's "pass" list, causing all further packets across that connection to be dropped. This manifests as a loss of visibility to the members in the cluster.
I can recover from this by lowering the OS tcp_keepalive parameter, so that the TCP connection will timeout and be destroyed by the OS before JBoss failure detection causes the remote member to be deemed suspect. When the connection is destroyed by the OS, JBoss creates a new TCP connection and is able to reach the other member of the cluster.
However, lowering the OS tcp_keepalive is not acceptable as a permanent solution. I was hoping that JBoss might have a configuration parameter to achieve this timeout behaviour at the application-level for sockets created by JBoss.
I hope that's a little better, sorry for the confusion :)
Thanks
Anne -
5. Re: silent TCP disconnect not detected
brian.stansberry Jan 23, 2006 7:07 PM (in response to annecotter)Interesting. So with what the firewall is doing, JGroups must not be seeing any exceptions on the Socket, and thus doesn't close the connection.
I've been trying to think of a workaround involving the conn_expire_time property of TCP (see http://wiki.jboss.org/wiki/Wiki.jsp?page=JGroupsTCP) but it has the flaw of 1) not working if there is continuous traffic over the connection and 2) needing to recycle the connection every few seconds if FD is used.
Using FD_SOCK shouldn't help either; eventually the firewall will cut the main TCP connection. This won't cause suspicions any more, but messages still won't get through -- that's actually worse.
Will have to get back to you on this one :(. AFAICT, there are no simple hooks in the TCP protocol code where you can trigger a connection recycle. -
6. Re: silent TCP disconnect not detected
belaban Jan 23, 2006 11:32 PM (in response to annecotter)If you use FD rather than FD_SOCK, the reject rule of the FW will discard packets, therefore heartbeats sent by FD won't be received, and the connection should be closed.
Does that work for you ? -
7. Re: silent TCP disconnect not detected
annecotter Jan 24, 2006 9:16 AM (in response to annecotter)Hi Bela,
We do use FD, and the behaviour seems to be as follows:
- firewall starts dropping packets belonging to the cluster TCP connection
- FD kicks in, heartbeats are sent to neighbor but no ACKs are received
- max_tries is finally reached, and the neighbor is deemed suspect (we do have a VERIFY_SUSPECT here, but as above, no ACK is received)
- since that same TCP connection is still being used, we get stuck in a state where JBoss thinks the neighbor is down.
I am looking for a way to have JBoss close the socket it's using for clustering traffic and open a new one.
Thanks
Anne -
8. Re: silent TCP disconnect not detected
belaban Jan 24, 2006 10:39 AM (in response to annecotter)Okay, got you.
I created http://jira.jboss.com/jira/browse/JGRP-185, and am fixing it right now. This is in CVS in 10 minutes -
9. Re: silent TCP disconnect not detected
belaban Jan 24, 2006 10:53 AM (in response to annecotter)Okay, done (TCP and ConnectionTable)
-
10. Re: silent TCP disconnect not detected
annecotter Jan 24, 2006 10:54 AM (in response to annecotter)Super - thanks Bela!
Do you know if there are any existing JBoss config parameters that would accomplish the application-level tcp keepalive that I was mentioning above? Reason being, ideally I would like to have this scenario detected, and the connection dropped and re-created before the far-end member is declared suspect.
Thanks in advance
Anne -
11. Re: silent TCP disconnect not detected
belaban Jan 24, 2006 11:52 AM (in response to annecotter)So you want me to do pinging on the TCP connection, and a missed heartbeat would close the connection, *before* FD detects the connection loss and generates a view change ?
What's the diff ? Why would you use this rather than FD ? -
12. Re: silent TCP disconnect not detected
annecotter Jan 24, 2006 12:13 PM (in response to annecotter)I was thinking that I would like to use it in addition to FD, but on further consideration I'm thinking you're right and it's probably unnecessary. With JGRP-185 that you just checked in, would I see the following behaviour?
- firewall starts dropping packets
- FD kicks in and member is declared suspect
- connection to suspect member is closed
- a new connection is attempted and is successful, member joins group again
If so, then I guess that is exactly what I need!
Thanks
Anne -
13. Re: silent TCP disconnect not detected
belaban Jan 24, 2006 12:15 PM (in response to annecotter)Yes. Try it out and let me know whether this works. Ran it in the debugger and it did, but feedback is welcome.
-
14. Re: silent TCP disconnect not detected
annecotter Feb 16, 2006 11:59 AM (in response to annecotter)Hi Bela,
I'm not sure how to obtain the change made for JGRP-185 - the fixed version is 2.3, but this doesn't seem to be available for download yet. This question might not belong in this forum, if there is a better place to have it answered please let me know.
Thanks
Anne