Failure Detection
While designing the transparent failover mechanism, there are three major use cases that should be taken in consideration:
Failure detected by the remoting pinger
Failure detected when sending a new invocation into the server
Failure detected during an in-flight invocation
1. Failure detected by the remoting pinger
This is the simplest case, and this is what we're supporting with the implementation currently existing in SVN. The implementation obviously doesn't cover all possible cases, since the pinger runs periodically, and invocations can be sent into the server at any time, as we will shortly see.
The failure detection component is in this case a remoting connection listener we install on the remoting connection. In case of server or network failure, the connection listener gets notified by the remoting pinger, which gets alarmed by the fact that the last ping did not succeed.
The actions that are currently taken by the existing code (obviously, it doesn't necessarily mean that handling is correct and complete) are:
Close valves recursively
Create a new JMS connection to the secondary server
Re-create the pre-failure delegate hierarchy and synchronize the old delegates with the new ones
Open the valves recursively
2. Failure detected when sending a new invocation into the server
If the server or network connection to the server fails between two consecutive remoting pinger runs and one (or more) invocations happen to be sent to the server before the next pinger run, then the invocation fails, most likely with an IOException. If more than one invocations happen in parallel, all of them will fail concurrently with IOExceptions.
This condition should be detected and trigger the client-side failover process. We should be careful in differentiating failure-triggering exceptions and regular JMS exceptions, which must not initiate the fail-over process, but sent up the stack back to the calling client code.
The failover detection component in this case is a FailoverValveInterceptor instance, installed at the beginning of each delegate AOP stack (if we ignore the ClientLoggingInterceptors). The IOException generated by the server/network connection failure will propagate up the delegate's AOP stack and it will be caught by the interceptor's invoke() try-catch clause.
One aspect that should be taken in consideration in this case is that our implementation is supposed to make the failure transparent. That means the FailureValveInterceptor has to block the invoking thread that raised the exception up to the catch clause, and then transparently re-try the invocation when client-side failover process completes. If it doesn't do that, the failover is not transparent anymore. For the client, the call that failed should not differ from a successful call in any way, except maybe that it takes a little bit longer to complete.
The FailureValveInterceptor should be the component responsible with blocking the calling thread and re-trying the invocation once the fail-over completes. It should be engineered to support this, in addition to blocking any other threads that may happen to arrive to the valve after the failure is detected.
3. Failure detected during an in-flight invocation
A third distinct situation in which a failure may occur is when there are one or more active invocation to the server, and the server or the network connection in between fails. After due time, the TCP/IP connection will time-out and the client-side remoting runtime will throw an exception - most likely an IOException subclass.
This situation is identical with the with the situation described previously, from a Messaging point of view.
The FailureValveInterceptor acts as failure detection component, and it must block the active threads, as well as any threads arriving to the valve in the future, until the fail-over completes. Upon completion, it must re-try the invocations that were interrupted by failure (otherwise the failover won't be transparent).
There's noting with this situation that's essentially different form what happens during the case described previously (2. Failure detected when sending a new invocation into the server).
Failure Handling
A generic client-side failure handling scenario that takes into consideration all three previously described cases should include the following steps:
Failure is detected by a "failure detection component" (the remoting connection listener or one or more FailureValveInterceptor instances).
If the failure detection component is a valve, it will put the calling thread to wait and automatically close itself.
The failure detection component notifies the "failure command center", located at connection delegate level - most likely accessible from the connection state.
The "failure command center" closes connection's and connection's children valves. Closing a valve means that:
If the valve has no active invocations passing through it, it will just transition to a state that will prevent any future incoming thread to pass through. Any incoming thread will block on the valve's synchronization element until the valve is opened.
If there are active threads traversing the valve at the moment when "close" command arrives, those threads must be interrupted and put to wait until the valve opens again, at which time invocations should be re-tried. A possibility would be to close the valve regardless of any active thread, since the active threads will have no choice but fail anyway shortly. At that time, the valve would catch the exceptions and put the threads to wait in absolutely the same way it did for the thread that triggered the failover.
A closed valve won't allow any incoming thread to pass; it will put it to wait instead until the "open" notification arrives.
The failure command center initiates the client-side failover process: creates a new connection to the failover server, recreates the existing delegate hierarchy, and synchronizes the old delegates with state from the newly-created delegates. Code that does this exists already and works well.
Upon successful completion of the client-side failover process, the failure command center opens all valves (hierarchically or otherwise). All threads waiting upon valve's opening are released and invocations follow their course over the newly opened connection.
Referenced by:
Comments