4 Replies Latest reply on Oct 13, 2006 11:35 AM by Brian Stansberry

    Summary of approach to HA/Failover

    Tim Fox Master

      Following our meeting today with Bela, Brian, Clebert, Ovidiu and myself here are my thoughts as to what we need to do:

      We need to maintain a cluster wide consistent mapping of node -> List < FailoverInfo >

      Where FailoverInfo contains the JGroups address of the failover node (actually the async stack) and the remoting invoker locator.

      Whether the actually remoting invoker locator is used depends on whether we use remoting, but in any case it is some kind of address the client can use to make a connection to a server.


      struct FailoverInfo
      Address address;
      InvokerLocator locator;

      We need to maintain a list per node, since any particular node can have 1 or more failover nodes.

      We should look at using the Distributed Replicator Manager code from JBoss AS clustering which Brian has separated into it's own jar for actually maintaining this state across the cluster.

      This has nice functionality such as being able to register listeners to respond to changes.

      If we can't use this then it's not a big deal just to use JGroups state directly as we already do for managing the binding state.

      When a node joins the group we should generate the failover info list for that node before registering it with the group.

      A simple algorithm would be to consider the JGroups view list as a circular buffer and choose failover nodes as the next items in the view list to the current node. We would have to be careful to not choose addresses on the current node as failovers. We should make this pluggable.

      (Question: Given a JGroups address how do we know if it is on the same physical box as another JGroups address? If the machine has multiple NICs this may be tricky. Maybe we need to propagate some other kind of machine id in the state too? Or we propagate a list of NICs per box in the state.)

      The JGroups address would be needed when doing in memory message persistence for replicating messages from one server node to another.

      When changes in the failover map occur due to nodes joining leaving the group, this needs to be propagated to clients, and can probably be done on some kind of special control channel that we can multiplex on top of the transport assuming we use our own multiplexing transport.

      When a failover occurs the first failover node detects the failure of the node by a change in view and then takes over responsibility of the failed nodes partial queues.

      If there are any persistent messages to load they are then loaded from storage. Loading from storage won't be necessary if in memory message replication is being used since the messages will already be in memory in the failed over node.

      Around the same time, the client will receive an connection exception in the connection listener and assume the server has failed (should we retry on the current server first in case the failure was transitory and the server hasn't really failed? E.g. if someone temporarily disconnected the network cable).

      If the client determines the server really has failed then, it tries another server based on its client side load balancing policy. This server may not be the correct failover server for the failed node due to difficulties in synchronizing the client and server side failover mapping.

      In this case the server tells the client the locator of the correct server, and the client tries to connect there. This process continues until the client finds the correct server or a maximum number of tries is reached at which point it will give up.

      When the client connects to the failover server the server failover process (reloading of queues) may not have completed yet. In this case the server will block the invocation until the failover is complete. (I.e. it won't send a response to the connect until it is complete).

      Once the client has successfully connected to the correct server it then recreates any session, consumer, producer and browser objects that existed before failure for the failed connection.

      It then sends a list of <message_id, destination> corresponding to unacked messages in each of the sessions on the connection that failed. Based on this the server recreates the delivery list in the server consumer delates.

      If the failed node is subsequently resurrected, then it is not such a simple matter to just move the connections back to the original node since there may be unacknowledged messages in live sessions. If we move the connections then we any non persistent messages might get redelivered.

      Therefore we can only safely move back connections if there are no unacked messages in any sessions.

      This is probably part of a bigger question of how we redistribute connections over many nodes when we suddenly add a lot of nodes to the cluster.

      For the first pass we should probably not bother since this is tricky. However if we want to be able to automatically spread load smoothly and get benefits when adding new nodes to cluster with already created connections we should consider this.

      We should also consider being able to bring down a node smoothly from the management console without losing sessions - i.e. move them transparently to another node. Again this is not a high priority but something to think about.

      Any more thoughts?

      (BTW We should probably put all of this in a wiki...)

        • 2. Re: Summary of approach to HA/Failover
          Brian Stansberry Master


          "timfox" wrote:
          (Question: Given a JGroups address how do we know if it is on the same physical box as another JGroups address? If the machine has multiple NICs this may be tricky. Maybe we need to propagate some other kind of machine id in the state too? Or we propagate a list of NICs per box in the state.)

          This is quite easy to handle due to the java.net.NetworkInterface.getNetworkInterfaces() method, from which all addresses on the machine can be discovered. The o.j.cache.buddyreplication.NextMemberBuddyLocator.isColocated() method uses this.

          • 4. Re: Summary of approach to HA/Failover
            Brian Stansberry Master

            I don't think the DistributedReplicantManager code is a very good fit for what you're describing. The DRM is really based on this kind of data structure:

            Outer Map
            -- Key = ejb1
            -- Value = Inner Map
            -------------- Key = node1 identifier
            -------------- Value = replicant (RMI stub, InvokerLocator, your struct etc.)
            -------------- Key = node2 identifier
            -------------- Value = replicant for targetting node2
            -- Key = ejb2
            -- Value = Inner Map

            Code that "writes" to the DRM (e.g. an EJB deployment process) doesn't care about order; it just publishes a replicant for ejb1.

            Code that "reads" the DRM usually wants a List of the replicants associated with ejb1. But DRM provides this by iterating through the Inner Map and ordering the responses based on the current JGroups view. Thus the ordering is fixed.

            Messaging needs:

            -- Key = node1
            -- Value = List
            ----------- element = struct for node3
            ----------- element = struct for node4

            Here the code that writes has already prepared a properly ordered list based on a pluggable policy. The code that "reads" the DRM needs to get that same list.

            Trying to change DRM to make a generic class that can handle both use cases is probably more effort than just creating something new.