Currently replication between live and backup node is done on a single thread. The operation is replicated, executed on backup, and when the result is returned to live, it is executed on live before the result is returned to the client (if applicable). This ensures the same ordering of state changes on live and backup but effectively makes all operation single threaded, thus reducing throughput in a clustered set-up.
Instead we should implement multi-threaded replication. If replicated operations are being executed concurrently on backup it is more tricky to ensure that state changes on any object occur in the same order as on live but it is possible.
This is how we can do it.
For the following types of objects it is important that their state changes occur in the same order on live and backup
1) Sessions (S)
2) Queues (Q)
3) Post offices. (P)
Operations that occur on the live node are either
a) Initiated externally - e.g. an operation like send message or acknowledge coming from a client
b) Initiated internally - e.g. an asynchronous delivery, a depage, or an asynchronous message expiry from the reaper
As any of these operations are being executed they may affect the state of one or more stateful objects, e.g. a send operation might affect the state of S1, Q1, Q2, S2, and S3 (as a sent message is routed to Q1 and Q2 and immediately delivered to S2 and S3).
When the operation is replicated to the backup we must ensure that the state changes on any affected stateful objects must occur in the same order as they did on live.
To do this as the operation is executed and the state changes occur, each stateful object is locked in turn, and a sequence number is obtained from the stateful object which tells us we're the nth stateful operation on that object.
So a send operation might end up with a set of state sequences numbers:
S1(12), Q1(102), Q2(32), S2 (74), S3 (1038)
Which tells us the operation cause the 12th state change on object S1, followed by the 102nd state change on object Q1 etc.
The list of sequence numbers can be stored in a thread local as the operation is executed on live.
When the operation is executed on live we delay sending back any data to the client until we have replicate the operation to the backup and got a response that the replication has been processed.
To ensure the state changes occur in the same order on backup, when we replicate the operation we also replicate the list state sequence numbers. When the operation is executed on the backup we attempt to obtain the locks on the stateful objects.
Say we try to get the lock on object Q1, but the current sequence number of the object Q1 on the backup is 100, the attempt to get the lock will block until the 101st state change is executed since we require that operation is the 102nd state change.
By using such a technique we can ensure that state changes are obtained in the same order on live and backup but are not limited to a single thread.
In many cases the objects will be in the expected or similar sequence numbers so the replications can occur in parallel.
In case of failure of the live node, operations may block indefinitely on the backup waiting for replications to arrive for missing state changes.
For this we can timeout on the lock and reverse state changes.