3 Replies Latest reply on Feb 15, 2013 11:28 AM by borges

How is "split brain" handled with the Share-Nothing configuration ?

mmonrocq Feb 15, 2013 3:08 AM

Hello,

I am currently evaluating the HornetQ solution and so far it seems great, but there are shadows around the failover handling for the Share-Nothing configuration and especially its handling of network partitions and intermittent link failures.

From the documentation, chapter 39.1.2 Data Replication we get:

Much like in the shared-store case, when the live server stops or crashes, its replicating backup will become active and take over its duties. Specifically, the backup will become active when it loses connection to its live server. This can be problematic because this can also happen because of a temporary network problem. In order to address this issue, the backup will try to determine whether it still can connect to the other servers in the cluster. If it can connect to more than half the servers, it will become active, if more than half the servers also disappeared with the live, the backup will wait and try reconnecting with the live. This avoids a split brain situation.

So, regarding Network Partitions:

I thus suppose that the live server will periodically check the quorum itself, since otherwise in case of network partition it could get isolated and still serve messages. I also suppose that there is some kind of timeout before the backup takes over, to let the live server some time to realize that it's isolated and stop serving clients. Is this accurate ? May those delays be configured ?

And regarding Intermittent link failures:

Let us suppose that I have a live server L, a backup server B and a passive server P in a cluster of 3 (quorum-size 2), what happens if the link between L and B is down, but both can still connect to P ? (note: in this case, both have a quorum of 2)

Apart from those shadowy zones, I have been very impressed with the performance, especially with the efficiency of the paging mechanism, congratz!

1. Re: How is "split brain" handled with the Share-Nothing configuration ?

borges Feb 15, 2013 11:28 AM (in response to mmonrocq)
Hi,

See my answers inline.
M. Monrocq wrote:

From the documentation, chapter 39.1.2 Data Replication we get:

Much like in the shared-store case, when the live server stops or crashes, its replicating backup will become active and take over its duties. Specifically, the backup will become active when it loses connection to its live server. This can be problematic because this can also happen because of a temporary network problem. In order to address this issue, the backup will try to determine whether it still can connect to the other servers in the cluster. If it can connect to more than half the servers, it will become active, if more than half the servers also disappeared with the live, the backup will wait and try reconnecting with the live. This avoids a split brain situation. #

So, regarding Network Partitions:

I thus suppose that the live server will periodically check the quorum itself, since otherwise in case of network partition it could get isolated and still serve messages. I also suppose that there is some kind of timeout before the backup takes over, to let the live server some time to realize that it's isolated and stop serving clients. Is this accurate ? May those delays be configured ?

Short answer is that this is unsupported.

The current implementation suffices when they are on the same network but does not really deal with a partition. We could add something to the cluster broadcasts to make either the active-backup[1] or the live server exit upon sight of one another. Both solutions are bad in the sense that they were possibly already serving client requests. The idea of making the live server check the cluster and then decide to stop serving requests on the assumption that the backup is operational somewhere else doesn't sound very attractive. I reckon this is a situation we might have to deal with, but this is certainly not for this first release.

[1]: by active-backup that I mean the replicating backup server that has become active.

And regarding Intermittent link failures:

Let us suppose that I have a live server L, a backup server B and a passive server P in a cluster of 3 (quorum-size 2), what happens if the link between L and B is down, but both can still connect to P ? (note: in this case, both have a quorum of 2)

The best way to address intermittent link failures would be increase the time-to-live (TTL) and ping intervals.
Apart from those shadowy zones, I have been very impressed with the performance, especially with the efficiency of the paging mechanism, congratz!
Thanks!

Keep in mind that replication is at its very first release. (oh well, this implementation of replication in HornetQ is at its very first release...). We hope to improve the logic around a the "missing live server" scenario (just as we hope to improve the support around other replication aspects).
Actions
2. Re: How is "split brain" handled with the Share-Nothing configuration ?

gaohoward Feb 17, 2013 8:19 PM (in response to mmonrocq)

So, regarding Network Partitions:

I thus suppose that the live server will periodically check the quorum itself, since otherwise in case of network partition it could get isolated and still serve messages. I also suppose that there is some kind of timeout before the backup takes over, to let the live server some time to realize that it's isolated and stop serving clients. Is this accurate ? May those delays be configured ?

I think it's the backup that decides when to fail over. The live servers don't care about quorum related status. When a live is isolated from others by network partition, there is no way for its replication type backup to know whether it is really dead or still alive (because of its network is unreachable for the moment). So it uses quorum for a decision.
And regarding Intermittent link failures:

Let us suppose that I have a live server L, a backup server B and a passive server P in a cluster of 3 (quorum-size 2), what happens if the link between L and B is down, but both can still connect to P ? (note: in this case, both have a quorum of 2)
I'm not sure I understand the term 'passive server' but I assume it is just another live server in the cluster, right? In that case if B can connect to P but failed to connect L, it considers that the network may be temporarily broken between B and L and then tries to reconnect to L.
Actions
3. Re: How is "split brain" handled with the Share-Nothing configuration ?

gaohoward Feb 17, 2013 8:25 PM (in response to mmonrocq)

In either case the live server doesn't check quorum status. The backup is responsible for that.

In 2nd case if B is still can connect to P, B will try to reconnect L and not failover.

Howard
Actions

Go to original post