3 Replies Latest reply on Feb 15, 2013 11:28 AM by borges

    How is "split brain" handled with the Share-Nothing configuration ?

    mmonrocq

      Hello,

       

      I am currently evaluating the HornetQ solution and so far it seems great, but there are shadows around the failover handling for the Share-Nothing configuration and especially its handling of network partitions and intermittent link failures.

       

      From the documentation, chapter 39.1.2 Data Replication we get:

       

      Much like in the shared-store case, when the live server stops or crashes, its replicating backup will become active and take over its duties. Specifically, the backup will become active when it loses connection to its live server. This can be problematic because this can also happen because of a temporary network problem. In order to address this issue, the backup will try to determine whether it still can connect to the other servers in the cluster. If it can connect to more than half the servers, it will become active, if more than half the servers also disappeared with the live, the backup will wait and try reconnecting with the live. This avoids a split brain situation.

       

       

      So, regarding Network Partitions:

       

      I thus suppose that the live server will periodically check the quorum itself, since otherwise in case of network partition it could get isolated and still serve messages. I also suppose that there is some kind of timeout before the backup takes over, to let the live server some time to realize that it's isolated and stop serving clients. Is this accurate ? May those delays be configured ?

       

       

      And regarding Intermittent link failures:

       

      Let us suppose that I have a live server L, a backup server B and a passive server P in a cluster of 3 (quorum-size 2), what happens if the link between L and B is down, but both can still connect to P ? (note: in this case, both have a quorum of 2)

       

       

      Apart from those shadowy zones, I have been very impressed with the performance, especially with the efficiency of the paging mechanism, congratz!