What is Split brain and why do you need to worry about it?

Split brain is a state of a server cluster where nodes diverge from each other and have conflicts when handling incoming I/O operations. The servers may record the same data inconsistently or compete for resources. This will usually shut the cluster off while the nodes wait for some direction on how to solve the conflict, which leads to downtime for your servers or even worse, data corruption.

What causes split brain?

Split brain may also occur due to network partitions. Network partitions occur when clusters lose the ability to communicate with each other but not the network, both incorrectly thinking the other server is offline. When this happens both nodes think they should be taking incoming requests as they are unaware that the other server is still functioning, corrupting whatever data comes in or is modified. This will only happen if a two node cluster is configured for availability. If a two node cluster is configured for consistency, it will go down when a partition occurs.

Split brain can also occur if there is a master-slave cluster configured to failover. If the master node briefly goes offline then comes back online, it will cause the other server to promote itself. If the node comes back and still thinks that it is the "master", but the secondary server has already promoted itself - this will lead to a power struggle. When a power struggle happens over incoming operations, it can likely corrupt data.

Nodes in a cluster send out packets of information on regular intervals to alert the other nodes that they are still there, and running. They do this on a heartbeat network, though the name can be misleading as there usually isn't a separate network connection. Heartbeat networks don't prevent network partitions, but it does enable clusters to detect when network partitions occur or when a node goes down so they can shut down and prevent data corruption.

As mentioned above, the only reliable way to prevent data corruption when a network partition occurs (in a two node cluster) is with downtime. However, clusters with an odd number of nodes are able to use mathematical calculations to prevent split-brain and keep running. They do this by reaching a quorum.

What is a quorum?

Quorum is the minimum number of members to establish a consensus. Imagine you're in a meeting and you have to vote on something. For the vote to pass, you need 2 out of 3 people to agree, or 3 out of 5 and so on. Well, that's the same with a Ceph monitors. They must establish a consensus about the data and the cluster map.

A quorum, which is the is reached by the nodes in a cluster each having a "vote" on what information is correct and only ever recording it when there is a majority consensus reached. It operates on several computer science principles.

Those principles are meant to guarantee data consistency by ensuring multiple different copies of data are never recorded. For example, there is only ever one consistent copy learned and for this to work it needs an odd number of nodes within the cluster to be able to vote each other down.

Still having trouble understanding?

Here is an analogy that could make things clearer.

Imagine you're in a local government meeting and about to give a proposal to two board members. When you're giving your proposal, one member briefly stops paying attention and when he starts paying attention again, he writes in his notes the wrong information - he thought was correct.

Later, you request a response from the two members but they have different information and don't know who's is correct. The members need to either stop there or risk losing the correct information and using the incorrect one. With just two members, the town council has no way to reach an agreement. They are always in a stalemate when voting against each other on whose information is correct.

Now imagine you gave your proposal to a three member council. When one of the members stops paying attention and later compares notes, the other two members correct them, and tell them they need to fix the mistake.

That is essentially a simplified version of split brain in which you are the client communicating with the cluster. The important takeaway is that in order for them to continue working, there needs to be an odd number of members (or server nodes) to reach an agreement.