How servers negotiate with each other: Raft distributed consensus algorithm

When clusters reach hundreds, and sometimes thousands, of machines, the question arises of the consistency of server states relative to each other. The Raft Distributed Consensus Algorithm provides the most stringent consistency guarantee possible. In this article, we will consider Raft from the point of view of an engineer and try to answer the questions “How?” And “Why?” It works.









Article author: Dmitry Pavlushin (developer Dodo Pizza Engineering).


Raft is a distributed consensus algorithm that is needed so that several participants can jointly decide whether an event occurred or not, and what followed.



The data served by the Raft cluster is a log consisting of records. When the user wants to change the data stored in the cluster, he tries to add a new record to the log with the command:





These commands are executed by distributed state machines. For simplicity and clarity, in the framework of this article, we will assume that these records are simply given when reading to a client who, based on the events that have taken place, restores the current state of the system (see Event sourcing) .



To ensure consensus in Raft, a leader is first selected on whom responsibility for managing the distributed log will be laid. The leader accepts requests from clients and replicates them to other servers in the cluster. If the leader fails, a new leader will be selected in the cluster. This is if in a short sentence in three sentences. Details will follow.



Basic concepts



  1. Server states In the Raft cluster, each server at any given time is in one of three states:

    • Leader (leader) - processes all client requests, is the source of truth of all data in the log, supports the follower log.
    • Follower (follower) is a passive server that only “listens” to new log entries from the leader and redirects all incoming requests from clients to the leader. In fact, it is a hot-standby replica of the leader.
    • Candidate (candidate) is a special state of the server, possible only during the selection of a new leader.


    During normal operation in a cluster, only one server is the leader, all the rest are its followers.



    About asynchronism
    It is worth noting here that condition is a relative concept. Due to the fact that the servers communicate asynchronously, different servers can observe the transitions of other servers from one state to another at different times.

  2. Raft divides time into segments of arbitrary length, called deadlines . Each term has a monotonically increasing number. The term begins with the election of a leader when one or more servers become candidates. If the candidate receives the majority of votes, he becomes a leader until the end of this period. If the votes are divided, and none of the candidates can get the majority of the votes, a timeout is triggered, and this period ends. After this, a new term begins with new candidates and elections. This situation is called split vote. An example is illustrated by term number three in the following diagram:





    The term number serves as the logical timestamp in the Raft cluster. It helps servers determine which information is more relevant at the moment.



    Server interaction rules and terms
    • Each server tracks the number of its current term.
    • The server includes its expiration number in each sent message.
    • If the server receives a message with a lesser term number than its own, then it ignores this message.
    • If the server receives a message with a longer deadline number than its own, then it updates its deadline number to match the received one.
    • If a candidate or leader receives a message with a longer deadline number than his own, then he understands that other servers have already initiated a new deadline, and his deadline is no longer relevant. Therefore, it goes from the current state to the “follower” state in addition to updating its number.


  3. Server Communication The servers in Raft interact by exchanging requests and responses. The basic algorithm uses only two types of calls:



    • RequestVote is used by candidates during the election. The request contains the candidate’s term number and metadata about the candidate’s log, discussed in more detail below. The response contains the term number of the responding server and the value “true” if the server votes for the candidate; False if the server votes against the candidate.
    • AppendEntries is used by the leader for log replication, as well as for the heartbeat mechanism. The request contains the leader’s term number, a collection of records that need to be added to the log (or an empty collection in the case of heartbeat), some metadata about the leader’s log, also discussed in more detail below. The response contains the follower term number and the value “true” if the follower successfully added entries to his log; “False” if adding entries to the log failed.


Work algorithm



1. Choose a leader



Raft relies on heartbeat to determine when it is time to start a new election. The follower remains the follower as long as he receives messages from the current leader or candidate. The leader periodically sends all other heartbeat servers.



If the follower does not receive any messages for a while, he will quite naturally assume that the leader is dead, which means it's time to take the initiative in his hands. At this point, the former follower initiates the election.



To initiate the election, the follower increments its term number, switches to the “candidate” state, votes for itself, and then sends the “RequestVote” request to all other servers. After that, the candidate waits for one of three events:



  1. The candidate receives the majority of votes (including his own) and wins the election. Each server votes only once in each term, for the first candidate to be reached (with some exceptions, discussed below), therefore, only one candidate can get the majority of votes in a specific term. The winning server becomes the leader, starts sending heartbeat and serving client requests to the cluster.
  2. The candidate receives a message from the current leader of the current term or from any server with an older term . In this case, the candidate understands that the elections in which he runs are no longer relevant. He has no choice but to recognize a new leader / new term and go into a state of follower.
  3. A candidate does not receive a majority of the votes for a certain timeout. This can happen when several followers become candidates, and the votes are divided among them so that not one gets the majority. In this case, the term ends without a leader, and the candidate immediately begins new elections for the next term.


2. We replicate logs



When a leader is selected, he is responsible for managing the distributed log. The leader accepts requests from clients containing some teams. The leader puts in his log a new record containing the command, and then sends "AppendEntries" to all followers in order to replicate the record with the new record.



When the record is successfully replicated on most servers, the leader begins to consider the record closed and responds to the client. The leader keeps track of which record is the last. He sends the number of this record to AppendEntries (including heartbeat) so that followers can commit the record to themselves.



In case the leader cannot reach out to some followers, he will retrace the AppendEntries to infinity. The following picture shows how the logs are organized in the Raft cluster:







Each box is one entry in the log. Each record stores one command, for example, x ← 3 assign the value 3 to the key x. The record also stores the number of the term in which it was generated. In the picture, this is indicated by a number at the top of the square. The color indication of the squares also means the term number. Each record has a serial number (log index).



3. We guarantee the reliability of the algorithm



So far, from what we have examined, it is not clear how Raft can give at least some guarantees. However, the algorithm provides a set of properties that together guarantee the reliability of its execution:





Links to materials for further study






All Articles