Schrödinger's Cat Without a Box: The Consensus Problem in Distributed Systems

So let's imagine. 5 cats are locked in the room, and in order to go wake up the owner, they need to all agree together on this, because they can open the door only by leaning on it with five of them. If one of the cats is Schrödinger's cat and the other cats are unaware of his decision, the question arises: "How can they do this?"

In this article, I will tell you in simple terms about the theoretical component of the world of distributed systems and the principles of their work. And also superficially consider the main idea underlying Paxos'a.

Schrödinger's Cat Without a Box: The Consensus Problem in Distributed Systems

When developers use cloud infrastructures, various databases, work in clusters of a large number of nodes, they are confident that the data will be consistent, safe and always available. But where are the guarantees?

In fact, the guarantees that we have are supplier guarantees. They are described in the documentation something like this: “This service is quite reliable, it has a given SLA, don’t worry, everything will work distributed as you expect.”

We tend to believe in the best, because smart uncles from big companies assured us that everything will be fine. We do not ask the question: why, in fact, can this work at all? Is there any formal justification for the correct operation of such systems?

I recently went to distributed computing school and was very inspired by this theme. Lectures at school were more like classes in mathematical analysis than anything related to computer systems. But this is exactly how the most important algorithms that we use every day, without suspecting it, were proved at one time.

Most modern distributed systems use the Paxos consensus algorithm and its various modifications. The coolest thing is that the validity and, in principle, the very possibility of the existence of this algorithm can be proved simply with a pen and paper. At the same time, in practice, the algorithm is used in large systems operating on a huge number of nodes in the clouds.

Light illustration of what will be discussed next: the problem of two generalsLet's take a look at the warm-up task of two generals.

We have two armies - red and white. White troops are based in the besieged city. Red troops led by generals A1 and A2 are located on two sides of the city. The task of the redheads is to attack the white city and win. However, the army of each red general individually is smaller than the army of whites.

Schrödinger's Cat Without a Box: The Consensus Problem in Distributed Systems

Victory conditions for the redheads: both generals must attack at the same time in order to have a numerical advantage over the whites. To do this, generals A1 and A2 need to agree with each other. If everyone attacks separately, the redheads will lose.

To negotiate, generals A1 and A2 can send messengers to each other through the territory of the white city. The messenger may successfully reach an allied general or may be intercepted by the enemy. Question: is there such a sequence of communications between the red-haired generals (the sequence of sending messengers from A1 to A2 and vice versa from A2 to A1), in which they are guaranteed to agree on an attack at hour X. Here, by guarantees it is understood that both generals will have unequivocal confirmation that an ally (another general) will attack exactly at the appointed time X.

Suppose A1 sends a messenger to A2 with the message: "Let's attack today at midnight!". General A1 cannot attack without confirmation from General A2. If the messenger from A1 has reached, then general A2 sends a confirmation with the message: "Yes, let's fill up the whites today." But now General A2 does not know whether his messenger has reached or not, he has no guarantees whether the attack will be simultaneous. Now General A2 again needs confirmation.

If we describe their communication further, it turns out the following: no matter how many message exchange cycles there are, there is no way to guarantee both generals that their messages have been received (assuming that either of the messengers can be intercepted).

The two generals problem is a great illustration of a very simple distributed system where there are two nodes with unreliable communication. So we do not have a 100% guarantee that they are synchronized. About similar problems only on a larger scale later in the article.

Introducing the concept of distributed systems

A distributed system is a group of computers (hereinafter referred to as nodes) that can exchange messages. Each individual node is some autonomous entity. A node can process tasks on its own, but in order to communicate with other nodes, it needs to send and receive messages.

How messages are specifically implemented, what protocols are used - this does not interest us in this context. It is important that the nodes of a distributed system can exchange data with each other by sending messages.

The definition itself doesn't look very complicated, but keep in mind that a distributed system has a number of attributes that will be important to us.

Attributes of distributed systems

  1. Concurrency – the possibility of occurrence of simultaneous or competitive events in the system. Moreover, we will consider that events that occurred on two different nodes are potentially concurrent until we have a clear order in which these events occur. And usually we don't.
  2. No global clock. We do not have a clear order of events due to the lack of a global clock. In the ordinary world of people, we are used to the fact that we have hours and time absolutely. Everything changes when it comes to distributed systems. Even ultra-precise atomic clocks have drift, and there are situations where we cannot tell which of two events happened first. Therefore, we cannot rely on time either.
  3. Independent failure of system nodes. There is another problem: something can go wrong simply because our nodes are not eternal. The hard drive may fail, the virtual machine in the cloud may reboot, the network may blink and messages will be lost. Moreover, there are situations when the nodes work, but at the same time they work against the system. The last class of problems even received a separate name: the problem Byzantine generals. The most popular example of a distributed system with such a problem is Blockchain. But today we will not consider this special class of problems. We will be interested in situations in which just one or more nodes can fail.
  4. Communication models (messaging models) between nodes. We have already found out that nodes communicate by exchanging messages. There are two well-known messaging models: synchronous and asynchronous.

Models of communication between nodes in distributed systems

Synchronous model - we know for sure that there is a finite known delta of time for which the message is guaranteed to reach from one node to another. If this time is up and the message has not arrived, we can safely say that the node has failed. In such a model, we have a predictable waiting time.

Asynchronous model – in asynchronous models, we assume that the waiting time is finite, but there is no time delta after which it can be guaranteed that the node has failed. Those. the waiting time for a message from a node can be arbitrarily long. This is an important definition, and we will talk about it further.

The concept of consensus in distributed systems

Before formally defining the concept of consensus, consider an example of a situation where we need it, namely − State Machine Replication.

We have some distributed log. We would like it to be consistent and contain identical data on all nodes of a distributed system. When one of the nodes learns a new value that it is going to write to the log, its task is to offer this value to all other nodes so that the log is updated on all nodes, and the system switches to a new consistent state. At the same time, it is important that the nodes agree among themselves: all nodes agree that the proposed new value is correct, all nodes accept this value, and only in this case can everyone log a new value.

In other words: none of the nodes objected that they had more up-to-date information, and the proposed value was incorrect. An agreement between nodes and agreement on a single correct accepted value is the consensus in a distributed system. Next, we will talk about algorithms that allow a distributed system to reach consensus with guaranteed.
Schrödinger's Cat Without a Box: The Consensus Problem in Distributed Systems
More formally, we can define a consensus algorithm (or simply a consensus algorithm) as some function that takes a distributed system from state A to state B. Moreover, this state is accepted by all nodes, and all nodes can confirm it. As it turns out, this task is not at all as trivial as it seems at first glance.

Properties of the consensus algorithm

The consensus algorithm must have three properties in order for the system to continue to exist and have some progress in the transition from state to state:

  1. Agreement – all correctly working nodes must take the same value (in articles this property is also found as a safety property). All nodes that are now functioning (not out of order and not lost contact with the rest) must come to an agreement and accept some final common value.

    It is important to understand here that the nodes in the distributed system we are considering want to agree. That is, we are now talking about systems in which something can simply fail (for example, some node fails), but in this system there are definitely no nodes that deliberately work against others (the task of the Byzantine generals). Due to this property, the system remains consistent.

  2. Integrity - if all correctly working nodes offer the same value v, so each correctly working node must accept this value v.
  3. Termination - all correctly working nodes will eventually take on some value (liveness property), which allows the algorithm to have progress in the system. Each individual node that works correctly must sooner or later accept the final value and confirm it: "For me, this value is true, I agree with the whole system."

An example of how the consensus algorithm works

While the properties of the algorithm may not be entirely clear. Therefore, we will illustrate with an example what stages the simplest consensus algorithm goes through in a system with a synchronous messaging model, in which all nodes function as expected, messages are not lost and nothing breaks (does this really happen?).

  1. It all starts with a marriage proposal (Propose). Suppose a client connects to a node called "Node 1" and starts a transaction, passing a new value to the node - O. From now on, we will call "Node 1" propose. As the proposer "Node 1" now has to notify the whole system that he has fresh data, and he sends messages to all other nodes: "Look! I received the value "O", and I want to write it down! Please confirm that you will also record "O" in your log."

    Schrödinger's Cat Without a Box: The Consensus Problem in Distributed Systems

  2. The next stage is voting for the proposed value (Voting). What is it for? It may happen that other nodes received more recent information, and they have data on the same transaction.

    Schrödinger's Cat Without a Box: The Consensus Problem in Distributed Systems

    When the node "Node 1" sends its propuse, the other nodes check their logs for data on this event. If there is no conflict, the nodes announce: “Yes, I have no other data for this event. The 'O' value is the most up-to-date information we deserve."

    In any other case, the nodes can respond to "Node 1": "Listen! I have more recent data on this transaction. Not "Oh", but something better."

    At the voting stage, the nodes come to a decision: either they all accept the same value, or one of them votes against, denoting that he has more recent data.

  3. If the voting round was successful, and everyone was in favor, then the system moves to a new stage - acceptance of the value (Accept). "Node 1" collects all the responses from other nodes and reports: "Everyone agreed on the value 'O'! Now I officially declare that "O" is our new meaning, the same for all! Write it down in your notebook, don't forget. Write to your log!"

    Schrödinger's Cat Without a Box: The Consensus Problem in Distributed Systems

  4. The rest of the nodes send confirmation (Accepted) that they have written down the value “O” for themselves, nothing new has been received during this time (a kind of two-phase commit). After this momentous event, we consider that the distributed transaction has completed.
    Schrödinger's Cat Without a Box: The Consensus Problem in Distributed Systems

Thus, the consensus algorithm in a simple case consists of four steps: propose, vote (voting), acceptance (accept), confirmation of acceptance (accepted).

If at some step we could not reach agreement, then the algorithm is restarted, taking into account the information provided by the nodes that refused to confirm the proposed value.

Consensus algorithm in asynchronous system

Before that, everything was smooth, because it was about the synchronous messaging model. But we know that in the modern world we are used to doing everything asynchronously. How does a similar algorithm work in a system with an asynchronous messaging model, where we believe that the waiting time for a response from a node can be arbitrarily long (by the way, a node failure can also be considered as an example when a node can respond arbitrarily long ).

Now that we know how the consensus algorithm works in principle, the question is for those inquisitive readers who have reached this point: how many nodes in a system of N nodes with an asynchronous message model can go down so that the system can still reach consensus?

The correct answer and rationale behind the spoiler.Correct answer: 0. If even one node in an asynchronous system goes down, the system will not be able to reach consensus. This statement is proven in the well-known FLP theorem (1985, Fischer, Lynch, Paterson, link to the original at the end of the article): “The impossibility of reaching a distributed consensus if at least one node fails.”
Schrödinger's Cat Without a Box: The Consensus Problem in Distributed Systems
Guys, then we have a problem, we are used to the fact that everything is asynchronous with us. And here it is. How to continue to live?

We have just talked about theory, about mathematics. What does "consensus cannot be reached" mean, translating from mathematical language into ours - engineering? This means that "cannot always be achieved", i.e. there is a case in which consensus is not achievable. And what is this case?

This is exactly the violation of the liveness property described above. We do not have a common agreement, and the system cannot progress (cannot terminate in a finite time) in the case where we do not have a response from all nodes. Because in an asynchronous system we don't have a predictable response time and we can't know if a node is down or just taking a long time to respond.

But in practice, we can find a solution. Let our algorithm be able to run for a long time in case of failures (it can potentially run indefinitely). But in most situations, when most of the nodes are functioning correctly, we will have progress in the system.

In practice, we are dealing with partially synchronous communication models. Partial synchronism is understood as follows: in the general case, we have an asynchronous model, but a certain concept of “global stabilization time” of a certain point in time is formally introduced.

This moment in time may not come for an arbitrarily long time, but one day it must come. The virtual alarm clock will ring, and from that moment on we can predict the time delta in which the messages will reach. From this point on, the system turns from asynchronous to synchronous. In practice, we deal with such systems.

Paxos algorithm solves consensus problems

Paxos is a family of algorithms that solve the consensus problem for partially synchronous systems, provided that some nodes can fail. The author of Paxos is Leslie Lamport. He proposed a formal proof of the existence and correctness of the algorithm in 1989.

But the proof turned out to be by no means trivial. The first publication was released only in 1998 (33 pages) describing the algorithm. As it turned out, it was extremely difficult to understand, and in 2001 an explanation was published for the article, which took 14 pages. The volumes of publications are given in order to show that in fact the problem of consensus is not at all simple, and behind such algorithms there is a huge work of the smartest people.

It is interesting that Leslie Lampor himself in his lecture noted that in the second article-explanation there is one statement, one line (he did not specify which one), which can be interpreted in different ways. And because of this, a large number of modern Paxos implementations do not work quite correctly.

A detailed analysis of the work of Paxos will take more than one article, so I will try to convey the main idea of ​​​​the algorithm very briefly. In the links at the end of my article you will find materials for further diving into this topic.

Roles at Paxos

The Paxos algorithm has a concept of roles. Consider three main ones (there are modifications with additional roles):

  1. Proposers (there may also be terms: leaders or coordinators). These are the guys who learn about some new meaning from the user and take on the role of leader. Their task is to launch a round of proposals for a new value and coordinate further actions of the nodes. Moreover, Paxos allows the presence of several leaders in certain situations.
  2. Acceptors (Voters). These are the nodes that vote to accept or reject a particular value. Their role is very important, because the decision depends on them: what state the system will go (or not go) to after the next stage of the consensus algorithm.
  3. Learners. Nodes that simply accept and write the new accepted value when the state of the system has changed. They do not make decisions, they just receive data and can give it to the end user.

One node can combine several roles in different situations.

The concept of quorum

We assume that we have a system of N nodes. And most of them F nodes may fail. If F nodes fail, then we should have at least 2F+1 acceptor nodes.

This is necessary so that we always, even in the worst situation, “good”, correctly working nodes have a majority. That is F+1 "good" nodes that agreed, and the final value will be accepted. Otherwise, there may be a situation where different local groups will take on different meanings and will not be able to agree among themselves. So we need an absolute majority to win the vote.

General idea of ​​the Paxos consensus algorithm

The Paxos algorithm assumes two large phases, which in turn are divided into two steps each:

  1. Phase 1a: Prepare. During the preparation stage, the leader (proposer) informs all nodes: “We are starting a new voting stage. We have a new round. The number of this round is n. We'll start voting now." So far, it just reports the start of a new cycle, but does not report the new value. The task of this stage is to initiate a new round and inform everyone of its unique number. The round number is important, it must be greater than all previous voting numbers from all previous leaders. Since it is thanks to the round number that other nodes in the system will understand how fresh the leader’s data is. Probably other nodes already have voting results from much later rounds and will simply tell the leader that he is behind the times.
  2. Phase 1b: Promise. When the acceptor nodes have received the number of the new voting stage, two outcomes are possible:
    • The number n of the new vote is greater than the number of any of the previous votes in which the acceptor participated. Then the acceptor sends a promise to the leader that it will no longer participate in any votes with a number less than n. If the acceptor has already voted for something (that is, it has already accepted some value in the second phase), then it appends the accepted value and the number of the vote in which it participated to its promise.
    • Otherwise, if the acceptor already knows about the high-numbered vote, it can simply ignore the prepare step and not respond to the leader.
  3. Phase 2a: Accept. The leader needs to wait for a response from the quorum (most of the nodes in the system) and, if the required number of responses are received, then he has two options for the development of events:
    • Some of the acceptors have submitted values ​​that they have already voted for. In this case, the leader chooses the value from the vote with the highest number. Let's call this value x, and send a message like this to all nodes: "Accept (n, x)", where the first value is the voting number from its own Propose step, and the second value is what everyone gathered for, i.e. value for which, in fact, we vote.
    • If none of the acceptors sent any values, but they simply promised to vote in this round, the leader can invite them to vote for their value, the value for which he became a leader at all. Let's call it y. It sends to all nodes a message of the form: "Accept (n, y)", by analogy with the previous outcome.
  4. Phase 2b: Accepted. Further, the acceptor nodes, upon receiving the message "Accept (...)", from the leader agree with it (send confirmation to all nodes that they agree with the new value) only if they have not promised some (other ) the leader to participate in voting with the number of the round n' > notherwise they ignore the confirmation prompt.

    If the majority of nodes answered the leader, and they all confirmed the new value, then the new value is considered accepted. Hooray! If the majority is not typed or there are nodes that refused to accept the new value, then everything starts over.

This is how the Paxos algorithm works. Each of these stages has many subtleties, we practically did not consider various types of failures, problems of multiple leaders, and much more, but the purpose of this article is only to introduce the reader to the world of distributed computing at the top level.

It is also worth noting that Paxos is not the only one of its kind, there are other algorithms, for example, Raft, but this is a topic for another article.

Links to materials for further study

Novice level:

Leslie Lamport Level:

Source: habr.com

Add a comment