HighLoad++, Mikhail Tyulenev (MongoDB): Causal consistency: from theory to practice

The next HighLoad++ conference will take place on April 6 and 7, 2020 in St. Petersburg.
Details and tickets here to register:. HighLoad++ Siberia 2019. Krasnoyarsk Hall. June 25, 12:00. Abstracts and presentation.

HighLoad++, Mikhail Tyulenev (MongoDB): Causal consistency: from theory to practice

It happens that practical requirements conflict with theory, where aspects important for a commercial product are not taken into account. This paper presents the process of selecting and combining different approaches to create Causal consistency components based on academic research based on the requirements of a commercial product. Students will learn about existing theoretical approaches to logical clocks, dependency tracking, system security, clock synchronization, and why MongoDB settled on certain solutions.

Mikhail Tyulenev (hereinafter - MT): - I will talk about Causal consistency - this is a feature that we worked on in MongoDB. I work in the distributed systems group, we did it about two years ago.

HighLoad++, Mikhail Tyulenev (MongoDB): Causal consistency: from theory to practice

In the process, I had to get acquainted with a large amount of academic Research, because this feature is well studied. It turned out that not a single article fits into what is required in production, the database in view of the very specific requirements that are, probably, in any production applications.

I will talk about how we, as a consumer of academic Research, prepare something from it that we can then present to our users as a ready-made dish that is convenient and safe to use.

Causal consistency. Let's define concepts

To begin with, I want to say in general terms what Causal consistency is. There are two characters - Leonard and Penny (TV series "The Big Bang Theory"):

HighLoad++, Mikhail Tyulenev (MongoDB): Causal consistency: from theory to practice

Let's say Penny is in Europe and Leonard wants to do some surprise party for her. And he doesn’t think of anything better than throwing her off the friend list, sending an update to all his friends on the feed: “Let's make Penny happy!” (she is in Europe, while she sleeps, does not see all this and cannot see it, because she is not there). At the final moment, she deletes this post, erases it from the "Feed" and restores access so that she does not notice anything and there is no scandal.
That's all fine, but let's assume that the system is distributed, and things didn't go a little wrong. It may, for example, happen that Penny's access restriction occurred after this post appeared, if the events are not connected by causality. Actually, this is an example of when Causal consistency is required in order to execute a business function (in this case).

In fact, these are rather non-trivial properties of the database - very few people support them. Let's move on to models.

Consistency Models

What is a consistency model in databases? These are some of the guarantees that a distributed system makes about what data and in what sequence the client can receive.

In principle, all consistency models boil down to how similar a distributed system is to a system that runs, for example, on one node on a laptop. And this is how a system that runs on thousands of geo-distributed "Nodes" is similar to a laptop in which all these properties are performed automatically in principle.

Therefore, consistency models only apply to distributed systems. All systems that previously existed and worked on the same vertical scaling did not experience such problems. There was one Buffer Cache, and everything was always subtracted from it.

Strong model

Actually, the very first model is Strong (or the rise ability line, as it is often called). This is a consistency model that ensures that every change, once confirmed that it has happened, is visible to all users of the system.

This creates a global order for all events in the database. This is a very strong property of consistency, and it is generally very expensive. However, it is very well supported. It's just very expensive and slow - it's just rarely used. This is called rise ability.

There is another, more powerful property that is supported in Spanner - it is called External Consistency. We'll talk about it a little later.

Reason

The next one is Causal, which is exactly what I was talking about. There are several other sub-levels between Strong and Causal that I won't talk about, but they all boil down to Causal. This is an important model because it is the strongest of all models, the strongest consistency in the presence of a network or partitions.

Causals is actually a situation in which events are connected by a causal relationship. Very often they are perceived as Read your on rights from the point of view of the client. If the client observed some values, he cannot see the values ​​that were in the past. It's already starting to see prefix reads. It all comes down to the same thing.
Causals as a consistency model is a partial ordering of events on the server, in which events from all clients are observed in the same sequence. In this case, Leonard and Penny.

Eventual

The third model is Eventual Consistency. This is what absolutely all distributed systems support, the minimal model that makes sense at all. It means the following: when we have some changes in the data, at some point they become consistent.

At such a moment, she does not say anything, otherwise she would turn into External Consistency - it would be a completely different story. Nevertheless, this is a very popular model, the most common. By default, all users of distributed systems use Eventual Consistency.

I want to give some comparative examples:

HighLoad++, Mikhail Tyulenev (MongoDB): Causal consistency: from theory to practice

What do these arrows mean?

  • latency. As the strength of the consistency increases, it becomes larger for obvious reasons: you need to make more entries, get confirmation from all the hosts and nodes that participate in the cluster that the data is already there. Accordingly, the Eventual Consistency is the fastest answer, because there, as a rule, you can even commit in memory and this will be enough in principle.
  • availability. If this is understood as the ability of the system to respond in the presence of network breaks, partitions, or some kind of failure, fault tolerance increases with a decrease in the consistency model, since it is enough for us that one host lives and at the same time gives out some data. Eventual Consistency does not guarantee anything about the data at all - it can be anything.
  • anomalies. This, of course, increases the number of anomalies. In Strong Consistency, they should almost never exist at all, and in Eventual Consistency they can be anything. The question arises: why do people choose Eventual Consistency if it contains anomalies? The answer is that Eventual Consistency models are applicable and anomalies exist, for example, in a short period of time; it is possible to use the master for reading and more or less read consistent data; it is often possible to use strong consistency models. In practice, this works, and often the number of anomalies is limited in time.

CAP theorem

When you see the words consistency, availability - what comes to mind? That's right - CAP theorem! Now I want to dispel the myth ... It's not me - there is Martin Kleppman, who wrote a wonderful article, a wonderful book.

HighLoad++, Mikhail Tyulenev (MongoDB): Causal consistency: from theory to practice

The CAP theorem is a principle formulated in the 2000s that Consistency, Availability, Partitions: take any two, and three cannot be chosen. It was a principle. It was proved as a theorem a few years later by Gilbert and Lynch. Then it began to be used as a mantra - systems began to be divided into CA, CP, AP and so on.

This theorem was actually proven for the following cases ... Firstly, Availability was considered not as a continuous value from zero to hundreds (0 - the system is "dead", 100 - responds quickly; we are so used to considering it), but as a property of the algorithm , which guarantees that for all its executions, it returns data.

There is not a word about the response time at all! There is an algorithm that returns data after 100 years - a perfectly fine available algorithm, which is part of the CAP theorem.
Second: a theorem was proved for changes in the values ​​of the same key, despite the fact that these changes are a resizable line. This means that in fact they are practically not used, because the models are different Eventual Consistency, Strong Consistency (maybe).

Why is it all? To the fact that the CAP theorem in the form in which it is proved is practically not applicable, it is rarely used. In theoretical form, it limits everything in some way. It turns out a certain principle that is intuitively true, but, in general, has not been proven in any way.

Causal consistency is the strongest model

What is happening now is that you can get all three things: Consistency, Availability get using Partitions. In particular, Causal consistency is the strongest consistency model, which, in the presence of Partitions (breaks in the network), still works. That is why it is of such great interest, that is why we took it up.

HighLoad++, Mikhail Tyulenev (MongoDB): Causal consistency: from theory to practice

Firstly, it simplifies the work of application developers. In particular, the presence of great support from the server: when all the records that occur inside one client are guaranteed to come in such a sequence on another client. Secondly, it maintains partitions.

The inner kitchen of MongoDB

Remembering that lunch, we move to the kitchen. I will talk about the system model, namely, what MongoDB is for those who hear about such a database for the first time.

HighLoad++, Mikhail Tyulenev (MongoDB): Causal consistency: from theory to practice

HighLoad++, Mikhail Tyulenev (MongoDB): Causal consistency: from theory to practice

MongoDB (hereinafter referred to as "MongoDB") is a distributed system that supports horizontal scaling, that is, sharding; and within each shard it also supports data redundancy i.e. replication.

Sharding in "MongoDB" (not a relational database) performs automatic balancing, that is, each collection of documents (or "table" in terms of relational data) into pieces, and the server automatically moves them between shards.

The Query Router that distributes queries for a client is some client through which it works. It already knows where and what data is located, directs all requests to the correct shard.

Another important point: MongoDB is a single master. There is one Primary - it can take records that support the keys that it contains. You can't do Multi-master write.

We made release 4.2 - new interesting things appeared there. In particular, they inserted Lucene -search - exactly executable java directly into Mongo, and there it became possible to search through Lucene, the same as in Elastica.

And they made a new product - Charts, it is also available on Atlas (Mongo's own Cloud). They have Free Tier - you can play around with that. I really liked Charts - data visualization, very intuitive.

Causal consistency ingredients

I counted about 230 articles that have been published on this topic - from Leslie Lampert. Now, from my memory, I will bring them to you some parts of these materials.

HighLoad++, Mikhail Tyulenev (MongoDB): Causal consistency: from theory to practice

It all started with an article by Leslie Lampert that was written in the 1970s. As you can see, some research on this topic is still ongoing. Now Causal consistency is experiencing interest in connection with the development of distributed systems.

Restrictions

What are the restrictions? This is actually one of the main points, because the restrictions that a production system imposes are very different from the restrictions that exist in academic articles. Often they are quite artificial.

HighLoad++, Mikhail Tyulenev (MongoDB): Causal consistency: from theory to practice

  • Firstly, "MongoDB" is a single master, as I said before (this greatly simplifies).
  • We believe that the system should support about 10 thousand shards. We cannot make any architectural decisions that will explicitly limit this value.
  • We have a cloud, but we assume that a person should still have the opportunity when he downloads the binary, runs it on his laptop, and everything works fine.
  • We assume something that is rarely used in Research: external clients can do anything. MongoDB is open source. Accordingly, clients can be so smart, angry - they may want to break everything. We hypothesize that the Byzantine Feilors may be occurring.
  • For external clients that are outside the perimeter, this is an important limitation: if this feature is disabled, then no performance degradation should be observed.
  • Another point is generally anti-academic: the compatibility of previous versions and future ones. Old drivers must support new updates, and the database must support old drivers.

In general, all this imposes restrictions.

Causal consistency components

I will now talk about some of the components. If we consider Causal consistency in general, we can select blocks. We chose from works that belong to some block: Dependency Tracking, the choice of watches, how these watches can be synchronized with each other, and how we provide security - this is an approximate plan of what I will talk about:

HighLoad++, Mikhail Tyulenev (MongoDB): Causal consistency: from theory to practice

Full Dependency Tracking

Why is it needed? So that when the data is replicated, each record, each data change contains information about what changes it depends on. The very first and naive change is when each message that contains an entry contains information about previous messages:

HighLoad++, Mikhail Tyulenev (MongoDB): Causal consistency: from theory to practice

In this example, the number in curly brackets are the record numbers. Sometimes these records with values ​​are transferred even in their entirety, sometimes some versions are transferred. The bottom line is that each change contains information about the previous one (obviously carries it all in itself).

Why did we decide not to use this approach (full tracking)? Obviously, because this approach is impractical: any change in a social network depends on all previous changes in this social network, passing, say, "Facebook" or "Vkontakte" in each update. Nevertheless, there are many studies of Full Dependency Tracking - these are pre-social networks, for some situations it really works.

Explicit Dependency Tracking

The next one is more limited. Here, too, the transfer of information is considered, but only that which is clearly dependent. What depends on what, as a rule, is already determined by Application. When data is replicated, a query will only return responses when previous dependencies have been satisfied, i.e. shown. This is the essence of how Causal consistency works.

HighLoad++, Mikhail Tyulenev (MongoDB): Causal consistency: from theory to practice

She sees that record 5 depends on records 1, 2, 3, 4 - accordingly, she waits before the client accesses the changes made by Penny's access rule, when all previous changes have already passed in the database.

This also does not suit us, because there is still too much information, and this will slow down. There is another approach...

Lamport Clock

They are very old. Lamport Clock means that these dependencies are folded into a scalar function, which is called Lamport Clock.

A scalar function is some abstract number. It is often referred to as logical time. With each event, this counter is incremented. The Counter, which is currently known to the process, sends each message. It is clear that processes can be out of sync, they can have completely different times. Nevertheless, by such an exchange of messages, the system somehow balances the clock. What happens in this case?

I split that big shard in two to make it clear: Friends can live in one node that contains a piece of a collection, and Feed can live in another node that contains a piece of this collection. I see how they can get out of line? Feed will say "Replicated" first, and then Friends. If the system doesn't provide some guarantee that the Feed won't be shown until the Friends dependencies in the Friends collection are also delivered, then we'll have exactly the situation I mentioned.

You can see how the counter time on the Feed increases logically:

HighLoad++, Mikhail Tyulenev (MongoDB): Causal consistency: from theory to practice

So the main property of this Lamport Clock and Causal consistency (explained via Lamport Clock) is this: if we have events A and B, and event B depends on event A*, then it follows that the LogicalTime of Event A is less than than LogicalTime from Event B.

* Sometimes they also say that A happened before B, that is, A happened before B - this is a kind of relationship that partially orders the entire set of events that happened at all.

The other way around is wrong. This is actually one of the main disadvantages of Lamport Clock - partial order. There is a concept of simultaneous events, that is, events in which neither (A happened before B) nor (A happened before B). An example is the parallel addition of Leonard as a friend of someone else (not even Leonard, but Sheldon, for example).
This is the property that is often used when working with Lamport watches: they look at the function and draw a conclusion from this - maybe these events are dependent. Because one way this is true: if LogicalTime A is less than LogicalTime B, then B cannot happen before A; and if more, then maybe.

Vector Clock

The logical development of Lamport's clock is the Vector clock. They differ in that each node that is here contains its own, separate clock, and they are transmitted as a vector.
In this case, you can see that the zero index of the vector is responsible for Feed, and the first index of the vector is for Friends (each of these nodes). And now they will increase: the zero index "Fida" increases when writing - 1, 2, 3:

HighLoad++, Mikhail Tyulenev (MongoDB): Causal consistency: from theory to practice

Why is Vector Clock better? Those that allow you to figure out which events are simultaneous and when they occur on different nodes. This is very important for a sharding system like MongoDB. However, we did not choose this, although it is a wonderful thing, and it works great, and it would probably suit us ...

If we have 10 thousand shards, we cannot transfer 10 thousand components, even if we compress, we come up with something else - all the same, the payload will be many times less than the volume of this entire vector. Therefore, gritting our hearts and teeth, we abandoned this approach and moved on to another.

Spanner True Time. atomic clock

I said that there would be a story about Spanner. This is a cool thing, right in the XNUMXst century: atomic clocks, GPS synchronization.

What's the idea? "Spanner" is a Google system that has even recently become available to people (they attached SQL to it). Each transaction there has some time stamp. Since the time is synchronized *, each event can be assigned a specific times - atomic clocks have a waiting time, after which another time is guaranteed to “happen”.

HighLoad++, Mikhail Tyulenev (MongoDB): Causal consistency: from theory to practice

Thus, by simply writing to the database and waiting for a period of time, the Serializability of the event is automatically guaranteed. They have the strongest Consistency model that can be imagined in principle - it is External Consistency.

* This is the main problem with Lampart clocks - they are never synchronous on distributed systems. They can diverge, even with NTP they still don't work very well. "Spanner" has an atomic clock and the timing seems to be microseconds.

Why didn't we choose? We do not assume that our users have built-in atomic clocks. When they come out, built into every laptop, there will be some super-cool GPS synchronization - then yes ... In the meantime, the best thing possible is Amazon, Base Stations - for fanatics ... So we used other watches.

Hybrid Clock

This is actually what ticks in "MongoDB" while ensuring Causal consistency. What are they hybrid? A hybrid is a scalar value, but it has two components:

HighLoad++, Mikhail Tyulenev (MongoDB): Causal consistency: from theory to practice

  • The first is the unix era (how many seconds have passed since the “beginning of the computer world”).
  • The second is some increment, also a 32-bit unsigned int.

This is, in fact, all. There is such an approach: the part that is responsible for time is synchronized with the clock all the time; every time an update occurs, this part is synchronized with the clock and it turns out that the time is always more or less correct, and increment allows you to distinguish between events that happened at the same time.

Why is this important for MongoDB? Because it allows you to make some kind of backup restoring at a certain point in time, that is, the event is indexed by time. This is important when some events are needed; for the database, events are changes in the database that occurred at certain intervals at a point in time.

The biggest reason I will tell only you (please don't tell anyone)! We did this because this is what ordered, indexed data looks like in MongoDB OpLog. OpLog is a data structure that contains absolutely all changes in the database: they first go to OpLog, and then they are already applied to Storage itself in the case when it is a replicated date or shard.

This was the main reason. Still, there are also practical requirements for the development of the base, which means that it should be simple - little code, as few broken things as possible that need to be rewritten and tested. The fact that our oplogs were indexed by hybrid watches helped a lot and allowed us to make the right choice. It really justified itself and somehow magically earned, on the very first prototype. It was very cool!

Clock synchronization

There are several synchronization methods described in the scientific literature. I'm talking about synchronization when we have two different shards. If one replica is a set, no synchronization is needed there: this is a “single master”; we have an OpLog, in which all changes fall - in this case, everything is already sequentially ordered in the Oplog itself. But if we have two different shards, time synchronization is important here. This is where the vector clock helped more! But we don't have them.

HighLoad++, Mikhail Tyulenev (MongoDB): Causal consistency: from theory to practice

The second one is the Heartbeats. It is possible to exchange some signals that occur every unit of time. But "Heartbeats" are too slow, we cannot provide latency to our client.

True time is, of course, a wonderful thing. But, again, this is probably the future ... Although it is already possible to do it in Atlas, there are already fast "Amazon" time synchronizers. But it won't be available to everyone.

Gossiping is when all messages include time. This is roughly what we use. Each message between nodes, a driver, a data node router, absolutely everything for MongoDB is some kind of elements, database components that contain clocks that flow. They have a hybrid time value everywhere, it is transmitted. 64 bits? It allows, it is possible.

How does it all work together?

Here I consider one cue-set to make it a little bit easier. There are Primary and Secondary. Secondary does replication and is not always fully in sync with Primary.

There is an insert (insert) in "Primary" with some value of time. This insert increases the internal counter by 11 if that's the maximum. Or it will check the clock values ​​and sync by clock if the clock values ​​are greater. This allows you to sort by time.

After he records, an important moment occurs. The clock is in "MongoDB" and is incremented only in case of writing to the "Oplog". This is the event that changes the state of the system. Absolutely in all classical articles, an event is considered to be a message hitting a node: a message has arrived, which means that the system has changed its state.

This is due to the fact that during the study it is not entirely possible to understand how this message will be interpreted. We know for sure that if it is not reflected in the "Oplog", then it will not be interpreted in any way, and the change in the state of the system is only an entry in the "Oplog". This simplifies everything for us: both the model simplifies and allows ordering within a single replica set, and many other useful things.

The value that is already written to the “Oplog” is returned - we know that this value is already in the “Oplog”, and its time is 12. Now, let’s say, reading starts from another node (Secondary), and it already passes afterClusterTime in the message. He says: "I want everything that happened at least after 12 or during twelve" (see the picture above).

This is what is called Causal a consistent (CAT). There is such a concept in theory that this is some slice of time, which is consistent in itself. In this case, we can say that this is the state of the system that was observed at time 12.

Right now there is nothing here yet, because this kind of mimics the situation when you want Secondary to replicate data from Primary. It is waiting... And now the data has arrived - it returns these values ​​back.

HighLoad++, Mikhail Tyulenev (MongoDB): Causal consistency: from theory to practice

That's pretty much how it all works. Almost.

What does "almost" mean? Let's suppose that there is some person who has read and understood how it all works. I realized that every time ClusterTime happens, it updates the internal logical clock, and then the next record increases by one. This function takes 20 lines. Let's say this person transmits the largest possible 64-bit number, minus one.

Why "minus one"? Because the internal clock will be substituted into this value (obviously, this is the largest possible and more than the current time), then an entry will occur in the "Oplog", and the clock will be incremented by one more - and there will already be a maximum value in general (there are just all units, there is nowhere further , unsaint ints).

It is clear that after that the system becomes absolutely inaccessible for anything. It can only be unloaded, cleaned - a lot of manual work. Full availability:

HighLoad++, Mikhail Tyulenev (MongoDB): Causal consistency: from theory to practice

Moreover, if it is replicated somewhere else, then the entire cluster simply goes down. Absolutely unacceptable situation that anyone can organize very quickly and easily! Therefore, we considered this moment as one of the most important. How to prevent it?

Our way is to sign clusterTime

This is how it is transmitted in the message (before the blue text). But we also began to generate a signature (blue text):

HighLoad++, Mikhail Tyulenev (MongoDB): Causal consistency: from theory to practice

The signature is generated by a key that is stored inside the database, inside the secure perimeter; itself is generated, updated (users do not see this). A hash is generated, and each message is signed when created and validated when received.
Probably, people have a question: “How much does this slow things down?” I told you that it should work quickly, especially in the absence of this feature.

What does it mean to use Causal consistency in this case? This is to show the afterClusterTime parameter. And without it, it will just pass the values ​​anyway. Gossiping, since version 3.6, always works.

If we leave the constant generation of signatures, then this will slow down the system even in the absence of a feature, which does not correspond to our approaches and requirements. And what have we done?

Do it fast!

A fairly simple thing, but an interesting trick - I will share, maybe someone will be interested.
We have a hash that stores the signed data. All data goes through the cache. The cache signs not specifically time, but Range. When some value comes in, we generate a Range, mask the last 16 bits, and we sign this value:

HighLoad++, Mikhail Tyulenev (MongoDB): Causal consistency: from theory to practice

By obtaining such a signature, we speed up the system (conditionally) by 65 times. It works great: when experiments were set up, the time was actually reduced by 10 thousand times when we have a sequential update. It is clear that when they are in discord, this does not work. But in most practical cases it works. The combination of the Range signature along with the signature solved the security problem.

What have we learned?

Lessons we learned from this:

  • We need to read materials, stories, articles, because we have a lot of interesting things. When we are working on some feature (especially now, when we made transactions, etc.), we need to read, understand. It takes time, but it's actually very helpful because it makes it clear where we are. We didn’t come up with anything new, we just took the ingredients.

    In general, there is a certain difference in thinking when there is an academic conference (Sigmon, for example) - everyone is focused on new ideas. What is the novelty of our algorithm? There is nothing special here. Rather, the novelty lies in how we blended existing approaches together. Therefore, the first thing is to read the classics, starting with Lampart.

  • In production, the requirements are completely different. I am sure that many of you are not dealing with "spherical" databases in an abstract vacuum, but with normal, real things that have availability, latency and fault tolerance problems.
  • The last thing is that we had to consider different ideas and combine several completely different articles into one approach, together. The idea about signing, for example, generally came from an article that considered the Paxos protocol, which for non-Byzantine Feilors are inside the authorization protocol, for Byzantine ones - outside the authorization protocol ... In general, this is exactly what we ended up doing.

    There is absolutely nothing new here! But as soon as we mixed it all together ... It's like saying that the Olivier salad recipe is nonsense, because eggs, mayonnaise and cucumbers have already been invented ... It's about the same story.

HighLoad++, Mikhail Tyulenev (MongoDB): Causal consistency: from theory to practice

I'll finish with this. Thank you!

Questions

Question from the audience (hereinafter – B): Thank you Michael for the report! The topic of time is interesting. You are using Gossiping. They said that everyone has their own time, everyone knows their local time. I understand that we have a driver - there can be a lot of clients with drivers, query-planners too, a lot of shards too ... And what does the system slide into if we suddenly have a discrepancy: someone decides that he is for a minute ahead, someone - a minute behind? Where will we be?

MT: - Great question indeed! I just wanted to say about shards. If I understand the question correctly, we have the following situation: there is shard 1 and shard 2, reading takes place from these two shards - they have a discrepancy, they do not interact with each other, because the time they know is different, especially the time they have they exist in oplogs.
Let's say shard 1 made a million records, shard 2 did nothing at all, and the request came to two shards. And the first one has afterClusterTime over a million. In such a situation, as I explained, shard 2 will never respond at all.

В: - I wanted to know how they synchronize and choose one logical time?

MT: - Very easy to sync. A shard, when afterClusterTime comes to it, and it does not find time in the Oplog, it initiates no approved. That is, he raises his time to this value with his hands. This means that it has no events matching this request. He creates this event artificially and thus becomes a Causal Consistent.

В: - And if after that some events come to him that are lost somewhere in the network?

MT: - The Shard is arranged in such a way that they will not come, since it is a single master. If he has already written down, then they will no longer arrive, but will be after. It cannot happen that something is stuck somewhere, then it will do no write, and then these events arrived - and Causal consistency was violated. When he does no write, they should all come next (he will wait for them).

HighLoad++, Mikhail Tyulenev (MongoDB): Causal consistency: from theory to practice

В: – I have a few questions about queues. Causal consistency assumes that there is a certain queue of actions that need to be performed. What happens if we lose one package? Here comes the 10th, 11th... The 12th is gone, and everyone else is waiting for it to come true. And suddenly our car died, we can't do anything. Is there a maximum queue length that gets accumulated before it gets executed? What fatal failure occurs when any one state is lost? Moreover, if we write down that there is some kind of previous state, then we should somehow start from it? And they did not push him away!

MT: - That's a great question too! What are we doing? MongoDB has the concept of quorum writes, quorum reads. In what cases can a message disappear? When the write is not quorum or when the read is not quorum (some kind of garbage can also stick).
Regarding Causal consistency, a large experimental check was performed, the result of which was that in the case when writes and reads are not quorum, Causal consistency violations occur. Exactly what you say!

Our advice is to use at least a quorum read when using Causal consistency. In this case, nothing will be lost, even if the quorum record is lost ... This is an orthogonal situation: if the user does not want data to be lost, a quorum record should be used. Causal consistency does not guarantee durability. Durability is guaranteed by replication and the machinery associated with replication.

В: - When we create an instance that sharding performs for us (not master, but slave, respectively), it relies on the unix time of its own machine or on the time of the "master"; Synchronized for the first time or periodically?

MT: - Now I'll clarify. Shard (i.e. horizontal partition) - there is always a Primary there. And in a shard there can be a “master” and there can be replicas. But a shard is always writable because it must support some domain (the shard has a Primary).

В: - So it all depends purely on the "master"? Is "master"-time always used?

MT: - Yes. It can be figuratively said: the clock is ticking when a record is made in the “master”, in the “Oplog”.

В: - We have a client that connects, and he does not need to know anything about the time?

MT: “You really don’t need to know anything! If we talk about how it works on the client: when the client wants to use Causal consistency, he needs to open a session. Now everything is there: both transactions in the session, and retrieve a rights ... A session is an ordering of logical events that occur with a client.

If he opens this session and says there that he wants Causal consistency (if the session supports Causal consistency by default), everything automatically works. The driver remembers this time and increments it when it receives a new message. It remembers which response the previous one returned from the server that returned the data. The next query will contain afterCluster("time is greater than this").

The client does not need to know absolutely anything! It is completely opaque to him. If people use these features, what does it allow them to do? First, you can safely read secondaries: you can write to Primary and read from geographically replicated secondaries and be sure that it works. At the same time, sessions that were recorded on Primary can even be transferred to Secondary, that is, you can use not one session, but several.

В: - A new layer of Compute science is strongly connected with the topic of Eventual consistency - CRDT (Conflict-free Replicated Data Types) data types. Have you considered integrating these data types into the database and what can you say about it?

MT: - Good question! CRDT makes sense for write conflicts: MongoDB has a single master.

В: – I have a question from devops. In the real world, there are such Jesuit situations when Byzantine Failure occurs, and evil people inside the protected perimeter begin to stick into the protocol, send craft packages in a special way?

HighLoad++, Mikhail Tyulenev (MongoDB): Causal consistency: from theory to practice

MT: “Evil people inside the perimeter are like a Trojan horse!” Evil people inside the perimeter can do a lot of bad things.

В: – It’s clear that leaving a hole in the server, roughly speaking, through which you can stick the zoo of elephants and collapse the entire cluster forever ... It will take time for manual recovery ... This, to put it mildly, is wrong. On the other hand, the following is curious: in real life, in practice, are there situations when naturally similar internal attacks occur?

MT: – Since I rarely encounter security breaches in real life, I can’t say if they do happen. But if we talk about the development philosophy, then we think so: we have a perimeter that provides the guys who make security - this is a castle, a wall; and inside the perimeter you can do whatever you want. It is clear that there are users with the ability to only view, and there are users with the ability to delete the directory.

Depending on the rights, the damage users can do might be a mouse, or it might be an elephant. It is clear that a user with full rights can do anything at all. A user with limited rights can do much less harm. In particular, it cannot break the system.

В: - In the protected perimeter, someone climbed into creating unexpected protocols for the server in order to put the server in cancer, and if you're lucky, then the entire cluster ... Can it be so "good"?

MT: “I have never heard of such things. The fact that this way you can fill up the server is not a secret. To fill up inside, being from the protocol, being an authorized user who can write something like this into the message ... In fact, it is impossible, because it will still be verified. It is possible to disable this authentication for users who do not want to - this is then their problem; Roughly speaking, they destroyed the walls themselves and you can shove an elephant in there, which will crush it ... In general, you can dress as a repairman, come and pull it out!

В: - Thank you for the report. Sergey (Yandex). Monga has a constant that limits the number of voting members in the Replica Set, and this constant is 7 (seven). Why is it a constant? Why is this not a parameter of some kind?

MT: – We also have Replica Sets with 40 nodes. There is always a majority. I don't know which version...

В: - In the Replica Set, you can run non-voting members, but voting members - a maximum of 7. How in this case to survive a shutdown if we have a Replica Set pulled into 3 data centers? One data center can easily shut down and another machine fall out.

MT: – This is already a bit beyond the scope of the report. This is a general question. Maybe I can tell you later.

HighLoad++, Mikhail Tyulenev (MongoDB): Causal consistency: from theory to practice

Some ads 🙂

Thank you for staying with us. Do you like our articles? Want to see more interesting content? Support us by placing an order or recommending to friends, cloud VPS for developers from $4.99, a unique analogue of entry-level servers, which was invented by us for you: The whole truth about VPS (KVM) E5-2697 v3 (6 Cores) 10GB DDR4 480GB SSD 1Gbps from $19 or how to share a server? (available with RAID1 and RAID10, up to 24 cores and up to 40GB DDR4).

Dell R730xd 2 times cheaper in Equinix Tier IV data center in Amsterdam? Only here 2 x Intel TetraDeca-Core Xeon 2x E5-2697v3 2.6GHz 14C 64GB DDR4 4x960GB SSD 1Gbps 100 TV from $199 in the Netherlands! Dell R420 - 2x E5-2430 2.2Ghz 6C 128GB DDR3 2x960GB SSD 1Gbps 100TB - from $99! Read about How to build infrastructure corp. class with the use of Dell R730xd E5-2650 v4 servers worth 9000 euros for a penny?

Source: habr.com

Add a comment