🥇Five students and three distributed key-value stores

Or how we wrote the C++ client library for ZooKeeper, etcd and Consul KV

In the world of distributed systems, there are a number of typical tasks: storing information about the composition of the cluster, managing the configuration of nodes, detecting failed nodes, choosing a leader and others. To solve these problems, special distributed systems have been created - coordination services. Now we will be interested in three of them: ZooKeeper, etcd and Consul. Of all the rich functionality of Consul, we will focus on Consul KV.

In fact, all these systems are fault-tolerant linearizable key-value stores. Although their data models have significant differences, which we will discuss later, they allow us to solve the same practical problems. Obviously, each application that uses the coordination service is tied to one of them, which may lead to the need to maintain several systems in one data center that solve the same tasks for different applications.

The idea to solve this problem originated in an Australian consulting agency, and it fell to us - a small team of students - to implement it, which I'm going to talk about.

We managed to create a library that provides a common interface for working with ZooKeeper, etcd and Consul KV. The library is written in C++, but there are plans to port it to other languages.

Data models

To develop a common interface for three different systems, you need to understand what they have in common and how they differ. Let's figure it out.

ZooKeeper

The keys are organized into a tree and are called nodes (znodes). Accordingly, for a node, you can get a list of its children. The operations of creating a znode (create) and changing the value (setData) are separated: you can read and change values only for existing keys. Watches can be attached to the operations of checking the existence of a node, reading a value, and getting children. Watch is a one-time trigger that fires when the version of the corresponding data on the server changes. Ephemeral nodes are used to detect failures. They are bound to the session of the client that created them. When the client closes the session or stops notifying ZooKeeper of its existence, these nodes are automatically deleted. Simple transactions are supported - a set of operations that either all succeed or fail if at least one of them fails.

etcd

The developers of this system were clearly inspired by ZooKeeper, and therefore did everything differently. There is no hierarchy of keys here, but they form a lexicographically ordered set. You can get or remove all keys belonging to some range. Such a structure may seem strange, but it is actually very expressive, and a hierarchical view is easily emulated through it.

Etcd does not have a standard compare-and-set operation, but there is something better - transactions. Of course, they are in all three systems, but transactions in etcd are especially good. They consist of three blocks: check, success, failure. The first block contains a set of conditions, the second and third - operations. The transaction is executed atomically. If all conditions are true, then the success block is executed, otherwise - failure. In API version 3.3, success and failure blocks can contain nested transactions. That is, you can atomically execute conditional constructs of almost arbitrary nesting level. To learn more about what checks and operations exist, see documentation.

Watches also exist here, although they are a bit more complex and reusable. That is, after setting watch on a range of keys, you will receive all updates in this range until you cancel watch, and not just the first one. In etcd, the equivalent of ZooKeeper client sessions are leases.

Consul K.V.

Here, too, there is no strict hierarchical structure, but Consul can create the appearance that it exists: you can get and delete all keys with the specified prefix, that is, work with the "subtree" of the key. Such queries are called recursive. In addition, Consul can only select keys that do not contain the specified character after the prefix, which corresponds to receiving immediate "children". But it is worth remembering that this is precisely the appearance of a hierarchical structure: it is quite possible to create a key if its parent does not exist or delete a key that has children, while the children will continue to be stored in the system.

Instead of watch in Consul, there are blocking HTTP requests. In fact, these are ordinary calls to the data reading method, for which, along with the rest of the parameters, the latest known version of the data is specified. If the current version of the corresponding data on the server is greater than the specified one, the response is returned immediately, otherwise, when the value changes. There are also sessions that can be attached to keys at any time. It is worth noting that, unlike etcd and ZooKeeper, where deleting sessions leads to the deletion of associated keys, there is a mode in which the session is simply detached from them. Available transactions, without branches, but with all sorts of checks.

Putting it all together

ZooKeeper has the strictest data model. The expressive range queries available in etcd cannot be effectively emulated in either ZooKeeper or Consul. Trying to take the best from all the services, we got an interface almost equivalent to the ZooKeeper interface with the following significant exceptions:

sequence, container and TTL nodes not supported
ACLs are not supported
the set method creates the key if it didn't exist (in ZK setData returns an error in this case)
set and cas methods are separated (in ZK they are essentially the same)
the erase method deletes the vertex along with the subtree (in ZK delete returns an error if the vertex has children)
there is only one version for each key, the value version (in ZK there are three of them)

The rejection of sequential nodes is due to the fact that etcd and Consul do not have built-in support for them, and on top of the resulting library interface, they can be easily implemented by the user.

Implementing a similar behavior to ZooKeeper when removing a vertex would require maintaining a separate child counter in etcd and Consul for each key. Since we tried to avoid storing meta-information, it was decided to delete the entire subtree.

Implementation details

Let us consider in more detail some aspects of the implementation of the library interface in different systems.

Hierarchy in etcd

Maintaining a hierarchical view in etcd turned out to be one of the most interesting tasks. Range queries make it easy to get a list of keys with a given prefix. For example, if you want everything that starts with "/foo", you are asking for a range ["/foo", "/fop"). But this would return the entire subtree of the key, which might be unacceptable if the subtree is large. At first we planned to use the key conversion mechanism, implemented in zetcd. It involves adding one byte to the beginning of the key, equal to the depth of the node in the tree. I'll give you an example.

"/foo" -> "u01/foo"
"/foo/bar" -> "u02/foo/bar"

Then get all immediate children of the key "/foo" it is possible by requesting a range ["u02/foo/", "u02/foo0"). Yes, in ASCII "0" standing right after "/".

But how in this case to implement the removal of the vertex? It turns out that you need to delete all the ranges of the view ["uXX/foo/", "uXX/foo0") for XX from 01 to FF. And then we ran into operation number limit within a single transaction.

As a result, a simple key conversion system was invented, which made it possible to efficiently implement both deleting a key and obtaining a list of children. It is enough to add a special character before the last token. For example:

"/very" -> "/u00very"
"/very/long" -> "/very/u00long"
"/very/long/path" -> "/very/long/u00path"

Then removing the key "/very" turns into deletion "/u00very" and range ["/very/", "/very0"), and getting all children into a query for keys from a range ["/very/u00", "/very/u01").

Deleting a key in ZooKeeper

As I already mentioned, in ZooKeeper you cannot delete a node if it has children. We want to delete the key along with the subtree. How to be? We do it optimistically. We first traverse the subtree recursively, getting the children of each vertex in a separate query. We then build a transaction that attempts to delete all nodes in the subtree in the correct order. Of course, changes can occur between reading a subtree and deleting it. In this case, the transaction will fail. Moreover, the subtree can change during the reading process. Requesting the children of the next node may return an error if, for example, this node has already been deleted. In both cases, we repeat the whole process again.

This approach makes deleting a key very inefficient if it has children, and even more so if the application continues to work with the subtree, deleting and creating keys. However, this allowed us not to complicate the implementation of other methods in etcd and Consul.

set in ZooKeeper

In ZooKeeper, there are separate methods that work with the tree structure (create, delete, getChildren) and that work with data in nodes (setData, getData). All methods have strict preconditions: create will return an error if the node has already been created, delete or setData - if it doesn't already exist. What we needed was a set method that can be called without thinking about the presence of a key.

One option is to take an optimistic approach, as with deletions. Check if node exists. If exists, call setData, otherwise, create. If the last method returned an error, repeat everything from the beginning. The first thing to notice is the pointlessness of the existence test. You can immediately call create. Successful completion will mean that the node did not exist and was created. Otherwise, create will return an appropriate error, after which you need to call setData. Of course, between calls, a vertex can be removed by a competing call, and setData will also return an error. In this case, you can repeat everything again, but is it worth it?

If both methods return an error, then we know for sure that a concurrent deletion has taken place. Let's imagine that this deletion happened after the call to set. Then whatever value we try to set is already erased. So we can assume that set was successful, even if nothing was actually written.

More technical details

In this section, we will digress from distributed systems and talk about coding.
One of the main requirements of the customer was cross-platform: Linux, MacOS and Windows should support at least one of the services. Initially, we only developed for Linux, and started testing on other systems later. This caused a lot of problems, which for some time it was completely unclear how to approach. As a result, all three coordination services are now supported on Linux and MacOS, and only Consul KV on Windows.

From the very beginning, we tried to use ready-made libraries to access services. In the case of ZooKeeper, the choice fell on ZooKeeper C++, which eventually failed to compile on Windows. This, however, is not surprising: the library is positioned as linux-only. For Consul, the only option was ppconsul. It had to add support sessions и transactions. For etcd, a full-fledged library supporting the latest version of the protocol was never found, so we just generated grpc client.

Inspired by the asynchronous interface of the ZooKeeper C++ library, we decided to implement an asynchronous interface as well. ZooKeeper C++ uses the future/promise primitives for this. In the STL, they are, unfortunately, implemented very modestly. For example, no then method, which applies the passed function to the result of the future when it becomes available. In our case, such a method is necessary to convert the result into the format of our library. To get around this problem, we had to implement our own simple thread pool, since, at the request of the customer, we could not use heavy third-party libraries such as Boost.

Our implementation of then works as follows. When called, an additional promise/future pair is created. A new future is returned, and the passed one is queued along with the corresponding function and an additional promise. A thread from the pool selects several futures from the queue and polls them using wait_for. When a result becomes available, the corresponding function is called and its return value is passed to the promise.

We used the same thread pool to make requests to etcd and Consul. This means that several different threads can work with the underlying libraries. ppconsul is not thread-safe, so calls to it are protected by locks.
You can work with grpc from multiple threads, but there are subtleties. In etcd watches are implemented via grpc streams. These are such bidirectional channels for messages of a certain type. The library creates a single thread for all watches and a single thread that processes incoming messages. So grpc forbids making parallel writes to stream. This means that when initializing or deleting a watch, you must wait until the previous request is completed before sending the next one. We use to sync condition variables.

Сonclusion

See for yourself: liboffkv.

Our team: Raed Romanov, Ivan Glushenkov, Dmitry Kamaldinov, Victor Krapivensky, Vitaly Ivanin.

Source: habr.com

Five students and three distributed key-value stores