HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

HighLoad++ Moscow 2018, Congress Hall. November 9, 15:00

Abstracts and presentation: http://www.highload.ru/moscow/2018/abstracts/4066

Yuriy Nasretdinov (VKontakte): the report will talk about the experience of implementing ClickHouse in our company - why do we need it, how much data we store, how we write it, and so on.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

Additional materials: using Clickhouse as a replacement for ELK, Big Query and TimescaleDB

Yuri Nasretdinov: - Hi all! My name is Yury Nasretdinov, as they have already introduced me. I work on VKontakte. I will talk about how we insert data into ClickHouse from our fleet of servers (tens of thousands).

What are logs and why collect them?

What we will tell: what we did, why we needed ClickHouse, respectively - why we chose it, what kind of performance you can roughly get without configuring anything specially. I’ll tell you more about buffer tables, about the problems that we had with them and about our solutions that we developed from open source - KittenHouse and Lighthouse.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

Why did we need to do anything at all (everything is always good on VKontakte, right?). We wanted to collect debug logs (and there were hundreds of terabytes of data there), maybe it would be more convenient to calculate statistics somehow; and we have a fleet of servers in the tens of thousands, from which all this needs to be done.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

Why did we decide? We certainly had solutions for storing logs. Here - there is such a public "Backend VK". I highly recommend subscribing to it.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

What is logs? This is an engine that returns empty arrays. Engines in VK are what others call microservices. And such a sticker is smiling (quite a lot of likes). How so? Well, keep listening!

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

What can be used to store logs? It is impossible not to mention "Hadup". Then, for example, Rsyslog (storage of these logs in files). LSD. Who knows what LSD is? No, not this LSD. Store files, respectively, too. Well, ClickHouse is a strange option of some kind.

Clickhouse and competitors: requirements and opportunities

What do we want? We want to make sure that we don't have to worry too much about the operation, so that it works out of the box, preferably with minimal configuration. We want to write a lot, and write quickly. And we want to keep it all sorts of months, years, that is, for a long time. We may want to sort out some problem that they came to us with, they said - “Something is not working for us here”, - and that was 3 months ago), and we want to be able to see what happened 3 months ago. Data compression - it's understandable why it would be a plus - because the amount of space that is taken up is reduced.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

And we have such an interesting requirement: we sometimes write the output of some commands (for example, logs), it can be more than 4 kilobytes quite calmly. And if this thing has to work over UDP, then it does not need to spend ... it will not have any “overhead” for the connection, and for a large number of servers this will be a plus.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

Let's see what "open-source" offers us. First, we have Logs Engine - this is our engine; in principle, he knows how to do everything, even long lines can write. Well, it does not compress data transparently - we can compress large columns ourselves if we want ... we, of course, do not want to (if possible). The only problem is that he can only give what he can remember; the rest to read, you need to get the binlog of this engine and, accordingly, it is quite long.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

What are the other options? For example, "Hadup". Ease of use… Who says Hadup is easy to set up? With the record, of course, there are no problems. Questions sometimes arise with reading. In principle, I would say that rather not, especially for logs. Long-term storage - of course, yes, data compression - yes, long strings - it's clear that you can write. But to record from a large number of servers ... You still need to do something yourself!

Rsyslog. In fact, we used it as a fallback so that we could read it without a binlog dump, but it cannot write into long lines, in principle, it cannot write more than 4 kilobytes. Data compression should be done in the same way. Reading will go from files.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

Then there is the "badush" development of LSD. Essentially the same as “Rsyslog”: it supports long lines, but it cannot work over UDP and, in fact, because of this, unfortunately, there are quite a lot of things that need to be rewritten. LSD needs to be redesigned to be able to record from tens of thousands of servers.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

And here! A funny option is ElasticSearch. How to say? He is doing well with reading, that is, he reads quickly, but with writing he is not very good. Firstly, if it compresses the data, it is very weak. Most likely, for a full-fledged search, more voluminous data structures are required than the original volume. It is difficult to operate, problems often arise with it. And, again, writing to Elastic - we have to do everything ourselves.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

ClickHouse is the ideal option, of course. The only thing is that recording from tens of thousands of servers is a problem. But it is at least one, we can try to solve it somehow. And the rest of the report is about this problem. What kind of performance can you expect from ClickHouse?

How will we insert? MergeTree

Who among you has not heard about ClickHouse, does not know? Need to tell, don't you? Very fast. An insert there is 1-2 gigabits per second, in bursts up to 10 gigabits per second it can actually withstand on such a configuration - there are two 6-core Xeons (that is, not even the most powerful ones), 256 gigabytes of RAM, 20 terabytes in RAID (no one configured, default settings). Alexey Milovidov, ClickHouse developer, is probably crying that we didn’t set up anything (everything worked for us like this). Accordingly, a scanning speed of, say, about 6 billion lines per second can be obtained if the data is well compressed. If you do like % on a text string - 100 million lines per second, that is, it seems that it is quite fast.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

How will we insert? Well, you know that in "VK" - in PHP. We will insert from each PHP worker via HTTP into “ClickHouse”, into the MergeTree table for each entry. Who sees the problem in this scheme? For some reason, not everyone raised their hands. Let's tell.

Firstly, there are a lot of servers - accordingly, there will be a lot of connections (bad). Then it is better to insert data into MergeTree no more than once per second. And who knows why? Okay, okay. I'll tell you a little more about this. Another interesting question is that we are not doing analytics, we don’t need to enrich the data, we don’t need intermediate servers, we want to insert directly into ClickHouse (preferably, the more direct, the better).

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

Accordingly, how is the insertion into MergeTree carried out? Why is it better to insert into it no more than once a second or less often? The fact is that “ClickHouse” is a columnar database and sorts the data in ascending order of the primary key, and when you insert, a number of files are created at least according to the number of columns in which the data is sorted in ascending order of the primary key (a separate directory is created, a set of files on disk for each insert). Then the next insertion comes, and in the background they are combined into larger “partitions”. Since the data is sorted, it is possible to merge two sorted files without large memory consumption.

But, as you might guess, if you write 10 files per insert, then ClickHouse (or your server) will quickly run out, so it is recommended to insert in large batches. Accordingly, we never launched the first scheme in production. We immediately launched one, which here number 2 has:

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

Here imagine that there are about a thousand servers on which we launched, there is just PHP. And on each server there is our local agent, which we called "Kittenhouse", which keeps one connection with "ClickHouse" and inserts data every few seconds. Inserts data not into the MergeTree, but into a buffer table, which just serves to avoid inserting directly into the MergeTree, right away.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

Working with buffer tables

What it is? Buffer tables are a piece of memory that is sharded (that is, it can be inserted into it often). They consist of several pieces, and each of the pieces works as an independent buffer, and they flush independently (if you have a lot of pieces in the buffer, then there will be a lot of inserts per second). You can read from these tables - then you read the union of the contents of the buffer and the parent table, but at this moment the write is blocked, so it is better not to read from there. And buffer tables show very good QPS, that is, up to 3 thousand QPS you will not have any problems at all when inserting. It is clear that if the server loses power, then the data can be lost, because they were only stored in memory.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

At the same time, the buffer schema complicates ALTER, because you first need to drop the old buffer table with the old schema (the data will not be lost in this case, because it will be flushed before the table is deleted). Then you "alter" the table you need and create the buffer table again. Accordingly, while there is no buffer table, your data does not flow anywhere, but you can have it on disk at least locally.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

What is Kittenhouse and how does it work?

What is Kitten House? This is a proxy. Guess what language? I collected the most hype topics in my report - this is Clickhouse, Go, maybe I’ll remember something else. Yes, it's written in Go, because I'm not very good at writing in C, I don't want to.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

Accordingly, it keeps a connection with each server and can write to memory. For example, if we write error logs in Clickhouse, then if Clickhouse does not have time to insert data (after all, if there is too much of it), then we do not swell from memory - we just throw out the rest. Because if we write several gigabits per second of errors, then, probably, we can throw out some. Kittenhouse knows how to do it. Plus, he knows how to reliable delivery, that is, writing to disk on the local machine and once in a while (there, once every couple of seconds) tries to deliver the data from this file. And at first we used the usual Values ​​format - not some kind of binary format, a text format (as in regular SQL).

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

But then something like this happened. We used reliable delivery, wrote logs, then decided (it was such a conditionally test cluster) ... It was put out for several hours and brought back up, and an insertion went from a thousand servers - it turned out that Clickhouse still had the “Thread per connection” model - accordingly, in a thousand connections, an active insertion leads to an average load on the server of about one and a half thousand. Surprisingly, the server accepted requests, but such data was inserted after a while; but it was very hard for the server to serve it ...

Adding nginx

Such a solution for the Thread per connection model is nginx. We put nginx in front of Clickhouse, at the same time set up balancing for two replicas (we also increased the insertion speed by 2 times, although it’s not a fact that it should be so) and limited the number of connections to Clickhouse, to the upstream and, accordingly, more than 50 connections, it seems there is no point in inserting.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

Then we realized that in general this scheme has drawbacks, because we have one nginx here. Accordingly, if this nginx goes down, despite the presence of replicas, we lose data or, at least, do not write anywhere. Therefore, we made our own load balancing. We also realized that Clickhouse is still suitable for logs and the demon also began to write its logs in Clickhouse too - very convenient, to be honest. We still use it for other "demons".

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

Then we discovered such an interesting problem: if you use a not quite standard insertion method in SQL mode, then a full-fledged AST-based SQL parser is forced, which is rather slow. Accordingly, we have added settings so that this never happens. We did load balancing, health checks, so that if one dies, we still leave the data. We have enough tables so that we need to have different Clickhouse clusters. And we also started thinking about other uses - for example, we wanted to write logs from nginx modules, but they don’t know how to communicate using our RPC. Well, I would like to somehow teach them to send - for example, receive events via UDP on localhost and then send them to Clickhouse.

One step away from a solution

The final scheme began to look like this (the fourth version of this scheme): on each server, nginx stands in front of the Clickhouse (on the same server, moreover) and simply proxies requests to localhost with a limit on the number of connections of 50 pieces. And this scheme was already quite working, everything was pretty good with it.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

We lived like this for about a month. Everyone was happy, they added tables, added, added ... In general, it turned out that the way we added buffer tables was not very optimal (let's say so). We did 16 chunks in each table and a flash interval of a couple of seconds; we had 20 tables and 8 inserts per second went to each table - and at this moment, Clickhouse began ... the records began to blunt. They didn’t even go through ... By default, nginx had such an interesting thing that if connections ended at the upstream, then it simply returns “502” to all new requests.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

And here we have (I just looked at the logs in the Clickhouse itself) somewhere around half a percent of requests failed. Accordingly, disk utilization was high, there were many merzhey. Well what have I done? Naturally, I did not begin to understand why the connection and upstream ends.

Replacing nginx with a reverse proxy

I decided that we need to manage this ourselves, we don’t need to let nginx do it - nginx does not know what tables there are in Clickhouse, and replaced nginx with a reverse proxy, which I also wrote myself.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

What is he doing? It is based on the fasthttp library "gosh", that is, fast, almost as fast as nginx. Sorry, Igor, if you are here (note: Igor Sysoev is a Russian programmer who created the nginx web server). He can understand what kind of queries they are - INSERT or SELECT - respectively, he holds different connection pools for different types of queries.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

Accordingly, even if we do not have time to execute requests for an insert, then the “selects” will pass, and vice versa. And it groups the data into buffer tables - with a small buffer: if there were any errors, syntax errors, and so on - so that they do not greatly affect the rest of the data, because when we inserted simply into buffer tables, we had small "batches", and all syntax error errors affected only this small piece; and here they will already affect a large buffer. Small is 1 megabyte, that is, not so small.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

Inserting a synchronous and essentially replacing nginx does essentially the same thing that nginx did before - you don’t need to change the local “Kittenhouse” for this. And since it uses fasthttp, it is very fast - you can do more than 100 thousand requests per second of single inserts through reverse proxy. Theoretically, you can insert one line at a time into the kittenhouse reverse proxy, but of course we don't do that.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

The scheme began to look like this: "Kittenhouse", reverse-proxy groups many requests into tables and, in turn, buffer tables insert them into the main ones.

Killer is temporary, Kitten is permanent

There was such an interesting problem ... Have any of you used fasthttp? Who used fasthttp with POST requests? Probably, this was not really worth doing, because it buffers the request body by default, and we had a buffer size of 16 megabytes set. The insert stopped keeping up at some point, and 16-megabyte chunks began to come from all the tens of thousands of servers, and they were all buffered in memory before surrendering to Clickhouse. Accordingly, the memory was running out, Out-Of-Memory Killer came, killed the reverse proxy (or "Clickhouse", which could theoretically "eat" more than the reverse proxy). The cycle was repeated. Not a very pleasant problem. Although we stumbled upon it only after a few months of operation.

What I've done? Again, I don't really like to figure out what exactly happened. I think it's pretty obvious not to buffer to memory. I couldn't patch fasthttp even though I tried. But I found a way to make it so that there was no need to patch anything, and came up with my own method in HTTP - I called it KITTEN. Well, it's logical - "VK", "Kitten" ... How else? ..

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

If a request comes to the server with the Kitten method, then the server should answer “meow” (meow) - it is logical. If he answers this, then it is considered that he understands this protocol, and then I intercept the connection (there is such a method in fasthttp), and the connection goes into "raw" mode. Why do I need it? I want to control how reading from TCP connections happens. TCP has a wonderful property: if no one is reading from the other side, then the write starts to wait, and memory is not particularly spent on this.

And now I’m reading from about 50 clients at a time (from fifty, because fifty should definitely be enough, even if the bet comes from another DC) ... Consumption has decreased with this approach by at least 20 times, but, to be honest, I could not measure exactly how much, because it already makes no sense (it has already become at the level of error). The protocol is binary, that is, there is a table name and data; no http headers, so I didn't use a websocket (I don't need to communicate with browsers - I made a protocol that suits our needs). And everything was fine with him.

Buffer table is sad

We recently came across another interesting feature of buffer tables. And this problem is already much more painful than the others. Let's imagine the following situation: you already actively use Clickhouse, you have dozens of Clickhouse servers, and you have some queries that take a very long time to read (let's say more than 60 seconds); and you come and do Alter at this moment… In the meantime, the “selects” that started before “Alter” will not be included in this table, “Alter” will not start - probably some peculiarities of how Clickhouse works in this place. Maybe this can be fixed? Or not?

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

In general, it is clear that in fact this is not such a big problem, but with buffer tables it becomes more painful. Because if, for example, your “Alter” timeouts (and it can time out on another host - not on yours, but on a replica, for example), then ... You deleted the buffer table like that, your “Alter” timed out (or some kind of “Alter” error occurred) - it’s necessary that the data continues to be written: you create the buffer tables back (according to the same scheme that the parent table had), then the “Alter” passes , nevertheless ends, and the buffer table begins to differ in schema from the parent. Depending on what the "Alter" was, the insert may no longer go to this buffer table - this is very sad.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

There is also such a sign (maybe someone noticed it) - in the new versions of Clickhouse it is called query_thread_log. By default, in some version there was one. Here we have accumulated 840 million records in a couple of months (100 gigabytes). This is due to the fact that “inserts” were written there (maybe now, by the way, they are not written). As I told you, our "inserts" are small - we had a lot of "inserts" in the buffer tables. It is clear that this is being disabled - I'm just telling you what I saw on our server. Why? This is another argument against using buffer tables! Spotty is very sad.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

And who knew that this comrade's name was Spotty? VK employees raised their hands. OK.

About "KitttenHouse" plans

They don't usually share plans, do they? Suddenly you will not fulfill them and will not look very good in the eyes of others. But I'll risk it! We want to do the following: buffer tables, as it seems to me, are still a crutch and we need to buffer the insert ourselves. But we still don't want to buffer it on disk, so we'll buffer the insert in memory.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

Accordingly, when an “insert” is made, it will no longer be synchronous - it will already work as a buffer table, it will insert into the parent table (well, sometime later) and report through a separate channel which inserts passed and which did not.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

Why can't you leave the synchronous insert? She's much more comfortable. The fact is that if you insert from 10 thousand hosts, then everything is fine - you will have a little bit from each host, you insert there once a second, everything is fine. But I want this scheme to work, for example, from two machines, so that you can pour at high speed - maybe not squeeze the maximum out of Clickhouse, but write at least 100 megabytes per second from one machine through a reverse proxy - this scheme should scale both to a large number and to a small one, so we cannot wait for each insertion by a second, so it must be asynchronous. And in the same way, asynchronous confirmations should come after the insertion has taken place. We will know if it passed or not.

The most important thing is that in this circuit we know for sure whether the insertion went through or not. Imagine the following situation: you have a buffer table, you write something to it, and then, let's say, the table goes into read only and tries to flush the buffer. Where will the data go? remain in the buffer. But we can't be sure of this - all of a sudden some other error, due to which the data will not remain in the buffer ... (Addresses Alexey Milovidov, Yandex, ClickHouse developer) Or will they? Always? Alexey assures us that everything will be fine. We have no reason not to believe him. But anyway: if we do not use buffer tables, then there will definitely not be problems with them. Creating twice as many tables is also inconvenient, although in principle there are no big problems. This is the plan.

Let's talk about reading

Now let's talk about reading. We also wrote our own instrument here. It would seem, well, why write your own instrument here? .. And who used Tabiks? Somehow, few people raised their hands ... And who is satisfied with the performance of "Tabix"? Well, we are not satisfied, and it is not very convenient for viewing data. For analytics, it is fine, but just for viewing, it is clearly not optimized. Therefore, I wrote my own, my own interface.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

It is very simple - it can only read data. He can't show graphics, he can't do anything. But it can show what we need: for example, how many rows are in the table, how much space it takes (without breakdown by columns), that is, a very basic interface is what we need.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

And it looks very similar to Sequel Pro, but only made on the Twitter Bootstrap, and the second version. You ask: "Yuri, why on the second version?" What year? 2018? In general, I did this for a long time for Muskul (MySQL) and just changed a couple of lines in the queries there, and it began to work for Clickhouse, for which special thanks! Because the parser is very similar to the "muscular", and the queries are very similar - very convenient, especially at first.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

Well, he knows how to filter tables, how to show the structure, the contents of the table, allows you to sort, filter by columns, shows the query that turned out as a result, affected rows (how many as a result), that is, basic things for viewing data. Pretty fast.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

There is also an editor. I honestly tried to steal the entire editor from Tabix, but I couldn't. But somehow it still works. Basically, that's all.

"Clickhouse" is suitable for logs

I want to tell you that Clickhouse, despite all the problems described, is very well suited for logs. Most importantly, it solves our problem - it is very fast and allows you to filter the logs by columns. In principle, buffer tables did not perform well, but usually no one knows why ... Maybe now you know more where you will have problems.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

TCP? In general, in VK it is customary to use UDP. And when I used TCP… Of course, no one told me: “Yuri, what are you doing! You can't, you need UDP." It turned out that TCP is not so scary. The only thing is, if you have tens of thousands of active compounds that are written to you, you need to prepare it a little more carefully; but it is possible and quite easy.

I promised to post “Kittenhouse” and “Lighthouse” on HighLoad Siberia if everyone subscribes to our public “VK backend” ... And you know, not everyone has subscribed ... Of course, I won’t demand from you to subscribe to our public. There are still too many of you, someone may even be offended, but anyway - subscribe, please (and here I have to make eyes like a cat's). That's link to it by the way. Thank you very much! Our Github here. With Clickhouse your hair will be soft and silky.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

Leader: - Friends, now questions. Right after we hand out the certificate of appreciation and your presentation on VHS.

Yuri Nasretdinov (hereinafter referred to as Yun): - And how could you record my report on VHS if it had just ended?

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

Leader: “You, too, cannot fully determine how Clickhouse will work or not! Friends, 5 minutes for questions!

Questions

Question from the audience (hereinafter - Z): - Good afternoon. Thank you very much for the report. I have two questions. I'll start with a frivolous one: does the number of letters t in the name "Kittenhouse" in the diagrams (3, 4, 7 ...) affect the satisfaction of the cats?

YUN: - Quantity of what?

W: - The letters t. There are three t's, somewhere around three t's.

YUN: Didn't I fix it? Well, of course it does! These are different products - I just deceived you all this time. Okay, I'm kidding - it doesn't matter. Ah, right here! No, it's the same thing, I made a mistake.

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

W: - Thank you. The second question is serious. As far as I understand, in Clickhouse buffer tables live exclusively in memory, they are not buffered to disk and, accordingly, are not persistent.

YUN: - Yes.

W: - And at the same time, buffering to disk is performed on your client, which implies some guarantee of delivery of these same logs. But at Clickhouse, this is by no means guaranteed. Explain how the guarantee is carried out, due to what? .. Here is this mechanism in more detail

YUN: - Yes, theoretically there are no contradictions here, because when the Clickhouse falls, you can actually detect in a million different ways. When Clickhouse crashes (in case of incorrect completion), you can, roughly speaking, rewind your log that you wrote down a little and start from the moment when everything was exactly fine. Let's say you rewind a minute ago, that is, it is considered that you have flushed everything in a minute.

W: - That is, "Kittenhouse" keeps the window longer and, in the event of a fall, is able to recognize it and rewind it?

YUN: But that's in theory. In practice, we do not do this, and reliable delivery is from zero to infinity times. But the average is one. It suits us that if the Clickhouse crashes for some reason or the servers reboot, then we lose a little. In all other cases, nothing will happen.

W: - Hello. From the very beginning it seemed to me that indeed you will use UDP from the very beginning of the report. You have http, all that ... And most of the problems that you described, as I understand it, were caused by this particular solution ...

YUN: – What do we use TCP?

W: – Essentially, yes.

YUN: - No.

W: - It was with fasthttp that you had problems, with the connection you had problems. If you just used UDP, you would save yourself some time. Well, there would be problems with long messages or something ...

YUN: - With what?

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

W: - With long messages, since it may not fit into the MTU, something else ... Well, problems of their own may arise there. The question is: why not UDP after all?

YUN: - I believe that the authors who developed TCP / IP are much smarter than me and know how to serialize packets better than me (so that they go), at the same time regulate the sending window, do not overload the network, give feedback about what it does not read, except from the other side ... All these problems, in my opinion, would be in UDP, only I would have to write even more code than I already wrote in order to implement all the same myself and most likely badly. I don't even like to write in C, let alone...

W: - How convenient! I sent ok and you don’t expect anything - you have absolutely asynchronous. A notification came back that everything is fine - it means it has come; didn't come - so bad.

YUN: “I need both—I need to be able to ship both with guaranteed delivery and without guaranteed delivery. These are two different scenarios. Some logs I need not to lose or not to lose within reason.

W: - I won't waste my time. This needs to be discussed longer. Thank you.

Leader: - Who has questions - pens in the sky!

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

W: Hi, I'm Sasha. Somewhere in the middle of the report, there was a feeling that it was possible, in addition to TCP, to use a ready-made solution - some kind of Kafka.

YUN: - Well, how ... I said that I do not want to use intermediate servers, because ... to Kafka - it turns out that we have ten thousand hosts; we actually have more, tens of thousands of hosts. With "Kafka" without any proxies, it can also hurt. In addition, most importantly, it still gives “latency”, gives extra hosts that you need to have. And I don't want to have them - I want ...

W: “And in the end, that’s how it turned out.

YUN: – No, there are no hosts! It all works on Clickhouse hosts.

W: - Well, and "Kittenhouse", the reverse of which - where does he live?

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

YUN: “On the Clickhouse host, it doesn't write anything to disk.

W: - Let us suppose.

Leader: - Suits you? Can we get paid?

W: - Yes, you can. In fact, there are many crutches in order to get the same thing, and now - the previous answer on the topic of TCP contradicts, in my opinion, this situation. It just feels like everything could be done on the knee in much less time.

YUN: - And also why I didn’t want to use Kafka, because there were quite a few complaints in the Clickhouse telegram chat that, for example, messages from Kafka were lost. Not from Kafka itself, but from the integration of Kafka and Clickhouse; or something is not connected there. Roughly speaking, then it would be necessary to write the client for "Kafka" then. I don't think there could be a simpler or more reliable solution.

W: - Tell me, why didn’t some queues try or some such common bus? Since you say that with asynchronous it was possible to drive the logs themselves through the queue and in response to receive it also asynchronously through the queue?

HighLoad++, Yuri Nasretdinov (VKontakte): how VK inserts data into ClickHouse from tens of thousands of servers

YUN: – Can you please suggest which queues could be used?

W: - Any, even without a guarantee that they are in order. Redis some, RMQ ...

YUN: - I have a feeling that Redis most likely will not be able to pull such an amount of insertion even on one host (in the sense of several servers) that Clickhouse pulls. I can't back this up with any evidence (I haven't benchmarked it), but it seems to me that Redis is not the best solution here. In principle, this system can be considered as an impromptu message queue, but which is sharpened only for Clickhouse

Leader: Yuri, thank you very much. I propose to end the questions and answers on this and say to which of those who asked the question we will give the book.

YUN: I would like to give the book to the first person who asked a question.

Leader: - Wonderful! Great! Fabulous! Thanks a lot!

Some ads 🙂

Thank you for staying with us. Do you like our articles? Want to see more interesting content? Support us by placing an order or recommending to friends, cloud VPS for developers from $4.99, a unique analogue of entry-level servers, which was invented by us for you: The whole truth about VPS (KVM) E5-2697 v3 (6 Cores) 10GB DDR4 480GB SSD 1Gbps from $19 or how to share a server? (available with RAID1 and RAID10, up to 24 cores and up to 40GB DDR4).

Dell R730xd 2 times cheaper in Equinix Tier IV data center in Amsterdam? Only here 2 x Intel TetraDeca-Core Xeon 2x E5-2697v3 2.6GHz 14C 64GB DDR4 4x960GB SSD 1Gbps 100 TV from $199 in the Netherlands! Dell R420 - 2x E5-2430 2.2Ghz 6C 128GB DDR3 2x960GB SSD 1Gbps 100TB - from $99! Read about How to build infrastructure corp. class with the use of Dell R730xd E5-2650 v4 servers worth 9000 euros for a penny?

Source: habr.com

Add a comment