Moving to ClickHouse: 3 years later

Three years ago Viktor Tarnavsky and Alexey Milovidov from Yandex on stage HighLoad++ told, which ClickHouse is good, and how it does not slow down. And on the next stage was Alexander Zaitsev с report about moving to clickhouse from another analytical DBMS and with the conclusion that clickhouse, of course, good, but not very convenient. When in 2016 the company LifeStreet, where Alexander worked then, translated the multipetabyte analytical system into clickhouse, it was a fascinating "yellow brick road", full of unknown dangers - clickhouse then it looked like a minefield.

Three years later clickhouse became much better - during this time, Alexander founded the Altinity company, which not only helps to move to clickhouse dozens of projects, but also improves the product itself together with colleagues from Yandex. Now clickhouse still not a carefree walk, but no longer a minefield.

Alexander has been involved in distributed systems since 2003, developing large projects on MySQL, Oracle и Vertica. At the last HighLoad++ 2019 Alexander, one of the pioneers of the use clickhouse, told what this DBMS is now. We will learn about the main features clickhouse: how it differs from other systems and in what cases it is more effective to use it. Using examples, let's consider fresh and project-proven practices for building systems based on clickhouse.


Retrospective: what happened 3 years ago

Three years ago we transferred the company LifeStreet on clickhouse from a different analytics database, and the ad network analytics migration looked like this:

  • June 2016. In OpenSource Appeared clickhouse and started our project;
  • August. Proof of concept: large ad network, infrastructure and 200-300 terabytes of data;
  • October. First production data;
  • December. Full product load - 10-50 billion events per day.
  • June 2017. Successful migration of users to clickhouse, 2,5 petabytes of data on a cluster of 60 servers.

As the migration progressed, the understanding grew that clickhouse is a good system that is pleasant to work with, but this is an internal project of Yandex. Therefore, there are nuances: Yandex will first deal with its own internal customers and only then with the community and the needs of external users, while ClickHouse did not reach the enterprise level in many functional areas then. So in March 2017, we founded Altinity to make clickhouse even faster and more convenient not only for Yandex, but also for other users. And now we:

  • We train and help build solutions based on clickhouse so that customers do not fill bumps, and so that the solution eventually works;
  • We provide 24/7 support clickhouse- installations;
  • We develop our own ecosystem projects;
  • Actively commit to myself clickhouse, responding to requests from users who want to see certain features.

And of course, we help with the move to clickhouse с MySQL, Vertica, Oracle, Greenplum, Redshift and other systems. We have been involved in a wide variety of relocations and they have all been successful.

Moving to ClickHouse: 3 years later

Why even move to clickhouse

Doesn't slow down! This is the main reason. clickhouse - very fast database for different scenarios:

Moving to ClickHouse: 3 years later

Random quotes from people who work with clickhouse.

Scalability. On some other database, you can achieve good performance on one piece of hardware, but clickhouse you can scale not only vertically, but also horizontally by simply adding servers. Everything does not work as smoothly as we would like, but it works. You can grow the system as your business grows. It is important that we are not limited by the decision now and there is always the potential for development.

Portability. There is no attachment to one thing. For example, with Amazon RedShift hard to move somewhere. A clickhouse you can put it on your laptop, server, deploy it to the cloud, go to Kubernetes - there are no restrictions on the operation of the infrastructure. This is convenient for everyone, and this is a great advantage that many other similar databases cannot boast of.

Flexibility. clickhouse does not stop at one thing, for example, Yandex.Metrica, but is being developed and used in more and more different projects and industries. It can be expanded by adding new features to solve new problems. For example, it is believed that storing logs in a database is bad manners, so for this they came up with Elasticsearch. But thanks to the flexibility clickhouse, you can also store logs in it, and often it is even better than in Elasticsearch - In clickhouse it requires 10 times less iron.

Free Open Source. You don't have to pay for anything. No need to negotiate permission to put the system on your laptop or server. There are no hidden fees. At the same time, no other Open Source database technology can compete in speed with clickhouse. MySQL, MariaDB, Greenplum - they are all much slower.

Community, drive and . I clickhouse great community: meetups, chats and Alexey Milovidov, who charges us all with his energy and optimism.

Moving to ClickHouse

To switch to clickhouse with something, you need only three things:

  • Understand limitations clickhouse and what it is not suitable for.
  • Use the benefits technology and its greatest strengths.
  • Experiment. Even knowing how it works clickhouse, it is not always possible to predict when it will be faster, when it will be slower, when it will be better, and when it will be worse. So try.

The problem of moving

There is only one "but": if you move to clickhouse with something else, something usually goes wrong. We are used to some practices and things that work in our favorite database. For example, anyone working with SQL-databases, considers the following set of functions mandatory:

  • transactions;
  • constraints;
  • consistency;
  • indexes;
  • UPDATE/DELETE;
  • NULLs;
  • milliseconds;
  • automatic type conversions;
  • multiple joins;
  • arbitrary partitions;
  • cluster management tools.

Recruitment is mandatory, but three years ago in clickhouse there was none of these features! Now less than half of the unrealized remains: transactions, constraints, Consistency, milliseconds and type casting.

And the main thing is that in clickhouse some standard practices and approaches do not work or do not work the way we are used to. Everything that appears in clickhouse, corresponds to "click house way”, i.e. functions are different from other DBs. For example:

  • Indexes are not selected, but skipped.
  • UPDATE/DELETE not synchronous, but asynchronous.
  • There are multiple joins, but there is no query planner. How they are then executed is generally not very clear to people from the database world.

ClickHouse Scenarios

In 1960, an American mathematician of Hungarian origin WignerEP wrote an articleThe unreasonable effectiveness of mathematics in the natural sciences”(“The incomprehensible effectiveness of mathematics in the natural sciences”) that the world around us is for some reason well described by mathematical laws. Mathematics is an abstract science, and physical laws expressed in mathematical form are not trivial, and WignerEP emphasized that this is very strange.

From my point of view, clickhouse - the same oddity. To reformulate Wigner, we can say this: amazing is the inconceivable efficiency clickhouse in a wide variety of analytical applications!

Moving to ClickHouse: 3 years later

For example, let's take Real-Time Data Warehouse, into which data is loaded almost continuously. We want to receive requests from him with a second delay. Please use clickhousebecause it was designed for this scenario. clickhouse this is how it is used not only in the web, but also in marketing and financial analytics, AdTech, as well as in Fraud detectionn. IN Real-time Data Warehouse a complex structured schema such as "star" or "snowflake" is used, many tables with JOIN (sometimes multiple), and the data is usually stored and changed in some systems.

Let's take another scenario - TimeSeries: monitoring devices, networks, usage statistics, internet of things. Here we meet with fairly simple events ordered in time. clickhouse was not originally developed for this, but has shown itself well, so large companies use clickhouse as a repository for monitoring information. To see if it fits clickhouse for time-series, we made a benchmark based on the approach and results InfluxDB и TimescaleDB — specialized time-series databases. It turned out that clickhouse, even without optimization for such tasks, also wins on a foreign field:

Moving to ClickHouse: 3 years later

В time-series usually a narrow table is used - several small columns. A lot of data can come from monitoring - millions of records per second - and they usually come in small inserts (real-time streaming). Therefore, we need a different insert script, and the queries themselves - with some specifics of their own.

Log Management. Collecting logs in the database is usually bad, but in clickhouse this can be done with some comments as described above. Many companies use clickhouse just for this. In this case, a flat wide table is used, where we store the entire logs (for example, in the form JSON), or cut into pieces. The data is usually loaded in large batches (files), and we are looking for some field.

For each of these functions, specialized databases are usually used. clickhouse one can do it all and so well that it overtakes them in performance. Let's now take a closer look time-series script, and how to "cook" clickhouse under this scenario.

Time Series

This is currently the main scenario for which clickhouse considered the standard solution. Time-series is a set of time-ordered events representing changes in some process over time. For example, it can be the heart rate per day or the number of processes in the system. Everything that gives time ticks with some dimensions is time-series:

Moving to ClickHouse: 3 years later

Most of these events come from monitoring. This can be not only web monitoring, but also real devices: cars, industrial systems, IoT, industries or unmanned taxis, in the trunk of which Yandex is already putting clickhouse-server.

For example, there are companies that collect data from ships. Every few seconds, sensors from a container ship send hundreds of different measurements. Engineers study them, build models and try to understand how efficiently the vessel is being used, because a container ship should not stand idle for a second. Any downtime is a waste of money, so it is important to predict the route so that parking is minimal.

Now there is a growth of specialized databases that measure time-series. Online DB Engines somehow different databases are ranked, and they can be viewed by type:

Moving to ClickHouse: 3 years later

The fastest growing type time seriess. Graph databases are also growing, but time seriess have been growing faster in the past few years. Typical representatives of this family of databases are InfluxDB, Prometheus, KDB, TimescaleDB (built on PostgreSQL), solutions from Amazon. clickhouse here too can be used, and it is used. Let me give you a few public examples.

One of the pioneers is the company CloudFlare (CDNprovider). They monitor their CDN via clickhouse (DNS-requests, HTTP-requests) with a huge load - 6 million events per second. Everything goes through Kafka, goes to clickhouse, which provides the ability to see real-time dashboards of events in the system.

Comcast - one of the leaders in telecommunications in the United States: Internet, digital television, telephony. They created a similar control system CDN within the framework of Open Source project Apache Traffic Control to work with their huge data. clickhouse used as a backend for analytics.

percona built in clickhouse inside your PMMto keep monitoring different MySQL.

Specific requirements

Time-series databases have their own specific requirements.

  • Fast insertion from many agents. We need to insert data from many streams very quickly. clickhouse does it well, because it has all non-blocking inserts. Any INSERT is a new file on disk, and small inserts can be buffered one way or another. IN clickhouse it is better to insert data in large batches, rather than one line at a time.
  • Flexible Circuit. In time-series we usually don't know the structure of the data completely. It is possible to build a monitoring system for a specific application, but then it is difficult to use it for another application. This requires a more flexible scheme. clickhouse, allows you to do this, even though it is a strongly typed base.
  • Efficient storage and "forgetting" data... Usually in time-series a huge amount of data, so they need to be stored as efficiently as possible. For example, at InfluxDB good compression is its main feature. But in addition to storage, you also need to be able to "forget" old data and do some downsampling — automatic counting of aggregates.
  • Quick queries on aggregated data. Sometimes it is interesting to look at the last 5 minutes with an accuracy of milliseconds, but on monthly data, minute or second granularity may not be needed - general statistics are enough. Support of this kind is necessary, otherwise a request for 3 months will be executed for a very long time even in clickhouse.
  • Requests like "last point, as of». These are typical for time-series requests: look at the last measurement or the state of the system at a point in time t. For the database, these are not very pleasant queries, but they also need to be able to execute.
  • "gluing" time series. Time-series is a time series. If there are two time series, then they often need to be connected and correlated. It is not convenient to do this on all databases, especially with unaligned time series: here are some time marks, there are others. You can consider the average, but suddenly there will still be a hole, so it's not clear.

Let's see how these requirements are met in clickhouse.

scheme

В clickhouse scheme for time-series can be done in different ways, depending on the degree of data regularity. It is possible to build a system on regular data when we know all the metrics in advance. For example, did CloudFlare with monitoring CDN is a well-optimized system. You can build a more general system that monitors the entire infrastructure, different services. In the case of irregular data, we do not know in advance what we are monitoring - and this is probably the most common case.

regular data. Columns. The scheme is simple - columns with the necessary types:

CREATE TABLE cpu (
  created_date Date DEFAULT today(),  
  created_at DateTime DEFAULT now(),  
  time String,  
  tags_id UInt32,  /* join to dim_tag */
  usage_user Float64,  
  usage_system Float64,  
  usage_idle Float64,  
  usage_nice Float64,  
  usage_iowait Float64,  
  usage_irq Float64,  
  usage_softirq Float64,  
  usage_steal Float64,  
  usage_guest Float64,  
  usage_guest_nice Float64
) ENGINE = MergeTree(created_date, (tags_id, created_at), 8192);

This is a regular table that monitors some kind of system boot activity (user, system, idle, nice). Simple and convenient, but not flexible. If we want a more flexible scheme, then we can use arrays.

Irregular data. Arrays:

CREATE TABLE cpu_alc (
  created_date Date,  
  created_at DateTime,  
  time String,  
  tags_id UInt32,  
  metrics Nested(
    name LowCardinality(String),  
    value Float64
  )
) ENGINE = MergeTree(created_date, (tags_id, created_at), 8192);

SELECT max(metrics.value[indexOf(metrics.name,'usage_user')]) FROM ...

  Structure nested are two arrays: metrics.name и metrics.value. Here you can store such arbitrary monitoring data as an array of names and an array of measurements for each event. For further optimization, several such structures can be made instead of one. For example, one for float-value, another - for int-meaning, because int I want to store more efficiently.

But such a structure is more difficult to access. You will have to use a special construction, using special functions to pull out the values ​​first of the index, and then of the array:

SELECT max(metrics.value[indexOf(metrics.name,'usage_user')]) FROM ...

But it still works fast enough. Another way to store irregular data is by rows.

Irregular data. Strings. In this traditional way, without arrays, names and values ​​\u5b\u000bare stored at once. If 5 measurements come from one device at once, 000 rows are generated in the database:

CREATE TABLE cpu_rlc (
  created_date Date,  
  created_at DateTime,  
  time String,  
  tags_id UInt32,  
  metric_name LowCardinality(String),  
  metric_value Float64
) ENGINE = MergeTree(created_date, (metric_name, tags_id, created_at), 8192);


SELECT 
    maxIf(metric_value, metric_name = 'usage_user'),
    ... 
FROM cpu_r
WHERE metric_name IN ('usage_user', ...)

clickhouse copes with this - it has special extensions clickhouse SQL. For example, maxIf - a special function that calculates the maximum by the metric when some condition is met. You can write several such expressions in one query and immediately calculate the value for several metrics.

Let's compare three approaches:

Moving to ClickHouse: 3 years later

Details

Here I have added "Data Size on Disk" for some test dataset. In the case of columns, we have the smallest data size: maximum compression, maximum query speed, but we pay by having to fix everything at once.

In the case of arrays, things are a little worse. The data still compresses well and it is possible to store an irregular pattern. But clickhouse - a column database, and when we start storing everything in an array, it turns into a string database, and we pay for flexibility with efficiency. For any operation, you will have to read the entire array into memory, then find the desired element in it - and if the array grows, then the speed degrades.

In one of the companies that uses this approach (for example, Uber), arrays are cut into pieces of 128 elements. The data of several thousand metrics with a volume of 200 TB of data / day is not stored in one array, but in 10 or 30 arrays with special storage logic.

The simplest approach is with strings. But the data is poorly compressed, the size of the table is large, and even when queries are based on several metrics, ClickHouse does not work optimally.

hybrid scheme

Let's assume that we have chosen an array schema. But if we know that most of our dashboards show only user and system metrics, we can additionally materialize these metrics into columns from an array at the table level in this way:

CREATE TABLE cpu_alc (
  created_date Date,  
  created_at DateTime,  
  time String,  
  tags_id UInt32,  
  metrics Nested(
    name LowCardinality(String),  
    value Float64
  ),
  usage_user Float64 
             MATERIALIZED metrics.value[indexOf(metrics.name,'usage_user')],
  usage_system Float64 
             MATERIALIZED metrics.value[indexOf(metrics.name,'usage_system')]
) ENGINE = MergeTree(created_date, (tags_id, created_at), 8192);

When pasted clickhouse will automatically count them. This way you can combine business with pleasure: the scheme is flexible and general, but we pulled out the most commonly used columns. I note that this did not require changing the insert and ETL, which continues to insert arrays into the table. We just did ALTER TABLE, added a couple of speakers and got a hybrid and faster scheme that you can start using right away.

Codecs and compression

For time-series it is important how well you pack the data, because the array of information can be very large. IN clickhouse there is a set of tools to achieve the effect of compression 1:10, 1:20, and sometimes more. This means that 1 TB of uncompressed data on disk takes up 50-100 GB. Smaller size is good, data can be read and processed faster.

To achieve a high level of compression, clickhouse supports the following codecs:

Moving to ClickHouse: 3 years later

Table example:

CREATE TABLE benchmark.cpu_codecs_lz4 (
    created_date Date DEFAULT today(), 
    created_at DateTime DEFAULT now() Codec(DoubleDelta, LZ4), 
    tags_id UInt32, 
    usage_user Float64 Codec(Gorilla, LZ4), 
    usage_system Float64 Codec(Gorilla, LZ4), 
    usage_idle Float64 Codec(Gorilla, LZ4), 
    usage_nice Float64 Codec(Gorilla, LZ4), 
    usage_iowait Float64 Codec(Gorilla, LZ4), 
    usage_irq Float64 Codec(Gorilla, LZ4), 
    usage_softirq Float64 Codec(Gorilla, LZ4), 
    usage_steal Float64 Codec(Gorilla, LZ4), 
    usage_guest Float64 Codec(Gorilla, LZ4), 
    usage_guest_nice Float64 Codec(Gorilla, LZ4), 
    additional_tags String DEFAULT ''
)
ENGINE = MergeTree(created_date, (tags_id, created_at), 8192);

Here we define the codec DoubleDelta in one case, in the second Gorilla, and be sure to add more LZ4 compression. As a result, the size of the data on disk is greatly reduced:

Moving to ClickHouse: 3 years later

This shows how much space the same data takes up, but using different codecs and compressions:

  • in a GZIP file on disk;
  • in ClickHouse without codecs, but with ZSTD compression;
  • in ClickHouse with LZ4 and ZSTD codecs and compression.

It can be seen that tables with codecs take up much less space.

Size matters

Not less important select correct data type:

Moving to ClickHouse: 3 years later

In all the examples above I have used float64. But if we chose float32then that would be even better. This was well demonstrated by the guys from Perkona in the article at the link above. It is important to use the most compact type that suits the task: even less for size on disk than for query speed. clickhouse very sensitive to it.

If you can use int32 instead int64, then expect an almost twofold increase in performance. The data takes up less memory, and all the "arithmetic" works much faster. clickhouse inside it is a very strictly typed system, it makes the most of all the possibilities that modern systems provide.

Aggregation and Materialized Views

Aggregation and materialized views allow you to make aggregates for different occasions:

Moving to ClickHouse: 3 years later

For example, you may have non-aggregated source data, and you can hang various materialized views on them with automatic summation through a special engine SummingMergeTree (SMT). SMT is a special aggregating data structure that counts aggregates automatically. Raw data is inserted into the database, it is automatically aggregated, and dashboards can be used right away.

TTL - "forget" old data

How to "forget" data that is no longer needed? clickhouse knows how to do it. When creating tables, you can specify TTL expressions: for example, that we store minute data for one day, daily data for 30 days, and never touch weekly or monthly data:

CREATE TABLE aggr_by_minute
…
TTL time + interval 1 day

CREATE TABLE aggr_by_day
…
TTL time + interval 30 day

CREATE TABLE aggr_by_week
…
/* no TTL */

multi tier - partitioning data across disks

Developing this idea, data can be stored in clickhouse in different places. Suppose we want to store hot data for the last week on a very fast local SSD, and we add more historical data to another place. IN clickhouse now it's possible:

Moving to ClickHouse: 3 years later

You can configure the retention policy (storage policy) So clickhouse will automatically transfer data to another storage when certain conditions are met.

But that's not all. At the level of a particular table, you can define rules for exactly when data is transferred to cold storage. For example, 7 days of data lie on a very fast disk, and everything that is older is transferred to a slow one. This is good because it allows the system to keep at maximum performance, while controlling costs and not spending money on cold data:

CREATE TABLE 
... 
TTL date + INTERVAL 7 DAY TO VOLUME 'cold_volume', 
    date + INTERVAL 180 DAY DELETE

Unique Features clickhouse

Almost everything in clickhouse there are such "highlights", but they are leveled by the exclusive - what is not in other databases. For example, here are some of the unique features clickhouse:

  • Arrays. In clickhouse very good support for arrays, as well as the ability to perform complex calculations on them.
  • Aggregating Data Structures. This is one of the "killer features" clickhouse. Despite the fact that the guys from Yandex say that we do not want to aggregate data, everything is aggregated in clickhousebecause it's fast and convenient.
  • Materialized Views. Together with aggregating data structures, materialized views allow you to make a convenient real-time aggregation.
  • ClickHouse SQL. This is a language extension SQL with some additional and exclusive features that are only available in clickhouse. Previously, it was, as it were, an extension on the one hand, and a disadvantage on the other. Now almost all the shortcomings compared to SQL 92 we removed it, now it's just an extension.
  • Lambda-expressions. Are they still in some database?
  • ML-support. This is in different databases, some are better, some are worse.
  • Open source. We can expand clickhouse together. Now in clickhouse about 500 contributors, and this number is constantly growing.

Tricky Queries

В clickhouse there are many different ways to do the same thing. For example, there are three different ways to return the last value from a table for CPU (there is also a fourth, but it is even more exotic).

The first one shows how convenient it is to do in clickhouse queries when you want to check that tuple contained in the subquery. This is something that I personally really lacked in other databases. If I want to compare something with a subquery, then in other databases only a scalar can be compared with it, and for several columns I need to write JOIN. In clickhouse you can use tuple:

SELECT *
  FROM cpu 
 WHERE (tags_id, created_at) IN 
    (SELECT tags_id, max(created_at)
        FROM cpu 
        GROUP BY tags_id)

The second way does the same but uses an aggregate function argMax:

SELECT 
    argMax(usage_user), created_at),
    argMax(usage_system), created_at),
...
 FROM cpu 

В clickhouse there are several dozen aggregate functions, and if you use combinators, then according to the laws of combinatorics, you get about a thousand of them. ArgMax - one of the functions that counts the maximum value: the query returns the value usage_user, at which the maximum value is reached created_at:

SELECT now() as created_at,
       cpu.*
  FROM (SELECT DISTINCT tags_id from cpu) base 
  ASOF LEFT JOIN cpu USING (tags_id, created_at)

ASOF JOIN - "gluing" rows with different times. This is a unique feature for databases and is only available in kdb+. If there are two time series with different times, ASOF JOIN allows them to be shifted and glued in one request. For each value in one time series, the nearest value in another is found, and they are returned on the same line:

Moving to ClickHouse: 3 years later

Analytic Functions

In the standard SQL-2003 you can write like this:

SELECT origin,
       timestamp,
       timestamp -LAG(timestamp, 1) OVER (PARTITION BY origin ORDER BY timestamp) AS duration,
       timestamp -MIN(timestamp) OVER (PARTITION BY origin ORDER BY timestamp) AS startseq_duration,
       ROW_NUMBER() OVER (PARTITION BY origin ORDER BY timestamp) AS sequence,
       COUNT() OVER (PARTITION BY origin ORDER BY timestamp) AS nb
  FROM mytable
ORDER BY origin, timestamp;

В clickhouse this is not possible - it does not support the standard SQL-2003 and probably never will. Instead, in clickhouse it is customary to write like this:

Moving to ClickHouse: 3 years later

I promised lambdas - here they are!

This is an analogue of an analytical query in the standard SQL-2003: it counts the difference between two timestamp, duration, ordinal - everything that we usually consider analytic functions. IN clickhouse we count them through arrays: first we collapse the data into an array, after that we do whatever we want on the array, and then we expand it back. It's not very convenient, it requires a love of functional programming at the very least, but it's very flexible.

Special functions

Besides, in clickhouse many specialized features. For example, how to determine how many sessions are running at the same time? A typical task for monitoring is to determine the maximum load in a single request. IN clickhouse there is a special function for this purpose:

Moving to ClickHouse: 3 years later

In general, ClickHouse has special functions for many purposes:

  • runningDifference, runningAccumulate, neighbor;
  • sumMap(key, value);
  • timeSeriesGroupSum(uid, timestamp, value);
  • timeSeriesGroupRateSum(uid, timestamp, value);
  • skewPop, skewSamp, kurtPop, kurtSamp;
  • WITH FILL / WITH TIES;
  • simpleLinearRegression, stochasticLinearRegression.

This is not a complete list of features, there are only 500-600 of them. Hint: all functions in clickhouse is in the system table (not all are documented, but all are interesting):

select * from system.functions order by name

clickhouse stores a lot of information about itself, including log tables, query_log, trace log, operation log with data blocks (part_log), the metrics log, and the system log, which it usually writes to disk. The metrics log is time-series в clickhouse in fact clickhouse: the database itself can play a role time-series databases, thus "devouring" itself.

Moving to ClickHouse: 3 years later

This is also a unique thing - since we are doing a good job for time-serieswhy can't we store everything we need in ourselves? We do not need Prometheus, we keep everything in ourselves. Connected grafana and we monitor ourselves. However, if clickhouse falls, we won't see - why - that's why they usually don't do that.

Large cluster or many small ones clickhouse

What is better - one large cluster or many small ClickHouses? The traditional approach to DWH is a large cluster in which schemes are allocated for each application. We came to the database administrator - give us a schema, and we were given it:

Moving to ClickHouse: 3 years later

В clickhouse you can do it differently. Can each application make its own clickhouse:

Moving to ClickHouse: 3 years later

We don't need a big monster anymore DWH and uncooperative admins. We can give each application its own clickhouse, and the developer can do it himself, since clickhouse very easy to install and does not require complex administration:

Moving to ClickHouse: 3 years later

But if we have a lot clickhouse, and you need to set it often, then you want to automate this process. For this we can, for example, use Kubernetes и clickhouse-operator. IN Kubernetes ClickHouse you can put "on click": I can click a button, run the manifest and the database is ready. You can immediately create a scheme, start loading metrics there, and after 5 minutes I have a dashboard ready grafana. It's so simple!

The result?

So, clickhouse - this is:

  • Quick. Everyone knows this.
  • Simple. A little debatable, but I think it's hard to learn, easy to fight. If you understand how clickhouse works, everything is very simple.
  • Universally. It is suitable for different scenarios: DWH, Time Series, Log Storage. But it is not OLTP database, so don't try to do short inserts and reads there.
  • Interestingly. Probably the one who works with clickhouse, experienced many interesting minutes in a good and bad sense. For example, a new release came out, everything stopped working. Or when you struggled with a task for two days, but after a question in the Telegram chat, the task was solved in two minutes. Or, as at the conference at the report of Lesha Milovidov, a screenshot from clickhouse broke the broadcast HighLoad++. These kinds of things happen all the time and make our life with clickhouse bright and interesting!

The presentation can be viewed here.

Moving to ClickHouse: 3 years later

The long-awaited meeting of developers of high-load systems at HighLoad++ will take place on November 9 and 10 in Skolkovo. Finally, it will be an offline conference (albeit with all precautions), as the energy of HighLoad++ cannot be packaged online.

For the conference, we find and show you cases about the maximum possibilities of technology: HighLoad ++ was, is and will be the only place where you can learn in two days how Facebook, Yandex, VKontakte, Google and Amazon work.

Having held our meetings without interruption since 2007, this year we will meet for the 14th time. During this time, the conference has grown 10 times, last year the key event of the industry gathered 3339 participants, 165 speakers of reports and meetups, and 16 tracks were playing at the same time.
Last year there were 20 buses for you, 5280 liters of tea and coffee, 1650 liters of fruit drinks and 10200 bottles of water. And another 2640 kilograms of food, 16 plates and 000 cups. By the way, with the money raised from recycled paper, we planted 25 oak seedlings 🙂

Tickets can be bought here, receive news about the conference — here, and talk in all social networks: Telegram, Facebook, Vkontakte и Twitter.

Source: habr.com

Add a comment