ClickHouse Database for Humans, or Alien Technologies

Aleksey Lizunov, Head of the Competence Center for Remote Service Channels of the Directorate of Information Technologies of the MKB

ClickHouse Database for Humans, or Alien Technologies

As an alternative to the ELK stack (ElasticSearch, Logstash, Kibana), we are doing research on using the ClickHouse database as a data store for logs.

In this article, we would like to talk about our experience of using the ClickHouse database and the preliminary results of the pilot operation. It should be noted right away that the results were impressive.


ClickHouse Database for Humans, or Alien Technologies

Next, we will describe in more detail how our system is configured, and what components it consists of. But now I would like to talk a little about this database as a whole, and why it is worth paying attention to. The ClickHouse database is a high-performance analytical columnar database from Yandex. It is used in Yandex services, initially it is the main data storage for Yandex.Metrica. Open-source system, free. From a developer's point of view, I've always wondered how they implemented it, because there is fantastically big data. And Metrica's user interface itself is very flexible and fast. At the first acquaintance with this database, the impression is: “Well, finally! Made for the people! Starting from the installation process and ending with sending requests.

This database has a very low entry threshold. Even an average-skilled developer can install this database in a few minutes and start using it. Everything works clearly. Even people who are new to Linux can quickly handle the installation and do the simplest operations. If earlier, with the words Big Data, Hadoop, Google BigTable, HDFS, an ordinary developer had ideas that it was about some terabytes, petabytes, that some superhumans are engaged in settings and development for these systems, then with the advent of the ClickHouse database, we got a simple, understandable tool with which you can solve a previously unattainable range of tasks. It only takes one fairly average machine and five minutes to install. That is, we got such a database as, for example, MySql, but only for storing billions of records! A certain super-archiver with the SQL language. It's like people were handed the weapons of aliens.

About our logging system

To collect information, IIS log files of standard format web applications are used (we are also currently parsing application logs, but the main goal at the pilot stage is to collect IIS logs).

For various reasons, we could not completely abandon the ELK stack, and we continue to use the LogStash and Filebeat components, which have proven themselves well and work quite reliably and predictably.

The general logging scheme is shown in the figure below:

ClickHouse Database for Humans, or Alien Technologies

A feature of writing data to the ClickHouse database is infrequent (once per second) insertion of records in large batches. This, apparently, is the most “problematic” part that you encounter when you first experience working with the ClickHouse database: the scheme becomes a little more complicated.
The plugin for LogStash, which directly inserts data into ClickHouse, helped a lot here. This component is deployed on the same server as the database itself. So, generally speaking, it is not recommended to do it, but from a practical point of view, so as not to produce separate servers while it is deployed on the same server. We did not observe any failures or resource conflicts with the database. In addition, it should be noted that the plugin has a retry mechanism in case of errors. And in case of errors, the plugin writes to disk a batch of data that could not be inserted (the file format is convenient: after editing, you can easily insert the corrected batch using clickhouse-client).

A complete list of software used in the scheme is presented in the table:

List of software used

Name

Description

Distribution link

Nginx

Reverse-proxy to restrict access by ports and organize authorization

Currently not used in the scheme

https://nginx.org/ru/download.html

https://nginx.org/download/nginx-1.16.0.tar.gz

FileBeat

Transfer of file logs.

https://www.elastic.co/downloads/beats/filebeat (distribution kit for Windows 64bit).

https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-7.3.0-windows-x86_64.zip

logstash

Log collector.

Used to collect logs from FileBeat, as well as to collect logs from the RabbitMQ queue (for servers that are in the DMZ.)

https://www.elastic.co/products/logstash

https://artifacts.elastic.co/downloads/logstash/logstash-7.0.1.rpm

Logstash-output-clickhouse

Loagstash plugin for transferring logs to the ClickHouse database in batches

https://github.com/mikechris/logstash-output-clickhouse

/usr/share/logstash/bin/logstash-plugin install logstash-output-clickhouse

/usr/share/logstash/bin/logstash-plugin install logstash-filter-prune

/usr/share/logstash/bin/logstash-plugin install logstash-filter-multiline

clickhouse

Log storage https://clickhouse.yandex/docs/ru/

https://packagecloud.io/Altinity/clickhouse/packages/el/7/clickhouse-server-19.5.3.8-1.el7.x86_64.rpm

https://packagecloud.io/Altinity/clickhouse/packages/el/7/clickhouse-client-19.5.3.8-1.el7.x86_64.rpm

Note. Starting from August 2018, “normal” rpm builds for RHEL appeared in the Yandex repository, so you can try to use them. At the time of installation, we were using packages built by Altinity.

grafana

Log visualization. Setting up dashboards

https://grafana.com/

https://grafana.com/grafana/download

Redhat & Centos(64 Bit) - latest version

ClickHouse datasource for Grafana 4.6+

Plugin for Grafana with ClickHouse data source

https://grafana.com/plugins/vertamedia-clickhouse-datasource

https://grafana.com/api/plugins/vertamedia-clickhouse-datasource/versions/1.8.1/download

logstash

Log router from FileBeat to RabbitMQ queue.

Note. Unfortunately FileBeat does not output directly to RabbitMQ, so an intermediate link in the form of Logstash is required

https://www.elastic.co/products/logstash

https://artifacts.elastic.co/downloads/logstash/logstash-7.0.1.rpm

Rabbit MQ

message queue. This is the log buffer in the DMZ

https://www.rabbitmq.com/download.html

https://github.com/rabbitmq/rabbitmq-server/releases/download/v3.7.14/rabbitmq-server-3.7.14-1.el7.noarch.rpm

Erlang Runtime (Required for RabbitMQ)

Erlang runtime. Required for RabbitMQ to work

http://www.erlang.org/download.html

https://www.rabbitmq.com/install-rpm.html#install-erlang http://www.erlang.org/downloads/21.3

The server configuration with the ClickHouse database is presented in the following table:

Name

Value

Note

Configuration

HDD: 40GB
RAM: 8GB
Processor: Core 2 2Ghz

It is necessary to pay attention to the tips for operating the ClickHouse database (https://clickhouse.yandex/docs/ru/operations/tips/)

General system software

OS: Red Hat Enterprise Linux Server (Maipo)

JRE (Java 8)

 

As you can see, this is an ordinary workstation.

The structure of the table for storing logs is as follows:

log_web.sql

CREATE TABLE log_web (
  logdate Date,
  logdatetime DateTime CODEC(Delta, LZ4HC),
   
  fld_log_file_name LowCardinality( String ),
  fld_server_name LowCardinality( String ),
  fld_app_name LowCardinality( String ),
  fld_app_module LowCardinality( String ),
  fld_website_name LowCardinality( String ),
 
  serverIP LowCardinality( String ),
  method LowCardinality( String ),
  uriStem String,
  uriQuery String,
  port UInt32,
  username LowCardinality( String ),
  clientIP String,
  clientRealIP String,
  userAgent String,
  referer String,
  response String,
  subresponse String,
  win32response String,
  timetaken UInt64
   
  , uriQuery__utm_medium String
  , uriQuery__utm_source String
  , uriQuery__utm_campaign String
  , uriQuery__utm_term String
  , uriQuery__utm_content String
  , uriQuery__yclid String
  , uriQuery__region String
 
) Engine = MergeTree()
PARTITION BY toYYYYMM(logdate)
ORDER BY (fld_app_name, fld_app_module, logdatetime)
SETTINGS index_granularity = 8192;

We use default partitioning (by month) and index granularity. All fields practically correspond to IIS log entries for logging http requests. Separately, we note that there are separate fields for storing utm-tags (they are parsed at the stage of inserting into the table from the query string field).

Also, several system fields have been added to the table to store information about systems, components, servers. See the table below for a description of these fields. In one table, we store logs for several systems.

Name

Description

Example

fld_app_name

Application/system name
Valid values:

  • site1.domain.com External site 1
  • site2.domain.com External site 2
  • internal-site1.domain.local Internal site 1

site1.domain.com

fld_app_module

System module
Valid values:

  • web - Website
  • svc - Web site service
  • intgr - Integration Web Service
  • bo - Admin (BackOffice)

websites

fld_website_name

Site name in IIS

Several systems can be deployed on one server, or even several instances of one system module

web main

fld_server_name

Server name

web1.domain.com

fld_log_file_name

Path to the log file on the server

C:inetpublogsLogFiles
W3SVC1u_ex190711.log

This allows you to efficiently build graphs in Grafana. For example, view requests from the frontend of a particular system. This is similar to the site counter in Yandex.Metrica.

Here are some statistics on the use of the database for two months.

Number of records broken down by systems and their components

SELECT
    fld_app_name,
    fld_app_module,
    count(fld_app_name) AS rows_count
FROM log_web
GROUP BY
    fld_app_name,
    fld_app_module
    WITH TOTALS
ORDER BY
    fld_app_name ASC,
    rows_count DESC
 
┌─fld_app_name─────┬─fld_app_module─┬─rows_count─┐
│ site1.domain.ru  │ web            │     131441 │
│ site2.domain.ru  │ web            │    1751081 │
│ site3.domain.ru  │ web            │  106887543 │
│ site3.domain.ru  │ svc            │   44908603 │
│ site3.domain.ru  │ intgr          │    9813911 │
│ site4.domain.ru  │ web            │     772095 │
│ site5.domain.ru  │ web            │   17037221 │
│ site5.domain.ru  │ intgr          │     838559 │
│ site5.domain.ru  │ bo             │       7404 │
│ site6.domain.ru  │ web            │     595877 │
│ site7.domain.ru  │ web            │   27778858 │
└──────────────────┴────────────────┴────────────┘
 
Totals:
┌─fld_app_name─┬─fld_app_module─┬─rows_count─┐
│              │                │  210522593 │
└──────────────┴────────────────┴────────────┘
 
11 rows in set. Elapsed: 4.874 sec. Processed 210.52 million rows, 421.67 MB (43.19 million rows/s., 86.51 MB/s.)

The amount of data on the disk

SELECT
    formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed,
    formatReadableSize(sum(data_compressed_bytes)) AS compressed,
    sum(rows) AS total_rows
FROM system.parts
WHERE table = 'log_web'
 
┌─uncompressed─┬─compressed─┬─total_rows─┐
│ 54.50 GiB    │ 4.86 GiB   │  211427094 │
└──────────────┴────────────┴────────────┘
 
1 rows in set. Elapsed: 0.035 sec.

Degree of data compression in columns

SELECT
    name,
    formatReadableSize(data_uncompressed_bytes) AS uncompressed,
    formatReadableSize(data_compressed_bytes) AS compressed,
    data_uncompressed_bytes / data_compressed_bytes AS compress_ratio
FROM system.columns
WHERE table = 'log_web'
 
┌─name───────────────────┬─uncompressed─┬─compressed─┬─────compress_ratio─┐
│ logdate                │ 401.53 MiB   │ 1.80 MiB   │ 223.16665968777315 │
│ logdatetime            │ 803.06 MiB   │ 35.91 MiB  │ 22.363966401202305 │
│ fld_log_file_name      │ 220.66 MiB   │ 2.60 MiB   │  84.99905736932571 │
│ fld_server_name        │ 201.54 MiB   │ 50.63 MiB  │  3.980924816977078 │
│ fld_app_name           │ 201.17 MiB   │ 969.17 KiB │ 212.55518183686877 │
│ fld_app_module         │ 201.17 MiB   │ 968.60 KiB │ 212.67805817411906 │
│ fld_website_name       │ 201.54 MiB   │ 1.24 MiB   │  162.7204926761546 │
│ serverIP               │ 201.54 MiB   │ 50.25 MiB  │  4.010824061219731 │
│ method                 │ 201.53 MiB   │ 43.64 MiB  │  4.617721053304486 │
│ uriStem                │ 5.13 GiB     │ 832.51 MiB │  6.311522291936919 │
│ uriQuery               │ 2.58 GiB     │ 501.06 MiB │  5.269731450124478 │
│ port                   │ 803.06 MiB   │ 3.98 MiB   │ 201.91673864241824 │
│ username               │ 318.08 MiB   │ 26.93 MiB  │ 11.812513794583598 │
│ clientIP               │ 2.35 GiB     │ 82.59 MiB  │ 29.132328640073343 │
│ clientRealIP           │ 2.49 GiB     │ 465.05 MiB │  5.478382297052563 │
│ userAgent              │ 18.34 GiB    │ 764.08 MiB │  24.57905114484208 │
│ referer                │ 14.71 GiB    │ 1.37 GiB   │ 10.736792723669906 │
│ response               │ 803.06 MiB   │ 83.81 MiB  │  9.582334090987247 │
│ subresponse            │ 399.87 MiB   │ 1.83 MiB   │  218.4831068635027 │
│ win32response          │ 407.86 MiB   │ 7.41 MiB   │ 55.050315514606815 │
│ timetaken              │ 1.57 GiB     │ 402.06 MiB │ 3.9947395692010637 │
│ uriQuery__utm_medium   │ 208.17 MiB   │ 12.29 MiB  │ 16.936148912472955 │
│ uriQuery__utm_source   │ 215.18 MiB   │ 13.00 MiB  │ 16.548367623199912 │
│ uriQuery__utm_campaign │ 381.46 MiB   │ 37.94 MiB  │ 10.055156353418509 │
│ uriQuery__utm_term     │ 231.82 MiB   │ 10.78 MiB  │ 21.502540454070672 │
│ uriQuery__utm_content  │ 441.34 MiB   │ 87.60 MiB  │  5.038260760449327 │
│ uriQuery__yclid        │ 216.88 MiB   │ 16.58 MiB  │  13.07721335008116 │
│ uriQuery__region       │ 204.35 MiB   │ 9.49 MiB   │  21.52661903446796 │
└────────────────────────┴──────────────┴────────────┴────────────────────┘
 
28 rows in set. Elapsed: 0.005 sec.

Description of used components

FileBeat. Transferring file logs

This component tracks changes to log files on disk and passes the information to LogStash. Installed on all servers where log files are written (usually IIS). Works in tail mode (i.e. transfers only the added records to the file). But separately it can be configured to transfer entire files. This is useful when you need to download data from previous months. Just put the log file in a folder and it will read it in its entirety.

When the service is stopped, the data is no longer transferred further to the storage.

An example configuration looks like this:

filebeat.yml

filebeat.inputs:
- type: log
  enabled: true
  paths:
    - C:/inetpub/logs/LogFiles/W3SVC1/*.log
  exclude_files: ['.gz$','.zip$']
  tail_files: true
  ignore_older: 24h
  fields:
    fld_server_name: "site1.domain.ru"
    fld_app_name: "site1.domain.ru"
    fld_app_module: "web"
    fld_website_name: "web-main"
 
- type: log
  enabled: true
  paths:
    - C:/inetpub/logs/LogFiles/__Import/access_log-*
  exclude_files: ['.gz$','.zip$']
  tail_files: false
  fields:
    fld_server_name: "site2.domain.ru"
    fld_app_name: "site2.domain.ru"
    fld_app_module: "web"
    fld_website_name: "web-main"
    fld_logformat: "logformat__apache"
 
 
filebeat.config.modules:
  path: ${path.config}/modules.d/*.yml
  reload.enabled: false
  reload.period: 2s
 
output.logstash:
  hosts: ["log.domain.com:5044"]
 
  ssl.enabled: true
  ssl.certificate_authorities: ["C:/filebeat/certs/ca.pem", "C:/filebeat/certs/ca-issuing.pem"]
  ssl.certificate: "C:/filebeat/certs/site1.domain.ru.cer"
  ssl.key: "C:/filebeat/certs/site1.domain.ru.key"
 
#================================ Processors =====================================
 
processors:
  - add_host_metadata: ~
  - add_cloud_metadata: ~

logstash. Log Collector

This component is designed to receive log entries from FileBeat (or through the RabbitMQ queue), parsing and inserting batches into the ClickHouse database.

For insertion into ClickHouse, the Logstash-output-clickhouse plugin is used. The Logstash plugin has a request retry mechanism, but with a regular shutdown, it's better to stop the service itself. When stopped, messages will be accumulated in the RabbitMQ queue, so if the stop is for a long time, then it is better to stop Filebeats on the servers. In a scheme where RabbitMQ is not used (on the local network, Filebeat directly sends logs to Logstash), Filebeats work quite acceptable and securely, so for them the unavailability of output passes without consequences.

An example configuration looks like this:

log_web__filebeat_clickhouse.conf

input {
 
    beats {
        port => 5044
        type => 'iis'
        ssl => true
        ssl_certificate_authorities => ["/etc/logstash/certs/ca.cer", "/etc/logstash/certs/ca-issuing.cer"]
        ssl_certificate => "/etc/logstash/certs/server.cer"
        ssl_key => "/etc/logstash/certs/server-pkcs8.key"
        ssl_verify_mode => "peer"
 
            add_field => {
                "fld_server_name" => "%{[fields][fld_server_name]}"
                "fld_app_name" => "%{[fields][fld_app_name]}"
                "fld_app_module" => "%{[fields][fld_app_module]}"
                "fld_website_name" => "%{[fields][fld_website_name]}"
                "fld_log_file_name" => "%{source}"
                "fld_logformat" => "%{[fields][fld_logformat]}"
            }
    }
 
    rabbitmq {
        host => "queue.domain.com"
        port => 5671
        user => "q-reader"
        password => "password"
        queue => "web_log"
        heartbeat => 30
        durable => true
        ssl => true
        #ssl_certificate_path => "/etc/logstash/certs/server.p12"
        #ssl_certificate_password => "password"
 
        add_field => {
            "fld_server_name" => "%{[fields][fld_server_name]}"
            "fld_app_name" => "%{[fields][fld_app_name]}"
            "fld_app_module" => "%{[fields][fld_app_module]}"
            "fld_website_name" => "%{[fields][fld_website_name]}"
            "fld_log_file_name" => "%{source}"
            "fld_logformat" => "%{[fields][fld_logformat]}"
        }
    }
 
}
 
filter { 
 
      if [message] =~ "^#" {
        drop {}
      }
 
      if [fld_logformat] == "logformat__iis_with_xrealip" {
     
          grok {
            match => ["message", "%{TIMESTAMP_ISO8601:log_timestamp} %{IP:serverIP} %{WORD:method} %{NOTSPACE:uriStem} %{NOTSPACE:uriQuery} %{NUMBER:port} %{NOTSPACE:username} %{IPORHOST:clientIP} %{NOTSPACE:userAgent} %{NOTSPACE:referer} %{NUMBER:response} %{NUMBER:subresponse} %{NUMBER:win32response} %{NUMBER:timetaken} %{NOTSPACE:xrealIP} %{NOTSPACE:xforwarderfor}"]
          }
      } else {
   
          grok {
             match => ["message", "%{TIMESTAMP_ISO8601:log_timestamp} %{IP:serverIP} %{WORD:method} %{NOTSPACE:uriStem} %{NOTSPACE:uriQuery} %{NUMBER:port} %{NOTSPACE:username} %{IPORHOST:clientIP} %{NOTSPACE:userAgent} %{NOTSPACE:referer} %{NUMBER:response} %{NUMBER:subresponse} %{NUMBER:win32response} %{NUMBER:timetaken}"]
          }
 
      }
 
      date {
        match => [ "log_timestamp", "YYYY-MM-dd HH:mm:ss" ]
          timezone => "Etc/UTC"
        remove_field => [ "log_timestamp", "@timestamp" ]
        target => [ "log_timestamp2" ]
      }
 
        ruby {
            code => "tstamp = event.get('log_timestamp2').to_i
                        event.set('logdatetime', Time.at(tstamp).strftime('%Y-%m-%d %H:%M:%S'))
                        event.set('logdate', Time.at(tstamp).strftime('%Y-%m-%d'))"
        }
 
      if [bytesSent] {
        ruby {
          code => "event['kilobytesSent'] = event['bytesSent'].to_i / 1024.0"
        }
      }
 
 
      if [bytesReceived] {
        ruby {
          code => "event['kilobytesReceived'] = event['bytesReceived'].to_i / 1024.0"
        }
      }
 
   
        ruby {
            code => "event.set('clientRealIP', event.get('clientIP'))"
        }
        if [xrealIP] {
            ruby {
                code => "event.set('clientRealIP', event.get('xrealIP'))"
            }
        }
        if [xforwarderfor] {
            ruby {
                code => "event.set('clientRealIP', event.get('xforwarderfor'))"
            }
        }
 
      mutate {
        convert => ["bytesSent", "integer"]
        convert => ["bytesReceived", "integer"]
        convert => ["timetaken", "integer"] 
        convert => ["port", "integer"]
 
        add_field => {
            "clientHostname" => "%{clientIP}"
        }
      }
 
        useragent {
            source=> "useragent"
            prefix=> "browser"
        }
 
        kv {
            source => "uriQuery"
            prefix => "uriQuery__"
            allow_duplicate_values => false
            field_split => "&"
            include_keys => [ "utm_medium", "utm_source", "utm_campaign", "utm_term", "utm_content", "yclid", "region" ]
        }
 
        mutate {
            join => { "uriQuery__utm_source" => "," }
            join => { "uriQuery__utm_medium" => "," }
            join => { "uriQuery__utm_campaign" => "," }
            join => { "uriQuery__utm_term" => "," }
            join => { "uriQuery__utm_content" => "," }
            join => { "uriQuery__yclid" => "," }
            join => { "uriQuery__region" => "," }
        }
 
}
 
output { 
  #stdout {codec => rubydebug}
    clickhouse {
      headers => ["Authorization", "Basic abcdsfks..."]
      http_hosts => ["http://127.0.0.1:8123"]
      save_dir => "/etc/logstash/tmp"
      table => "log_web"
      request_tolerance => 1
      flush_size => 10000
      idle_flush_time => 1
        mutations => {
            "fld_log_file_name" => "fld_log_file_name"
            "fld_server_name" => "fld_server_name"
            "fld_app_name" => "fld_app_name"
            "fld_app_module" => "fld_app_module"
            "fld_website_name" => "fld_website_name"
 
            "logdatetime" => "logdatetime"
            "logdate" => "logdate"
            "serverIP" => "serverIP"
            "method" => "method"
            "uriStem" => "uriStem"
            "uriQuery" => "uriQuery"
            "port" => "port"
            "username" => "username"
            "clientIP" => "clientIP"
            "clientRealIP" => "clientRealIP"
            "userAgent" => "userAgent"
            "referer" => "referer"
            "response" => "response"
            "subresponse" => "subresponse"
            "win32response" => "win32response"
            "timetaken" => "timetaken"
             
            "uriQuery__utm_medium" => "uriQuery__utm_medium"
            "uriQuery__utm_source" => "uriQuery__utm_source"
            "uriQuery__utm_campaign" => "uriQuery__utm_campaign"
            "uriQuery__utm_term" => "uriQuery__utm_term"
            "uriQuery__utm_content" => "uriQuery__utm_content"
            "uriQuery__yclid" => "uriQuery__yclid"
            "uriQuery__region" => "uriQuery__region"
        }
    }
 
}

pipelines.yml

# This file is where you define your pipelines. You can define multiple.
# For more information on multiple pipelines, see the documentation:
#   https://www.elastic.co/guide/en/logstash/current/multiple-pipelines.html
 
- pipeline.id: log_web__filebeat_clickhouse
  path.config: "/etc/logstash/log_web__filebeat_clickhouse.conf"

clickhouse. Log storage

Logs for all systems are stored in one table (see at the beginning of the article). It is intended to store information about requests: all parameters are similar for different formats, such as IIS logs, apache and nginx logs. For application logs, in which, for example, errors, information messages, warnings are recorded, a separate table will be provided with the appropriate structure (currently at the design stage).

When designing a table, it is very important to decide on the primary key (by which the data will be sorted during storage). The degree of data compression and query speed depend on this. In our example, the key is
ORDER BY (fld_app_name, fld_app_module, logdatetime)
That is, by the name of the system, the name of the system component and the date of the event. Initially, the date of the event came first. After moving it to the last place, queries began to work about twice as fast. Changing the primary key will require recreating the table and reloading the data so that ClickHouse re-sorts the data on disk. This is a heavy operation, so it's a good idea to think a lot about what should be included in the sort key.

It should also be noted that the LowCardinality data type has appeared in relatively recent versions. When using it, the size of compressed data is drastically reduced for those fields that have low cardinality (few options).

Version 19.6 is currently in use and we plan to try updating to the latest version. They have such wonderful features as Adaptive Granularity, Skipping indices and the DoubleDelta codec, for example.

By default, during installation, the logging level is set to trace. The logs are rotated and archived, but at the same time they expand up to a gigabyte. If there is no need, then you can set the warning level, then the size of the log is drastically reduced. The logging setting is set in the config.xml file:

<!-- Possible levels: https://github.com/pocoproject/poco/blob/develop/Foundation/include/Poco/Logger. h#L105 -->
<level>warning</level>

Some useful commands

Поскольку оригинальные пакеты установки собираются по Debian, то для других версий Linux необходимо использовать пакеты собранные компанией Altinity.
 
Вот по этой ссылке есть инструкции с ссылками на их репозиторий: https://www.altinity.com/blog/2017/12/18/logstash-with-clickhouse
sudo yum search clickhouse-server
sudo yum install clickhouse-server.noarch
  
1. проверка статуса
sudo systemctl status clickhouse-server
 
2. остановка сервера
sudo systemctl stop clickhouse-server
 
3. запуск сервера
sudo systemctl start clickhouse-server
 
Запуск для выполнения запросов в многострочном режиме (выполнение после знака ";")
clickhouse-client --multiline
clickhouse-client --multiline --host 127.0.0.1 --password pa55w0rd
clickhouse-client --multiline --host 127.0.0.1 --port 9440 --secure --user default --password pa55w0rd
 
Плагин кликлауза для логстеш в случае ошибки в одной строке сохраняет всю пачку в файл /tmp/log_web_failed.json
Можно вручную исправить этот файл и попробовать залить его в БД вручную:
clickhouse-client --host 127.0.0.1 --password password --query="INSERT INTO log_web FORMAT JSONEachRow" < /tmp/log_web_failed__fixed.json
 
sudo mv /etc/logstash/tmp/log_web_failed.json /etc/logstash/tmp/log_web_failed__fixed.json
sudo chown user_dev /etc/logstash/tmp/log_web_failed__fixed.json
sudo clickhouse-client --host 127.0.0.1 --password password --query="INSERT INTO log_web FORMAT JSONEachRow" < /etc/logstash/tmp/log_web_failed__fixed.json
sudo mv /etc/logstash/tmp/log_web_failed__fixed.json /etc/logstash/tmp/log_web_failed__fixed_.json
 
выход из командной строки
quit;
## Настройка TLS
https://www.altinity.com/blog/2019/3/5/clickhouse-networking-part-2
 
openssl s_client -connect log.domain.com:9440 < /dev/null

logstash. Log router from FileBeat to RabbitMQ queue

This component is used to route logs coming from FileBeat to the RabbitMQ queue. There are two points here:

  1. Unfortunately, FileBeat does not have an output plugin to write directly to RabbitMQ. And such functionality, judging by the issue on their github, is not planned for implementation. There is a plugin for Kafka, but for some reason we cannot use it at home.
  2. There are requirements for collecting logs in the DMZ. Based on them, the logs must first be added to the queue and then LogStash reads the entries from the queue from the outside.

Therefore, it is for the case where servers are located in the DMZ that one has to use such a slightly complicated scheme. An example configuration looks like this:

iis_w3c_logs__filebeat_rabbitmq.conf

input {
 
    beats {
        port => 5044
        type => 'iis'
        ssl => true
        ssl_certificate_authorities => ["/etc/pki/tls/certs/app/ca.pem", "/etc/pki/tls/certs/app/ca-issuing.pem"]
        ssl_certificate => "/etc/pki/tls/certs/app/queue.domain.com.cer"
        ssl_key => "/etc/pki/tls/certs/app/queue.domain.com-pkcs8.key"
        ssl_verify_mode => "peer"
    }
 
}
 
output { 
  #stdout {codec => rubydebug}
 
    rabbitmq {
        host => "127.0.0.1"
        port => 5672
        exchange => "monitor.direct"
        exchange_type => "direct"
        key => "%{[fields][fld_app_name]}"
        user => "q-writer"
        password => "password"
        ssl => false
    }
}

RabbitMQ. message queue

This component is used to buffer log entries in the DMZ. Recording is done through a bunch of Filebeat → LogStash. Reading is done from outside the DMZ via LogStash. When operating through RabboitMQ, about 4 thousand messages per second are processed.

Message routing is configured by system name, i.e. based on FileBeat configuration data. All messages go to one queue. If for some reason the queuing service is stopped, then this will not lead to the loss of messages: FileBeats will receive connection errors and temporarily suspend sending. And LogStash that reads from the queue will also receive network errors and wait for the connection to be restored. In this case, the data, of course, will no longer be written to the database.

The following instructions are used to create and configure queues:

sudo /usr/local/bin/rabbitmqadmin/rabbitmqadmin declare exchange --vhost=/ name=monitor.direct type=direct sudo /usr/local/bin/rabbitmqadmin/rabbitmqadmin declare queue --vhost=/ name=web_log durable=true
sudo /usr/local/bin/rabbitmqadmin/rabbitmqadmin --vhost="/" declare binding source="monitor.direct" destination_type="queue" destination="web_log" routing_key="site1.domain.ru"
sudo /usr/local/bin/rabbitmqadmin/rabbitmqadmin --vhost="/" declare binding source="monitor.direct" destination_type="queue" destination="web_log" routing_key="site2.domain.ru"

Grafana. Dashboards

This component is used to visualize monitoring data. In this case, you need to install the ClickHouse datasource for Grafana 4.6+ plugin. We had to tweak it a bit to improve the efficiency of processing SQL filters on the dashboard.

For example, we use variables, and if they are not set in the filter field, then we would like it not to generate a condition in the WHERE of the form ( uriStem = » AND uriStem != » ). In this case, ClickHouse will read the uriStem column. In general, we tried different options and eventually corrected the plugin (the $valueIfEmpty macro) so that in the case of an empty value it returns 1, without mentioning the column itself.

And now you can use this query for the graph

$columns(response, count(*) c) from $table where $adhoc
and $valueIfEmpty($fld_app_name, 1, fld_app_name = '$fld_app_name')
and $valueIfEmpty($fld_app_module, 1, fld_app_module = '$fld_app_module') and $valueIfEmpty($fld_server_name, 1, fld_server_name = '$fld_server_name') and $valueIfEmpty($uriStem, 1, uriStem like '%$uriStem%')
and $valueIfEmpty($clientRealIP, 1, clientRealIP = '$clientRealIP')

which translates to this SQL (note that the empty uriStem fields have been converted to just 1)

SELECT
t,
groupArray((response, c)) AS groupArr
FROM (
SELECT
(intDiv(toUInt32(logdatetime), 60) * 60) * 1000 AS t, response,
count(*) AS c FROM default.log_web
WHERE (logdate >= toDate(1565061982)) AND (logdatetime >= toDateTime(1565061982)) AND 1 AND (fld_app_name = 'site1.domain.ru') AND (fld_app_module = 'web') AND 1 AND 1 AND 1
GROUP BY
t, response
ORDER BY
t ASC,
response ASC
)
GROUP BY t ORDER BY t ASC

Conclusion

The appearance of the ClickHouse database has become a landmark event in the market. It was hard to imagine that, completely free of charge, in an instant we were armed with a powerful and practical tool for working with big data. Of course, with increasing needs (for example, sharding and replication to multiple servers), the scheme will become more complicated. But on first impressions, working with this database is very pleasant. It can be seen that the product is made "for people."

Compared to ElasticSearch, the cost of storing and processing logs is estimated to be reduced by five to ten times. In other words, if for the current amount of data we would have to set up a cluster of several machines, then when using ClickHouse, one low-power machine is enough for us. Yes, of course, ElasticSearch also has on-disk data compression mechanisms and other features that can significantly reduce resource consumption, but compared to ClickHouse, this will be more expensive.

Without any special optimizations on our part, on default settings, loading data and selecting from the database works at an amazing speed. We don't have much data yet (about 200 million records), but the server itself is weak. We can use this tool in the future for other purposes not related to storing logs. For example, for end-to-end analytics, in the field of security, machine learning.

At the end, a little about the pros and cons.

Cons

  1. Loading records in large batches. On the one hand, this is a feature, but you still have to use additional components for buffering records. This task is not always easy, but still solvable. And I would like to simplify the scheme.
  2. Some exotic functionality or new features often break in new versions. This causes concern, reducing the desire to upgrade to a new version. For example, the Kafka table engine is a very useful feature that allows you to directly read events from Kafka, without implementing consumers. But judging by the number of Issues on the github, we are still careful not to use this engine in production. However, if you do not make sudden gestures to the side and use the main functionality, then it works stably.

pros

  1. It does not slow down.
  2. Low entry threshold.
  3. open source.
  4. It’s free.
  5. Scales well (sharding/replication out of the box)
  6. Included in the register of Russian software recommended by the Ministry of Communications.
  7. The presence of official support from Yandex.

Source: habr.com

Add a comment