Database ClickHouse kanggo Manungsa, utawa Teknologi Alien

Aleksey Lizunov, Kepala Pusat Kompetensi kanggo Saluran Layanan Jarak Jauh Direktorat Teknologi Informasi MKB

Database ClickHouse kanggo Manungsa, utawa Teknologi Alien

Minangka alternatif kanggo tumpukan ELK (ElasticSearch, Logstash, Kibana), kita nindakake riset babagan nggunakake database ClickHouse minangka nyimpen data kanggo log.

Ing artikel iki, kita pengin ngomong babagan pengalaman nggunakake database ClickHouse lan asil awal saka operasi pilot. Sampeyan kudu nyatet langsung sing asil padha nyengsemaken.


Database ClickHouse kanggo Manungsa, utawa Teknologi Alien

Sabanjure, kita bakal njlèntrèhaké luwih rinci carane sistem kita diatur, lan komponen apa. Nanging saiki aku pengin ngomong sethithik babagan database iki kanthi sakabehe, lan kenapa kudu digatekake. Database ClickHouse minangka basis data kolom analitik kinerja dhuwur saka Yandex. Iki digunakake ing layanan Yandex, wiwitane minangka panyimpenan data utama kanggo Yandex.Metrica. Sistem open source, gratis. Saka sudut pandang pangembang, aku tansah kepingin weruh carane nindakake, amarga ana data gedhe banget. Lan antarmuka panganggo Metrica dhewe banget fleksibel lan cepet. Nalika kenalan pisanan karo database iki, kesan: "Inggih, pungkasanipun! Digawe kanggo wong! Miwiti saka proses instalasi lan dipungkasi kanthi ngirim panjalukan.

Basis data iki nduweni ambang entri sing sithik banget. Malah pangembang trampil rata-rata bisa nginstal database iki ing sawetara menit lan miwiti nggunakake. Kabeh dianggo kanthi cetha. Malah wong sing anyar ing Linux bisa kanthi cepet nangani instalasi lan nindakake operasi sing paling gampang. Yen sadurunge, ing tembung Big Data, Hadoop, Google BigTable, HDFS, pangembang biasa duwe gagasan babagan sawetara terabyte, petabyte, yen sawetara superhumans melu setelan lan pangembangan sistem kasebut, banjur kanthi tekane database ClickHouse, kita entuk alat sing gampang dingerteni sing bisa ngatasi sawetara tugas sing sadurunge ora bisa ditindakake. Mung butuh siji mesin sing cukup rata-rata lan limang menit kanggo nginstal. Sing, kita entuk database kaya, contone, MySql, nanging mung kanggo nyimpen milyaran cathetan! A super-archiver tartamtu karo basa SQL. Kayane wong dipasrahake senjatane wong asing.

Babagan sistem logging kita

Kanggo ngumpulake informasi, file log IIS saka aplikasi web format standar digunakake (kita uga lagi ngurai log aplikasi, nanging tujuan utama ing tahap pilot yaiku kanggo ngumpulake log IIS).

Kanthi macem-macem alasan, kita ora bisa ninggalake tumpukan ELK, lan kita terus nggunakake komponen LogStash lan Filebeat, sing wis mbuktekake kanthi apik lan bisa dipercaya lan bisa ditebak.

Skema logging umum ditampilake ing gambar ing ngisor iki:

Database ClickHouse kanggo Manungsa, utawa Teknologi Alien

Fitur nulis data menyang database ClickHouse yaiku jarang (sapisan per detik) nglebokake cathetan ing batch gedhe. Iki, ketoke, minangka bagean sing paling "masalah" sing sampeyan temoni nalika sampeyan nemoni nggarap database ClickHouse: skema dadi luwih rumit.
Plugin kanggo LogStash, sing langsung nglebokake data menyang ClickHouse, mbantu akeh ing kene. Komponen iki disebarake ing server sing padha karo database kasebut. Dadi, umume, ora dianjurake kanggo nindakake, nanging saka sudut pandang praktis, supaya ora ngasilake server sing kapisah nalika disebarake ing server sing padha. Kita ora mirsani kegagalan utawa konflik sumber daya karo database. Kajaba iku, kudu dicathet yen plugin duwe mekanisme nyoba maneh yen ana kesalahan. Lan yen ana kesalahan, plugin nulis menyang disk kumpulan data sing ora bisa dilebokake (format file trep: sawise nyunting, sampeyan bisa kanthi gampang nglebokake batch sing wis didandani nggunakake clickhouse-klien).

Dhaptar lengkap piranti lunak sing digunakake ing skema ditampilake ing tabel:

Dhaptar piranti lunak sing digunakake

Judhul

Description

Link distribusi

NGINX

Reverse-proxy kanggo matesi akses dening port lan ngatur wewenang

Saiki ora digunakake ing skema

https://nginx.org/ru/download.html

https://nginx.org/download/nginx-1.16.0.tar.gz

FileBeat

Transfer file log.

https://www.elastic.co/downloads/beats/filebeat (kit distribusi kanggo Windows 64bit).

https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-7.3.0-windows-x86_64.zip

logstash

Kolektor log.

Digunakake kanggo ngumpulake log saka FileBeat, uga kanggo ngumpulake log saka antrian RabbitMQ (kanggo server sing ana ing DMZ.)

https://www.elastic.co/products/logstash

https://artifacts.elastic.co/downloads/logstash/logstash-7.0.1.rpm

Logstash-output-clickhouse

Plugin Loagstash kanggo nransfer log menyang database ClickHouse kanthi batch

https://github.com/mikechris/logstash-output-clickhouse

/usr/share/logstash/bin/logstash-plugin nginstal logstash-output-clickhouse

/usr/share/logstash/bin/logstash-plugin nginstal logstash-filter-prune

/usr/share/logstash/bin/logstash-plugin nginstal logstash-filter-multiline

clickhouse

Panyimpenan log https://clickhouse.yandex/docs/ru/

https://packagecloud.io/Altinity/clickhouse/packages/el/7/clickhouse-server-19.5.3.8-1.el7.x86_64.rpm

https://packagecloud.io/Altinity/clickhouse/packages/el/7/clickhouse-client-19.5.3.8-1.el7.x86_64.rpm

Cathetan. Wiwit Agustus 2018, rpm "normal" dibangun kanggo RHEL katon ing repositori Yandex, supaya sampeyan bisa nyoba nggunakake. Nalika instalasi, kita nggunakake paket sing dibangun dening Altinity.

Grafana

Visualisasi log. Nyetel dashboard

https://grafana.com/

https://grafana.com/grafana/download

Redhat & Centos(64 Bit) - versi paling anyar

Sumber data ClickHouse kanggo Grafana 4.6+

Plugin kanggo Grafana karo sumber data ClickHouse

https://grafana.com/plugins/vertamedia-clickhouse-datasource

https://grafana.com/api/plugins/vertamedia-clickhouse-datasource/versions/1.8.1/download

logstash

Log router saka FileBeat menyang antrian RabbitMQ.

Cathetan. Sayange, FileBeat ora ngasilake langsung menyang RabbitMQ, mula mbutuhake tautan penengah ing wangun Logstash.

https://www.elastic.co/products/logstash

https://artifacts.elastic.co/downloads/logstash/logstash-7.0.1.rpm

KelinciMQ

antrian pesen. Iki minangka buffer log ing DMZ

https://www.rabbitmq.com/download.html

https://github.com/rabbitmq/rabbitmq-server/releases/download/v3.7.14/rabbitmq-server-3.7.14-1.el7.noarch.rpm

Erlang Runtime (Dibutuhake kanggo RabbitMQ)

Runtime Erlang. Dibutuhake kanggo RabbitMQ bisa digunakake

http://www.erlang.org/download.html

https://www.rabbitmq.com/install-rpm.html#install-erlang http://www.erlang.org/downloads/21.3

Konfigurasi server karo database ClickHouse ditampilake ing tabel ing ngisor iki:

Judhul

Nilai

komentar

Konfigurasi

HDD: 40 GB
RAM: 8GB
Prosesor: inti 2 2Ghz

Sampeyan kudu menehi perhatian marang tips kanggo ngoperasikake database ClickHouse (https://clickhouse.yandex/docs/ru/operations/tips/)

Piranti lunak sistem umum

OS: Red Hat Enterprise Linux Server (Maipo)

JRE (Jawa 8)

 

Nalika sampeyan bisa ndeleng, iki workstation biasa.

Struktur tabel kanggo nyimpen log kaya ing ngisor iki:

log_web.sql

CREATE TABLE log_web (
  logdate Date,
  logdatetime DateTime CODEC(Delta, LZ4HC),
   
  fld_log_file_name LowCardinality( String ),
  fld_server_name LowCardinality( String ),
  fld_app_name LowCardinality( String ),
  fld_app_module LowCardinality( String ),
  fld_website_name LowCardinality( String ),
 
  serverIP LowCardinality( String ),
  method LowCardinality( String ),
  uriStem String,
  uriQuery String,
  port UInt32,
  username LowCardinality( String ),
  clientIP String,
  clientRealIP String,
  userAgent String,
  referer String,
  response String,
  subresponse String,
  win32response String,
  timetaken UInt64
   
  , uriQuery__utm_medium String
  , uriQuery__utm_source String
  , uriQuery__utm_campaign String
  , uriQuery__utm_term String
  , uriQuery__utm_content String
  , uriQuery__yclid String
  , uriQuery__region String
 
) Engine = MergeTree()
PARTITION BY toYYYYMM(logdate)
ORDER BY (fld_app_name, fld_app_module, logdatetime)
SETTINGS index_granularity = 8192;

Kita nggunakake partisi standar (miturut sasi) lan granularity indeks. Kabeh kolom praktis cocog karo entri log IIS kanggo logging panjalukan http. Kapisah, kita nyathet yen ana kolom sing kapisah kanggo nyimpen utm-tags (padha diurai ing tahap nyisipake menyang tabel saka kolom string query).

Uga, sawetara kolom sistem wis ditambahake ing tabel kanggo nyimpen informasi babagan sistem, komponen, server. Deleng tabel ing ngisor iki kanggo katrangan babagan lapangan kasebut. Ing siji tabel, kita nyimpen log kanggo sawetara sistem.

Judhul

Description

Conto:

fld_app_name

Jeneng aplikasi / sistem
Nilai sing sah:

  • site1.domain.com Situs njaba 1
  • site2.domain.com Situs njaba 2
  • internal-site1.domain.local Situs internal 1

site1.domain.com

fld_app_module

Modul sistem
Nilai sing sah:

  • web - Situs web
  • svc - Layanan Web Situs Web
  • intgr - Layanan Web Integrasi
  • bo - Admin (BackOffice)

web

fld_website_name

Jeneng situs ing IIS

Sawetara sistem bisa disebarake ing siji server, utawa malah sawetara conto saka siji modul sistem

web utama

fld_server_name

Jeneng server

web1.domain.com

fld_log_file_name

Path menyang file log ing server

C:inetpublogsLogFiles
W3SVC1u_ex190711.log

Iki ngidini sampeyan nggawe grafik kanthi efisien ing Grafana. Contone, ndeleng panjalukan saka frontend sistem tartamtu. Iki padha karo counter situs ing Yandex.Metrica.

Ing ngisor iki ana sawetara statistik babagan panggunaan database sajrone rong wulan.

Jumlah cathetan sing dipérang miturut sistem lan komponen

SELECT
    fld_app_name,
    fld_app_module,
    count(fld_app_name) AS rows_count
FROM log_web
GROUP BY
    fld_app_name,
    fld_app_module
    WITH TOTALS
ORDER BY
    fld_app_name ASC,
    rows_count DESC
 
┌─fld_app_name─────┬─fld_app_module─┬─rows_count─┐
│ site1.domain.ru  │ web            │     131441 │
│ site2.domain.ru  │ web            │    1751081 │
│ site3.domain.ru  │ web            │  106887543 │
│ site3.domain.ru  │ svc            │   44908603 │
│ site3.domain.ru  │ intgr          │    9813911 │
│ site4.domain.ru  │ web            │     772095 │
│ site5.domain.ru  │ web            │   17037221 │
│ site5.domain.ru  │ intgr          │     838559 │
│ site5.domain.ru  │ bo             │       7404 │
│ site6.domain.ru  │ web            │     595877 │
│ site7.domain.ru  │ web            │   27778858 │
└──────────────────┴────────────────┴────────────┘
 
Totals:
┌─fld_app_name─┬─fld_app_module─┬─rows_count─┐
│              │                │  210522593 │
└──────────────┴────────────────┴────────────┘
 
11 rows in set. Elapsed: 4.874 sec. Processed 210.52 million rows, 421.67 MB (43.19 million rows/s., 86.51 MB/s.)

Jumlah data ing disk

SELECT
    formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed,
    formatReadableSize(sum(data_compressed_bytes)) AS compressed,
    sum(rows) AS total_rows
FROM system.parts
WHERE table = 'log_web'
 
┌─uncompressed─┬─compressed─┬─total_rows─┐
│ 54.50 GiB    │ 4.86 GiB   │  211427094 │
└──────────────┴────────────┴────────────┘
 
1 rows in set. Elapsed: 0.035 sec.

Derajat kompresi data ing kolom

SELECT
    name,
    formatReadableSize(data_uncompressed_bytes) AS uncompressed,
    formatReadableSize(data_compressed_bytes) AS compressed,
    data_uncompressed_bytes / data_compressed_bytes AS compress_ratio
FROM system.columns
WHERE table = 'log_web'
 
┌─name───────────────────┬─uncompressed─┬─compressed─┬─────compress_ratio─┐
│ logdate                │ 401.53 MiB   │ 1.80 MiB   │ 223.16665968777315 │
│ logdatetime            │ 803.06 MiB   │ 35.91 MiB  │ 22.363966401202305 │
│ fld_log_file_name      │ 220.66 MiB   │ 2.60 MiB   │  84.99905736932571 │
│ fld_server_name        │ 201.54 MiB   │ 50.63 MiB  │  3.980924816977078 │
│ fld_app_name           │ 201.17 MiB   │ 969.17 KiB │ 212.55518183686877 │
│ fld_app_module         │ 201.17 MiB   │ 968.60 KiB │ 212.67805817411906 │
│ fld_website_name       │ 201.54 MiB   │ 1.24 MiB   │  162.7204926761546 │
│ serverIP               │ 201.54 MiB   │ 50.25 MiB  │  4.010824061219731 │
│ method                 │ 201.53 MiB   │ 43.64 MiB  │  4.617721053304486 │
│ uriStem                │ 5.13 GiB     │ 832.51 MiB │  6.311522291936919 │
│ uriQuery               │ 2.58 GiB     │ 501.06 MiB │  5.269731450124478 │
│ port                   │ 803.06 MiB   │ 3.98 MiB   │ 201.91673864241824 │
│ username               │ 318.08 MiB   │ 26.93 MiB  │ 11.812513794583598 │
│ clientIP               │ 2.35 GiB     │ 82.59 MiB  │ 29.132328640073343 │
│ clientRealIP           │ 2.49 GiB     │ 465.05 MiB │  5.478382297052563 │
│ userAgent              │ 18.34 GiB    │ 764.08 MiB │  24.57905114484208 │
│ referer                │ 14.71 GiB    │ 1.37 GiB   │ 10.736792723669906 │
│ response               │ 803.06 MiB   │ 83.81 MiB  │  9.582334090987247 │
│ subresponse            │ 399.87 MiB   │ 1.83 MiB   │  218.4831068635027 │
│ win32response          │ 407.86 MiB   │ 7.41 MiB   │ 55.050315514606815 │
│ timetaken              │ 1.57 GiB     │ 402.06 MiB │ 3.9947395692010637 │
│ uriQuery__utm_medium   │ 208.17 MiB   │ 12.29 MiB  │ 16.936148912472955 │
│ uriQuery__utm_source   │ 215.18 MiB   │ 13.00 MiB  │ 16.548367623199912 │
│ uriQuery__utm_campaign │ 381.46 MiB   │ 37.94 MiB  │ 10.055156353418509 │
│ uriQuery__utm_term     │ 231.82 MiB   │ 10.78 MiB  │ 21.502540454070672 │
│ uriQuery__utm_content  │ 441.34 MiB   │ 87.60 MiB  │  5.038260760449327 │
│ uriQuery__yclid        │ 216.88 MiB   │ 16.58 MiB  │  13.07721335008116 │
│ uriQuery__region       │ 204.35 MiB   │ 9.49 MiB   │  21.52661903446796 │
└────────────────────────┴──────────────┴────────────┴────────────────────┘
 
28 rows in set. Elapsed: 0.005 sec.

Katrangan saka komponen digunakake

FileBeat. Transfer file log

Komponen iki nglacak owah-owahan menyang file log ing disk lan ngirim informasi menyang LogStash. Dipasang ing kabeh server ing ngendi file log ditulis (biasane IIS). Dianggo ing mode buntut (i.e. mung nransfer cathetan sing ditambahake menyang file). Nanging kanthi kapisah bisa dikonfigurasi kanggo nransfer kabeh file. Iki migunani nalika sampeyan kudu ngundhuh data saka sasi sadurunge. Cukup lebokake file log ing folder lan bakal maca kabeh.

Nalika layanan mandheg, data ora ditransfer maneh menyang panyimpenan.

Conto konfigurasi katon kaya iki:

filebeat.yml

filebeat.inputs:
- type: log
  enabled: true
  paths:
    - C:/inetpub/logs/LogFiles/W3SVC1/*.log
  exclude_files: ['.gz$','.zip$']
  tail_files: true
  ignore_older: 24h
  fields:
    fld_server_name: "site1.domain.ru"
    fld_app_name: "site1.domain.ru"
    fld_app_module: "web"
    fld_website_name: "web-main"
 
- type: log
  enabled: true
  paths:
    - C:/inetpub/logs/LogFiles/__Import/access_log-*
  exclude_files: ['.gz$','.zip$']
  tail_files: false
  fields:
    fld_server_name: "site2.domain.ru"
    fld_app_name: "site2.domain.ru"
    fld_app_module: "web"
    fld_website_name: "web-main"
    fld_logformat: "logformat__apache"
 
 
filebeat.config.modules:
  path: ${path.config}/modules.d/*.yml
  reload.enabled: false
  reload.period: 2s
 
output.logstash:
  hosts: ["log.domain.com:5044"]
 
  ssl.enabled: true
  ssl.certificate_authorities: ["C:/filebeat/certs/ca.pem", "C:/filebeat/certs/ca-issuing.pem"]
  ssl.certificate: "C:/filebeat/certs/site1.domain.ru.cer"
  ssl.key: "C:/filebeat/certs/site1.domain.ru.key"
 
#================================ Processors =====================================
 
processors:
  - add_host_metadata: ~
  - add_cloud_metadata: ~

logstash. Kolektor Log

Komponen iki dirancang kanggo nampa entri log saka FileBeat (utawa liwat antrian RabbitMQ), parsing lan masang batch menyang database ClickHouse.

Kanggo nglebokake menyang ClickHouse, plugin Logstash-output-clickhouse digunakake. Plugin Logstash duwe mekanisme nyoba maneh, nanging kanthi mateni biasa, luwih becik mungkasi layanan kasebut. Yen mandheg, pesen bakal diklumpukake ing antrian RabbitMQ, dadi yen mandheg suwe, mula luwih becik mungkasi Filebeats ing server. Ing skema ing ngendi RabbitMQ ora digunakake (ing jaringan lokal, Filebeat langsung ngirim log menyang Logstash), Filebeats bisa ditrima lan aman, saengga ora kasedhiya output liwat tanpa akibat.

Conto konfigurasi katon kaya iki:

log_web__filebeat_clickhouse.conf

input {
 
    beats {
        port => 5044
        type => 'iis'
        ssl => true
        ssl_certificate_authorities => ["/etc/logstash/certs/ca.cer", "/etc/logstash/certs/ca-issuing.cer"]
        ssl_certificate => "/etc/logstash/certs/server.cer"
        ssl_key => "/etc/logstash/certs/server-pkcs8.key"
        ssl_verify_mode => "peer"
 
            add_field => {
                "fld_server_name" => "%{[fields][fld_server_name]}"
                "fld_app_name" => "%{[fields][fld_app_name]}"
                "fld_app_module" => "%{[fields][fld_app_module]}"
                "fld_website_name" => "%{[fields][fld_website_name]}"
                "fld_log_file_name" => "%{source}"
                "fld_logformat" => "%{[fields][fld_logformat]}"
            }
    }
 
    rabbitmq {
        host => "queue.domain.com"
        port => 5671
        user => "q-reader"
        password => "password"
        queue => "web_log"
        heartbeat => 30
        durable => true
        ssl => true
        #ssl_certificate_path => "/etc/logstash/certs/server.p12"
        #ssl_certificate_password => "password"
 
        add_field => {
            "fld_server_name" => "%{[fields][fld_server_name]}"
            "fld_app_name" => "%{[fields][fld_app_name]}"
            "fld_app_module" => "%{[fields][fld_app_module]}"
            "fld_website_name" => "%{[fields][fld_website_name]}"
            "fld_log_file_name" => "%{source}"
            "fld_logformat" => "%{[fields][fld_logformat]}"
        }
    }
 
}
 
filter { 
 
      if [message] =~ "^#" {
        drop {}
      }
 
      if [fld_logformat] == "logformat__iis_with_xrealip" {
     
          grok {
            match => ["message", "%{TIMESTAMP_ISO8601:log_timestamp} %{IP:serverIP} %{WORD:method} %{NOTSPACE:uriStem} %{NOTSPACE:uriQuery} %{NUMBER:port} %{NOTSPACE:username} %{IPORHOST:clientIP} %{NOTSPACE:userAgent} %{NOTSPACE:referer} %{NUMBER:response} %{NUMBER:subresponse} %{NUMBER:win32response} %{NUMBER:timetaken} %{NOTSPACE:xrealIP} %{NOTSPACE:xforwarderfor}"]
          }
      } else {
   
          grok {
             match => ["message", "%{TIMESTAMP_ISO8601:log_timestamp} %{IP:serverIP} %{WORD:method} %{NOTSPACE:uriStem} %{NOTSPACE:uriQuery} %{NUMBER:port} %{NOTSPACE:username} %{IPORHOST:clientIP} %{NOTSPACE:userAgent} %{NOTSPACE:referer} %{NUMBER:response} %{NUMBER:subresponse} %{NUMBER:win32response} %{NUMBER:timetaken}"]
          }
 
      }
 
      date {
        match => [ "log_timestamp", "YYYY-MM-dd HH:mm:ss" ]
          timezone => "Etc/UTC"
        remove_field => [ "log_timestamp", "@timestamp" ]
        target => [ "log_timestamp2" ]
      }
 
        ruby {
            code => "tstamp = event.get('log_timestamp2').to_i
                        event.set('logdatetime', Time.at(tstamp).strftime('%Y-%m-%d %H:%M:%S'))
                        event.set('logdate', Time.at(tstamp).strftime('%Y-%m-%d'))"
        }
 
      if [bytesSent] {
        ruby {
          code => "event['kilobytesSent'] = event['bytesSent'].to_i / 1024.0"
        }
      }
 
 
      if [bytesReceived] {
        ruby {
          code => "event['kilobytesReceived'] = event['bytesReceived'].to_i / 1024.0"
        }
      }
 
   
        ruby {
            code => "event.set('clientRealIP', event.get('clientIP'))"
        }
        if [xrealIP] {
            ruby {
                code => "event.set('clientRealIP', event.get('xrealIP'))"
            }
        }
        if [xforwarderfor] {
            ruby {
                code => "event.set('clientRealIP', event.get('xforwarderfor'))"
            }
        }
 
      mutate {
        convert => ["bytesSent", "integer"]
        convert => ["bytesReceived", "integer"]
        convert => ["timetaken", "integer"] 
        convert => ["port", "integer"]
 
        add_field => {
            "clientHostname" => "%{clientIP}"
        }
      }
 
        useragent {
            source=> "useragent"
            prefix=> "browser"
        }
 
        kv {
            source => "uriQuery"
            prefix => "uriQuery__"
            allow_duplicate_values => false
            field_split => "&"
            include_keys => [ "utm_medium", "utm_source", "utm_campaign", "utm_term", "utm_content", "yclid", "region" ]
        }
 
        mutate {
            join => { "uriQuery__utm_source" => "," }
            join => { "uriQuery__utm_medium" => "," }
            join => { "uriQuery__utm_campaign" => "," }
            join => { "uriQuery__utm_term" => "," }
            join => { "uriQuery__utm_content" => "," }
            join => { "uriQuery__yclid" => "," }
            join => { "uriQuery__region" => "," }
        }
 
}
 
output { 
  #stdout {codec => rubydebug}
    clickhouse {
      headers => ["Authorization", "Basic abcdsfks..."]
      http_hosts => ["http://127.0.0.1:8123"]
      save_dir => "/etc/logstash/tmp"
      table => "log_web"
      request_tolerance => 1
      flush_size => 10000
      idle_flush_time => 1
        mutations => {
            "fld_log_file_name" => "fld_log_file_name"
            "fld_server_name" => "fld_server_name"
            "fld_app_name" => "fld_app_name"
            "fld_app_module" => "fld_app_module"
            "fld_website_name" => "fld_website_name"
 
            "logdatetime" => "logdatetime"
            "logdate" => "logdate"
            "serverIP" => "serverIP"
            "method" => "method"
            "uriStem" => "uriStem"
            "uriQuery" => "uriQuery"
            "port" => "port"
            "username" => "username"
            "clientIP" => "clientIP"
            "clientRealIP" => "clientRealIP"
            "userAgent" => "userAgent"
            "referer" => "referer"
            "response" => "response"
            "subresponse" => "subresponse"
            "win32response" => "win32response"
            "timetaken" => "timetaken"
             
            "uriQuery__utm_medium" => "uriQuery__utm_medium"
            "uriQuery__utm_source" => "uriQuery__utm_source"
            "uriQuery__utm_campaign" => "uriQuery__utm_campaign"
            "uriQuery__utm_term" => "uriQuery__utm_term"
            "uriQuery__utm_content" => "uriQuery__utm_content"
            "uriQuery__yclid" => "uriQuery__yclid"
            "uriQuery__region" => "uriQuery__region"
        }
    }
 
}

pipelines.yml

# This file is where you define your pipelines. You can define multiple.
# For more information on multiple pipelines, see the documentation:
#   https://www.elastic.co/guide/en/logstash/current/multiple-pipelines.html
 
- pipeline.id: log_web__filebeat_clickhouse
  path.config: "/etc/logstash/log_web__filebeat_clickhouse.conf"

clickhouse. Panyimpenan log

Log kanggo kabeh sistem disimpen ing siji tabel (ndeleng ing wiwitan artikel). Iki dimaksudake kanggo nyimpen informasi babagan panjalukan: kabeh paramèter padha kanggo format sing beda-beda, kayata log IIS, apache lan log nginx. Kanggo log aplikasi, umpamane, kesalahan, pesen informasi, bebaya dicathet, tabel sing kapisah bakal diwenehake kanthi struktur sing cocog (saiki ing tahap desain).

Nalika ngrancang tabel, penting banget kanggo mutusake kunci utama (sing data bakal diurutake sajrone panyimpenan). Tingkat kompresi data lan kacepetan pitakon gumantung iki. Ing conto kita, kunci kasebut
ORDER BY (fld_app_name, fld_app_module, logdatetime)
Yaiku, kanthi jeneng sistem, jeneng komponen sistem lan tanggal acara. Wiwitane, tanggal acara kasebut luwih dhisik. Sawise pindhah menyang panggonan pungkasan, pitakon wiwit kerjane kira-kira kaping pindho luwih cepet. Ngganti kunci utami mbutuhake nggawe ulang tabel lan ngisi maneh data supaya ClickHouse ngurutake maneh data ing disk. Iki minangka operasi sing abot, mula luwih becik sampeyan mikir babagan apa sing kudu dilebokake ing tombol urut.

Sampeyan uga kudu dicathet yen jinis data LowCardinality wis muncul ing versi sing relatif anyar. Nalika nggunakake, ukuran data sing dikompres dikurangi drastis kanggo lapangan sing duwe kardinalitas kurang (sawetara opsi).

Versi 19.6 lagi digunakake lan kita arep nyoba nganyari menyang versi paling anyar. Dheweke duwe fitur sing apik kayata Adaptive Granularity, Skipping indeks lan codec DoubleDelta, contone.

Kanthi gawan, nalika instalasi, tingkat logging disetel kanggo dilacak. Log kasebut diputer lan diarsipake, nanging ing wektu sing padha ditambahake nganti gigabyte. Yen ora perlu, sampeyan bisa nyetel level bebaya, banjur ukuran log dikurangi drastis. Setelan logging disetel ing file config.xml:

<!-- Possible levels: https://github.com/pocoproject/poco/blob/develop/Foundation/include/Poco/Logger. h#L105 -->
<level>warning</level>

Sawetara printah migunani

Поскольку оригинальные пакеты установки собираются по Debian, то для других версий Linux необходимо использовать пакеты собранные компанией Altinity.
 
Вот по этой ссылке есть инструкции с ссылками на их репозиторий: https://www.altinity.com/blog/2017/12/18/logstash-with-clickhouse
sudo yum search clickhouse-server
sudo yum install clickhouse-server.noarch
  
1. проверка статуса
sudo systemctl status clickhouse-server
 
2. остановка сервера
sudo systemctl stop clickhouse-server
 
3. запуск сервера
sudo systemctl start clickhouse-server
 
Запуск для выполнения запросов в многострочном режиме (выполнение после знака ";")
clickhouse-client --multiline
clickhouse-client --multiline --host 127.0.0.1 --password pa55w0rd
clickhouse-client --multiline --host 127.0.0.1 --port 9440 --secure --user default --password pa55w0rd
 
Плагин кликлауза для логстеш в случае ошибки в одной строке сохраняет всю пачку в файл /tmp/log_web_failed.json
Можно вручную исправить этот файл и попробовать залить его в БД вручную:
clickhouse-client --host 127.0.0.1 --password password --query="INSERT INTO log_web FORMAT JSONEachRow" < /tmp/log_web_failed__fixed.json
 
sudo mv /etc/logstash/tmp/log_web_failed.json /etc/logstash/tmp/log_web_failed__fixed.json
sudo chown user_dev /etc/logstash/tmp/log_web_failed__fixed.json
sudo clickhouse-client --host 127.0.0.1 --password password --query="INSERT INTO log_web FORMAT JSONEachRow" < /etc/logstash/tmp/log_web_failed__fixed.json
sudo mv /etc/logstash/tmp/log_web_failed__fixed.json /etc/logstash/tmp/log_web_failed__fixed_.json
 
выход из командной строки
quit;
## Настройка TLS
https://www.altinity.com/blog/2019/3/5/clickhouse-networking-part-2
 
openssl s_client -connect log.domain.com:9440 < /dev/null

logstash. Log router saka FileBeat menyang antrian RabbitMQ

Komponen iki digunakake kanggo ngarahake log sing teka saka FileBeat menyang antrian RabbitMQ. Ana rong titik ing kene:

  1. Sayange, FileBeat ora duwe plugin output kanggo nulis langsung menyang RabbitMQ. Lan fungsi kasebut, miturut masalah ing github, ora direncanakake kanggo implementasine. Ana plugin kanggo Kafka, nanging sakperangan alesan ora bisa digunakake ing omah.
  2. Ana syarat kanggo ngumpulake log ing DMZ. Adhedhasar kasebut, log kudu ditambahake menyang antrian lan banjur LogStash maca entri saka antrian saka njaba.

Mulane, kanggo kasus ing ngendi server dumunung ing DMZ, sampeyan kudu nggunakake skema sing rada rumit. Conto konfigurasi katon kaya iki:

iis_w3c_logs__filebeat_rabbitmq.conf

input {
 
    beats {
        port => 5044
        type => 'iis'
        ssl => true
        ssl_certificate_authorities => ["/etc/pki/tls/certs/app/ca.pem", "/etc/pki/tls/certs/app/ca-issuing.pem"]
        ssl_certificate => "/etc/pki/tls/certs/app/queue.domain.com.cer"
        ssl_key => "/etc/pki/tls/certs/app/queue.domain.com-pkcs8.key"
        ssl_verify_mode => "peer"
    }
 
}
 
output { 
  #stdout {codec => rubydebug}
 
    rabbitmq {
        host => "127.0.0.1"
        port => 5672
        exchange => "monitor.direct"
        exchange_type => "direct"
        key => "%{[fields][fld_app_name]}"
        user => "q-writer"
        password => "password"
        ssl => false
    }
}

KelinciMQ. antrian pesen

Komponen iki digunakake kanggo buffer entri log ing DMZ. Rekaman ditindakake liwat sekumpulan Filebeat → LogStash. Maca ditindakake saka njaba DMZ liwat LogStash. Nalika operasi liwat RabboitMQ, kira-kira 4 ewu pesen per detik diproses.

Nuntun pesen dikonfigurasi miturut jeneng sistem, yaiku adhedhasar data konfigurasi FileBeat. Kabeh pesen menyang siji antrian. Yen sakperangan alesan layanan antrian mandheg, mula iki ora bakal nyebabake ilang pesen: FileBeats bakal nampa kesalahan sambungan lan nundha kiriman sementara. Lan LogStash sing maca saka antrian uga bakal nampa kesalahan jaringan lan ngenteni sambungan dibalèkaké. Ing kasus iki, data, mesthi, ora bakal ditulis maneh ing database.

Pandhuan ing ngisor iki digunakake kanggo nggawe lan ngatur antrian:

sudo /usr/local/bin/rabbitmqadmin/rabbitmqadmin declare exchange --vhost=/ name=monitor.direct type=direct sudo /usr/local/bin/rabbitmqadmin/rabbitmqadmin declare queue --vhost=/ name=web_log durable=true
sudo /usr/local/bin/rabbitmqadmin/rabbitmqadmin --vhost="/" declare binding source="monitor.direct" destination_type="queue" destination="web_log" routing_key="site1.domain.ru"
sudo /usr/local/bin/rabbitmqadmin/rabbitmqadmin --vhost="/" declare binding source="monitor.direct" destination_type="queue" destination="web_log" routing_key="site2.domain.ru"

Grafana. Dashboard

Komponen iki digunakake kanggo nggambarake data pemantauan. Ing kasus iki, sampeyan kudu nginstal sumber data ClickHouse kanggo Grafana 4.6+ plugin. Kita kudu ngapiki kanggo nambah efisiensi ngolah saringan SQL ing dashboard.

Contone, kita nggunakake variabel, lan yen padha ora disetel ing kolom Filter, banjur kita kaya iku ora kanggo generate kondisi ing WHERE formulir ( uriStem = » AND uriStem != » ). Ing kasus iki, ClickHouse bakal maca kolom uriStem. Umumé, kita nyoba macem-macem opsi lan pungkasanipun mbenerake plugin (makro $valueIfEmpty) supaya ing cilik saka Nilai kosong ngasilake 1, tanpa nyebutake kolom dhewe.

Lan saiki sampeyan bisa nggunakake pitakon iki kanggo grafik

$columns(response, count(*) c) from $table where $adhoc
and $valueIfEmpty($fld_app_name, 1, fld_app_name = '$fld_app_name')
and $valueIfEmpty($fld_app_module, 1, fld_app_module = '$fld_app_module') and $valueIfEmpty($fld_server_name, 1, fld_server_name = '$fld_server_name') and $valueIfEmpty($uriStem, 1, uriStem like '%$uriStem%')
and $valueIfEmpty($clientRealIP, 1, clientRealIP = '$clientRealIP')

sing nerjemahake menyang SQL iki (cathetan yen kolom uriStem kosong wis diowahi dadi mung 1)

SELECT
t,
groupArray((response, c)) AS groupArr
FROM (
SELECT
(intDiv(toUInt32(logdatetime), 60) * 60) * 1000 AS t, response,
count(*) AS c FROM default.log_web
WHERE (logdate >= toDate(1565061982)) AND (logdatetime >= toDateTime(1565061982)) AND 1 AND (fld_app_name = 'site1.domain.ru') AND (fld_app_module = 'web') AND 1 AND 1 AND 1
GROUP BY
t, response
ORDER BY
t ASC,
response ASC
)
GROUP BY t ORDER BY t ASC

kesimpulan

Tampilan database ClickHouse wis dadi acara landmark ing pasar. Pancen angel mbayangno, kanthi gratis, kanthi cepet kita bersenjata alat sing kuat lan praktis kanggo nggarap data gedhe. Mesthi, kanthi nambah kabutuhan (contone, sharding lan replikasi menyang macem-macem server), skema kasebut bakal dadi luwih rumit. Nanging ing kesan pisanan, nggarap database iki nyenengake banget. Bisa dideleng yen produk digawe "kanggo wong."

Dibandhingake karo ElasticSearch, biaya nyimpen lan ngolah log dikira bakal suda kaping lima nganti sepuluh. Ing tembung liya, yen kanggo jumlah data saiki, kita kudu nyiyapake klompok sawetara mesin, banjur nalika nggunakake ClickHouse, siji mesin kurang daya cukup kanggo kita. Ya, mesthi, ElasticSearch uga nduweni mekanisme kompresi data ing disk lan fitur liyane sing bisa nyuda konsumsi sumber daya kanthi signifikan, nanging dibandhingake karo ClickHouse, iki bakal luwih larang.

Tanpa optimasi khusus ing bagean kita, ing setelan gawan, loading data lan milih saka database bisa ing kacepetan apik tenan. Kita durung duwe data akeh (kira-kira 200 yuta cathetan), nanging server dhewe ora kuwat. Kita bisa nggunakake alat iki ing mangsa ngarep kanggo tujuan liyane sing ora ana hubungane karo nyimpen log. Contone, kanggo analytics end-to-end, ing bidang keamanan, machine learning.

Ing pungkasan, sethithik babagan pro lan kontra.

Минусы

  1. Loading cathetan ing kumpulan gedhe. Ing tangan siji, iki fitur, nanging sampeyan isih kudu nggunakake komponen tambahan kanggo rekaman buffering. Tugas iki ora tansah gampang, nanging isih bisa ditanggulangi. Lan aku pengin nyederhanakake skema kasebut.
  2. Sawetara fungsi eksotis utawa fitur anyar asring rusak ing versi anyar. Iki nyebabake keprihatinan, nyuda kepinginan kanggo nganyarke menyang versi anyar. Contone, mesin meja Kafka minangka fitur sing migunani banget sing ngidini sampeyan langsung maca acara saka Kafka, tanpa ngetrapake konsumen. Nanging miturut jumlah Masalah ing github, kita isih ati-ati supaya ora nggunakake mesin iki ing produksi. Nanging, yen sampeyan ora nggawe sadurunge nyeret dadakan ing sisih lan nggunakake fungsi utama, banjur bisa digunakake kanthi stabil.

Плюсы

  1. Ora alon mudhun.
  2. Ambang entri kurang.
  3. mbukak sumber.
  4. Gratis.
  5. Timbangan kanthi apik (sharding / replikasi metu saka kothak)
  6. Kalebu ing dhaptar piranti lunak Rusia sing disaranake dening Kementerian Komunikasi.
  7. Anane dhukungan resmi saka Yandex.

Source: www.habr.com

Add a comment