1.1 miliar perjalanan taksi: 108-core ClickHouse klaster

Tarjamahan artikel ieu disiapkeun husus pikeun siswa tangtu "Insinyur Data".

1.1 miliar perjalanan taksi: 108-core ClickHouse klaster

clickhouse mangrupa database columnar open source. Éta lingkungan anu saé dimana ratusan analis tiasa gancang naroskeun data anu lengkep, sanaos puluhan milyar rékaman énggal diasupkeun per dinten. Biaya infrastruktur pikeun ngadukung sistem sapertos kitu tiasa saluhur $ 100 per taun, sareng berpotensi satengahna gumantung kana pamakean. Dina hiji waktos, pamasangan ClickHouse ti Yandex Metrics ngandung 10 triliun rékaman. Salian Yandex, ClickHouse ogé parantos suksés sareng Bloomberg sareng Cloudflare.

Dua taun katukang kuring nyéépkeun analisis komparatif database ngagunakeun hiji mesin, sarta eta janten panggancangna bebas software database Kuring geus kungsi katempo. Saprak harita, pamekar teu eureun nambahkeun fitur, kaasup rojongan pikeun Kafka, HDFS jeung komprési ZStandard. Taun ka tukang maranéhna ditambahkeun rojongan pikeun métode komprési cascading, jeung délta-ti-delta coding janten mungkin. Nalika ngompresi data séri waktos, nilai gauge tiasa dikomprés saé nganggo encoding délta, tapi pikeun konter langkung saé ngagunakeun encoding délta-demi-delta. Komprési anu saé parantos janten konci pikeun pagelaran ClickHouse.

ClickHouse diwangun ku 170 sarébu garis kode C ++, teu kaasup perpustakaan pihak-katilu, tur mangrupakeun salah sahiji codebases database disebarkeun pangleutikna. Dina babandingan, SQLite teu ngarojong distribusi sarta diwangun ku 235 sarébu garis kode C. Salaku tulisan ieu, 207 insinyur geus nyumbang ka ClickHouse, sarta inténsitas commits geus ngaronjatna anyar.

Dina Maret 2017, ClickHouse mimiti ngalaksanakeun log robah salaku cara gampang pikeun ngalacak pangwangunan. Éta ogé peupeus nepi file dokuméntasi monolithic kana hirarki file basis Markdown. Masalah sareng fitur dilacak via GitHub, sareng sacara umum parangkat lunak parantos langkung diaksés dina sababaraha taun ka pengker.

Dina tulisan ieu, kuring bakal ningali kinerja klaster ClickHouse dina AWS EC2 nganggo prosesor 36-inti sareng panyimpenan NVMe.

UPDATE: Saminggu saatos mimiti nyebarkeun tulisan ieu, kuring ngulang ujian kalayan konfigurasi anu langkung saé sareng ngahontal hasil anu langkung saé. Tulisan ieu parantos diropéa pikeun ngagambarkeun parobahan ieu.

Ngaluncurkeun Kluster AWS EC2

Kuring bakal ngagunakeun tilu c5d.9xlarge EC2 instansi pikeun pos ieu. Tiap di antarana ngandung 36 CPUs virtual, 72 GB RAM, 900 GB gudang NVMe SSD tur ngarojong 10 jaringan Gigabit. Éta ngarugikeun $ 1,962 / jam masing-masing di daérah eu-kulon-1 nalika ngajalankeun on demand. Kuring bakal nganggo Ubuntu Server 16.04 LTS salaku sistem operasi.

firewall ieu ngonpigurasi ambéh unggal mesin bisa komunikasi saling tanpa palarangan, sarta ngan alamat IPv4 abdi whitelisted ku SSH dina kluster.

NVMe drive dina kaayaan kesiapan operasional

Pikeun ClickHouse tiasa dianggo, kuring bakal nyiptakeun sistem file dina format EXT4 dina drive NVMe dina unggal server.

$ sudo mkfs -t ext4 /dev/nvme1n1
$ sudo mkdir /ch
$ sudo mount /dev/nvme1n1 /ch

Sakali sagalana geus ngonpigurasi, anjeun tiasa ningali titik Gunung jeung 783 GB spasi sadia on unggal sistem.

$ lsblk

NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
loop0         7:0    0  87.9M  1 loop /snap/core/5742
loop1         7:1    0  16.5M  1 loop /snap/amazon-ssm-agent/784
nvme0n1     259:1    0     8G  0 disk
└─nvme0n1p1 259:2    0     8G  0 part /
nvme1n1     259:0    0 838.2G  0 disk /ch

$ df -h

Filesystem      Size  Used Avail Use% Mounted on
udev             35G     0   35G   0% /dev
tmpfs           6.9G  8.8M  6.9G   1% /run
/dev/nvme0n1p1  7.7G  967M  6.8G  13% /
tmpfs            35G     0   35G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            35G     0   35G   0% /sys/fs/cgroup
/dev/loop0       88M   88M     0 100% /snap/core/5742
/dev/loop1       17M   17M     0 100% /snap/amazon-ssm-agent/784
tmpfs           6.9G     0  6.9G   0% /run/user/1000
/dev/nvme1n1    825G   73M  783G   1% /ch

Dataset anu kuring bakal dianggo dina tés ieu mangrupikeun dump data anu kuring hasilkeun tina 1.1 milyar perjalanan taksi anu dicandak di New York City salami genep taun. Dina blog Hiji Milyar Perjalanan Taksi di Redshift rinci kumaha kuring dikumpulkeun set data ieu. Éta disimpen dina AWS S3, jadi kuring bakal ngonpigurasikeun AWS CLI kalayan aksés sareng konci rahasia kuring.

$ sudo apt update
$ sudo apt install awscli
$ aws configure

Kuring bakal nyetél wates pamundut sakaligus klien ka 100 supados file diunduh langkung gancang tibatan setélan standar.

$ aws configure set 
    default.s3.max_concurrent_requests 
    100

Abdi badé ngaunduh data taksi tina AWS S3 sareng simpen dina drive NVMe dina server munggaran. Dataset ieu ~ 104GB dina format CSV anu dikomprés GZIP.

$ sudo mkdir -p /ch/csv
$ sudo chown -R ubuntu /ch/csv
$ aws s3 sync s3://<bucket>/csv /ch/csv

Pamasangan ClickHouse

Kuring baris install sebaran OpenJDK pikeun Java 8 sakumaha diperlukeun pikeun ngajalankeun Apache ZooKeeper, nu diperlukeun pikeun instalasi disebarkeun of ClickHouse on sakabeh tilu mesin.

$ sudo apt update
$ sudo apt install 
    openjdk-8-jre 
    openjdk-8-jdk-headless

Teras kuring nyetél variabel lingkungan JAVA_HOME.

$ sudo vi /etc/profile
 
export JAVA_HOME=/usr
 
$ source /etc/profile

Kuring lajeng bakal ngagunakeun sistem manajemen pakét Ubuntu pikeun install ClickHouse 18.16.1, glances na ZooKeeper on sakabeh tilu mesin.

$ sudo apt-key adv 
    --keyserver hkp://keyserver.ubuntu.com:80 
    --recv E0C56BD4
$ echo "deb http://repo.yandex.ru/clickhouse/deb/stable/ main/" | 
    sudo tee /etc/apt/sources.list.d/clickhouse.list
$ sudo apt-get update

$ sudo apt install 
    clickhouse-client 
    clickhouse-server 
    glances 
    zookeeperd

Kuring gé nyieun diréktori pikeun ClickHouse sarta ogé ngalakukeun sababaraha overrides konfigurasi dina sakabéh tilu server.

$ sudo mkdir /ch/clickhouse
$ sudo chown -R clickhouse /ch/clickhouse

$ sudo mkdir -p /etc/clickhouse-server/conf.d
$ sudo vi /etc/clickhouse-server/conf.d/taxis.conf

Ieu mangrupikeun overrides konfigurasi anu bakal kuring anggo.

<?xml version="1.0"?>
<yandex>
    <listen_host>0.0.0.0</listen_host>
    <path>/ch/clickhouse/</path>

 <remote_servers>
        <perftest_3shards>
            <shard>
                <replica>
                    <host>172.30.2.192</host>
                    <port>9000</port>
                 </replica>
            </shard>
            <shard>
                 <replica>
                    <host>172.30.2.162</host>
                    <port>9000</port>
                 </replica>
            </shard>
            <shard>
                 <replica>
                    <host>172.30.2.36</host>
                    <port>9000</port>
                 </replica>
            </shard>
        </perftest_3shards>
    </remote_servers>

  <zookeeper-servers>
        <node>
            <host>172.30.2.192</host>
            <port>2181</port>
        </node>
        <node>
            <host>172.30.2.162</host>
            <port>2181</port>
        </node>
        <node>
            <host>172.30.2.36</host>
            <port>2181</port>
        </node>
    </zookeeper-servers>

 <macros>
        <shard>03</shard>
        <replica>01</replica>
    </macros>
</yandex>

Kuring lajeng bakal ngajalankeun ZooKeeper sarta server ClickHouse on sakabeh tilu mesin.

$ sudo /etc/init.d/zookeeper start
$ sudo service clickhouse-server start

Ngunggah data ka ClickHouse

Dina server kahiji kuring bakal nyieun tabel lalampahan (trips), anu bakal nyimpen set data perjalanan taksi nganggo mesin Log.

$ clickhouse-client --host=0.0.0.0
 
CREATE TABLE trips (
    trip_id                 UInt32,
    vendor_id               String,

    pickup_datetime         DateTime,
    dropoff_datetime        Nullable(DateTime),

    store_and_fwd_flag      Nullable(FixedString(1)),
    rate_code_id            Nullable(UInt8),
    pickup_longitude        Nullable(Float64),
    pickup_latitude         Nullable(Float64),
    dropoff_longitude       Nullable(Float64),
    dropoff_latitude        Nullable(Float64),
    passenger_count         Nullable(UInt8),
    trip_distance           Nullable(Float64),
    fare_amount             Nullable(Float32),
    extra                   Nullable(Float32),
    mta_tax                 Nullable(Float32),
    tip_amount              Nullable(Float32),
    tolls_amount            Nullable(Float32),
    ehail_fee               Nullable(Float32),
    improvement_surcharge   Nullable(Float32),
    total_amount            Nullable(Float32),
    payment_type            Nullable(String),
    trip_type               Nullable(UInt8),
    pickup                  Nullable(String),
    dropoff                 Nullable(String),

    cab_type                Nullable(String),

    precipitation           Nullable(Int8),
    snow_depth              Nullable(Int8),
    snowfall                Nullable(Int8),
    max_temperature         Nullable(Int8),
    min_temperature         Nullable(Int8),
    average_wind_speed      Nullable(Int8),

    pickup_nyct2010_gid     Nullable(Int8),
    pickup_ctlabel          Nullable(String),
    pickup_borocode         Nullable(Int8),
    pickup_boroname         Nullable(String),
    pickup_ct2010           Nullable(String),
    pickup_boroct2010       Nullable(String),
    pickup_cdeligibil       Nullable(FixedString(1)),
    pickup_ntacode          Nullable(String),
    pickup_ntaname          Nullable(String),
    pickup_puma             Nullable(String),

    dropoff_nyct2010_gid    Nullable(UInt8),
    dropoff_ctlabel         Nullable(String),
    dropoff_borocode        Nullable(UInt8),
    dropoff_boroname        Nullable(String),
    dropoff_ct2010          Nullable(String),
    dropoff_boroct2010      Nullable(String),
    dropoff_cdeligibil      Nullable(String),
    dropoff_ntacode         Nullable(String),
    dropoff_ntaname         Nullable(String),
    dropoff_puma            Nullable(String)
) ENGINE = Log;

Kuring teras nimba sareng ngamuat unggal file CSV kana méja perjalanan (trips). Di handap ieu réngsé dina 55 menit 10 detik. Saatos operasi ieu, ukuran diréktori data éta 134 GB.

$ time (for FILENAME in /ch/csv/trips_x*.csv.gz; do
            echo $FILENAME
            gunzip -c $FILENAME | 
                clickhouse-client 
                    --host=0.0.0.0 
                    --query="INSERT INTO trips FORMAT CSV"
        done)

Laju impor nyaéta 155 MB eusi CSV anu teu dikomprés per detik. Kuring curiga ieu kusabab bottleneck dina decompression GZIP. Bisa jadi leuwih gancang mun unzip sakabéh file gzipped sajajar ngagunakeun xargs lajeng muka data unzipped. Di handap ieu pedaran ngeunaan naon dilaporkeun salila prosés impor CSV.

$ sudo glances

ip-172-30-2-200 (Ubuntu 16.04 64bit / Linux 4.4.0-1072-aws)                                                                                                 Uptime: 0:11:42
CPU       8.2%  nice:     0.0%                           LOAD    36-core                           MEM      9.8%  active:    5.20G                           SWAP      0.0%
user:     6.0%  irq:      0.0%                           1 min:    2.24                            total:  68.7G  inactive:  61.0G                           total:       0
system:   0.9%  iowait:   1.3%                           5 min:    1.83                            used:   6.71G  buffers:   66.4M                           used:        0
idle:    91.8%  steal:    0.0%                           15 min:   1.01                            free:   62.0G  cached:    61.6G                           free:        0

NETWORK     Rx/s   Tx/s   TASKS 370 (507 thr), 2 run, 368 slp, 0 oth sorted automatically by cpu_percent, flat view
ens5        136b    2Kb
lo         343Mb  343Mb     CPU%  MEM%  VIRT   RES   PID USER        NI S    TIME+ IOR/s IOW/s Command
                           100.4   1.5 1.65G 1.06G  9909 ubuntu       0 S  1:01.33     0     0 clickhouse-client --host=0.0.0.0 --query=INSERT INTO trips FORMAT CSV
DISK I/O     R/s    W/s     85.1   0.0 4.65M  708K  9908 ubuntu       0 R  0:50.60   32M     0 gzip -d -c /ch/csv/trips_xac.csv.gz
loop0          0      0     54.9   5.1 8.14G 3.49G  8091 clickhous    0 S  1:44.23     0   45M /usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml
loop1          0      0      4.5   0.0     0     0   319 root         0 S  0:07.50    1K     0 kworker/u72:2
nvme0n1        0     3K      2.3   0.0 91.1M 28.9M  9912 root         0 R  0:01.56     0     0 /usr/bin/python3 /usr/bin/glances
nvme0n1p1      0     3K      0.3   0.0     0     0   960 root       -20 S  0:00.10     0     0 kworker/28:1H
nvme1n1    32.1M   495M      0.3   0.0     0     0  1058 root       -20 S  0:00.90     0     0 kworker/23:1H

Abdi bakal ngosongkeun rohangan dina drive NVMe ku ngahapus file CSV asli sateuacan neraskeun.

$ sudo rm -fr /ch/csv

Ngarobah kana Bentuk Kolom

Mesin Log ClickHouse bakal nyimpen data dina format berorientasi baris. Pikeun naroskeun data langkung gancang, kuring ngarobih kana format kolom nganggo mesin MergeTree.

$ clickhouse-client --host=0.0.0.0

Di handap ieu réngsé dina 34 menit 50 detik. Saatos operasi ieu, ukuran diréktori data éta 237 GB.

CREATE TABLE trips_mergetree
    ENGINE = MergeTree(pickup_date, pickup_datetime, 8192)
    AS SELECT
        trip_id,
        CAST(vendor_id AS Enum8('1' = 1,
                                '2' = 2,
                                'CMT' = 3,
                                'VTS' = 4,
                                'DDS' = 5,
                                'B02512' = 10,
                                'B02598' = 11,
                                'B02617' = 12,
                                'B02682' = 13,
                                'B02764' = 14)) AS vendor_id,
        toDate(pickup_datetime)                 AS pickup_date,
        ifNull(pickup_datetime, toDateTime(0))  AS pickup_datetime,
        toDate(dropoff_datetime)                AS dropoff_date,
        ifNull(dropoff_datetime, toDateTime(0)) AS dropoff_datetime,
        assumeNotNull(store_and_fwd_flag)       AS store_and_fwd_flag,
        assumeNotNull(rate_code_id)             AS rate_code_id,

        assumeNotNull(pickup_longitude)         AS pickup_longitude,
        assumeNotNull(pickup_latitude)          AS pickup_latitude,
        assumeNotNull(dropoff_longitude)        AS dropoff_longitude,
        assumeNotNull(dropoff_latitude)         AS dropoff_latitude,
        assumeNotNull(passenger_count)          AS passenger_count,
        assumeNotNull(trip_distance)            AS trip_distance,
        assumeNotNull(fare_amount)              AS fare_amount,
        assumeNotNull(extra)                    AS extra,
        assumeNotNull(mta_tax)                  AS mta_tax,
        assumeNotNull(tip_amount)               AS tip_amount,
        assumeNotNull(tolls_amount)             AS tolls_amount,
        assumeNotNull(ehail_fee)                AS ehail_fee,
        assumeNotNull(improvement_surcharge)    AS improvement_surcharge,
        assumeNotNull(total_amount)             AS total_amount,
        assumeNotNull(payment_type)             AS payment_type_,
        assumeNotNull(trip_type)                AS trip_type,

        pickup AS pickup,
        pickup AS dropoff,

        CAST(assumeNotNull(cab_type)
            AS Enum8('yellow' = 1, 'green' = 2))
                                AS cab_type,

        precipitation           AS precipitation,
        snow_depth              AS snow_depth,
        snowfall                AS snowfall,
        max_temperature         AS max_temperature,
        min_temperature         AS min_temperature,
        average_wind_speed      AS average_wind_speed,

        pickup_nyct2010_gid     AS pickup_nyct2010_gid,
        pickup_ctlabel          AS pickup_ctlabel,
        pickup_borocode         AS pickup_borocode,
        pickup_boroname         AS pickup_boroname,
        pickup_ct2010           AS pickup_ct2010,
        pickup_boroct2010       AS pickup_boroct2010,
        pickup_cdeligibil       AS pickup_cdeligibil,
        pickup_ntacode          AS pickup_ntacode,
        pickup_ntaname          AS pickup_ntaname,
        pickup_puma             AS pickup_puma,

        dropoff_nyct2010_gid    AS dropoff_nyct2010_gid,
        dropoff_ctlabel         AS dropoff_ctlabel,
        dropoff_borocode        AS dropoff_borocode,
        dropoff_boroname        AS dropoff_boroname,
        dropoff_ct2010          AS dropoff_ct2010,
        dropoff_boroct2010      AS dropoff_boroct2010,
        dropoff_cdeligibil      AS dropoff_cdeligibil,
        dropoff_ntacode         AS dropoff_ntacode,
        dropoff_ntaname         AS dropoff_ntaname,
        dropoff_puma            AS dropoff_puma
    FROM trips;

Ieu kumaha kaluaran glance sapertos nalika operasi:

ip-172-30-2-200 (Ubuntu 16.04 64bit / Linux 4.4.0-1072-aws)                                                                                                 Uptime: 1:06:09
CPU      10.3%  nice:     0.0%                           LOAD    36-core                           MEM     16.1%  active:    13.3G                           SWAP      0.0%
user:     7.9%  irq:      0.0%                           1 min:    1.87                            total:  68.7G  inactive:  52.8G                           total:       0
system:   1.6%  iowait:   0.8%                           5 min:    1.76                            used:   11.1G  buffers:   71.8M                           used:        0
idle:    89.7%  steal:    0.0%                           15 min:   1.95                            free:   57.6G  cached:    57.2G                           free:        0

NETWORK     Rx/s   Tx/s   TASKS 367 (523 thr), 1 run, 366 slp, 0 oth sorted automatically by cpu_percent, flat view
ens5         1Kb    8Kb
lo           2Kb    2Kb     CPU%  MEM%  VIRT   RES   PID USER        NI S    TIME+ IOR/s IOW/s Command
                           241.9  12.8 20.7G 8.78G  8091 clickhous    0 S 30:36.73   34M  125M /usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml
DISK I/O     R/s    W/s      2.6   0.0 90.4M 28.3M  9948 root         0 R  1:18.53     0     0 /usr/bin/python3 /usr/bin/glances
loop0          0      0      1.3   0.0     0     0   203 root         0 S  0:09.82     0     0 kswapd0
loop1          0      0      0.3   0.1  315M 61.3M 15701 ubuntu       0 S  0:00.40     0     0 clickhouse-client --host=0.0.0.0
nvme0n1        0     3K      0.3   0.0     0     0     7 root         0 S  0:00.83     0     0 rcu_sched
nvme0n1p1      0     3K      0.0   0.0     0     0   142 root         0 S  0:00.22     0     0 migration/27
nvme1n1    25.8M   330M      0.0   0.0 59.7M 1.79M  2764 ubuntu       0 S  0:00.00     0     0 (sd-pam)

Dina tés panungtungan, sababaraha kolom dirobah sarta diitung ulang. Kuring manggihan yén sababaraha fungsi ieu euweuh gawéna saperti nu diharapkeun dina dataset ieu. Pikeun ngabéréskeun masalah ieu, kuring ngaleungitkeun fungsi anu teu pantes sareng ngamuat data tanpa ngarobih kana jinis anu langkung granular.

Distribusi data sakuliah klaster

Kuring bakal ngadistribusikaeun data ka sadaya tilu titik klaster. Pikeun mimitian, handap kuring bakal nyieun tabel dina sakabéh tilu mesin.

$ clickhouse-client --host=0.0.0.0

CREATE TABLE trips_mergetree_third (
    trip_id                 UInt32,
    vendor_id               String,
    pickup_date             Date,
    pickup_datetime         DateTime,
    dropoff_date            Date,
    dropoff_datetime        Nullable(DateTime),
    store_and_fwd_flag      Nullable(FixedString(1)),
    rate_code_id            Nullable(UInt8),
    pickup_longitude        Nullable(Float64),
    pickup_latitude         Nullable(Float64),
    dropoff_longitude       Nullable(Float64),
    dropoff_latitude        Nullable(Float64),
    passenger_count         Nullable(UInt8),
    trip_distance           Nullable(Float64),
    fare_amount             Nullable(Float32),
    extra                   Nullable(Float32),
    mta_tax                 Nullable(Float32),
    tip_amount              Nullable(Float32),
    tolls_amount            Nullable(Float32),
    ehail_fee               Nullable(Float32),
    improvement_surcharge   Nullable(Float32),
    total_amount            Nullable(Float32),
    payment_type            Nullable(String),
    trip_type               Nullable(UInt8),
    pickup                  Nullable(String),
    dropoff                 Nullable(String),

    cab_type                Nullable(String),

    precipitation           Nullable(Int8),
    snow_depth              Nullable(Int8),
    snowfall                Nullable(Int8),
    max_temperature         Nullable(Int8),
    min_temperature         Nullable(Int8),
    average_wind_speed      Nullable(Int8),

    pickup_nyct2010_gid     Nullable(Int8),
    pickup_ctlabel          Nullable(String),
    pickup_borocode         Nullable(Int8),
    pickup_boroname         Nullable(String),
    pickup_ct2010           Nullable(String),
    pickup_boroct2010       Nullable(String),
    pickup_cdeligibil       Nullable(FixedString(1)),
    pickup_ntacode          Nullable(String),
    pickup_ntaname          Nullable(String),
    pickup_puma             Nullable(String),

    dropoff_nyct2010_gid    Nullable(UInt8),
    dropoff_ctlabel         Nullable(String),
    dropoff_borocode        Nullable(UInt8),
    dropoff_boroname        Nullable(String),
    dropoff_ct2010          Nullable(String),
    dropoff_boroct2010      Nullable(String),
    dropoff_cdeligibil      Nullable(String),
    dropoff_ntacode         Nullable(String),
    dropoff_ntaname         Nullable(String),
    dropoff_puma            Nullable(String)
) ENGINE = MergeTree(pickup_date, pickup_datetime, 8192);

Lajeng abdi bakal pastikeun yén server munggaran bisa ningali sakabeh tilu titik dina kluster.

SELECT *
FROM system.clusters
WHERE cluster = 'perftest_3shards'
FORMAT Vertical;
Row 1:
──────
cluster:          perftest_3shards
shard_num:        1
shard_weight:     1
replica_num:      1
host_name:        172.30.2.192
host_address:     172.30.2.192
port:             9000
is_local:         1
user:             default
default_database:
Row 2:
──────
cluster:          perftest_3shards
shard_num:        2
shard_weight:     1
replica_num:      1
host_name:        172.30.2.162
host_address:     172.30.2.162
port:             9000
is_local:         0
user:             default
default_database:

Row 3:
──────
cluster:          perftest_3shards
shard_num:        3
shard_weight:     1
replica_num:      1
host_name:        172.30.2.36
host_address:     172.30.2.36
port:             9000
is_local:         0
user:             default
default_database:

Lajeng abdi bakal nangtukeun hiji méja anyar dina server munggaran anu dumasar kana schema nu trips_mergetree_third tur ngagunakeun mesin Distribusi.

CREATE TABLE trips_mergetree_x3
    AS trips_mergetree_third
    ENGINE = Distributed(perftest_3shards,
                         default,
                         trips_mergetree_third,
                         rand());

Kuring lajeng bakal nyalin data tina tabel dumasar MergeTree ka sadaya tilu server. Di handap ieu réngsé dina 34 menit 44 detik.

INSERT INTO trips_mergetree_x3
    SELECT * FROM trips_mergetree;

Saatos operasi di luhur, abdi masihan ClickHouse 15 menit pikeun mindahkeun jauh ti tanda tingkat gudang maksimum. Diréktori data mungguh 264 GB, 34 GB sareng 33 GB dina unggal tilu server.

ClickHouse evaluasi kinerja klaster

Naon anu kuring tingali salajengna nyaéta waktos panggancangna kuring ningali ngajalankeun unggal pamundut dina méja sababaraha kali trips_mergetree_x3.

$ clickhouse-client --host=0.0.0.0

Di handap ieu réngsé dina 2.449 detik.

SELECT cab_type, count(*)
FROM trips_mergetree_x3
GROUP BY cab_type;

Di handap ieu réngsé dina 0.691 detik.

SELECT passenger_count,
       avg(total_amount)
FROM trips_mergetree_x3
GROUP BY passenger_count;

Di handap ieu réngsé dina 0 detik.

SELECT passenger_count,
       toYear(pickup_date) AS year,
       count(*)
FROM trips_mergetree_x3
GROUP BY passenger_count,
         year;

Di handap ieu réngsé dina 0.983 detik.

SELECT passenger_count,
       toYear(pickup_date) AS year,
       round(trip_distance) AS distance,
       count(*)
FROM trips_mergetree_x3
GROUP BY passenger_count,
         year,
         distance
ORDER BY year,
         count(*) DESC;

Pikeun babandingan, abdi ngajalankeun queries sarua dina tabel basis MergeTree nu resides solely dina server munggaran.

Evaluasi kinerja hiji titik ClickHouse

Naon anu kuring tingali salajengna nyaéta waktos panggancangna kuring ningali ngajalankeun unggal pamundut dina méja sababaraha kali trips_mergetree_x3.

Di handap ieu réngsé dina 0.241 detik.

SELECT cab_type, count(*)
FROM trips_mergetree
GROUP BY cab_type;

Di handap ieu réngsé dina 0.826 detik.

SELECT passenger_count,
       avg(total_amount)
FROM trips_mergetree
GROUP BY passenger_count;

Di handap ieu réngsé dina 1.209 detik.

SELECT passenger_count,
       toYear(pickup_date) AS year,
       count(*)
FROM trips_mergetree
GROUP BY passenger_count,
         year;

Di handap ieu réngsé dina 1.781 detik.

SELECT passenger_count,
       toYear(pickup_date) AS year,
       round(trip_distance) AS distance,
       count(*)
FROM trips_mergetree
GROUP BY passenger_count,
         year,
         distance
ORDER BY year,
         count(*) DESC;

Refleksi kana hasil

Ieu kahiji kalina yén basis data basis CPU bébas bisa outperform database basis GPU dina tés kuring. Éta database basis GPU geus ngaliwatan dua révisi saprak lajeng, tapi kinerja anu ClickHouse dikirimkeun dina titik tunggal mangkaning pisan impressive.

Dina waktos anu sami, nalika ngalaksanakeun Query 1 dina mesin anu disebarkeun, biaya overhead mangrupikeun urutan anu langkung ageung. Kuring miharep kuring lasut hal dina panalungtikan kuring pikeun pos ieu sabab bakal nice ningali kali query turun nalika kuring nambahkeun leuwih titik kana klaster. Sanajan kitu, éta hébat yén nalika executing queries séjén, kinerja ngaronjat ku ngeunaan 2 kali.

Éta langkung saé ningali ClickHouse mekar nuju tiasa misahkeun neundeun sareng ngitung supados aranjeunna tiasa skala sacara mandiri. Pangrojong HDFS, anu ditambahan taun ka tukang, tiasa janten léngkah ka arah ieu. Dina hal komputasi, upami hiji pamundut tunggal tiasa digancangan ku nambihan langkung seueur titik kana kluster, maka masa depan parangkat lunak ieu terang pisan.

Hatur nuhun pikeun nyandak waktos maca pos ieu. Kuring nawiskeun jasa konsultasi, arsitéktur, sareng pamekaran prakték ka klien di Amérika Kalér sareng Éropa. Upami anjeun hoyong ngabahas kumaha saran kuring tiasa ngabantosan bisnis anjeun, mangga ngahubungi kuring via LinkedIn.

sumber: www.habr.com

Tambahkeun komentar