1.1 biliyan taksi tafiye-tafiye: 108-core ClickHouse cluster

An shirya fassarar labarin musamman ga ɗaliban kwas ɗin Injiniya Data.

1.1 biliyan taksi tafiye-tafiye: 108-core ClickHouse cluster

DannaHause buɗaɗɗen tushen bayanai ne na columnar. Yana da kyakkyawan yanayi inda ɗaruruwan manazarta za su iya bincika cikakkun bayanai cikin sauri, koda kuwa ana shigar da dubun-dubatar sabbin bayanai kowace rana. Kudin kayan more rayuwa don tallafawa irin wannan tsarin zai iya kaiwa $100 a kowace shekara, kuma mai yuwuwa rabin abin ya danganta da amfani. A wani lokaci, shigarwa na ClickHouse daga Yandex Metrics ya ƙunshi bayanan tiriliyan 10. Baya ga Yandex, ClickHouse kuma ya sami nasara tare da Bloomberg da Cloudflare.

Shekaru biyu da suka wuce na yi kwatanta bincike bayanan bayanai ta amfani da injin guda ɗaya, kuma ya zama mafi sauri free database software Na taba gani. Tun daga nan, masu haɓakawa ba su daina ƙara fasali ba, gami da goyan bayan Kafka, HDFS da matsawar ZStandard. A bara sun kara tallafi don hanyoyin matsawa cascading, da delta-da-delta coding ya zama mai yiwuwa. Lokacin damfara bayanan jerin lokaci, ƙimar ma'auni za a iya matsawa da kyau ta amfani da rikodin rikodi na delta, amma ga ƙididdiga zai fi kyau a yi amfani da ɓoyayyen delta-by-delta. Kyakkyawan matsawa ya zama mabuɗin aikin ClickHouse.

ClickHouse ya ƙunshi layukan C++ dubu 170, ban da ɗakunan karatu na ɓangare na uku, kuma yana ɗaya daga cikin mafi ƙanƙanta da rarraba bayanai na codebases. Idan aka kwatanta, SQLite ba ya goyan bayan rarrabawa kuma ya ƙunshi layin 235 dubu na lambar C. Kamar yadda wannan rubuce-rubucen, 207 injiniyoyi sun ba da gudummawa ga ClickHouse, kuma ƙarfin aikatawa yana karuwa kwanan nan.

A cikin Maris 2017, ClickHouse ya fara gudanarwa canza log a matsayin hanya mai sauƙi don kula da ci gaba. Sun kuma wargaza fayil ɗin takaddun guda ɗaya cikin tsarin fayil na tushen Markdown. Ana bin batutuwa da fasalulluka ta hanyar GitHub, kuma software gabaɗaya ta sami damar samun dama ga 'yan shekarun nan.

A cikin wannan labarin, zan duba aikin gungu na ClickHouse akan AWS EC2 ta amfani da na'urori masu sarrafawa na 36-core da ajiyar NVMe.

LABARI: Mako guda bayan buga wannan post ɗin, na sake gwada gwajin tare da ingantaccen tsari kuma na sami sakamako mafi kyau. An sabunta wannan sakon don nuna waɗannan canje-canje.

Ƙaddamar da AWS EC2 Cluster

Zan yi amfani da misalin c5d.9xlarge EC2 guda uku don wannan matsayi. Kowannen su ya ƙunshi CPUs masu kama-da-wane 36, 72 GB na RAM, 900 GB na ajiyar NVMe SSD kuma yana goyan bayan hanyar sadarwa ta Gigabit 10. Suna kashe $1,962 a kowace awa a cikin yankin eu-west-1 lokacin da ake buƙata. Zan yi amfani da Ubuntu Server 16.04 LTS azaman tsarin aiki.

An saita Tacewar zaɓi ta yadda kowace na'ura za ta iya sadarwa da juna ba tare da hani ba, kuma adireshin IPv4 na kawai SSH ya ba da izini a cikin gungu.

Tuƙi NVMe a cikin yanayin shirye-shiryen aiki

Don ClickHouse ya yi aiki, zan ƙirƙiri tsarin fayil a cikin tsarin EXT4 akan injin NVMe akan kowane sabar.

$ sudo mkfs -t ext4 /dev/nvme1n1
$ sudo mkdir /ch
$ sudo mount /dev/nvme1n1 /ch

Da zarar an daidaita komai, za ku iya ganin wurin dutsen da 783 GB na sarari da ke akwai akan kowane tsarin.

$ lsblk

NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
loop0         7:0    0  87.9M  1 loop /snap/core/5742
loop1         7:1    0  16.5M  1 loop /snap/amazon-ssm-agent/784
nvme0n1     259:1    0     8G  0 disk
└─nvme0n1p1 259:2    0     8G  0 part /
nvme1n1     259:0    0 838.2G  0 disk /ch

$ df -h

Filesystem      Size  Used Avail Use% Mounted on
udev             35G     0   35G   0% /dev
tmpfs           6.9G  8.8M  6.9G   1% /run
/dev/nvme0n1p1  7.7G  967M  6.8G  13% /
tmpfs            35G     0   35G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            35G     0   35G   0% /sys/fs/cgroup
/dev/loop0       88M   88M     0 100% /snap/core/5742
/dev/loop1       17M   17M     0 100% /snap/amazon-ssm-agent/784
tmpfs           6.9G     0  6.9G   0% /run/user/1000
/dev/nvme1n1    825G   73M  783G   1% /ch

Saitin bayanan da zan yi amfani da shi a wannan gwajin juji ne na bayanan da na samar daga hawan tasi biliyan 1.1 da aka yi a birnin New York sama da shekaru shida. A kan blog Tasi Biliyan Daya a Redshift cikakkun bayanai yadda na tattara wannan bayanan. Ana adana su a cikin AWS S3, don haka zan saita AWS CLI tare da damar shiga da maɓallan sirri na.

$ sudo apt update
$ sudo apt install awscli
$ aws configure

Zan saita iyakacin buƙatun abokin ciniki na lokaci guda zuwa 100 domin fayilolin saukewa da sauri fiye da saitunan tsoho.

$ aws configure set 
    default.s3.max_concurrent_requests 
    100

Zan zazzage bayanan abubuwan hawan tasi daga AWS S3 kuma in adana shi akan tuƙin NVMe akan sabar farko. Wannan saitin bayanan shine ~ 104GB a cikin tsarin CSV mai matse GZIP.

$ sudo mkdir -p /ch/csv
$ sudo chown -R ubuntu /ch/csv
$ aws s3 sync s3://<bucket>/csv /ch/csv

ClickHouse shigarwa

Zan shigar da rarrabawar OpenJDK don Java 8 kamar yadda ake buƙata don gudanar da Apache ZooKeeper, wanda ake buƙata don rarraba rarraba ClickHouse akan duk injina uku.

$ sudo apt update
$ sudo apt install 
    openjdk-8-jre 
    openjdk-8-jdk-headless

Sannan na saita canjin yanayi JAVA_HOME.

$ sudo vi /etc/profile
 
export JAVA_HOME=/usr
 
$ source /etc/profile

Zan yi amfani da tsarin sarrafa fakitin Ubuntu don shigar da ClickHouse 18.16.1, kallo da ZooKeeper akan dukkan injuna guda uku.

$ sudo apt-key adv 
    --keyserver hkp://keyserver.ubuntu.com:80 
    --recv E0C56BD4
$ echo "deb http://repo.yandex.ru/clickhouse/deb/stable/ main/" | 
    sudo tee /etc/apt/sources.list.d/clickhouse.list
$ sudo apt-get update

$ sudo apt install 
    clickhouse-client 
    clickhouse-server 
    glances 
    zookeeperd

Zan ƙirƙiri kundin adireshi don ClickHouse kuma in yi wasu gyare-gyare a kan duk sabobin uku.

$ sudo mkdir /ch/clickhouse
$ sudo chown -R clickhouse /ch/clickhouse

$ sudo mkdir -p /etc/clickhouse-server/conf.d
$ sudo vi /etc/clickhouse-server/conf.d/taxis.conf

Waɗannan su ne ƙa'idodin daidaitawa waɗanda zan yi amfani da su.

<?xml version="1.0"?>
<yandex>
    <listen_host>0.0.0.0</listen_host>
    <path>/ch/clickhouse/</path>

 <remote_servers>
        <perftest_3shards>
            <shard>
                <replica>
                    <host>172.30.2.192</host>
                    <port>9000</port>
                 </replica>
            </shard>
            <shard>
                 <replica>
                    <host>172.30.2.162</host>
                    <port>9000</port>
                 </replica>
            </shard>
            <shard>
                 <replica>
                    <host>172.30.2.36</host>
                    <port>9000</port>
                 </replica>
            </shard>
        </perftest_3shards>
    </remote_servers>

  <zookeeper-servers>
        <node>
            <host>172.30.2.192</host>
            <port>2181</port>
        </node>
        <node>
            <host>172.30.2.162</host>
            <port>2181</port>
        </node>
        <node>
            <host>172.30.2.36</host>
            <port>2181</port>
        </node>
    </zookeeper-servers>

 <macros>
        <shard>03</shard>
        <replica>01</replica>
    </macros>
</yandex>

Daga nan zan gudanar da ZooKeeper da uwar garken ClickHouse akan dukkan injina guda uku.

$ sudo /etc/init.d/zookeeper start
$ sudo service clickhouse-server start

Ana loda bayanai zuwa ClickHouse

A kan uwar garken farko zan ƙirƙiri teburin tafiya (trips), wanda zai adana bayanan tafiye-tafiyen tasi ta amfani da injin Log.

$ clickhouse-client --host=0.0.0.0
 
CREATE TABLE trips (
    trip_id                 UInt32,
    vendor_id               String,

    pickup_datetime         DateTime,
    dropoff_datetime        Nullable(DateTime),

    store_and_fwd_flag      Nullable(FixedString(1)),
    rate_code_id            Nullable(UInt8),
    pickup_longitude        Nullable(Float64),
    pickup_latitude         Nullable(Float64),
    dropoff_longitude       Nullable(Float64),
    dropoff_latitude        Nullable(Float64),
    passenger_count         Nullable(UInt8),
    trip_distance           Nullable(Float64),
    fare_amount             Nullable(Float32),
    extra                   Nullable(Float32),
    mta_tax                 Nullable(Float32),
    tip_amount              Nullable(Float32),
    tolls_amount            Nullable(Float32),
    ehail_fee               Nullable(Float32),
    improvement_surcharge   Nullable(Float32),
    total_amount            Nullable(Float32),
    payment_type            Nullable(String),
    trip_type               Nullable(UInt8),
    pickup                  Nullable(String),
    dropoff                 Nullable(String),

    cab_type                Nullable(String),

    precipitation           Nullable(Int8),
    snow_depth              Nullable(Int8),
    snowfall                Nullable(Int8),
    max_temperature         Nullable(Int8),
    min_temperature         Nullable(Int8),
    average_wind_speed      Nullable(Int8),

    pickup_nyct2010_gid     Nullable(Int8),
    pickup_ctlabel          Nullable(String),
    pickup_borocode         Nullable(Int8),
    pickup_boroname         Nullable(String),
    pickup_ct2010           Nullable(String),
    pickup_boroct2010       Nullable(String),
    pickup_cdeligibil       Nullable(FixedString(1)),
    pickup_ntacode          Nullable(String),
    pickup_ntaname          Nullable(String),
    pickup_puma             Nullable(String),

    dropoff_nyct2010_gid    Nullable(UInt8),
    dropoff_ctlabel         Nullable(String),
    dropoff_borocode        Nullable(UInt8),
    dropoff_boroname        Nullable(String),
    dropoff_ct2010          Nullable(String),
    dropoff_boroct2010      Nullable(String),
    dropoff_cdeligibil      Nullable(String),
    dropoff_ntacode         Nullable(String),
    dropoff_ntaname         Nullable(String),
    dropoff_puma            Nullable(String)
) ENGINE = Log;

Ina cirewa da loda kowane fayilolin CSV a cikin teburin tafiya (trips). An kammala waɗannan abubuwan a cikin mintuna 55 da sakan 10. Bayan wannan aiki, girman kundin bayanan ya kasance 134 GB.

$ time (for FILENAME in /ch/csv/trips_x*.csv.gz; do
            echo $FILENAME
            gunzip -c $FILENAME | 
                clickhouse-client 
                    --host=0.0.0.0 
                    --query="INSERT INTO trips FORMAT CSV"
        done)

Gudun shigo da shi ya kasance 155 MB na abun ciki na CSV maras nauyi a sakan daya. Ina tsammanin wannan ya faru ne saboda ƙulli a cikin lalatawar GZIP. Zai yi sauri don buɗe duk fayilolin gzipped a layi daya ta amfani da xargs sannan a loda bayanan da ba a buɗe ba. A ƙasa akwai bayanin abin da aka ruwaito yayin aiwatar da shigo da CSV.

$ sudo glances

ip-172-30-2-200 (Ubuntu 16.04 64bit / Linux 4.4.0-1072-aws)                                                                                                 Uptime: 0:11:42
CPU       8.2%  nice:     0.0%                           LOAD    36-core                           MEM      9.8%  active:    5.20G                           SWAP      0.0%
user:     6.0%  irq:      0.0%                           1 min:    2.24                            total:  68.7G  inactive:  61.0G                           total:       0
system:   0.9%  iowait:   1.3%                           5 min:    1.83                            used:   6.71G  buffers:   66.4M                           used:        0
idle:    91.8%  steal:    0.0%                           15 min:   1.01                            free:   62.0G  cached:    61.6G                           free:        0

NETWORK     Rx/s   Tx/s   TASKS 370 (507 thr), 2 run, 368 slp, 0 oth sorted automatically by cpu_percent, flat view
ens5        136b    2Kb
lo         343Mb  343Mb     CPU%  MEM%  VIRT   RES   PID USER        NI S    TIME+ IOR/s IOW/s Command
                           100.4   1.5 1.65G 1.06G  9909 ubuntu       0 S  1:01.33     0     0 clickhouse-client --host=0.0.0.0 --query=INSERT INTO trips FORMAT CSV
DISK I/O     R/s    W/s     85.1   0.0 4.65M  708K  9908 ubuntu       0 R  0:50.60   32M     0 gzip -d -c /ch/csv/trips_xac.csv.gz
loop0          0      0     54.9   5.1 8.14G 3.49G  8091 clickhous    0 S  1:44.23     0   45M /usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml
loop1          0      0      4.5   0.0     0     0   319 root         0 S  0:07.50    1K     0 kworker/u72:2
nvme0n1        0     3K      2.3   0.0 91.1M 28.9M  9912 root         0 R  0:01.56     0     0 /usr/bin/python3 /usr/bin/glances
nvme0n1p1      0     3K      0.3   0.0     0     0   960 root       -20 S  0:00.10     0     0 kworker/28:1H
nvme1n1    32.1M   495M      0.3   0.0     0     0  1058 root       -20 S  0:00.90     0     0 kworker/23:1H

Zan ba da sarari akan tuƙi na NVMe ta hanyar share ainihin fayilolin CSV kafin ci gaba.

$ sudo rm -fr /ch/csv

Juya zuwa Samfuran Rukunin

Injin Log ClickHouse zai adana bayanai a cikin tsarin da ya dace da jere. Don neman bayanai cikin sauri, na canza shi zuwa tsarin shafi ta amfani da injin MergeTree.

$ clickhouse-client --host=0.0.0.0

An kammala waɗannan abubuwan a cikin mintuna 34 da daƙiƙa 50. Bayan wannan aiki, girman kundin bayanan shine 237 GB.

CREATE TABLE trips_mergetree
    ENGINE = MergeTree(pickup_date, pickup_datetime, 8192)
    AS SELECT
        trip_id,
        CAST(vendor_id AS Enum8('1' = 1,
                                '2' = 2,
                                'CMT' = 3,
                                'VTS' = 4,
                                'DDS' = 5,
                                'B02512' = 10,
                                'B02598' = 11,
                                'B02617' = 12,
                                'B02682' = 13,
                                'B02764' = 14)) AS vendor_id,
        toDate(pickup_datetime)                 AS pickup_date,
        ifNull(pickup_datetime, toDateTime(0))  AS pickup_datetime,
        toDate(dropoff_datetime)                AS dropoff_date,
        ifNull(dropoff_datetime, toDateTime(0)) AS dropoff_datetime,
        assumeNotNull(store_and_fwd_flag)       AS store_and_fwd_flag,
        assumeNotNull(rate_code_id)             AS rate_code_id,

        assumeNotNull(pickup_longitude)         AS pickup_longitude,
        assumeNotNull(pickup_latitude)          AS pickup_latitude,
        assumeNotNull(dropoff_longitude)        AS dropoff_longitude,
        assumeNotNull(dropoff_latitude)         AS dropoff_latitude,
        assumeNotNull(passenger_count)          AS passenger_count,
        assumeNotNull(trip_distance)            AS trip_distance,
        assumeNotNull(fare_amount)              AS fare_amount,
        assumeNotNull(extra)                    AS extra,
        assumeNotNull(mta_tax)                  AS mta_tax,
        assumeNotNull(tip_amount)               AS tip_amount,
        assumeNotNull(tolls_amount)             AS tolls_amount,
        assumeNotNull(ehail_fee)                AS ehail_fee,
        assumeNotNull(improvement_surcharge)    AS improvement_surcharge,
        assumeNotNull(total_amount)             AS total_amount,
        assumeNotNull(payment_type)             AS payment_type_,
        assumeNotNull(trip_type)                AS trip_type,

        pickup AS pickup,
        pickup AS dropoff,

        CAST(assumeNotNull(cab_type)
            AS Enum8('yellow' = 1, 'green' = 2))
                                AS cab_type,

        precipitation           AS precipitation,
        snow_depth              AS snow_depth,
        snowfall                AS snowfall,
        max_temperature         AS max_temperature,
        min_temperature         AS min_temperature,
        average_wind_speed      AS average_wind_speed,

        pickup_nyct2010_gid     AS pickup_nyct2010_gid,
        pickup_ctlabel          AS pickup_ctlabel,
        pickup_borocode         AS pickup_borocode,
        pickup_boroname         AS pickup_boroname,
        pickup_ct2010           AS pickup_ct2010,
        pickup_boroct2010       AS pickup_boroct2010,
        pickup_cdeligibil       AS pickup_cdeligibil,
        pickup_ntacode          AS pickup_ntacode,
        pickup_ntaname          AS pickup_ntaname,
        pickup_puma             AS pickup_puma,

        dropoff_nyct2010_gid    AS dropoff_nyct2010_gid,
        dropoff_ctlabel         AS dropoff_ctlabel,
        dropoff_borocode        AS dropoff_borocode,
        dropoff_boroname        AS dropoff_boroname,
        dropoff_ct2010          AS dropoff_ct2010,
        dropoff_boroct2010      AS dropoff_boroct2010,
        dropoff_cdeligibil      AS dropoff_cdeligibil,
        dropoff_ntacode         AS dropoff_ntacode,
        dropoff_ntaname         AS dropoff_ntaname,
        dropoff_puma            AS dropoff_puma
    FROM trips;

Wannan shine yadda fitowar kallo tayi kama yayin aikin:

ip-172-30-2-200 (Ubuntu 16.04 64bit / Linux 4.4.0-1072-aws)                                                                                                 Uptime: 1:06:09
CPU      10.3%  nice:     0.0%                           LOAD    36-core                           MEM     16.1%  active:    13.3G                           SWAP      0.0%
user:     7.9%  irq:      0.0%                           1 min:    1.87                            total:  68.7G  inactive:  52.8G                           total:       0
system:   1.6%  iowait:   0.8%                           5 min:    1.76                            used:   11.1G  buffers:   71.8M                           used:        0
idle:    89.7%  steal:    0.0%                           15 min:   1.95                            free:   57.6G  cached:    57.2G                           free:        0

NETWORK     Rx/s   Tx/s   TASKS 367 (523 thr), 1 run, 366 slp, 0 oth sorted automatically by cpu_percent, flat view
ens5         1Kb    8Kb
lo           2Kb    2Kb     CPU%  MEM%  VIRT   RES   PID USER        NI S    TIME+ IOR/s IOW/s Command
                           241.9  12.8 20.7G 8.78G  8091 clickhous    0 S 30:36.73   34M  125M /usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml
DISK I/O     R/s    W/s      2.6   0.0 90.4M 28.3M  9948 root         0 R  1:18.53     0     0 /usr/bin/python3 /usr/bin/glances
loop0          0      0      1.3   0.0     0     0   203 root         0 S  0:09.82     0     0 kswapd0
loop1          0      0      0.3   0.1  315M 61.3M 15701 ubuntu       0 S  0:00.40     0     0 clickhouse-client --host=0.0.0.0
nvme0n1        0     3K      0.3   0.0     0     0     7 root         0 S  0:00.83     0     0 rcu_sched
nvme0n1p1      0     3K      0.0   0.0     0     0   142 root         0 S  0:00.22     0     0 migration/27
nvme1n1    25.8M   330M      0.0   0.0 59.7M 1.79M  2764 ubuntu       0 S  0:00.00     0     0 (sd-pam)

A gwaji na ƙarshe, an canza ginshiƙai da yawa kuma an sake ƙididdige su. Na gano cewa wasu daga cikin waɗannan ayyukan ba sa aiki kamar yadda ake tsammani akan wannan saitin bayanai. Don magance wannan matsalar, na cire ayyukan da ba su dace ba kuma na loda bayanan ba tare da juyo zuwa nau'ikan granular ba.

Rarraba bayanai a cikin gungu

Zan rarraba bayanan a cikin dukkan nodes ɗin tari guda uku. Don farawa, a ƙasa zan ƙirƙiri tebur akan duk injunan guda uku.

$ clickhouse-client --host=0.0.0.0

CREATE TABLE trips_mergetree_third (
    trip_id                 UInt32,
    vendor_id               String,
    pickup_date             Date,
    pickup_datetime         DateTime,
    dropoff_date            Date,
    dropoff_datetime        Nullable(DateTime),
    store_and_fwd_flag      Nullable(FixedString(1)),
    rate_code_id            Nullable(UInt8),
    pickup_longitude        Nullable(Float64),
    pickup_latitude         Nullable(Float64),
    dropoff_longitude       Nullable(Float64),
    dropoff_latitude        Nullable(Float64),
    passenger_count         Nullable(UInt8),
    trip_distance           Nullable(Float64),
    fare_amount             Nullable(Float32),
    extra                   Nullable(Float32),
    mta_tax                 Nullable(Float32),
    tip_amount              Nullable(Float32),
    tolls_amount            Nullable(Float32),
    ehail_fee               Nullable(Float32),
    improvement_surcharge   Nullable(Float32),
    total_amount            Nullable(Float32),
    payment_type            Nullable(String),
    trip_type               Nullable(UInt8),
    pickup                  Nullable(String),
    dropoff                 Nullable(String),

    cab_type                Nullable(String),

    precipitation           Nullable(Int8),
    snow_depth              Nullable(Int8),
    snowfall                Nullable(Int8),
    max_temperature         Nullable(Int8),
    min_temperature         Nullable(Int8),
    average_wind_speed      Nullable(Int8),

    pickup_nyct2010_gid     Nullable(Int8),
    pickup_ctlabel          Nullable(String),
    pickup_borocode         Nullable(Int8),
    pickup_boroname         Nullable(String),
    pickup_ct2010           Nullable(String),
    pickup_boroct2010       Nullable(String),
    pickup_cdeligibil       Nullable(FixedString(1)),
    pickup_ntacode          Nullable(String),
    pickup_ntaname          Nullable(String),
    pickup_puma             Nullable(String),

    dropoff_nyct2010_gid    Nullable(UInt8),
    dropoff_ctlabel         Nullable(String),
    dropoff_borocode        Nullable(UInt8),
    dropoff_boroname        Nullable(String),
    dropoff_ct2010          Nullable(String),
    dropoff_boroct2010      Nullable(String),
    dropoff_cdeligibil      Nullable(String),
    dropoff_ntacode         Nullable(String),
    dropoff_ntaname         Nullable(String),
    dropoff_puma            Nullable(String)
) ENGINE = MergeTree(pickup_date, pickup_datetime, 8192);

Sannan zan tabbatar da cewa uwar garken farko na iya ganin duk nodes guda uku a cikin cluster.

SELECT *
FROM system.clusters
WHERE cluster = 'perftest_3shards'
FORMAT Vertical;
Row 1:
──────
cluster:          perftest_3shards
shard_num:        1
shard_weight:     1
replica_num:      1
host_name:        172.30.2.192
host_address:     172.30.2.192
port:             9000
is_local:         1
user:             default
default_database:
Row 2:
──────
cluster:          perftest_3shards
shard_num:        2
shard_weight:     1
replica_num:      1
host_name:        172.30.2.162
host_address:     172.30.2.162
port:             9000
is_local:         0
user:             default
default_database:

Row 3:
──────
cluster:          perftest_3shards
shard_num:        3
shard_weight:     1
replica_num:      1
host_name:        172.30.2.36
host_address:     172.30.2.36
port:             9000
is_local:         0
user:             default
default_database:

Sannan zan ayyana sabon tebur akan sabar farko wacce ta dogara akan tsarin trips_mergetree_third kuma yana amfani da injin Rarraba.

CREATE TABLE trips_mergetree_x3
    AS trips_mergetree_third
    ENGINE = Distributed(perftest_3shards,
                         default,
                         trips_mergetree_third,
                         rand());

Zan kwafi bayanan daga teburin MergeTree zuwa duk sabobin uku. An kammala waɗannan abubuwan a cikin mintuna 34 da sakan 44.

INSERT INTO trips_mergetree_x3
    SELECT * FROM trips_mergetree;

Bayan aikin da ke sama, na ba ClickHouse mintuna 15 don matsawa daga matsakaicin alamar matakin ajiya. Kundayen adireshi sun ƙare kasancewa 264 GB, 34 GB da 33 GB bi da bi akan kowane sabar ukun.

ClickHouse cluster kimanta aikin

Abin da na gani na gaba shi ne lokaci mafi sauri da na gani yana gudana kowace tambaya akan tebur sau da yawa trips_mergetree_x3.

$ clickhouse-client --host=0.0.0.0

An kammala waɗannan abubuwan a cikin daƙiƙa 2.449.

SELECT cab_type, count(*)
FROM trips_mergetree_x3
GROUP BY cab_type;

An kammala waɗannan abubuwan a cikin daƙiƙa 0.691.

SELECT passenger_count,
       avg(total_amount)
FROM trips_mergetree_x3
GROUP BY passenger_count;

An kammala waɗannan abubuwan a cikin daƙiƙa 0.

SELECT passenger_count,
       toYear(pickup_date) AS year,
       count(*)
FROM trips_mergetree_x3
GROUP BY passenger_count,
         year;

An kammala waɗannan abubuwan a cikin daƙiƙa 0.983.

SELECT passenger_count,
       toYear(pickup_date) AS year,
       round(trip_distance) AS distance,
       count(*)
FROM trips_mergetree_x3
GROUP BY passenger_count,
         year,
         distance
ORDER BY year,
         count(*) DESC;

Don kwatantawa, na gudanar da tambayoyin iri ɗaya akan tebur na tushen MergeTree wanda ke zaune kawai akan sabar farko.

Ƙimar aiki na kumburin ClickHouse ɗaya

Abin da na gani na gaba shi ne lokaci mafi sauri da na gani yana gudana kowace tambaya akan tebur sau da yawa trips_mergetree_x3.

An kammala waɗannan abubuwan a cikin daƙiƙa 0.241.

SELECT cab_type, count(*)
FROM trips_mergetree
GROUP BY cab_type;

An kammala waɗannan abubuwan a cikin daƙiƙa 0.826.

SELECT passenger_count,
       avg(total_amount)
FROM trips_mergetree
GROUP BY passenger_count;

An kammala waɗannan abubuwan a cikin daƙiƙa 1.209.

SELECT passenger_count,
       toYear(pickup_date) AS year,
       count(*)
FROM trips_mergetree
GROUP BY passenger_count,
         year;

An kammala waɗannan abubuwan a cikin daƙiƙa 1.781.

SELECT passenger_count,
       toYear(pickup_date) AS year,
       round(trip_distance) AS distance,
       count(*)
FROM trips_mergetree
GROUP BY passenger_count,
         year,
         distance
ORDER BY year,
         count(*) DESC;

Tunani akan sakamakon

Wannan shine karo na farko da tushen bayanai na CPU kyauta ya sami damar fin karfin bayanan tushen GPU a cikin gwaje-gwaje na. Wannan tushen tushen GPU ya wuce ta bita guda biyu tun daga lokacin, amma aikin da ClickHouse ya bayar akan kulli ɗaya duk da haka yana da ban sha'awa sosai.

A lokaci guda, lokacin aiwatar da Tambayoyi 1 akan injin da aka rarraba, ƙimar da ake kashewa shine tsari na girma mafi girma. Ina fata na rasa wani abu a cikin bincikena na wannan post ɗin saboda zai yi kyau in ga lokutan tambaya sun ragu yayin da na ƙara ƙarin nodes zuwa gungu. Koyaya, yana da kyau cewa lokacin aiwatar da wasu tambayoyin, aikin ya karu da kusan sau 2.

Zai yi kyau a ga ClickHouse ya samo asali don samun damar raba ajiya da lissafta ta yadda za su iya yin girman kansu. Tallafin HDFS, wanda aka ƙara a bara, na iya zama mataki zuwa wannan. Dangane da na'ura mai kwakwalwa, idan ana iya hanzarta tambaya guda ta hanyar ƙara ƙarin nodes zuwa gungu, to makomar wannan software tana da haske sosai.

Na gode da ba da lokaci don karanta wannan sakon. Ina ba da shawarwari, gine-gine, da ayyukan ci gaba ga abokan ciniki a Arewacin Amurka da Turai. Idan kuna son tattauna yadda shawarwarina zasu iya taimakawa kasuwancin ku, da fatan za a tuntuɓe ni ta hanyar LinkedIn.

source: www.habr.com

Add a comment