1.1 billiún turas tacsaí: braisle 108-lárnach ClickHouse

Ullmhaíodh aistriúchán an ailt go sonrach do mhic léinn an chúrsa Innealtóir Sonraí.

1.1 billiún turas tacsaí: braisle 108-lárnach ClickHouse

Cliceáil Teach is bunachar sonraí colún foinse oscailte é. Is timpeallacht iontach é inar féidir leis na céadta anailísí sonraí mionsonraithe a fhiosrú go tapa, fiú nuair a chuirtear isteach na mílte de thaifid nua in aghaidh an lae. D’fhéadfadh costais bonneagair chun tacú le córas den sórt sin a bheith chomh hard le $100 in aghaidh na bliana, agus b’fhéidir leath sin ag brath ar úsáid. Ag pointe amháin, bhí 10 trilliún taifead i suiteáil ClickHouse ó Yandex Metrics. Chomh maith le Yandex, d'éirigh le ClickHouse le Bloomberg agus Cloudflare.

Dhá bhliain ó shin chaith mé anailís chomparáideach bunachair shonraí ag baint úsáide as meaisín amháin, agus bhí sé is tapúla Bogearraí bunachar sonraí saor in aisce a chonaic mé riamh. Ó shin i leith, níor stop na forbróirí gnéithe a chur leis, lena n-áirítear tacaíocht do chomhbhrú Kafka, HDFS agus ZStandard. Anuraidh chuir siad tacaíocht le modhanna comhbhrú cascáideacha, agus delta-ó-delta bhí códú indéanta. Agus sonraí sraith ama á gcomhbhrú, is féidir luachanna tomhsaire a chomhbhrú go maith ag baint úsáide as ionchódú delta, ach le haghaidh cuntair bheadh ​​sé níos fearr ionchódú delta-by-delta a úsáid. Tá comhbhrú maith anois mar an eochair do fheidhmíocht ClickHouse.

Is éard atá i ClickHouse ná 170 míle líne de chód C++, gan leabharlanna tríú páirtí a áireamh, agus tá sé ar cheann de na bunachair chód bunachar sonraí dáilte is lú. I gcomparáid leis sin, ní thacaíonn SQLite le dáileadh agus tá sé comhdhéanta de 235 míle líne de chód C. Ón scríbhinn seo, chuir 207 innealtóir le ClickHouse, agus tá déine na ngealltanas ag méadú le déanaí.

I mí an Mhárta 2017, thosaigh ClickHouse ag stiúradh loga athrú mar bhealach éasca le súil a choinneáil ar an bhforbairt. Bhris siad freisin an comhad doiciméadaithe monolithic isteach i ordlathas comhad Markdown-bhunaithe. Déantar saincheisteanna agus gnéithe a rianú trí GitHub, agus go ginearálta tá na bogearraí i bhfad níos inrochtana le cúpla bliain anuas.

San Airteagal seo, táim chun breathnú ar fheidhmíocht braisle ClickHouse ar AWS EC2 ag baint úsáide as próiseálaithe 36-lárnach agus stóráil NVMe.

Suas chun dáta: Seachtain tar éis an post seo a fhoilsiú ar dtús, rinne mé an tástáil a athchur le cumraíocht fheabhsaithe agus bhain mé torthaí i bhfad níos fearr amach. Nuashonraíodh an postáil seo chun na hathruithe seo a léiriú.

Ag seoladh Braisle AWS EC2

Beidh trí chás c5d.9xlarge EC2 á n-úsáid agam don phost seo. Tá 36 CPU fíorúil i ngach ceann acu, 72 GB RAM, 900 GB de stóráil NVMe SSD agus tacaíonn sé le líonra 10 Gigabit. Chosain siad $1,962 in aghaidh na huaire an ceann i réigiún ae-iarthar-1 agus iad ag rith ar éileamh. Beidh mé ag baint úsáide as Ubuntu Server 16.04 LTS mar an córas oibriúcháin.

Tá an balla dóiteáin cumraithe ionas gur féidir le gach meaisín cumarsáid a dhéanamh lena chéile gan srianta, agus níl ach mo sheoladh IPv4 bánliostaithe ag SSH sa bhraisle.

Tiomáint NVMe i staid ullmhachta oibriúcháin

Chun ClickHouse a bheith ag obair, cruthóidh mé córas comhaid san fhormáid EXT4 ar thiomáint NVMe ar gach ceann de na freastalaithe.

$ sudo mkfs -t ext4 /dev/nvme1n1
$ sudo mkdir /ch
$ sudo mount /dev/nvme1n1 /ch

Nuair a bheidh gach rud cumraithe, is féidir leat an pointe mount agus an 783 GB de spás atá ar fáil ar gach córas a fheiceáil.

$ lsblk

NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
loop0         7:0    0  87.9M  1 loop /snap/core/5742
loop1         7:1    0  16.5M  1 loop /snap/amazon-ssm-agent/784
nvme0n1     259:1    0     8G  0 disk
└─nvme0n1p1 259:2    0     8G  0 part /
nvme1n1     259:0    0 838.2G  0 disk /ch

$ df -h

Filesystem      Size  Used Avail Use% Mounted on
udev             35G     0   35G   0% /dev
tmpfs           6.9G  8.8M  6.9G   1% /run
/dev/nvme0n1p1  7.7G  967M  6.8G  13% /
tmpfs            35G     0   35G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            35G     0   35G   0% /sys/fs/cgroup
/dev/loop0       88M   88M     0 100% /snap/core/5742
/dev/loop1       17M   17M     0 100% /snap/amazon-ssm-agent/784
tmpfs           6.9G     0  6.9G   0% /run/user/1000
/dev/nvme1n1    825G   73M  783G   1% /ch

Is é an tacar sonraí a úsáidfidh mé sa tástáil seo ná dumpáil sonraí a ghin mé ó 1.1 billiún turas tacsaí a tógadh i gCathair Nua-Eabhrac thar sé bliana. Ar an bhlag Billiún Turas Tacsaithe i Redshift sonraí conas a bhailigh mé an tacar sonraí seo. Stóráiltear iad in AWS S3, mar sin déanfaidh mé an AWS CLI a chumrú le mo chuid eochracha rochtana agus rúnda.

$ sudo apt update
$ sudo apt install awscli
$ aws configure

Socróidh mé teorainn iarratais chomhthráthach an chliaint go 100 ionas go n-íoslódálfar comhaid níos tapúla ná na socruithe réamhshocraithe.

$ aws configure set 
    default.s3.max_concurrent_requests 
    100

Íoslódáil mé an tacar sonraí turas tacsaí ó AWS S3 agus stórálfaidh mé é ar thiomáint NVMe ar an gcéad fhreastalaí. Tá ~104GB sa tacar sonraí seo i bhformáid CSV-chomhbhrúite GZIP.

$ sudo mkdir -p /ch/csv
$ sudo chown -R ubuntu /ch/csv
$ aws s3 sync s3://<bucket>/csv /ch/csv

Suiteáil ClickHouse

Suiteáilfidh mé an dáileadh OpenJDK do Java 8 mar go bhfuil sé riachtanach Apache ZooKeeper a reáchtáil, atá ag teastáil le haghaidh suiteáil scaipthe ClickHouse ar na trí mheaisín.

$ sudo apt update
$ sudo apt install 
    openjdk-8-jre 
    openjdk-8-jdk-headless

Ansin leag mé an athróg timpeallachta JAVA_HOME.

$ sudo vi /etc/profile
 
export JAVA_HOME=/usr
 
$ source /etc/profile

Úsáidfidh mé córas bainistíochta pacáiste Ubuntu ansin chun ClickHouse 18.16.1, amharc agus ZooKeeper a shuiteáil ar na trí mheaisín.

$ sudo apt-key adv 
    --keyserver hkp://keyserver.ubuntu.com:80 
    --recv E0C56BD4
$ echo "deb http://repo.yandex.ru/clickhouse/deb/stable/ main/" | 
    sudo tee /etc/apt/sources.list.d/clickhouse.list
$ sudo apt-get update

$ sudo apt install 
    clickhouse-client 
    clickhouse-server 
    glances 
    zookeeperd

Cruthóidh mé eolaire do ClickHouse agus déanfaidh mé roinnt sáruithe cumraíochta ar na trí fhreastalaí go léir.

$ sudo mkdir /ch/clickhouse
$ sudo chown -R clickhouse /ch/clickhouse

$ sudo mkdir -p /etc/clickhouse-server/conf.d
$ sudo vi /etc/clickhouse-server/conf.d/taxis.conf

Seo iad na sáruithe cumraíochta a bheidh in úsáid agam.

<?xml version="1.0"?>
<yandex>
    <listen_host>0.0.0.0</listen_host>
    <path>/ch/clickhouse/</path>

 <remote_servers>
        <perftest_3shards>
            <shard>
                <replica>
                    <host>172.30.2.192</host>
                    <port>9000</port>
                 </replica>
            </shard>
            <shard>
                 <replica>
                    <host>172.30.2.162</host>
                    <port>9000</port>
                 </replica>
            </shard>
            <shard>
                 <replica>
                    <host>172.30.2.36</host>
                    <port>9000</port>
                 </replica>
            </shard>
        </perftest_3shards>
    </remote_servers>

  <zookeeper-servers>
        <node>
            <host>172.30.2.192</host>
            <port>2181</port>
        </node>
        <node>
            <host>172.30.2.162</host>
            <port>2181</port>
        </node>
        <node>
            <host>172.30.2.36</host>
            <port>2181</port>
        </node>
    </zookeeper-servers>

 <macros>
        <shard>03</shard>
        <replica>01</replica>
    </macros>
</yandex>

Rithfidh mé ZooKeeper agus an freastalaí ClickHouse ar na trí mheaisín ansin.

$ sudo /etc/init.d/zookeeper start
$ sudo service clickhouse-server start

Sonraí á n-uaslódáil chuig ClickHouse

Ar an gcéad fhreastalaí cruthóidh mé tábla turais (trips), a stórálfar tacar sonraí de thurais tacsaí ag baint úsáide as an inneall Log.

$ clickhouse-client --host=0.0.0.0
 
CREATE TABLE trips (
    trip_id                 UInt32,
    vendor_id               String,

    pickup_datetime         DateTime,
    dropoff_datetime        Nullable(DateTime),

    store_and_fwd_flag      Nullable(FixedString(1)),
    rate_code_id            Nullable(UInt8),
    pickup_longitude        Nullable(Float64),
    pickup_latitude         Nullable(Float64),
    dropoff_longitude       Nullable(Float64),
    dropoff_latitude        Nullable(Float64),
    passenger_count         Nullable(UInt8),
    trip_distance           Nullable(Float64),
    fare_amount             Nullable(Float32),
    extra                   Nullable(Float32),
    mta_tax                 Nullable(Float32),
    tip_amount              Nullable(Float32),
    tolls_amount            Nullable(Float32),
    ehail_fee               Nullable(Float32),
    improvement_surcharge   Nullable(Float32),
    total_amount            Nullable(Float32),
    payment_type            Nullable(String),
    trip_type               Nullable(UInt8),
    pickup                  Nullable(String),
    dropoff                 Nullable(String),

    cab_type                Nullable(String),

    precipitation           Nullable(Int8),
    snow_depth              Nullable(Int8),
    snowfall                Nullable(Int8),
    max_temperature         Nullable(Int8),
    min_temperature         Nullable(Int8),
    average_wind_speed      Nullable(Int8),

    pickup_nyct2010_gid     Nullable(Int8),
    pickup_ctlabel          Nullable(String),
    pickup_borocode         Nullable(Int8),
    pickup_boroname         Nullable(String),
    pickup_ct2010           Nullable(String),
    pickup_boroct2010       Nullable(String),
    pickup_cdeligibil       Nullable(FixedString(1)),
    pickup_ntacode          Nullable(String),
    pickup_ntaname          Nullable(String),
    pickup_puma             Nullable(String),

    dropoff_nyct2010_gid    Nullable(UInt8),
    dropoff_ctlabel         Nullable(String),
    dropoff_borocode        Nullable(UInt8),
    dropoff_boroname        Nullable(String),
    dropoff_ct2010          Nullable(String),
    dropoff_boroct2010      Nullable(String),
    dropoff_cdeligibil      Nullable(String),
    dropoff_ntacode         Nullable(String),
    dropoff_ntaname         Nullable(String),
    dropoff_puma            Nullable(String)
) ENGINE = Log;

Bainim agus lódálann mé gach ceann de na comhaid CSV isteach i dtábla turais (trips). Críochnaíodh an méid seo a leanas i 55 nóiméad agus 10 soicind. Tar éis na hoibríochta seo, ba é méid an eolaire sonraí ná 134 GB.

$ time (for FILENAME in /ch/csv/trips_x*.csv.gz; do
            echo $FILENAME
            gunzip -c $FILENAME | 
                clickhouse-client 
                    --host=0.0.0.0 
                    --query="INSERT INTO trips FORMAT CSV"
        done)

Ba é an luas iompórtála ná 155 MB d'ábhar CSV neamh-chomhbhrúite in aghaidh an tsoicind. Tá amhras orm gur tharla sé seo mar gheall ar bhac i ndí-chomhbhrú GZIP. Seans go raibh sé níos tapúla na comhaid gzipped go léir a dhízipeáil ag an am céanna ag baint úsáide as xargs agus ansin na sonraí unzipped a luchtú. Seo thíos cur síos ar an méid a tuairiscíodh le linn phróiseas allmhairithe CSV.

$ sudo glances

ip-172-30-2-200 (Ubuntu 16.04 64bit / Linux 4.4.0-1072-aws)                                                                                                 Uptime: 0:11:42
CPU       8.2%  nice:     0.0%                           LOAD    36-core                           MEM      9.8%  active:    5.20G                           SWAP      0.0%
user:     6.0%  irq:      0.0%                           1 min:    2.24                            total:  68.7G  inactive:  61.0G                           total:       0
system:   0.9%  iowait:   1.3%                           5 min:    1.83                            used:   6.71G  buffers:   66.4M                           used:        0
idle:    91.8%  steal:    0.0%                           15 min:   1.01                            free:   62.0G  cached:    61.6G                           free:        0

NETWORK     Rx/s   Tx/s   TASKS 370 (507 thr), 2 run, 368 slp, 0 oth sorted automatically by cpu_percent, flat view
ens5        136b    2Kb
lo         343Mb  343Mb     CPU%  MEM%  VIRT   RES   PID USER        NI S    TIME+ IOR/s IOW/s Command
                           100.4   1.5 1.65G 1.06G  9909 ubuntu       0 S  1:01.33     0     0 clickhouse-client --host=0.0.0.0 --query=INSERT INTO trips FORMAT CSV
DISK I/O     R/s    W/s     85.1   0.0 4.65M  708K  9908 ubuntu       0 R  0:50.60   32M     0 gzip -d -c /ch/csv/trips_xac.csv.gz
loop0          0      0     54.9   5.1 8.14G 3.49G  8091 clickhous    0 S  1:44.23     0   45M /usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml
loop1          0      0      4.5   0.0     0     0   319 root         0 S  0:07.50    1K     0 kworker/u72:2
nvme0n1        0     3K      2.3   0.0 91.1M 28.9M  9912 root         0 R  0:01.56     0     0 /usr/bin/python3 /usr/bin/glances
nvme0n1p1      0     3K      0.3   0.0     0     0   960 root       -20 S  0:00.10     0     0 kworker/28:1H
nvme1n1    32.1M   495M      0.3   0.0     0     0  1058 root       -20 S  0:00.90     0     0 kworker/23:1H

Saorfaidh mé spás ar an tiomántán NVMe trí na comhaid CSV bunaidh a scriosadh sula leanann mé ar aghaidh.

$ sudo rm -fr /ch/csv

Tiontaigh go Foirm Colún

Déanfaidh an t-inneall Log ClickHouse sonraí a stóráil i bhformáid atá dírithe ar a chéile. Chun sonraí a fhiosrú níos tapúla, déanaim é a thiontú go formáid cholún ag baint úsáide as an inneall MergeTree.

$ clickhouse-client --host=0.0.0.0

Críochnaíodh an méid seo a leanas i 34 nóiméad agus 50 soicind. Tar éis na hoibríochta seo, ba é méid an eolaire sonraí ná 237 GB.

CREATE TABLE trips_mergetree
    ENGINE = MergeTree(pickup_date, pickup_datetime, 8192)
    AS SELECT
        trip_id,
        CAST(vendor_id AS Enum8('1' = 1,
                                '2' = 2,
                                'CMT' = 3,
                                'VTS' = 4,
                                'DDS' = 5,
                                'B02512' = 10,
                                'B02598' = 11,
                                'B02617' = 12,
                                'B02682' = 13,
                                'B02764' = 14)) AS vendor_id,
        toDate(pickup_datetime)                 AS pickup_date,
        ifNull(pickup_datetime, toDateTime(0))  AS pickup_datetime,
        toDate(dropoff_datetime)                AS dropoff_date,
        ifNull(dropoff_datetime, toDateTime(0)) AS dropoff_datetime,
        assumeNotNull(store_and_fwd_flag)       AS store_and_fwd_flag,
        assumeNotNull(rate_code_id)             AS rate_code_id,

        assumeNotNull(pickup_longitude)         AS pickup_longitude,
        assumeNotNull(pickup_latitude)          AS pickup_latitude,
        assumeNotNull(dropoff_longitude)        AS dropoff_longitude,
        assumeNotNull(dropoff_latitude)         AS dropoff_latitude,
        assumeNotNull(passenger_count)          AS passenger_count,
        assumeNotNull(trip_distance)            AS trip_distance,
        assumeNotNull(fare_amount)              AS fare_amount,
        assumeNotNull(extra)                    AS extra,
        assumeNotNull(mta_tax)                  AS mta_tax,
        assumeNotNull(tip_amount)               AS tip_amount,
        assumeNotNull(tolls_amount)             AS tolls_amount,
        assumeNotNull(ehail_fee)                AS ehail_fee,
        assumeNotNull(improvement_surcharge)    AS improvement_surcharge,
        assumeNotNull(total_amount)             AS total_amount,
        assumeNotNull(payment_type)             AS payment_type_,
        assumeNotNull(trip_type)                AS trip_type,

        pickup AS pickup,
        pickup AS dropoff,

        CAST(assumeNotNull(cab_type)
            AS Enum8('yellow' = 1, 'green' = 2))
                                AS cab_type,

        precipitation           AS precipitation,
        snow_depth              AS snow_depth,
        snowfall                AS snowfall,
        max_temperature         AS max_temperature,
        min_temperature         AS min_temperature,
        average_wind_speed      AS average_wind_speed,

        pickup_nyct2010_gid     AS pickup_nyct2010_gid,
        pickup_ctlabel          AS pickup_ctlabel,
        pickup_borocode         AS pickup_borocode,
        pickup_boroname         AS pickup_boroname,
        pickup_ct2010           AS pickup_ct2010,
        pickup_boroct2010       AS pickup_boroct2010,
        pickup_cdeligibil       AS pickup_cdeligibil,
        pickup_ntacode          AS pickup_ntacode,
        pickup_ntaname          AS pickup_ntaname,
        pickup_puma             AS pickup_puma,

        dropoff_nyct2010_gid    AS dropoff_nyct2010_gid,
        dropoff_ctlabel         AS dropoff_ctlabel,
        dropoff_borocode        AS dropoff_borocode,
        dropoff_boroname        AS dropoff_boroname,
        dropoff_ct2010          AS dropoff_ct2010,
        dropoff_boroct2010      AS dropoff_boroct2010,
        dropoff_cdeligibil      AS dropoff_cdeligibil,
        dropoff_ntacode         AS dropoff_ntacode,
        dropoff_ntaname         AS dropoff_ntaname,
        dropoff_puma            AS dropoff_puma
    FROM trips;

Seo an chuma a bhí ar an aschur sracfhéachaint le linn na hoibríochta:

ip-172-30-2-200 (Ubuntu 16.04 64bit / Linux 4.4.0-1072-aws)                                                                                                 Uptime: 1:06:09
CPU      10.3%  nice:     0.0%                           LOAD    36-core                           MEM     16.1%  active:    13.3G                           SWAP      0.0%
user:     7.9%  irq:      0.0%                           1 min:    1.87                            total:  68.7G  inactive:  52.8G                           total:       0
system:   1.6%  iowait:   0.8%                           5 min:    1.76                            used:   11.1G  buffers:   71.8M                           used:        0
idle:    89.7%  steal:    0.0%                           15 min:   1.95                            free:   57.6G  cached:    57.2G                           free:        0

NETWORK     Rx/s   Tx/s   TASKS 367 (523 thr), 1 run, 366 slp, 0 oth sorted automatically by cpu_percent, flat view
ens5         1Kb    8Kb
lo           2Kb    2Kb     CPU%  MEM%  VIRT   RES   PID USER        NI S    TIME+ IOR/s IOW/s Command
                           241.9  12.8 20.7G 8.78G  8091 clickhous    0 S 30:36.73   34M  125M /usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml
DISK I/O     R/s    W/s      2.6   0.0 90.4M 28.3M  9948 root         0 R  1:18.53     0     0 /usr/bin/python3 /usr/bin/glances
loop0          0      0      1.3   0.0     0     0   203 root         0 S  0:09.82     0     0 kswapd0
loop1          0      0      0.3   0.1  315M 61.3M 15701 ubuntu       0 S  0:00.40     0     0 clickhouse-client --host=0.0.0.0
nvme0n1        0     3K      0.3   0.0     0     0     7 root         0 S  0:00.83     0     0 rcu_sched
nvme0n1p1      0     3K      0.0   0.0     0     0   142 root         0 S  0:00.22     0     0 migration/27
nvme1n1    25.8M   330M      0.0   0.0 59.7M 1.79M  2764 ubuntu       0 S  0:00.00     0     0 (sd-pam)

Sa tástáil dheireanach, rinneadh roinnt colún a thiontú agus a athríomh. Chinn mé nach n-oibríonn cuid de na feidhmeanna seo a thuilleadh mar a bhíothas ag súil leis ar an tacar sonraí seo. Chun an fhadhb seo a réiteach, bhain mé na feidhmeanna míchuí agus luchtaigh mé na sonraí gan a thiontú go cineálacha níos gráinneach.

Dáileadh sonraí ar fud an bhraisle

Dáilfidh mé na sonraí ar na trí nód braisle. Chun tús a chur, thíos cruthóidh mé tábla ar na trí mheaisín.

$ clickhouse-client --host=0.0.0.0

CREATE TABLE trips_mergetree_third (
    trip_id                 UInt32,
    vendor_id               String,
    pickup_date             Date,
    pickup_datetime         DateTime,
    dropoff_date            Date,
    dropoff_datetime        Nullable(DateTime),
    store_and_fwd_flag      Nullable(FixedString(1)),
    rate_code_id            Nullable(UInt8),
    pickup_longitude        Nullable(Float64),
    pickup_latitude         Nullable(Float64),
    dropoff_longitude       Nullable(Float64),
    dropoff_latitude        Nullable(Float64),
    passenger_count         Nullable(UInt8),
    trip_distance           Nullable(Float64),
    fare_amount             Nullable(Float32),
    extra                   Nullable(Float32),
    mta_tax                 Nullable(Float32),
    tip_amount              Nullable(Float32),
    tolls_amount            Nullable(Float32),
    ehail_fee               Nullable(Float32),
    improvement_surcharge   Nullable(Float32),
    total_amount            Nullable(Float32),
    payment_type            Nullable(String),
    trip_type               Nullable(UInt8),
    pickup                  Nullable(String),
    dropoff                 Nullable(String),

    cab_type                Nullable(String),

    precipitation           Nullable(Int8),
    snow_depth              Nullable(Int8),
    snowfall                Nullable(Int8),
    max_temperature         Nullable(Int8),
    min_temperature         Nullable(Int8),
    average_wind_speed      Nullable(Int8),

    pickup_nyct2010_gid     Nullable(Int8),
    pickup_ctlabel          Nullable(String),
    pickup_borocode         Nullable(Int8),
    pickup_boroname         Nullable(String),
    pickup_ct2010           Nullable(String),
    pickup_boroct2010       Nullable(String),
    pickup_cdeligibil       Nullable(FixedString(1)),
    pickup_ntacode          Nullable(String),
    pickup_ntaname          Nullable(String),
    pickup_puma             Nullable(String),

    dropoff_nyct2010_gid    Nullable(UInt8),
    dropoff_ctlabel         Nullable(String),
    dropoff_borocode        Nullable(UInt8),
    dropoff_boroname        Nullable(String),
    dropoff_ct2010          Nullable(String),
    dropoff_boroct2010      Nullable(String),
    dropoff_cdeligibil      Nullable(String),
    dropoff_ntacode         Nullable(String),
    dropoff_ntaname         Nullable(String),
    dropoff_puma            Nullable(String)
) ENGINE = MergeTree(pickup_date, pickup_datetime, 8192);

Ansin déanfaidh mé cinnte go bhfeicfidh an chéad fhreastalaí na trí nóid go léir sa bhraisle.

SELECT *
FROM system.clusters
WHERE cluster = 'perftest_3shards'
FORMAT Vertical;
Row 1:
──────
cluster:          perftest_3shards
shard_num:        1
shard_weight:     1
replica_num:      1
host_name:        172.30.2.192
host_address:     172.30.2.192
port:             9000
is_local:         1
user:             default
default_database:
Row 2:
──────
cluster:          perftest_3shards
shard_num:        2
shard_weight:     1
replica_num:      1
host_name:        172.30.2.162
host_address:     172.30.2.162
port:             9000
is_local:         0
user:             default
default_database:

Row 3:
──────
cluster:          perftest_3shards
shard_num:        3
shard_weight:     1
replica_num:      1
host_name:        172.30.2.36
host_address:     172.30.2.36
port:             9000
is_local:         0
user:             default
default_database:

Ansin saineoidh mé tábla nua ar an gcéad fhreastalaí atá bunaithe ar an scéimre trips_mergetree_third agus úsáideann an t-inneall Dáilte.

CREATE TABLE trips_mergetree_x3
    AS trips_mergetree_third
    ENGINE = Distributed(perftest_3shards,
                         default,
                         trips_mergetree_third,
                         rand());

Déanfaidh mé na sonraí a chóipeáil ansin ón tábla MergeTree atá bunaithe ar na trí fhreastalaí. Críochnaíodh an méid seo a leanas i 34 nóiméad agus 44 soicind.

INSERT INTO trips_mergetree_x3
    SELECT * FROM trips_mergetree;

Tar éis na hoibríochta thuas, thug mé 15 nóiméad ClickHouse chun bogadh ar shiúl ón marc leibhéal stórála uasta. Ba é 264 GB, 34 GB agus 33 GB faoi seach na heolairí sonraí ar gach ceann de na trí fhreastalaí.

Meastóireacht feidhmíochta braisle ClickHouse

Ba é an chéad rud a chonaic mé ina dhiaidh sin an t-am is tapúla atá feicthe agam ag rith gach fiosrúcháin ar bhord go minic trips_mergetree_x3.

$ clickhouse-client --host=0.0.0.0

Chríochnaigh an méid seo a leanas i 2.449 soicind.

SELECT cab_type, count(*)
FROM trips_mergetree_x3
GROUP BY cab_type;

Chríochnaigh an méid seo a leanas i 0.691 soicind.

SELECT passenger_count,
       avg(total_amount)
FROM trips_mergetree_x3
GROUP BY passenger_count;

Chríochnaigh an méid seo a leanas i 0 soicind.

SELECT passenger_count,
       toYear(pickup_date) AS year,
       count(*)
FROM trips_mergetree_x3
GROUP BY passenger_count,
         year;

Chríochnaigh an méid seo a leanas i 0.983 soicind.

SELECT passenger_count,
       toYear(pickup_date) AS year,
       round(trip_distance) AS distance,
       count(*)
FROM trips_mergetree_x3
GROUP BY passenger_count,
         year,
         distance
ORDER BY year,
         count(*) DESC;

Chun comparáid a dhéanamh, rith mé na ceisteanna céanna ar thábla MergeTree-bhunaithe a chónaíonn ar an gcéad fhreastalaí amháin.

Meastóireacht feidhmíochta ar nód ClickHouse amháin

Ba é an chéad rud a chonaic mé ina dhiaidh sin an t-am is tapúla atá feicthe agam ag rith gach fiosrúcháin ar bhord go minic trips_mergetree_x3.

Chríochnaigh an méid seo a leanas i 0.241 soicind.

SELECT cab_type, count(*)
FROM trips_mergetree
GROUP BY cab_type;

Chríochnaigh an méid seo a leanas i 0.826 soicind.

SELECT passenger_count,
       avg(total_amount)
FROM trips_mergetree
GROUP BY passenger_count;

Chríochnaigh an méid seo a leanas i 1.209 soicind.

SELECT passenger_count,
       toYear(pickup_date) AS year,
       count(*)
FROM trips_mergetree
GROUP BY passenger_count,
         year;

Chríochnaigh an méid seo a leanas i 1.781 soicind.

SELECT passenger_count,
       toYear(pickup_date) AS year,
       round(trip_distance) AS distance,
       count(*)
FROM trips_mergetree
GROUP BY passenger_count,
         year,
         distance
ORDER BY year,
         count(*) DESC;

Machnamh ar na torthaí

Is é seo an chéad uair a bhí bunachar sonraí CPU-bhunaithe saor in aisce in ann feidhmíocht níos fearr a bhaint as bunachar sonraí GPU-bhunaithe i mo thástálacha. Tá dhá athbhreithniú déanta ar an mbunachar sonraí GPU-bhunaithe sin ó shin i leith, ach tá an fheidhmíocht a sheachaid ClickHouse ar nód amháin thar a bheith suntasach mar sin féin.

Ag an am céanna, agus Iarratas 1 á fhorghníomhú ar inneall dáilte, tá na forchostais ordú méide níos airde. Tá súil agam gur chaill mé rud éigin i mo thaighde don phost seo mar bheadh ​​sé go deas amanna fiosrúcháin a fheiceáil ag laghdú agus mé ag cur níos mó nóid leis an mbraisle. Mar sin féin, is iontach an rud é gur tháinig méadú 2 uair ar fheidhmíocht nuair a bhí fiosrúcháin eile á ndéanamh.

Bheadh ​​sé go deas ClickHouse a fheiceáil ag forbairt i dtreo a bheith in ann stóráil a scaradh agus a ríomh ionas gur féidir leo scála neamhspleách a dhéanamh. D’fhéadfadh tacaíocht HDFS, a cuireadh leis anuraidh, a bheith ina chéim i dtreo seo. Maidir leis an ríomhaireacht, más féidir ceist amháin a luathú trí níos mó nóid a chur leis an mbraisle, ansin tá todhchaí na bogearraí seo an-gheal.

Go raibh maith agat as an am a ghlacadh chun an post seo a léamh. Cuirim seirbhísí comhairliúcháin, ailtireachta agus forbartha cleachtais ar fáil do chliaint i Meiriceá Thuaidh agus san Eoraip. Más mian leat conas is féidir le mo chuid moltaí cabhrú le do ghnó a phlé, déan teagmháil liom trí LinkedIn.

Foinse: will.com

Add a comment