Binciken TSDB a cikin Prometheus 2

Binciken TSDB a cikin Prometheus 2

Tsarin bayanai na lokaci (TSDB) a cikin Prometheus 2 kyakkyawan misali ne na maganin injiniya wanda ke ba da manyan haɓakawa akan ajiyar v2 a cikin Prometheus 1 dangane da saurin tattara bayanai, aiwatar da tambaya, da ingantaccen albarkatu. Muna aiwatar da Prometheus 2 a Percona Kulawa da Gudanarwa (PMM) kuma na sami damar fahimtar aikin Prometheus 2 TSDB. A cikin wannan labarin zan yi magana game da sakamakon waɗannan abubuwan lura.

Matsakaicin nauyin aikin Prometheus

Ga waɗanda aka yi amfani da su don ma'amala da bayanan bayanan maƙasudi na gaba ɗaya, nauyin aikin Prometheus na yau da kullun yana da ban sha'awa sosai. Adadin tarin bayanai yana tsayawa tsayin daka: yawanci ayyukan da kuke saka idanu suna aika kusan adadin ma'auni guda ɗaya, kuma kayan aikin suna canzawa sannu a hankali.
Buƙatun bayanai na iya zuwa daga tushe daban-daban. Wasu daga cikinsu, kamar faɗakarwa, suma suna ƙoƙarin samun tabbataccen ƙima mai faɗi. Wasu, kamar buƙatun mai amfani, na iya haifar da fashewa, kodayake ba haka lamarin yake ba ga yawancin ayyukan aiki.

Gwajin lodi

A lokacin gwaji, na mayar da hankali kan ikon tattara bayanai. Na tura Prometheus 2.3.2 da aka haɗa tare da Go 1.10.1 (a matsayin wani ɓangare na PMM 1.14) akan sabis na Linode ta amfani da wannan rubutun: StackScript. Domin mafi haƙiƙan kaya tsara, amfani da wannan StackScript Na ƙaddamar da nodes na MySQL da yawa tare da ainihin kaya (Gwajin Sysbench TPC-C), kowannensu ya kwaikwayi nodes 10 Linux/MySQL.
Dukkanin waɗannan gwaje-gwajen an yi su ne akan uwar garken Linode tare da muryoyin ƙira guda takwas da 32 GB na ƙwaƙwalwar ajiya, suna gudanar da simintin ɗaukar nauyi 20 suna lura da misalan MySQL ɗari biyu. Ko, a cikin sharuddan Prometheus, 800 hari, 440 scrapes a sakan daya, 380 dubu records da biyu, da kuma 1,7 miliyan jerin aiki lokaci.

Zane

Hanyar da aka saba amfani da ita na bayanan bayanan gargajiya, gami da wanda Prometheus 1.x yayi amfani da shi, shine iyakar ƙwaƙwalwar ajiya. Idan bai isa ba don ɗaukar nauyin, za ku fuskanci manyan latencies kuma wasu buƙatun za su gaza. Ana iya daidaita amfani da ƙwaƙwalwar ajiya a cikin Prometheus 2 ta maɓalli storage.tsdb.min-block-duration, wanda ke ƙayyade tsawon lokacin da za a adana rikodi a cikin ƙwaƙwalwar ajiya kafin a juye zuwa diski (tsoho shine sa'o'i 2). Adadin ƙwaƙwalwar ajiya da ake buƙata zai dogara ne akan adadin jerin lokaci, tambura, da ɓangarorin da aka saka a cikin rafi mai shigowa. Dangane da sararin faifai, Prometheus yana nufin amfani da bytes 3 akan kowane rikodin (samfurin). A gefe guda, buƙatun ƙwaƙwalwar ajiya sun fi girma.

Kodayake yana yiwuwa a daidaita girman toshe, ba a ba da shawarar daidaita shi da hannu ba, don haka ana tilasta ka ba Prometheus adadin ƙwaƙwalwar ajiya kamar yadda ake buƙata don aikinka.
Idan babu isasshen ƙwaƙwalwar ajiya don tallafawa rafi mai shigowa na ma'auni, Prometheus zai faɗi daga ƙwaƙwalwar ajiya ko mai kashe OOM zai kai gare shi.
Ƙara swap don jinkirta hadarin lokacin da Prometheus ya ƙare da ƙwaƙwalwar ajiya ba ya taimaka sosai, saboda amfani da wannan aikin yana haifar da fashewar ƙwaƙwalwar ajiya. Ina tsammanin wani abu ne da ya shafi Go, mai tattara shara da yadda yake mu'amala da musanyawa.
Wata hanya mai ban sha'awa ita ce saita shingen kai da za a zubar da shi zuwa faifai a wani lokaci, maimakon ƙidaya shi daga farkon tsari.

Binciken TSDB a cikin Prometheus 2

Kamar yadda kuke gani a cikin jadawali, tarwatsawa zuwa faifai yana faruwa kowane sa'o'i biyu. Idan kun canza ma'aunin ɗan-block-lokaci zuwa sa'a ɗaya, to waɗannan sake saiti zasu faru kowace awa, farawa bayan rabin sa'a.
Idan kuna son amfani da wannan da sauran jadawali a cikin shigarwa na Prometheus, zaku iya amfani da wannan dashboard. An tsara shi don PMM amma, tare da ƙananan gyare-gyare, ya dace da kowane shigarwa na Prometheus.
Muna da toshe mai aiki mai suna head block wanda aka adana a ƙwaƙwalwar ajiya; tubalan tare da tsofaffin bayanai suna samuwa ta hanyar mmap(). Wannan yana kawar da buƙatar saita cache daban, amma kuma yana nufin cewa kuna buƙatar barin isasshen sarari don cache ɗin tsarin idan kuna son neman bayanan da suka girmi abin da toshewar kai zai iya ɗauka.
Wannan kuma yana nufin cewa Prometheus kama-da-wane amfani da ƙwaƙwalwar ajiya zai yi kyau sosai, wanda ba wani abu bane don damuwa.

Binciken TSDB a cikin Prometheus 2

Wani mahimmin zane mai ban sha'awa shine amfani da WAL (rubutun gaba). Kamar yadda kuke gani daga takardun ajiya, Prometheus yana amfani da WAL don guje wa hadarurruka. Takamaiman hanyoyin tabbatar da tsira bayanai, abin takaici, ba a rubuta su da kyau ba. Sigar Prometheus 2.3.2 tana jan WAL zuwa faifai kowane daƙiƙa 10 kuma wannan zaɓin ba zai iya daidaita mai amfani ba.

Ƙunƙwasawa

An ƙera Prometheus TSDB kamar kantin sayar da LSM (Log Structured Merge): kan toshe kan yana jujjuya lokaci zuwa lokaci zuwa faifai, yayin da tsarin haɗakarwa yana haɗa tubalan da yawa tare don guje wa bincika tubalan da yawa yayin tambayoyi. Anan zaka iya ganin adadin tubalan da na lura akan tsarin gwaji bayan kwana daya na kaya.

Binciken TSDB a cikin Prometheus 2

Idan kuna son ƙarin koyo game da kantin sayar da, zaku iya bincika fayil ɗin meta.json, wanda ke da bayanai game da tubalan da ake da su da kuma yadda suka kasance.

{
       "ulid": "01CPZDPD1D9R019JS87TPV5MPE",
       "minTime": 1536472800000,
       "maxTime": 1536494400000,
       "stats": {
               "numSamples": 8292128378,
               "numSeries": 1673622,
               "numChunks": 69528220
       },
       "compaction": {
               "level": 2,
               "sources": [
                       "01CPYRY9MS465Y5ETM3SXFBV7X",
                       "01CPYZT0WRJ1JB1P0DP80VY5KJ",
                       "01CPZ6NR4Q3PDP3E57HEH760XS"
               ],
               "parents": [
                       {
                               "ulid": "01CPYRY9MS465Y5ETM3SXFBV7X",
                               "minTime": 1536472800000,
                               "maxTime": 1536480000000
                       },
                       {
                               "ulid": "01CPYZT0WRJ1JB1P0DP80VY5KJ",
                               "minTime": 1536480000000,
                               "maxTime": 1536487200000
                       },
                       {
                               "ulid": "01CPZ6NR4Q3PDP3E57HEH760XS",
                               "minTime": 1536487200000,
                               "maxTime": 1536494400000
                       }
               ]
       },
       "version": 1
}

An haɗa haɗin haɗin gwiwa a cikin Prometheus zuwa lokacin da aka zubar da kan toshe zuwa faifai. A wannan lokaci, ana iya aiwatar da irin waɗannan ayyuka da yawa.

Binciken TSDB a cikin Prometheus 2

Ya bayyana cewa ba a iyakance ta kowace hanya ba kuma yana iya haifar da manyan faifan I/O yayin aiwatarwa.

Binciken TSDB a cikin Prometheus 2

CPU load spikes

Binciken TSDB a cikin Prometheus 2

Tabbas, wannan yana da mummunar tasiri akan saurin tsarin, kuma yana haifar da ƙalubale mai tsanani ga ajiyar LSM: yadda za a yi haɗin gwiwa don tallafawa ƙimar buƙatu mai girma ba tare da haifar da wuce gona da iri ba?
Yin amfani da ƙwaƙwalwar ajiya a cikin tsarin ƙaddamarwa shima yana da ban sha'awa sosai.

Binciken TSDB a cikin Prometheus 2

Za mu iya ganin yadda, bayan ƙaddamarwa, yawancin ƙwaƙwalwar ajiya suna canza yanayin daga Cache zuwa Kyauta: wannan yana nufin cewa an cire bayanai masu mahimmanci daga can. Abin mamaki idan an yi amfani da shi a nan fadvice() ko wata dabarar rage girman, ko kuma saboda an kuɓutar da cache ɗin daga tubalan da aka lalata yayin haɗakarwa?

Farfadowa bayan gazawar

Farfadowa daga kasawa yana ɗaukar lokaci, kuma saboda kyawawan dalilai. Don rafi mai shigowa na rikodin miliyan ɗaya a sakan daya, dole ne in jira kusan mintuna 25 yayin da aka dawo da murmurewa tare da la'akari da drive ɗin SSD.

level=info ts=2018-09-13T13:38:14.09650965Z caller=main.go:222 msg="Starting Prometheus" version="(version=2.3.2, branch=v2.3.2, revision=71af5e29e815795e9dd14742ee7725682fa14b7b)"
level=info ts=2018-09-13T13:38:14.096599879Z caller=main.go:223 build_context="(go=go1.10.1, user=Jenkins, date=20180725-08:58:13OURCE)"
level=info ts=2018-09-13T13:38:14.096624109Z caller=main.go:224 host_details="(Linux 4.15.0-32-generic #35-Ubuntu SMP Fri Aug 10 17:58:07 UTC 2018 x86_64 1bee9e9b78cf (none))"
level=info ts=2018-09-13T13:38:14.096641396Z caller=main.go:225 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-09-13T13:38:14.097715256Z caller=web.go:415 component=web msg="Start listening for connections" address=:9090
level=info ts=2018-09-13T13:38:14.097400393Z caller=main.go:533 msg="Starting TSDB ..."
level=info ts=2018-09-13T13:38:14.098718401Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536530400000 maxt=1536537600000 ulid=01CQ0FW3ME8Q5W2AN5F9CB7R0R
level=info ts=2018-09-13T13:38:14.100315658Z caller=web.go:467 component=web msg="router prefix" prefix=/prometheus
level=info ts=2018-09-13T13:38:14.101793727Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536732000000 maxt=1536753600000 ulid=01CQ78486TNX5QZTBF049PQHSM
level=info ts=2018-09-13T13:38:14.102267346Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536537600000 maxt=1536732000000 ulid=01CQ78DE7HSQK0C0F5AZ46YGF0
level=info ts=2018-09-13T13:38:14.102660295Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536775200000 maxt=1536782400000 ulid=01CQ7SAT4RM21Y0PT5GNSS146Q
level=info ts=2018-09-13T13:38:14.103075885Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536753600000 maxt=1536775200000 ulid=01CQ7SV8WJ3C2W5S3RTAHC2GHB
level=error ts=2018-09-13T14:05:18.208469169Z caller=wal.go:275 component=tsdb msg="WAL corruption detected; truncating" err="unexpected CRC32 checksum d0465484, want 0" file=/opt/prometheus/data/.prom2-data/wal/007357 pos=15504363
level=info ts=2018-09-13T14:05:19.471459777Z caller=main.go:543 msg="TSDB started"
level=info ts=2018-09-13T14:05:19.471604598Z caller=main.go:603 msg="Loading configuration file" filename=/etc/prometheus.yml
level=info ts=2018-09-13T14:05:19.499156711Z caller=main.go:629 msg="Completed loading of configuration file" filename=/etc/prometheus.yml
level=info ts=2018-09-13T14:05:19.499228186Z caller=main.go:502 msg="Server is ready to receive web requests."

Babban matsalar tsarin dawowa shine babban amfani da ƙwaƙwalwar ajiya. Duk da cewa a cikin yanayi na al'ada uwar garken na iya aiki a tsaye tare da adadin ƙwaƙwalwar ajiya, idan ya fadi ba zai iya dawowa ba saboda OOM. Mafita ɗaya da na samo ita ce in kashe tarin bayanai, kawo sabar, bar shi ya dawo da sake kunnawa tare da kunna tarin.

Warming sama

Wani hali da za a yi la'akari a lokacin dumi shine dangantaka tsakanin ƙananan aiki da yawan amfani da albarkatu daidai bayan farawa. A lokacin wasu, amma ba duka farawa ba, Na lura da nauyi mai nauyi akan CPU da ƙwaƙwalwar ajiya.

Binciken TSDB a cikin Prometheus 2

Binciken TSDB a cikin Prometheus 2

Matsaloli a cikin amfani da ƙwaƙwalwar ajiya suna nuna cewa Prometheus ba zai iya saita duk tarin daga farkon ba, kuma wasu bayanai sun ɓace.
Ban gano ainihin dalilai na babban CPU da nauyin ƙwaƙwalwar ajiya ba. Ina zargin cewa wannan shi ne saboda ƙirƙirar sabon jerin lokaci a cikin shingen kai tare da mitar mai yawa.

CPU load ya hauhawa

Bugu da ƙari ga abubuwan haɗin gwiwa, waɗanda ke haifar da babban nauyin I/O na gaskiya, na lura da tsauri mai tsanani a cikin nauyin CPU kowane minti biyu. Fashewar ya fi tsayi lokacin da shigar shigar ya yi girma kuma da alama mai tara shara na Go ne ya haifar da shi, tare da an loda aƙalla wasu muryoyi gabaɗaya.

Binciken TSDB a cikin Prometheus 2

Binciken TSDB a cikin Prometheus 2

Wadannan tsalle-tsalle ba su da mahimmanci. Ya bayyana cewa lokacin da waɗannan suka faru, wurin shiga Prometheus na ciki da ma'auni sun zama ba su samuwa, suna haifar da gibin bayanai a cikin waɗannan lokutan lokaci guda.

Binciken TSDB a cikin Prometheus 2

Hakanan zaka iya lura cewa mai fitar da Prometheus yana rufewa na daƙiƙa ɗaya.

Binciken TSDB a cikin Prometheus 2

Za mu iya lura da alaƙa tare da tarin datti (GC).

Binciken TSDB a cikin Prometheus 2

ƙarshe

TSDB a cikin Prometheus 2 yana da sauri, yana iya sarrafa miliyoyin jerin lokaci kuma a lokaci guda dubbai na rikodin a sakan daya ta amfani da kayan aiki masu inganci. Hakanan amfani da CPU da faifai I/O yana da ban sha'awa. Misali na ya nuna har zuwa ma'auni 200 a cikin dakika guda a kowace ainihin da aka yi amfani da su.

Don tsara haɓakawa, kuna buƙatar tunawa game da isassun adadin ƙwaƙwalwar ajiya, kuma wannan dole ne ya zama ainihin ƙwaƙwalwar ajiya. Adadin ƙwaƙwalwar da aka yi amfani da shi da na lura shine kusan 5 GB a cikin rikodin 100 a cikin sakan daya na rafi mai shigowa, wanda tare da cache na tsarin aiki ya ba da kusan 000 GB na ƙwaƙwalwar ajiya.

Tabbas, har yanzu akwai sauran aiki da yawa da za a yi don horar da CPU da faifai I/O spikes, kuma wannan ba abin mamaki ba ne idan aka yi la’akari da yadda aka kwatanta matashin TSDB Prometheus 2 da InnoDB, TokuDB, RocksDB, WiredTiger, amma duk suna da irin wannan. matsaloli a farkon rayuwarsu.

source: www.habr.com

Add a comment