Logs in Kubernetes (kuma ba kawai) a yau: tsammanin da gaskiya

Logs in Kubernetes (kuma ba kawai) a yau: tsammanin da gaskiya

Shekarar 2019 ce, kuma har yanzu ba mu da madaidaicin bayani don tara log in Kubernetes. A cikin wannan labarin, muna so, ta amfani da misalai daga aiki na gaske, don raba bincikenmu, matsalolin da aka fuskanta da mafitarsu.

Koyaya, da farko, zan yi ajiyar wuri cewa abokan ciniki daban-daban suna fahimtar abubuwa daban-daban ta hanyar tattara rajistan ayyukan:

  • wani yana son ganin rajistan ayyukan tsaro da tantancewa;
  • wani - gungumen azaba na dukan kayayyakin more rayuwa;
  • kuma ga wasu, ya isa ya tattara rajistan ayyukan aikace-aikacen kawai, ban da, misali, masu daidaitawa.

Da ke ƙasa akwai yanke ƙasa game da yadda muka aiwatar da “jerin abubuwan buri” daban-daban da waɗanne matsalolin da muka fuskanta.

Ka'idar: game da kayan aikin shiga

Fage akan abubuwan da ke cikin tsarin shiga

An yi nisa mai nisa, a sakamakon haka aka samar da hanyoyin tattara da kuma tantance gundumomi, wanda shi ne abin da muke amfani da shi a yau. A baya a cikin 1950s, Fortran ya gabatar da analogue na daidaitattun rafukan shigarwa/fitarwa, wanda ya taimaka wa mai shirye-shirye ya gyara shirinsa. Waɗannan su ne na farko log log ɗin da suka sauƙaƙa rayuwa ga masu tsara shirye-shirye na wancan lokacin. A yau za mu ga a cikinsu kashi na farko na tsarin katako - tushen ko "mai samarwa" na logins.

Kimiyyar na'ura mai kwakwalwa bai tsaya cik ba: hanyoyin sadarwar kwamfuta sun bayyana, gungu na farko... Hadadden tsarin da ya kunshi kwamfutoci da dama sun fara aiki. Yanzu an tilasta masu kula da tsarin tattara rajistan ayyukan daga injina da yawa, kuma a lokuta na musamman suna iya ƙara saƙon kwaya na OS idan suna buƙatar bincika gazawar tsarin. Don kwatanta tsarin tattara loggu na tsakiya, a farkon 2000s an buga shi RFC 3164, wanda ya daidaita remote_syslog. Wannan shine yadda wani muhimmin sashi ya bayyana: mai tattara log da ajiyar su.

Tare da karuwa a cikin ƙarar rajistan ayyukan da kuma ƙaddamar da yaduwar fasahar yanar gizo, tambaya ta taso game da abin da rajistan ayyukan ke buƙatar nunawa ga masu amfani. Sauƙaƙan kayan aikin wasan bidiyo (awk/sed/grep) an maye gurbinsu da ƙarin ci gaba masu kallo log - kashi na uku.

Saboda karuwa a cikin ƙarar rajistan ayyukan, wani abu kuma ya bayyana: ana buƙatar katako, amma ba duka ba. Kuma daban-daban rajistan ayyukan na bukatar daban-daban matakan kiyayewa: wasu za a iya rasa a cikin yini, yayin da wasu bukatar a adana shekaru 5. Don haka, an ƙara wani ɓangaren don tacewa da tafiyar da bayanan da ke gudana a cikin tsarin shiga - bari mu kira shi tace.

Ma'aji ya kuma yi babban tsalle: daga fayiloli na yau da kullun zuwa ma'ajin bayanai masu alaƙa, sannan zuwa ma'ajiyar da ta dace (misali, Elasticsearch). Don haka an raba ma'ajiyar da mai tarawa.

A ƙarshe, ainihin manufar gungumen azaba ta faɗaɗa zuwa wani nau'in rafi na abubuwan da muke son adanawa don tarihi. Ko kuma, idan kuna buƙatar gudanar da bincike ko zana rahoton nazari...

Sakamakon haka, a cikin ɗan ƙanƙanin lokaci, tarin log ɗin ya ɓullo da wani muhimmin tsarin ƙasa, wanda daman ana iya kiransa ɗaya daga cikin ɓangarori a cikin Babban Bayanai.

Logs in Kubernetes (kuma ba kawai) a yau: tsammanin da gaskiya
Idan sau ɗaya kwafi na yau da kullun na iya isa ga “tsarin shiga,” yanzu yanayin ya canza da yawa.

Kubernetes da logs

Lokacin da Kubernetes ya zo ga abubuwan more rayuwa, matsalar da ta riga ta kasance ta tattara rajistan ayyukan ba ta ƙetare shi ba. A wasu hanyoyi, ya zama mai raɗaɗi: sarrafa tsarin dandamali ba kawai a sauƙaƙe ba, amma har ma da rikitarwa a lokaci guda. Yawancin tsoffin ayyuka sun fara ƙaura zuwa ƙananan sabis. A cikin mahallin gungumen azaba, wannan yana bayyana a cikin haɓakar yawan tushen log, yanayin rayuwarsu ta musamman, da buƙatar bin alakar duk abubuwan da ke tattare da tsarin ta hanyar rajistan ayyukan ...

Ana duba gaba, zan iya bayyana cewa yanzu, abin takaici, babu daidaitaccen zaɓin shiga Kubernetes wanda zai kwatanta da sauran mutane. Mafi shaharar tsare-tsare a cikin al'umma sune kamar haka.

  • wani ya kwance tari EFK (Elasticsearch, Fluentd, Kibana);
  • wani yana kokarin sakin kwanan nan Loki ko amfani Ma'aikacin shiga;
  • нас (kuma watakila ba mu kadai ba?..) Na gamsu da ci gaban kaina - gidan log...

A matsayinka na mai mulki, muna amfani da daure masu zuwa a cikin gungu na K8s (don mafita mai sarrafa kansa):

Koyaya, ba zan tsaya kan umarnin shigarwa da daidaita su ba. Madadin haka, zan mai da hankali kan gazawar su da ƙarin ƙarshen duniya game da halin da ake ciki tare da rajistan ayyukan gabaɗaya.

Yi aiki tare da katako a cikin K8s

Logs in Kubernetes (kuma ba kawai) a yau: tsammanin da gaskiya

"Lokacin yau da kullun", ku nawa ne a wurin?..

Matsakaicin tarin rajistan ayyukan daga manyan kayan more rayuwa yana buƙatar albarkatu masu yawa, waɗanda za a kashe wajen tattarawa, adanawa da sarrafa kundayen. A yayin gudanar da ayyuka daban-daban, mun fuskanci buƙatu daban-daban da matsalolin aiki da suka taso daga gare su.

Bari mu gwada ClickHouse

Bari mu kalli ma'ajiya ta tsakiya akan aiki tare da aikace-aikacen da ke haifar da rajistan ayyukan da gaske: fiye da layukan 5000 a sakan daya. Bari mu fara aiki tare da rajistan ayyukansa, ƙara su zuwa ClickHouse.

Da zaran ana buƙatar matsakaicin ainihin lokacin, uwar garken 4-core tare da ClickHouse za a riga an ɗora shi akan tsarin faifai:

Logs in Kubernetes (kuma ba kawai) a yau: tsammanin da gaskiya

Irin wannan nau'in lodi ya faru ne saboda gaskiyar cewa muna ƙoƙarin rubutawa a ClickHouse da sauri. Kuma ma'ajin bayanan yana amsawa ga wannan tare da ƙarin nauyin diski, wanda zai iya haifar da kurakurai masu zuwa:

DB::Exception: Too many parts (300). Merges are processing significantly slower than inserts

Point shi ne, MergeTree Tables a ClickHouse (sun ƙunshi bayanan log) suna da nasu matsalolin yayin ayyukan rubutu. Bayanan da aka saka a cikin su yana haifar da bangare na wucin gadi, wanda aka haɗa shi da babban tebur. A sakamakon haka, rikodin ya zama mai matukar buƙata akan faifai, kuma yana da alaƙa da iyakancewar da muka karɓi sanarwa game da sama: ba za a iya haɗawa da juzu'i sama da 1 a cikin sakan 300 ba (a zahiri, wannan shine shigarwar 300). dakika daya).

Don guje wa wannan hali, ya kamata a rubuta zuwa ClickHouse a cikin manyan guda gwargwadon yiwuwa kuma ba fiye da sau 1 kowane sakan 2 ba. Koyaya, rubutu a cikin manyan fashe yana nuna cewa yakamata mu rubuta ƙasa akai-akai a ClickHouse. Wannan, bi da bi, na iya haifar da cikar buffer da asarar katako. Maganin shine ƙara Fluentd buffer, amma sannan amfani da ƙwaƙwalwar ajiya shima zai ƙaru.

Примечание: Wani matsala mai matsala na maganinmu tare da ClickHouse yana da alaƙa da gaskiyar cewa ana aiwatar da rabuwa a cikin shari'ar mu (loghouse) ta hanyar tebur na waje da aka haɗa. Haɗa tebur. Wannan yana haifar da gaskiyar cewa lokacin yin samfuran manyan tazarar lokaci, ana buƙatar RAM da yawa, tunda metatable yana jujjuya duk ɓangarori - har ma waɗanda ba su ƙunshi mahimman bayanai ba. Koyaya, yanzu wannan hanyar za a iya ayyana ta amintacce don sigar ClickHouse na yanzu (c 18.16).

A sakamakon haka, ya bayyana a fili cewa ba kowane aikin yana da isasshen albarkatu don tattara rajistan ayyukan a cikin ClickHouse (mafi daidai, rarraba su ba zai dace ba). Bugu da ƙari, za ku buƙaci amfani Ð ° ккумуР»Ñ Ñ,Ð¾Ñ €, wanda za mu dawo daga baya. Lamarin da aka kwatanta a sama gaskiya ne. Kuma a wancan lokacin ba mu iya ba da ingantaccen ingantaccen bayani mai ƙarfi wanda zai dace da abokin ciniki kuma ya ba mu damar tattara katako tare da ɗan jinkiri ...

Me game da Elasticsearch?

An san Elasticsearch don ɗaukar nauyin ayyuka masu nauyi. Bari mu gwada shi a cikin wannan aikin. Yanzu kaya yayi kama da haka:

Logs in Kubernetes (kuma ba kawai) a yau: tsammanin da gaskiya

Elasticsearch ya sami damar narkar da rafin bayanan, duk da haka, rubuta irin waɗannan kundin zuwa gare shi yana amfani da CPU sosai. Ana yanke wannan ta hanyar tsara tari. A fasaha, wannan ba matsala ba ne, amma ya zama cewa kawai don aiki da tsarin tarin log mun riga mun yi amfani da nau'ikan nau'ikan 8 kuma muna da ƙarin kayan aikin da aka ɗora a cikin tsarin ...

Layin ƙasa: wannan zaɓi na iya zama barata, amma idan aikin yana da girma kuma ana gudanar da shi a shirye don kashe manyan albarkatu akan tsarin shiga tsakani.

Sai wata tambaya ta dabi'a ta taso:

Wadanne katako ake bukata da gaske?

Logs in Kubernetes (kuma ba kawai) a yau: tsammanin da gaskiya Bari mu yi ƙoƙari mu canza tsarin da kanta: rajistan ayyukan ya kamata a lokaci guda su zama masu ba da labari kuma kada su rufe kowanne faruwa a cikin tsarin.

Bari mu ce muna da kantin sayar da kan layi mai nasara. Wadanne logs suke da mahimmanci? Tattara bayanai da yawa kamar yadda zai yiwu, misali, daga ƙofar biyan kuɗi, babban ra'ayi ne. Amma ba duk rajistan ayyukan slicing na hoto a cikin kasidar samfur ba ne masu mahimmanci a gare mu: kurakurai da ci gaba na sa ido kawai sun isa (misali, adadin kurakurai 500 da wannan sashin ya haifar).

Don haka mun kai ga matsayar cewa Tsakanin gungumen azaba ba koyaushe yana barata ba. Sau da yawa abokin ciniki yana so ya tattara duk rajistan ayyukan a wuri guda, kodayake a zahiri, daga duk log ɗin, kawai 5% na saƙon da ke da mahimmanci ga kasuwancin ana buƙatar:

  • Wani lokaci ya isa ya daidaita, a ce, kawai girman gunkin log ɗin da mai karɓar kuskure (misali, Sentry).
  • Sanarwa na kuskure da babban log ɗin gida kanta na iya zama sau da yawa isa don bincika abubuwan da suka faru.
  • Muna da ayyukan da aka yi tare da gwaje-gwajen aiki na musamman da tsarin tattara kurakurai. Masu haɓakawa ba su buƙatar rajistan ayyukan kamar haka - sun ga komai daga alamun kuskure.

Misali daga rayuwa

Wani labari zai iya zama misali mai kyau. Mun sami buƙatu daga ƙungiyar tsaro na ɗaya daga cikin abokan cinikinmu wanda ya riga ya yi amfani da maganin kasuwanci wanda aka haɓaka tun kafin gabatarwar Kubernetes.

Ya zama dole don "yin abokai" na tsarin tattara bayanai na tsakiya tare da firikwensin gano matsala na kamfani - QRadar. Wannan tsarin zai iya karɓar rajistan ayyukan ta hanyar syslog protocol kuma ya dawo da su daga FTP. Koyaya, ba a sami damar haɗa shi nan da nan tare da kayan aikin remote_syslog don ƙwarewa ba (kamar yadda ya kasance, ba mu kadai ba). Matsaloli tare da saita QRadar sun kasance a gefen ƙungiyar tsaro na abokin ciniki.

Sakamakon haka, an loda wani ɓangare na mahimman rajistan ayyukan kasuwanci zuwa FTP QRadar, kuma ɗayan an tura shi ta hanyar syslog na nesa kai tsaye daga nodes. Don wannan har ma mun rubuta sauki ginshiƙi - watakila zai taimaka wa wani ya magance irin wannan matsala ... Godiya ga makircin da aka samu, abokin ciniki da kansa ya karbi kuma ya bincikar mahimman bayanai (ta yin amfani da kayan aikin da ya fi so), kuma mun sami damar rage farashin tsarin shiga, ceton kawai watan da ya gabata.

Wani misali kuma yana nuni da abin da ba za a yi ba. Daya daga cikin abokan cinikinmu don sarrafawa kowane abubuwan da ke fitowa daga mai amfani, sanya multiline fitarwa mara tsari bayanai a cikin log. Kamar yadda kuke tsammani, irin waɗannan rajistan ayyukan sun kasance marasa dacewa ga karantawa da adanawa duka.

Ma'auni na katako

Irin waɗannan misalan sun kai ga ƙarshe cewa ban da zabar tsarin tarin log, kuna buƙatar suma zayyana gundumomin da kansu! Menene bukatun anan?

  • Dole ne rajistan ayyukan su kasance cikin tsari mai iya karanta na'ura (misali, JSON).
  • Logs ya kamata ya zama m kuma tare da ikon canza matakin shiga don magance matsalolin da za a iya yi. A lokaci guda, a cikin wuraren samarwa yakamata ku gudanar da tsarin tare da matakin shiga kamar Gargadi ko Kuskuren.
  • Dole ne a daidaita rajistan ayyukan, wato, a cikin abu na katako, dukkan layukan dole ne su kasance da nau'in filin iri ɗaya.

Guduwar da ba a tsara ta ba na iya haifar da matsala wajen loda rajistan ayyukan a cikin ajiya da kuma dakatar da sarrafa su gaba ɗaya. A matsayin misali, ga misali tare da kuskure 400, wanda da yawa sun ci karo da su a cikin madaidaitan rajistan ayyukan:

2019-10-29 13:10:43 +0000 [warn]: dump an error event: error_class=Fluent::Plugin::ElasticsearchErrorHandler::ElasticsearchError error="400 - Rejected by Elasticsearch"

Kuskuren yana nufin cewa kuna aika filin wanda nau'insa ba shi da kwanciyar hankali zuwa fihirisa tare da shirye-shiryen taswira. Misali mafi sauƙi shine filin a cikin log nginx tare da m $upstream_status. Yana iya ƙunsar ko dai lamba ko kirtani. Misali:

{ "ip": "1.2.3.4", "http_user": "-", "request_id": "17ee8a579e833b5ab9843a0aca10b941", "time": "29/Oct/2019:16:18:57 +0300", "method": "GET", "uri": "/staffs/265.png", "protocol": "HTTP/1.1", "status": "200", "body_size": "906", "referrer": "https://example.com/staff", "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36", "request_time": "0.001", "cache_status": "-", "upstream_response_time": "0.001, 0.007", "upstream_addr": "127.0.0.1:9000", "upstream_status": "200", "upstream_response_length": "906", "location": "staff"}
{ "ip": "1.2.3.4", "http_user": "-", "request_id": "47fe42807f2a7d8d5467511d7d553a1b", "time": "29/Oct/2019:16:18:57 +0300", "method": "GET", "uri": "/staff", "protocol": "HTTP/1.1", "status": "200", "body_size": "2984", "referrer": "-", "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36", "request_time": "0.010", "cache_status": "-", "upstream_response_time": "0.001, 0.007", "upstream_addr": "10.100.0.10:9000, 10.100.0.11:9000", "upstream_status": "404, 200", "upstream_response_length": "0, 2984", "location": "staff"}

Rubutun ya nuna cewa uwar garken 10.100.0.10 ta amsa tare da kuskure 404 kuma an aika buƙatar zuwa wani ma'ajiyar abun ciki. A sakamakon haka, darajar cikin rajistan ayyukan ta zama kamar haka:

"upstream_response_time": "0.001, 0.007"

Wannan yanayin ya zama ruwan dare wanda har ma ya cancanci ware nassoshi a cikin takardun.

AMINCI fa?

Akwai lokutan da duk rajistan ayyukan ba tare da togiya ba suna da mahimmanci. Kuma tare da wannan, tsarin tsarin tattara log na yau da kullun na K8s da aka gabatar/mutane a sama suna da matsaloli.

Misali, ƙwanƙwasa ba zai iya tattara gundumomi daga kwantena masu ɗan gajeren lokaci ba. A cikin ɗayan ayyukanmu, kwandon ƙaura na bayanai ya rayu ƙasa da daƙiƙa 4 sannan an share shi - bisa ga bayanin da ya dace:

"helm.sh/hook-delete-policy": hook-succeeded

Saboda wannan, ba a haɗa log ɗin aiwatar da ƙaura a cikin ma'ajiyar ba. Siyasa na iya taimakawa a wannan yanayin. before-hook-creation.

Wani misali shine jujjuyawar log Docker. Bari mu ce akwai aikace-aikacen da ke rubuta rayayye zuwa rajistan ayyukan. A karkashin yanayi na al'ada, muna sarrafa sarrafa duk rajistan ayyukan, amma da zarar matsala ta bayyana - alal misali, kamar yadda aka bayyana a sama tare da tsarin da ba daidai ba - aiki yana tsayawa, kuma Docker yana juya fayil ɗin. Sakamakon haka shine cewa ana iya rasa mahimman rajistan ayyukan kasuwanci.

Shi ya sa yana da mahimmanci don raba ramukan log, saka aika mafi mahimmanci kai tsaye cikin aikace-aikacen don tabbatar da amincin su. Bugu da ƙari, ba zai zama abin ban mamaki ba don ƙirƙirar wasu "Accumulator" na rajistan ayyukan, wanda zai iya tsira daga rashin samun gajeriyar ajiya yayin adana mahimman saƙonni.

A ƙarshe, kada mu manta da hakan Yana da mahimmanci a saka idanu akan kowane tsarin ƙasa da kyau. In ba haka ba, yana da sauƙi a shiga cikin yanayin da ƙwararren yana cikin yanayi CrashLoopBackOff kuma baya aika komai, kuma wannan yayi alkawarin asarar mahimman bayanai.

binciken

A cikin wannan labarin, ba mu kallon mafita na SaaS kamar Datadog ba. Yawancin matsalolin da aka bayyana a nan an riga an warware su ta hanya ɗaya ko wata ta hanyar kamfanonin kasuwanci da suka ƙware wajen tattara rajistan ayyukan, amma ba kowa ba ne zai iya amfani da SaaS don dalilai daban-daban. (Babban su ne farashi da yarda da 152-FZ).

Tarin gungu na tsakiya da farko yana kama da aiki mai sauƙi, amma ba haka bane. Yana da mahimmanci a tuna cewa:

  • Abubuwan da ke da mahimmanci kawai suna buƙatar shiga daki-daki, yayin da ana iya saita sa ido da tarin kurakurai don wasu tsarin.
  • Ya kamata a kiyaye ragi a cikin samarwa don kada a ƙara nauyin da ba dole ba.
  • Dole ne rajistan ayyukan su zama na'ura mai karantawa, daidaita su, kuma suna da tsayayyen tsari.
  • Dole ne a aika da rajistan ayyukan gaske a cikin wani rafi daban, wanda yakamata a raba su da manyan.
  • Yana da daraja la'akari da tarawar log, wanda zai iya cece ku daga fashe babban nauyi kuma ya sa nauyin da ke kan ajiyar ya zama daidai.

Logs in Kubernetes (kuma ba kawai) a yau: tsammanin da gaskiya
Waɗannan ƙa'idodi masu sauƙi, idan aka yi amfani da su a ko'ina, za su ba da damar da'irar da aka kwatanta a sama suyi aiki - duk da cewa sun ɓace mahimman abubuwan (batir). Idan ba ku bi irin waɗannan ka'idodin ba, aikin zai sauƙaƙe ku da kayan aikin zuwa wani ɓangaren da aka ɗora (kuma a lokaci guda mara inganci) na tsarin.

PS

Karanta kuma a kan shafinmu:

source: www.habr.com

Add a comment