Fluentd: Me yasa yake da mahimmanci don saita buffer fitarwa

Fluentd: Me yasa yake da mahimmanci don saita buffer fitarwa

A zamanin yau, ba shi yiwuwa a yi tunanin wani aikin tushen Kubernetes ba tare da tari na ELK ba, wanda ke adana rajistan ayyukan duka aikace-aikacen da tsarin tsarin tari. A cikin aikinmu, muna amfani da tari na EFK tare da Fluentd maimakon Logstash.

Fluentd zamani ne, mai tara log ɗin duniya wanda ke ƙara samun karbuwa kuma ya shiga gidauniyar Cloud Native Computing Foundation, wanda shine dalilin da ya sa vector ɗinsa ya mayar da hankali kan amfani tare da Kubernetes.

Gaskiyar amfani da Fluentd maimakon Logstash baya canza ainihin ainihin fakitin software, duk da haka, Fluentd yana da alaƙa da takamaiman nuances ɗin sa wanda ya haifar da haɓakarsa.

Alal misali, lokacin da muka fara amfani da EFK a cikin wani aiki mai ban sha'awa tare da girman katako, mun fuskanci gaskiyar cewa a Kibana an nuna wasu saƙonni akai-akai sau da yawa. A cikin wannan labarin za mu gaya muku dalilin da yasa wannan al'amari ya faru da kuma yadda za a magance matsalar.

Matsalar kwafi

A cikin ayyukanmu, Fluentd ana tura shi azaman DaemonSet (an ƙaddamar da shi ta atomatik a cikin misali ɗaya akan kowane kulli na gungu na Kubernetes) kuma yana sa ido kan rajistar kwantena stdout a /var/log/containers. Bayan tattarawa da sarrafawa, ana aika rajistan ayyukan a cikin nau'i na takaddun JSON zuwa ElasticSearch, tashe cikin tari ko tsari na tsaye, dangane da sikelin aikin da buƙatun don aiki da haƙurin kuskure. Ana amfani da Kibana azaman mahaɗar hoto.

Lokacin amfani da Fluentd tare da kayan aikin buffering fitarwa, mun ci karo da wani yanayi inda wasu takardu a cikin ElasticSearch ke da ainihin abun ciki iri ɗaya kuma sun bambanta kawai a cikin mai ganowa. Kuna iya tabbatar da cewa wannan maimaitawar saƙo ce ta amfani da log ɗin Nginx azaman misali. A cikin fayil ɗin log ɗin, wannan saƙon yana cikin kwafi ɗaya:

127.0.0.1 192.168.0.1 - [28/Feb/2013:12:00:00 +0900] "GET / HTTP/1.1" 200 777 "-" "Opera/12.0" -

Koyaya, akwai takardu da yawa a cikin ElasticSearch waɗanda ke ɗauke da wannan saƙo:

{
  "_index": "test-custom-prod-example-2020.01.02",
  "_type": "_doc",
  "_id": "HgGl_nIBR8C-2_33RlQV",
  "_version": 1,
  "_score": 0,
  "_source": {
    "service": "test-custom-prod-example",
    "container_name": "nginx",
    "namespace": "test-prod",
    "@timestamp": "2020-01-14T05:29:47.599052886 00:00",
    "log": "127.0.0.1 192.168.0.1 - [28/Feb/2013:12:00:00  0900] "GET / HTTP/1.1" 200 777 "-" "Opera/12.0" -",
    "tag": "custom-log"
  }
}

{
  "_index": "test-custom-prod-example-2020.01.02",
  "_type": "_doc",
  "_id": "IgGm_nIBR8C-2_33e2ST",
  "_version": 1,
  "_score": 0,
  "_source": {
    "service": "test-custom-prod-example",
    "container_name": "nginx",
    "namespace": "test-prod",
    "@timestamp": "2020-01-14T05:29:47.599052886 00:00",
    "log": "127.0.0.1 192.168.0.1 - [28/Feb/2013:12:00:00  0900] "GET / HTTP/1.1" 200 777 "-" "Opera/12.0" -",
    "tag": "custom-log"
  }
}

Bugu da ƙari, ana iya samun fiye da maimaitawa biyu.

Yayin gyara wannan matsala a cikin rajistan ayyukan Fluentd, zaku iya ganin adadi mai yawa na gargaɗi tare da abun ciki mai zuwa:

2020-01-16 01:46:46 +0000 [warn]: [test-prod] failed to flush the buffer. retry_time=4 next_retry_seconds=2020-01-16 01:46:53 +0000 chunk="59c37fc3fb320608692c352802b973ce" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>"elasticsearch", :port=>9200, :scheme=>"http", :user=>"elastic", :password=>"obfuscated"}): read timeout reached"

Waɗannan gargaɗin suna faruwa ne lokacin da ElasticSearch ba zai iya mayar da martani ga buƙatu ba a cikin lokacin da aka kayyade ta hanyar ma'aunin request_timeout, wanda shine dalilin da ya sa ba za a iya share guntun buffer da aka tura ba. Bayan wannan, Fluentd yayi ƙoƙarin sake aika guntun buffer zuwa ElasticSearch kuma bayan yunƙurin sabani, aikin ya kammala cikin nasara:

2020-01-16 01:47:05 +0000 [warn]: [test-prod] retry succeeded. chunk_id="59c37fc3fb320608692c352802b973ce" 
2020-01-16 01:47:05 +0000 [warn]: [test-prod] retry succeeded. chunk_id="59c37fad241ab300518b936e27200747" 
2020-01-16 01:47:05 +0000 [warn]: [test-dev] retry succeeded. chunk_id="59c37fc11f7ab707ca5de72a88321cc2" 
2020-01-16 01:47:05 +0000 [warn]: [test-dev] retry succeeded. chunk_id="59c37fb5adb70c06e649d8c108318c9b" 
2020-01-16 01:47:15 +0000 [warn]: [kube-system] retry succeeded. chunk_id="59c37f63a9046e6dff7e9987729be66f"

Koyaya, ElasticSearch yana ɗaukar kowane ɓangarorin buffer ɗin da aka canjawa wuri a matsayin na musamman kuma yana ba su ƙima na filin _id na musamman yayin ƙididdigewa. Wannan shine yadda kwafin saƙonni ke bayyana.

A cikin Kibana yana kama da haka:

Fluentd: Me yasa yake da mahimmanci don saita buffer fitarwa

Shirya matsala

Akwai zaɓuɓɓuka da yawa don magance wannan matsalar. Ɗaya daga cikinsu ita ce hanyar da aka gina a cikin filogi mai sauƙi-plugin-elasticsearch don samar da zanta na musamman ga kowane takarda. Idan kayi amfani da wannan hanyar, ElasticSearch zai gane maimaitawa a matakin turawa kuma ya hana kwafin takardu. Amma dole ne mu yi la'akari da cewa wannan hanyar magance matsalar tana fama da bincike kuma baya kawar da kuskure tare da rashin lokaci, don haka mun yi watsi da amfani da shi.

Muna amfani da kayan aikin buffering akan fitowar Fluentd don hana asarar log in taron matsalolin cibiyar sadarwa na ɗan gajeren lokaci ko ƙara ƙarfin shiga. Idan saboda wasu dalilai ElasticSearch ya kasa rubuta takarda kai tsaye zuwa fihirisar, takardar tana layi kuma tana adanawa akan faifai. Sabili da haka, a cikin yanayinmu, don kawar da tushen matsalar da ke haifar da kuskuren da aka bayyana a sama, dole ne a saita madaidaicin dabi'u don sigogin buffering, wanda buffer na Fluentd zai kasance da isasshen girman kuma a lokaci guda gudanar da za a share a cikin lokacin da aka ware.

Ya kamata a lura da cewa ma'auni na sigogi da aka tattauna a ƙasa suna da mutum ɗaya a cikin kowane takamaiman yanayin yin amfani da buffering a cikin kayan aikin fitarwa, saboda sun dogara da dalilai da yawa: ƙarfin rubuta saƙonnin zuwa log ta ayyuka, tsarin faifai, aikin cibiyar sadarwa. tashar tashar jiragen ruwa da bandwidth. Don haka, don samun saitunan buffer waɗanda suka dace da kowane shari'a, amma ba ƙari ba, guje wa dogon bincike a makance, za ku iya amfani da bayanan lalata da Fluentd ke rubutawa cikin log ɗin sa yayin aiki kuma cikin sauri samun madaidaitan dabi'u.

A lokacin da aka rubuta matsalar, tsarin ya yi kama da haka:

 <buffer>
        @type file
        path /var/log/fluentd-buffers/kubernetes.test.buffer
        flush_mode interval
        retry_type exponential_backoff
        flush_thread_count 2
        flush_interval 5s
        retry_forever
        retry_max_interval 30
        chunk_limit_size 8M
        queue_limit_length 8
        overflow_action block
      </buffer>

Lokacin warware matsalar, an zaɓi ƙimar waɗannan sigogi da hannu:
chunk_limit_size - girman guntu wanda aka raba saƙonni a cikin ma'ajin.

  • flush_interval - tazarar lokaci bayan haka an share ma'ajin.
  • queue_limit_length - matsakaicin adadin ƙugiya a cikin jerin gwano.
  • request_timeout shine lokacin da aka kafa haɗin tsakanin Fluentd da ElasticSearch.

Ana iya ƙididdige jimlar girman buffer ta hanyar ninka sigogi queue_limit_length da chunk_limit_size, wanda za'a iya fassara shi azaman "matsakaicin adadin chunks a cikin jerin gwano, kowannensu yana da girman da aka ba shi." Idan girman buffer bai isa ba, gargaɗin mai zuwa zai bayyana a cikin rajistan ayyukan:

2020-01-21 10:22:57 +0000 [warn]: [test-prod] failed to write data into buffer by buffer overflow action=:block

Yana nufin cewa buffer ba shi da lokacin da za a share shi a cikin lokacin da aka keɓe kuma an toshe bayanan da ke shiga cikin cikakken buffer, wanda zai haifar da asarar wani ɓangare na rajistan ayyukan.

Kuna iya ƙara buffer ta hanyoyi biyu: ta hanyar ƙara ko dai girman kowane ƙugiya a cikin jerin gwano, ko adadin ƙugiya da za a iya kasancewa a cikin jerin gwano.

Idan ka saita girman chunk_limit_size zuwa fiye da megabyte 32, to ElasticSeacrh ba zai yarda da shi ba, tunda fakitin mai shigowa zai yi girma da yawa. Don haka, idan kuna buƙatar ƙara ƙarin buffer, yana da kyau a ƙara matsakaicin tsayin layi queue_limit_length.

Lokacin da buffer ya daina ambaliya kuma kawai rashin isassun saƙo ya rage, za ku iya fara ƙara ma'aunin request_timeout. Koyaya, idan kun saita ƙimar zuwa fiye da daƙiƙa 20, gargaɗin masu zuwa zasu fara bayyana a cikin rajistan ayyukan Fluentd:

2020-01-21 09:55:33 +0000 [warn]: [test-dev] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=20.85753920301795 slow_flush_log_threshold=20.0 plugin_id="postgresql-dev" 

Wannan saƙon baya shafar aikin tsarin ta kowace hanya kuma yana nufin cewa lokacin buffer ɗin ya ɗauki tsawon lokaci fiye da saita siginar slow_flush_log_threshold. Wannan bayani ne na gyara kurakurai kuma muna amfani da shi lokacin zabar ƙimar ma'aunin request_timeout.

Algorithm ɗin zaɓi na gama gari shine kamar haka:

  1. Saita request_timeout zuwa ƙimar da aka ba da tabbacin ta fi buƙata (daruruwan daƙiƙa). Yayin saitin, babban ma'auni na daidaitaccen saitin wannan siga zai kasance bacewar gargaɗi game da rashin ƙarewar lokaci.
  2. Jira saƙonni game da wuce matakin slow_flush_log_threshold. Rubutun faɗakarwa a cikin filin da ya wuce_lokaci zai nuna ainihin lokacin da aka share ma'ajin.
  3. Saita request_timeout zuwa ƙima mafi girma fiye da matsakaicin ƙimar lokacin da aka samu yayin lokacin kallo. Muna ƙididdige ƙimar buƙatar_lokacin ƙarewa kamar lokacin da ya wuce + 50%.
  4. Don cire gargadi game da dogon buffer flushes daga log ɗin, zaku iya ɗaga ƙimar slow_flush_log_threshold. Muna ƙididdige wannan ƙimar a matsayin lokacin da ya wuce + 25%.

Ƙimar ƙarshe na waɗannan sigogi, kamar yadda muka gani a baya, ana samun su daban-daban ga kowane hali. Ta bin algorithm na sama, muna da tabbacin kawar da kuskuren da ke haifar da maimaita saƙonni.

Teburin da ke ƙasa yana nuna yadda adadin kurakurai a kowace rana, wanda ke haifar da kwafin saƙonni, canje-canje a cikin aiwatar da zaɓin ƙimar sigogin da aka bayyana a sama:

node-1
node-2
node-3
node-4

Kafin bayan
Kafin bayan
Kafin bayan
Kafin bayan

ya kasa zubar da buffer
1749/2
694/2
47/0
1121/2

sake gwadawa ya yi nasara
410/2
205/1
24/0
241/2

Yana da mahimmanci a lura cewa saitunan da aka haifar na iya rasa mahimmancin su yayin da aikin ke girma kuma, saboda haka, adadin rajistan ayyukan yana ƙaruwa. Alamar farko ta rashin isassun lokacin ƙarewa ita ce dawowar saƙonni game da dogon buffer flushing zuwa Fluentd log, wato, wuce matakin slow_flush_log_threshold. Daga wannan lokacin, har yanzu akwai ɗan ƙaramin gefe kafin a ƙetare ma'aunin request_timeout, don haka ya zama dole a amsa waɗannan saƙonnin a kan lokaci kuma a maimaita tsarin zaɓin mafi kyawun saitunan da aka bayyana a sama.

ƙarshe

Daidaita daidaitaccen buffer na Fluentd yana ɗaya daga cikin manyan matakai na daidaita ma'aunin EFK, ƙayyadaddun daidaiton aikin sa da daidaitaccen wurin sanya takardu a cikin fihirisa. Dangane da ƙayyadadden ƙayyadaddun algorithm, za ku iya tabbata cewa duk rajistan ayyukan za a rubuta zuwa ElasticSearch index a cikin daidai tsari, ba tare da maimaitawa ko asara.

Hakanan karanta wasu labarai akan shafinmu:

source: www.habr.com

Add a comment