Yadda muka yi amfani da jinkirin kwafi don dawo da bala'i tare da PostgreSQL

Yadda muka yi amfani da jinkirin kwafi don dawo da bala'i tare da PostgreSQL
Maimaita ba madadin ba ne. Ko babu? Anan ga yadda muka yi amfani da kwafin da aka jinkirta don murmurewa daga share gajerun hanyoyi da gangan.

Kwararrun ababen more rayuwa GitLab ne ke da alhakin aikin GitLab.com - mafi girman misalin GitLab a yanayi. Tare da masu amfani da miliyan 3 da kusan ayyukan miliyan 7, yana ɗaya daga cikin manyan wuraren buɗe tushen SaaS tare da keɓaɓɓen gine-gine. Ba tare da tsarin bayanan PostgreSQL ba, kayan aikin GitLab.com ba za su yi nisa ba, kuma menene muke yi don tabbatar da haƙurin kuskure idan duk wani gazawa lokacin da za a iya rasa bayanai. Yana da wuya irin wannan bala'i ya faru, amma mun shirya sosai kuma mun tanadar da hanyoyin adanawa da kwafi daban-daban.

Maimaita ba hanya ce ta adana bayanan bayanai ba (duba kasa). Amma yanzu za mu ga yadda ake saurin dawo da bayanan da aka goge ba tare da gangan ba ta amfani da kwafin malalaci: kunnawa GitLab.com mai amfani share gajeriyar hanya don aikin gitlab-ce kuma an rasa haɗin kai tare da buƙatun haɗuwa da ayyuka.

Tare da kwafin da aka jinkirta, mun dawo da bayanai a cikin awanni 1,5 kacal. Kalli yadda abin ya faru.

Nuna lokacin dawowa tare da PostgreSQL

PostgreSQL yana da ginanniyar aikin da ke mayar da yanayin bayanan zuwa wani takamaiman lokaci a cikin lokaci. Ana kiranta Maida-in-Lokaci (PITR) kuma yana amfani da hanyoyi iri ɗaya waɗanda ke ci gaba da kwafi har zuwa yau: farawa tare da ingantaccen hoto na gabaɗayan rukunin bayanai (ajiya ta tushe), muna aiwatar da jerin canje-canjen jihohi har zuwa wani lokaci cikin lokaci.

Don amfani da wannan fasalin don madadin sanyi, koyaushe muna yin ainihin ma'ajin bayanai da adana shi a cikin ma'ajiyar bayanai (GitLab archives suna zaune a ciki). Ma'ajiyar girgije ta Google). Muna kuma sa ido kan canje-canje a cikin yanayin bayanan ta hanyar adana bayanan gaba da rubutu (rubuta-gaba log, WAL). Kuma tare da duk wannan a wurin, za mu iya yin PITR don dawo da bala'i: farawa da hoton da aka ɗauka kafin rashin nasara, da kuma amfani da canje-canje daga tarihin WAL har zuwa gazawar.

Menene maimaita maimaitawa?

Lazy kwafi shine aikace-aikacen canje-canje daga WAL tare da jinkiri. Wato cinikin ya faru ne a cikin sa'a guda X, amma zai bayyana a cikin kwafi tare da jinkiri d a cikin awa daya X + d.

PostgreSQL yana da hanyoyi 2 don saita kwafin bayanan bayanan jiki: dawo da madadin da kwafin yawo. Ana dawowa daga rumbun adana bayanai, da gaske yana aiki kamar PITR, amma a ci gaba: koyaushe muna dawo da canje-canje daga tarihin WAL kuma muna amfani da su zuwa kwafi. A kwafin yawo kai tsaye ya dawo da rafin WAL daga mai masaukin bayanai na sama. Mun fi son dawo da kayan tarihi - yana da sauƙin sarrafawa kuma yana da aikin yau da kullun wanda ke ci gaba da gungun samarwa.

Yadda ake saita jinkirin dawowa daga rumbun adana bayanai

Zaɓuɓɓukan farfadowa aka bayyana a cikin fayil recovery.conf... Misali:

standby_mode = 'on'
restore_command = '/usr/bin/envdir /etc/wal-e.d/env /opt/wal-e/bin/wal-e wal-fetch -p 4 "%f" "%p"'
recovery_min_apply_delay = '8h'
recovery_target_timeline = 'latest'

Tare da waɗannan sigogi, mun saita kwafin da aka jinkirta tare da dawo da madadin. Anan ana amfani dashi wal-e don cire sassan WAL (restore_command) daga tarihin, kuma za a yi amfani da canje-canje bayan sa'o'i takwas (XNUMX)recovery_min_apply_delay). Kwafin zai duba don canje-canjen lokaci a cikin tarihin, misali saboda gazawar gungu (recovery_target_timeline).

С recovery_min_apply_delay Kuna iya saita kwafin yawo tare da jinkiri, amma akwai wasu matsuguni a nan waɗanda ke da alaƙa da ramukan kwafi, ra'ayoyin jiran aiki mai zafi, da sauransu. Rumbun tarihin WAL yana ba ku damar guje musu.

Alamar recovery_min_apply_delay ya bayyana kawai a cikin PostgreSQL 9.3. A cikin sigar da ta gabata, don maimaitawa da aka jinkirta kuna buƙatar saita haɗin ayyukan gudanarwa na farfadowa (pg_xlog_replay_pause(), pg_xlog_replay_resume()) ko riƙe sassan WAL a cikin ma'ajiyar bayanai na tsawon lokacin jinkiri.

Ta yaya PostgreSQL ke yin wannan?

Yana da ban sha'awa don ganin yadda PostgreSQL ke aiwatar da farfadowar malalaci. Mu duba recoveryApplyDelay(XlogReaderState). An kira shi daga babban maimaita madauki ga kowane shigarwa daga WAL.

static bool
recoveryApplyDelay(XLogReaderState *record)
{
    uint8       xact_info;
    TimestampTz xtime;
    long        secs;
    int         microsecs;

    /* nothing to do if no delay configured */
    if (recovery_min_apply_delay <= 0)
        return false;

    /* no delay is applied on a database not yet consistent */
    if (!reachedConsistency)
        return false;

    /*
     * Is it a COMMIT record?
     *
     * We deliberately choose not to delay aborts since they have no effect on
     * MVCC. We already allow replay of records that don't have a timestamp,
     * so there is already opportunity for issues caused by early conflicts on
     * standbys.
     */
    if (XLogRecGetRmid(record) != RM_XACT_ID)
        return false;

    xact_info = XLogRecGetInfo(record) & XLOG_XACT_OPMASK;

    if (xact_info != XLOG_XACT_COMMIT &&
        xact_info != XLOG_XACT_COMMIT_PREPARED)
        return false;

    if (!getRecordTimestamp(record, &xtime))
        return false;

    recoveryDelayUntilTime =
        TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);

    /*
     * Exit without arming the latch if it's already past time to apply this
     * record
     */
    TimestampDifference(GetCurrentTimestamp(), recoveryDelayUntilTime,
                        &secs, &microsecs);
    if (secs <= 0 && microsecs <= 0)
        return false;

    while (true)
    {
        // Shortened:
        // Use WaitLatch until we reached recoveryDelayUntilTime
        // and then
        break;
    }
    return true;
}

Ƙarshen ƙasa shine cewa jinkirin ya dogara ne akan lokacin jiki da aka rubuta a cikin ma'amala aikata timestamp (xtime). Kamar yadda kake gani, jinkirin ya shafi aikatawa kawai kuma baya shafar sauran shigarwar - duk canje-canje ana amfani da su kai tsaye, kuma ƙaddamarwar ta jinkirta, don haka za mu ga canje-canje ne kawai bayan an saita jinkirin.

Yadda ake amfani da kwafin da aka jinkirta don dawo da bayanai

Bari mu ce muna da tarin bayanai da kwafi tare da jinkirin awoyi takwas wajen samarwa. Bari mu ga yadda za a mai da bayanai ta amfani da misali da gangan share gajerun hanyoyi.

Lokacin da muka koyi game da matsalar, mu an dakatad da maido da kayan tarihin don kwafin da aka jinkirta:

SELECT pg_xlog_replay_pause();

Tare da tsayawa, ba mu da haɗarin cewa kwafin zai maimaita buƙatar DELETE. Abu mai amfani idan kuna buƙatar lokaci don gano komai.

Ma'anar ita ce kwafin da aka jinkirta dole ne ya isa lokacin kafin buƙatar DELETE. Mun kusan san lokacin cirewar jiki. Mun goge recovery_min_apply_delay kuma ya kara da cewa recovery_target_time в recovery.conf. Wannan shine yadda kwafin ya isa daidai lokacin ba tare da bata lokaci ba:

recovery_target_time = '2018-10-12 09:25:00+00'

Tare da tambarin lokaci, yana da kyau a rage yawan wuce haddi don kada a rasa. Gaskiya ne, mafi girma da raguwa, yawancin bayanan da muka rasa. Bugu da kari, idan muka rasa bukatar DELETE, duk abin da za a sake sharewa kuma za ku sake farawa (ko ma ɗaukar ajiyar sanyi don PITR).

Mun sake kunna misalin Postgres da aka jinkirta kuma an maimaita sassan WAL har zuwa ƙayyadadden lokaci. Kuna iya bin diddigin ci gaba a wannan matakin ta yin tambaya:

SELECT
  -- current location in WAL
  pg_last_xlog_replay_location(),
  -- current transaction timestamp (state of the replica)
  pg_last_xact_replay_timestamp(),
  -- current physical time
  now(),
  -- the amount of time still to be applied until recovery_target_time has been reached
  '2018-10-12 09:25:00+00'::timestamptz - pg_last_xact_replay_timestamp() as delay;

Idan tambarin lokaci ya daina canzawa, farfadowar ya cika. Ana iya keɓance aikin recovery_target_actiondon rufewa, haɓaka, ko dakatar da misalin bayan sake gwadawa (an dakatar da shi ta tsohuwa).

Ma'ajiyar bayanai ta koma matsayinta kafin wannan bukata ta rashin sa'a. Yanzu zaku iya, misali, fitar da bayanai. Mun fitar da bayanan lakabin da aka goge da duk hanyoyin haɗin kai zuwa batutuwa da haɗa buƙatun kuma mun matsar da su cikin bayanan samarwa. Idan asarar ta yi girma, zaku iya inganta kwafin kawai kuma kuyi amfani da shi azaman babba. Amma sai duk canje-canje bayan wurin da muka murmure za su ɓace.

Maimakon tambarin lokaci, yana da kyau a yi amfani da ID na ma'amala. Yana da amfani a yi rikodin waɗannan ID ɗin, misali, don maganganun DDL (kamar DROP TABLE), ta hanyar amfani log_statements = 'ddl'. Idan muna da ID na ma'amala, za mu ɗauka recovery_target_xid kuma gudanar da komai har zuwa ma'amala kafin bukatar DELETE.

Komawa aiki abu ne mai sauqi: cire duk canje-canje daga recovery.conf kuma zata sake farawa Postgres. Kwafin nan ba da jimawa ba zai sake samun jinkiri na awanni takwas, kuma mun shirya don matsaloli na gaba.

Fa'idodin farfadowa

Tare da kwafin da aka jinkirta maimakon ajiyar sanyi, ba dole ba ne ku ciyar da sa'o'i don maido da hoton gaba ɗaya daga ma'ajin. Misali, yana ɗaukar mu sa'o'i biyar kafin mu sami cikakken madadin TB 2. Sannan kuma har yanzu dole ne ku yi amfani da duk WAL na yau da kullun don murmurewa zuwa yanayin da ake so (a cikin mafi munin yanayi).

Kwafin da aka jinkirta yana da kyau fiye da madadin sanyi ta hanyoyi biyu:

  1. Babu buƙatar cire duk ainihin madadin daga ma'ajiyar.
  2. Akwai ƙayyadaddun tagar sa'o'i takwas na sassan WAL waɗanda dole ne a maimaita su.

Har ila yau, muna bincika kullum don ganin ko za a iya yin PITR daga WAL, kuma za mu yi sauri lura da cin hanci da rashawa ko wasu matsalolin da ke tattare da tarihin WAL ta hanyar lura da rashin jinkirin kwafin da aka jinkirta.

A cikin wannan misalin, ya ɗauki mu mintuna 50 kafin mu dawo, ma'ana gudun shine 110 GB na bayanan WAL a cikin awa ɗaya (har yanzu ana kunna ma'ajiyar bayanai). Farashin S3). A cikin duka, mun warware matsalar kuma mun dawo da bayanan a cikin sa'o'i 1,5.

Sakamako: inda kwafin da aka jinkirta yana da amfani (kuma inda ba shi da)

Yi amfani da jinkirin maimaitawa azaman taimakon farko idan kun rasa bayanai da gangan kuma kun lura da wannan matsala a cikin daidaitawar jinkiri.

Amma ka tuna: maimaitawa ba madadin ba ne.

Ajiyayyen da kwafi suna da dalilai daban-daban. Ajiyayyen sanyi zai zo da amfani idan kun yi da gangan DELETE ko DROP TABLE. Muna yin ajiyar ajiya daga ajiyar sanyi kuma muna dawo da yanayin tebur na baya ko duka bayanan bayanai. Amma a lokaci guda bukatar DROP TABLE kusan nan take ake sake bugawa a cikin duk kwafi akan gungu mai aiki, don haka kwafi na yau da kullun ba zai taimaka a nan ba. Kwafi da kanta tana adana bayanan da aka samu lokacin da aka yi hayar sabobin guda ɗaya da rarraba kaya.

Ko da tare da kwafin da aka jinkirta, wani lokaci muna buƙatar ainihin madaidaicin sanyi a wuri mai aminci idan gazawar cibiyar bayanai, ɓoyayyun ɓoyayyun, ko wasu al'amuran da ba a gane su nan da nan ba. Maimaitawa kadai ba shi da amfani a nan.

Примечание. A GitLab.com A halin yanzu muna kare kawai daga asarar bayanai a matakin tsarin kuma ba mu dawo da bayanai a matakin mai amfani ba.

source: www.habr.com

Add a comment