Kwarewata ta farko ta dawo da bayanan Postgres bayan gazawa (shafi mara inganci a cikin toshe 4123007 na tushen relatton/16490)

Ina so in raba tare da ku gwaninta na farko na nasara na maido da bayanan Postgres zuwa cikakken aiki. Na saba da Postgres DBMS rabin shekara da ta gabata; kafin wannan ba ni da gogewa a cikin sarrafa bayanai kwata-kwata.

Kwarewata ta farko ta dawo da bayanan Postgres bayan gazawa (shafi mara inganci a cikin toshe 4123007 na tushen relatton/16490)

Ina aiki a matsayin injiniya na DevOps a cikin babban kamfanin IT. Kamfaninmu yana haɓaka software don ayyuka masu ɗaukar nauyi, kuma ni ke da alhakin aiki, kulawa da turawa. An ba ni daidaitaccen ɗawainiya: don sabunta aikace-aikace akan sabar ɗaya. An rubuta aikace-aikacen a cikin Django, yayin da ake sabunta ƙaura (canje-canje a cikin tsarin bayanai), kuma kafin wannan tsari muna ɗaukar cikakkun bayanai ta hanyar daidaitaccen shirin pg_dump, kawai idan.

Kuskuren da ba zato ba tsammani ya faru yayin ɗaukar juji (Postgres 9.5):

pg_dump: Oumping the contents of table “ws_log_smevlog” failed: PQgetResult() failed.
pg_dump: Error message from server: ERROR: invalid page in block 4123007 of relatton base/16490/21396989
pg_dump: The command was: COPY public.ws_log_smevlog [...]
pg_dunp: [parallel archtver] a worker process dled unexpectedly

Kuskuren "shafi mara inganci a toshe" yayi magana akan matsaloli a matakin tsarin fayil, wanda yayi muni sosai. A kan dandalin tattaunawa daban-daban an ba da shawarar a yi CIKAKKEN WUTA tare da zabin sifili_lalatattun shafuka don magance wannan matsala. To, bari mu gwada...

Ana shirye-shiryen farfadowa

HANKALI! Tabbatar ɗaukar madadin Postgres kafin duk wani yunƙuri na maido da bayanan ku. Idan kuna da injin kama-da-wane, dakatar da bayanan kuma ɗauki hoto. Idan ba zai yiwu a ɗauki hoto ba, dakatar da bayanan kuma kwafi abubuwan da ke cikin littafin adireshi na Postgres (ciki har da fayilolin wal) zuwa wuri mai aminci. Babban abin da ke cikin kasuwancinmu ba shine mu sanya abubuwa su yi muni ba. Karanta shi.

Tun da tsarin bayanai gabaɗaya ya yi aiki a gare ni, na iyakance kaina ga juji na yau da kullun, amma ban da tebur tare da bayanan da suka lalace (zaɓi). -T, --exclude-table=TABLE a pg_dump).

Sabar ta zahiri ce, ba shi yiwuwa a ɗauki hoto. An cire madadin, mu ci gaba.

Duba tsarin fayil

Kafin ƙoƙarin dawo da bayanan, muna buƙatar tabbatar da cewa komai yana cikin tsari tare da tsarin fayil ɗin kanta. Kuma idan akwai kurakurai, gyara su, domin in ba haka ba, za ku iya yin muni ne kawai.

A cikin yanayina, tsarin fayil ɗin tare da bayanan bayanai an saka shi a ciki "/srv" kuma nau'in ya kasance ext4.

Tsaida bayanan bayanai: systemctl tsaya [email kariya] kuma duba cewa tsarin fayil ɗin baya amfani da kowa kuma ana iya cire shi ta amfani da umarnin mayanar:
lsof +D/srv

Na kuma dakatar da redis database, tun da shi ma yana amfani "/srv". Daga baya na sauke / srv (zuwa).

An duba tsarin fayil ta amfani da mai amfani yanann da canza -f (Tilasta bincika koda tsarin fayil ɗin yana da tsabta):

Kwarewata ta farko ta dawo da bayanan Postgres bayan gazawa (shafi mara inganci a cikin toshe 4123007 na tushen relatton/16490)

Next, amfani da mai amfani zuw 2fs (sudo dumpe2fs /dev/mapper/gu2—sys-srv | grep duba) za ku iya tabbatar da cewa an yi cak ɗin a zahiri:

Kwarewata ta farko ta dawo da bayanan Postgres bayan gazawa (shafi mara inganci a cikin toshe 4123007 na tushen relatton/16490)

yanann ya ce ba a sami matsala ba a matakin tsarin fayil na ext4, wanda ke nufin cewa za ku iya ci gaba da ƙoƙarin dawo da bayanan, ko kuma ku koma zuwa injin ya cika (hakika, kuna buƙatar hawan tsarin fayil baya kuma fara bayanan bayanai).

Idan kana da uwar garken jiki, tabbatar da duba matsayin faifai (ta smartctl -a /dev/XXX) ko mai kula da RAID don tabbatar da cewa matsalar ba ta kasance a matakin hardware ba. A cikin akwati na, RAID ya zama "hardware", don haka na tambayi mai kula da gida don duba matsayin RAID (sabar tana da nisan kilomita dari da yawa). Ya ce babu kurakurai, wanda ke nufin cewa babu shakka za mu iya fara maidowa.

Ƙoƙari 1: shafukan sifili_damaged_shafukan

Muna haɗi zuwa bayanan bayanai ta psql tare da asusun da ke da haƙƙin mai amfani. Muna buƙatar superuser, saboda ... zaɓi sifili_lalatattun shafuka shi kadai zai iya canzawa. A halin da nake ciki shi ne postgres:

psql -h 127.0.0.1 -U postgres -s [sunan bayanai]

Zaɓi sifili_lalatattun shafuka da ake buƙata don yin watsi da kurakuran karantawa (daga gidan yanar gizon postgrespro):

Lokacin da PostgreSQL ya gano ɓoyayyen shafi na ɓarna, yawanci yana ba da rahoton kuskure kuma yana soke ciniki na yanzu. Idan shafin zero_damaged_shafukan ya kunna, tsarin a maimakon haka yana ba da gargaɗi, ya cire shafin da ya lalace a ƙwaƙwalwar ajiya, kuma yana ci gaba da sarrafawa. Wannan hali yana lalata bayanai, wato duk layuka da ke cikin shafin da ya lalace.

Muna ba da damar zaɓi kuma muna ƙoƙarin yin cikakken injin tebur:

VACUUM FULL VERBOSE

Kwarewata ta farko ta dawo da bayanan Postgres bayan gazawa (shafi mara inganci a cikin toshe 4123007 na tushen relatton/16490)
Abin takaici, rashin sa'a.

Mun ci karo da kuskure irin wannan:

INFO: vacuuming "“public.ws_log_smevlog”
WARNING: invalid page in block 4123007 of relation base/16400/21396989; zeroing out page
ERROR: unexpected chunk number 573 (expected 565) for toast value 21648541 in pg_toast_106070

pg_zuwa - hanyar adana "dogon bayanai" a cikin Poetgres idan bai dace da shafi ɗaya ba (8kb ta tsohuwa).

Ƙoƙari 2: reindex

Nasihar farko daga Google ba ta taimaka ba. Bayan 'yan mintoci kaɗan na bincike, na sami tip na biyu - don yin reindex tebur lalace. Na ga wannan shawarar a wurare da yawa, amma ba ta sa gaba gaɗi ba. Bari mu sake index:

reindex table ws_log_smevlog

Kwarewata ta farko ta dawo da bayanan Postgres bayan gazawa (shafi mara inganci a cikin toshe 4123007 na tushen relatton/16490)

reindex kammala ba tare da matsala ba.

Duk da haka, wannan bai taimaka ba. WUTA CIKAKKEN ya fadi da kuskure makamancin haka. Tun da na saba da kasawa, na fara neman shawara akan Intanet kuma na ci karo da wani abu mai ban sha'awa labarin.

Ƙoƙari 3: Zabi, IYAKA, KASHE

Labarin da ke sama ya ba da shawarar duba layin tebur a jere da cire bayanan matsala. Da farko muna buƙatar duba dukkan layukan:

for ((i=0; i<"Number_of_rows_in_nodes"; i++ )); do psql -U "Username" "Database Name" -c "SELECT * FROM nodes LIMIT 1 offset $i" >/dev/null || echo $i; done

A cikin yanayina, teburin ya ƙunshi 1 628 991 layi! Ya zama dole a kula sosai rarraba bayanai, amma wannan batu ne don tattaunawa ta daban. A ranar Asabar ne, na gudanar da wannan umarni cikin tmux na kwanta:

for ((i=0; i<1628991; i++ )); do psql -U my_user -d my_database -c "SELECT * FROM ws_log_smevlog LIMIT 1 offset $i" >/dev/null || echo $i; done

Da safe na yanke shawarar duba yadda abubuwa ke tafiya. Abin ya ba ni mamaki, na gano cewa bayan sa'o'i 20, kawai kashi 2% na bayanan da aka bincika! Ba na so in jira kwanaki 50. Wata cikakkiyar gazawa.

Amma ban karaya ba. Na yi mamakin dalilin da yasa binciken ya dauki lokaci mai tsawo haka. Daga takardun (sake kan postgrespro) Na gano:

OFFSET yana ƙayyadaddun tsallake ƙayyadaddun adadin layuka kafin fara fitar da layuka.
Idan duka OFFSET da LIMIT an kayyade, tsarin zai fara tsallake layuka na OFFSET sannan ya fara kirga layuka don iyakance LIMIT.

Lokacin amfani da LIMIT, yana da mahimmanci kuma a yi amfani da Oda ta hanyar magana domin a mayar da sakamakon layuka cikin takamaiman tsari. In ba haka ba, za a dawo da sassan layuka marasa tabbas.

Babu shakka, umarnin da ke sama ba daidai ba ne: na farko, babu oda ta, sakamakon zai iya zama kuskure. Abu na biyu, Postgres ya fara bincika kuma ya tsallake layuka na OFFSET, kuma tare da karuwa OFFSET Yawan aiki zai ragu har ma da gaba.

Ƙoƙari na 4: ɗauki juji cikin sigar rubutu

Sai wata alama mai haske ta zo a raina: ɗauki jujjuya cikin fom ɗin rubutu kuma bincika layin ƙarshe da aka yi rikodi.

Amma da farko, bari mu dubi tsarin tebur. ws_log_smevlog:

Kwarewata ta farko ta dawo da bayanan Postgres bayan gazawa (shafi mara inganci a cikin toshe 4123007 na tushen relatton/16490)

A cikin yanayinmu muna da shafi "Id", wanda ya ƙunshi keɓaɓɓen mai ganowa (counter) na jere. Shirin ya kasance kamar haka:

  1. Mun fara ɗaukar juji a cikin sigar rubutu (a cikin tsarin umarnin sql)
  2. A wani lokaci na lokaci, za a katse juji saboda kuskure, amma har yanzu fayil ɗin rubutu za a adana shi akan faifai.
  3. Muna duba ƙarshen fayil ɗin rubutu, ta haka ne zamu sami mai ganowa (id) na layin ƙarshe wanda aka cire cikin nasara

Na fara shan juji a cikin hanyar rubutu:

pg_dump -U my_user -d my_database -F p -t ws_log_smevlog -f ./my_dump.dump

Juji, kamar yadda aka zata, an katse shi da kuskure iri ɗaya:

pg_dump: Error message from server: ERROR: invalid page in block 4123007 of relatton base/16490/21396989

Ci gaba ta hanyar wutsiya Na kalli karshen juji (wutsiya -5 ./my_dump.dump) gano cewa an katse juji akan layi tare da id 186 525. "Don haka matsalar tana cikin layi tare da id 186 526, ya karye, kuma yana buƙatar share!" – Na yi tunani. Amma, yin tambaya ga database:
«zaɓi * daga ws_log_smevlog inda id=186529"Ya bayyana cewa komai yana da kyau tare da wannan layin ... Layukan da ke da alamun 186 - 530 kuma sun yi aiki ba tare da matsala ba. Wani “kyakkyawan tunani” ya kasa. Daga baya na fahimci dalilin da ya sa wannan ya faru: lokacin sharewa da canza bayanai daga tebur, ba a share su ta jiki ba, amma an yi musu alama a matsayin "matattu tuples", sannan ya zo. autovacuum kuma yana yiwa waɗannan layin a matsayin share kuma yana ba da damar sake amfani da waɗannan layukan. Don fahimta, idan bayanan da ke cikin tebur sun canza kuma an kunna autovacuum, to ba a adana shi bi-da-bi.

Ƙoƙari na 5: Zaɓi, DAGA, INA id=

Kasawa yana sa mu fi karfi. Kada ku daina, kuna buƙatar zuwa ƙarshe kuma kuyi imani da kanku da iyawar ku. Don haka na yanke shawarar gwada wani zaɓi: kawai duba duk bayanan da ke cikin bayanan ɗaya bayan ɗaya. Sanin tsarin tebur na (duba sama), muna da filin id wanda yake na musamman (maɓalli na farko). Muna da layuka 1 a cikin tebur da id suna cikin tsari, wanda ke nufin za mu iya bi ta su ɗaya bayan ɗaya:

for ((i=1; i<1628991; i=$((i+1)) )); do psql -U my_user -d my_database  -c "SELECT * FROM ws_log_smevlog where id=$i" >/dev/null || echo $i; done

Idan kowa bai fahimta ba, umarnin yana aiki kamar haka: yana duba layin tebur a jere kuma yana aika stdout zuwa / dev / null, amma idan umurnin SELECT ya kasa, to, an buga rubutun kuskure (ana aika stderr zuwa na'ura mai kwakwalwa) kuma ana buga layin da ke dauke da kuskuren (godiya ga ||, wanda ke nufin cewa zaɓin ya sami matsala (lambar dawowa na umurnin) ba 0)).

Na yi sa'a, an ƙirƙiri fihirisa a filin id:

Kwarewata ta farko ta dawo da bayanan Postgres bayan gazawa (shafi mara inganci a cikin toshe 4123007 na tushen relatton/16490)

Wannan yana nufin cewa gano layi tare da id ɗin da ake so bai kamata ya ɗauki lokaci mai yawa ba. A ka'idar ya kamata yayi aiki. To, bari mu shigar da umurnin a ciki tmux muje mu kwanta.

Da safe na gano cewa an duba shigarwar kusan 90, wanda ya wuce kashi 000%. Kyakkyawan sakamako idan aka kwatanta da hanyar da ta gabata (5%)! Amma ban so in jira kwanaki 2 ba ...

Ƙoƙari 6: Zaɓi, DAGA, INA id> = da id

Abokin ciniki yana da kyakkyawar uwar garken da aka keɓe ga ma'ajin bayanai: dual-processor Intel Xeon E5-2697 v2, akwai da yawa kamar 48 zaren a wurinmu! Nauyin da ke kan sabar ya kasance matsakaita; za mu iya zazzage kusan zaren guda 20 ba tare da wata matsala ba. Hakanan akwai isasshen RAM: har zuwa 384 gigabytes!

Don haka, umarnin ya buƙaci daidaita shi:

for ((i=1; i<1628991; i=$((i+1)) )); do psql -U my_user -d my_database  -c "SELECT * FROM ws_log_smevlog where id=$i" >/dev/null || echo $i; done

Anan yana yiwuwa a rubuta kyakkyawan rubutu mai kyan gani, amma na zaɓi hanyar daidaitawa mafi sauri: da hannu raba kewayon 0-1628991 zuwa tazara na rikodin 100 kuma gudanar da umarni 000 daban na tsari:

for ((i=N; i<M; i=$((i+1)) )); do psql -U my_user -d my_database  -c "SELECT * FROM ws_log_smevlog where id=$i" >/dev/null || echo $i; done

Amma ba haka kawai ba. A ka'idar, haɗawa zuwa bayanan bayanai kuma yana ɗaukar ɗan lokaci da albarkatun tsarin. Haɗin 1 ba shi da wayo sosai, zaku yarda. Don haka, bari mu dawo da layuka 628 maimakon ɗaya akan haɗin kai ɗaya. Sakamakon haka, ƙungiyar ta rikiɗe zuwa wannan:

for ((i=N; i<M; i=$((i+1000)) )); do psql -U my_user -d my_database  -c "SELECT * FROM ws_log_smevlog where id>=$i and id<$((i+1000))" >/dev/null || echo $i; done

Bude windows 16 a cikin zaman tmux kuma gudanar da umarni:

1) for ((i=0; i<100000; i=$((i+1000)) )); do psql -U my_user -d my_database  -c "SELECT * FROM ws_log_smevlog where id>=$i and id<$((i+1000))" >/dev/null || echo $i; done
2) for ((i=100000; i<200000; i=$((i+1000)) )); do psql -U my_user -d my_database  -c "SELECT * FROM ws_log_smevlog where id>=$i and id<$((i+1000))" >/dev/null || echo $i; done
…
15) for ((i=1400000; i<1500000; i=$((i+1000)) )); do psql -U my_user -d my_database -c "SELECT * FROM ws_log_smevlog where id>=$i and id<$((i+1000))" >/dev/null || echo $i; done
16) for ((i=1500000; i<1628991; i=$((i+1000)) )); do psql -U my_user -d my_database  -c "SELECT * FROM ws_log_smevlog where id>=$i and id<$((i+1000))" >/dev/null || echo $i; done

Bayan kwana guda na sami sakamako na farko! Wato (ba a kiyaye ƙimar XXX da ZZZ):

ERROR:  missing chunk number 0 for toast value 37837571 in pg_toast_106070
829000
ERROR:  missing chunk number 0 for toast value XXX in pg_toast_106070
829000
ERROR:  missing chunk number 0 for toast value ZZZ in pg_toast_106070
146000

Wannan yana nufin cewa layi uku sun ƙunshi kuskure. Id na rikodin matsala na farko da na biyu sun kasance tsakanin 829 da 000, ids na uku sun kasance tsakanin 830 da 000. Bayan haka, kawai sai mu nemo ainihin ƙimar bayanan matsalar. Don yin wannan, muna duba cikin kewayon mu tare da bayanan matsala tare da mataki na 146 kuma mu gano id:

for ((i=829000; i<830000; i=$((i+1)) )); do psql -U my_user -d my_database -c "SELECT * FROM ws_log_smevlog where id=$i" >/dev/null || echo $i; done
829417
ERROR:  unexpected chunk number 2 (expected 0) for toast value 37837843 in pg_toast_106070
829449
for ((i=146000; i<147000; i=$((i+1)) )); do psql -U my_user -d my_database -c "SELECT * FROM ws_log_smevlog where id=$i" >/dev/null || echo $i; done
829417
ERROR:  unexpected chunk number ZZZ (expected 0) for toast value XXX in pg_toast_106070
146911

Jin daɗi

Mun sami layukan matsala. Muna shiga cikin bayanan ta hanyar psql kuma muna ƙoƙarin share su:

my_database=# delete from ws_log_smevlog where id=829417;
DELETE 1
my_database=# delete from ws_log_smevlog where id=829449;
DELETE 1
my_database=# delete from ws_log_smevlog where id=146911;
DELETE 1

Abin mamaki, an goge abubuwan da aka shigar ba tare da wata matsala ba ko da ba tare da zaɓi ba sifili_lalatattun shafuka.

Sai na haɗa da database, yi WUTA CIKAKKEN (Ina tsammanin ba lallai ba ne don yin wannan), kuma a ƙarshe na sami nasarar cire madadin ta amfani da pg_zuba. An kwashe juji ba tare da kurakurai ba! An magance matsalar ta hanyar wauta. Farin ciki bai san iyaka ba, bayan rashin nasara da yawa mun sami damar samun mafita!

Godiya da Kammalawa

Wannan shine yadda gwanina na farko na maido da ainihin bayanan Postgres ya juya. Zan tuna da wannan kwarewa na dogon lokaci.

Kuma a ƙarshe, Ina so in ce na gode wa PostgresPro don fassara takaddun zuwa Rashanci kuma don gaba daya free online darussa, wanda ya taimaka sosai a lokacin nazarin matsalar.

source: www.habr.com

Add a comment