My first postgres database crash recovery experience (invalid page in block 4123007 of relatton base/16490)

I want to share with you my first successful experience in restoring the full functionality of a Postgres database. I got acquainted with the Postgres DBMS half a year ago, before that I had no experience in database administration at all.

My first postgres database crash recovery experience (invalid page in block 4123007 of relatton base/16490)

I work as a semi-DevOps engineer at a large IT company. Our company is developing software for high-load services, while I am responsible for performance, maintenance and deployment. I was given a standard task: to update the application on one server. The application is written in Django, migrations are performed during the update (changing the database structure), and before this process we take a full database dump through the standard pg_dump program just in case.

An unexpected error occurred while dumping (Postgres version is 9.5):

pg_dump: Oumping the contents of table “ws_log_smevlog” failed: PQgetResult() failed.
pg_dump: Error message from server: ERROR: invalid page in block 4123007 of relatton base/16490/21396989
pg_dump: The command was: COPY public.ws_log_smevlog [...]
pg_dunp: [parallel archtver] a worker process dled unexpectedly

Error "invalid page in block" speaks of problems at the file system level, which is very bad. Various forums suggested doing FULL VACUUM with option zero_damaged_pages to solve this problem. Well, guess what…

Preparing for recovery

ATTENTION! Be sure to back up Postgres before any attempt to restore the database. If you have a virtual machine, stop the database and take a snapshot. If you can't take a snapshot, stop the database and copy the contents of the Postgres directory (including the wal files) to a safe place. The main thing in our business is not to make things worse. Read it.

Since the database worked for me in general, I limited myself to the usual database dump, but excluded the table with corrupted data (option -T, --exclude-table=TABLE in pg_dump).

The server was physical, it was impossible to take a snapshot. The backup is removed, we move on.

File system check

Before attempting to restore the database, we need to make sure that everything is in order with the file system itself. And in case of errors, correct them, because otherwise you can only make things worse.

In my case, the file system with the database was mounted in "/srv" and the type was ext4.

Stop the database: systemctl stop [email protected] and check that the file system is not used by anyone and can be unmounted using the command lsof:
lsof +D /srv

I also had to stop the redis database as it was also using "/srv". Next I unmounted / srv (umount).

The file system check was performed using the utility e2fsck with the -f switch (Force checking even if filesystem is marked clean):

My first postgres database crash recovery experience (invalid page in block 4123007 of relatton base/16490)

Next, using the utility dumpe2fs (sudo dumpe2fs /dev/mapper/gu2-sys-srv | grep checked) you can make sure that the check was actually made:

My first postgres database crash recovery experience (invalid page in block 4123007 of relatton base/16490)

e2fsck says that no problems were found at the ext4 file system level, which means that you can continue trying to restore the database, or rather return to vacuum full (of course, you need to mount the file system back and start the database).

If you have a physical server, then be sure to check the status of the disks (via smartctl -a /dev/XXX) or the RAID controller to make sure that the problem is not at the hardware level. In my case, the RAID turned out to be "iron", so I asked the local admin to check the status of the RAID (the server was several hundred kilometers away from me). He said that there were no errors, which means that we can definitely start the restoration.

Attempt 1: zero_damaged_pages

We connect to the database through psql with an account that has superuser rights. We need the superuser, because. option zero_damaged_pages only he can change. In my case, this is postgres:

psql -h 127.0.0.1 -U postgres -s [database_name]

Option zero_damaged_pages needed to ignore read errors (from the postgrespro website):

When a corrupted page header is detected, PostgreSQL will usually report an error and abort the current transaction. If zero_damaged_pages is enabled, the system instead issues a warning, zeroes out the damaged page in memory, and continues processing. This behavior destroys the data, namely all lines in the corrupted page.

Enable the option and try to do full vacuum tables:

VACUUM FULL VERBOSE

My first postgres database crash recovery experience (invalid page in block 4123007 of relatton base/16490)
Unfortunately, failure.

We encountered a similar error:

INFO: vacuuming "“public.ws_log_smevlog”
WARNING: invalid page in block 4123007 of relation base/16400/21396989; zeroing out page
ERROR: unexpected chunk number 573 (expected 565) for toast value 21648541 in pg_toast_106070

pg_toast - the mechanism for storing "long data" in Poetgres, if they do not fit on one page (default 8kb).

Attempt 2: reindex

The first advice from Google did not help. After a few minutes of searching, I found the second tip - to do reindex damaged table. I saw this advice in many places, but it did not inspire confidence. Let's make a reindex:

reindex table ws_log_smevlog

My first postgres database crash recovery experience (invalid page in block 4123007 of relatton base/16490)

reindex completed without problems.

However, this didn't help. VACUUM FULL crashed with the same error. Since I am used to failure, I began to look for advice on the Internet further and came across a rather interesting Article.

Attempt 3: SELECT, LIMIT, OFFSET

In the article above, it was suggested to look at the table line by line and delete the problematic data. To begin with, it was necessary to view all the lines:

for ((i=0; i<"Number_of_rows_in_nodes"; i++ )); do psql -U "Username" "Database Name" -c "SELECT * FROM nodes LIMIT 1 offset $i" >/dev/null || echo $i; done

In my case, the table contained 1 628 991 lines! In a good way, it was necessary to take care of data partitioning, but this is a topic for a separate discussion. It was Saturday, I ran this command in tmux and went to bed:

for ((i=0; i<1628991; i++ )); do psql -U my_user -d my_database -c "SELECT * FROM ws_log_smevlog LIMIT 1 offset $i" >/dev/null || echo $i; done

By morning I decided to check how things were going. To my surprise, I found that only 20% of the data was crawled in 2 hours! I didn't want to wait 50 days. Another complete failure.

But I didn't give up. I wondered why the scan took so long. From the documentation (on postgrespro again) I learned:

OFFSET tells it to skip the specified number of lines before starting to output lines.
If both OFFSET and LIMIT are specified, the system first skips OFFSET rows and then starts counting rows for the LIMIT limit.

When using LIMIT, it is important to also use the ORDER BY clause so that the result rows are returned in a particular order. Otherwise, unpredictable subsets of rows will be returned.

It is obvious that the above command was erroneous: firstly, there was no order by, the result could be wrong. Second, Postgres had to first scan and skip OFFSET rows, and incrementally OFFSET performance would drop even further.

Attempt 4: dump in text form

Then a seemingly brilliant idea came to my mind: to take a dump in text form and analyze the last written line.

But first, let's get acquainted with the structure of the table ws_log_smevlog:

My first postgres database crash recovery experience (invalid page in block 4123007 of relatton base/16490)

In our case we have a column "Id", which contained the unique identifier (counter) of the row. The plan was this:

  1. We start to take a dump in text form (in the form of sql commands)
  2. At a certain point in time, the dumping would be interrupted due to an error, but the text file would still be saved to disk
  3. We look at the end of the text file, thereby we find the identifier (id) of the last line that was removed successfully

I started dumping in text form:

pg_dump -U my_user -d my_database -F p -t ws_log_smevlog -f ./my_dump.dump

Dumping, as expected, aborted with the same error:

pg_dump: Error message from server: ERROR: invalid page in block 4123007 of relatton base/16490/21396989

Coming through tail I looked at the end of the dump (tail -5 ./my_dump.dump) found that the dump broke on the line with id 186 525. “So the problem is in the line with id 186 526, it is broken, and it must be deleted!” I thought. But by making a query to the database:
«select * from ws_log_smevlog where id=186529” it turned out that everything was fine with this line ... Lines with indices 186 530 - 186 540 also worked without problems. Another "brilliant idea" failed. Later I understood why this happened: when deleting and changing data from the table, they are not physically deleted, but marked as “dead tuples”, then comes autovacuum and marks these lines as deleted and allows these lines to be reused. For understanding, if the data in the table changes and autovacuum is enabled, then they are not stored sequentially.

Attempt 5: SELECT, FROM, WHERE id=

Failure makes us stronger. You should never give up, you need to go to the end and believe in yourself and your abilities. So I decided to try another option: just look at all the records in the database one by one. Knowing my table structure (see above), we have an id field which is unique (primary key). We have 1 rows in the table and id are in order, which means we can just iterate over them one at a time:

for ((i=1; i<1628991; i=$((i+1)) )); do psql -U my_user -d my_database  -c "SELECT * FROM ws_log_smevlog where id=$i" >/dev/null || echo $i; done

If anyone does not understand, the command works as follows: it scans the table line by line and sends stdout to / Dev / null, but if the SELECT command fails, then an error text is printed (stderr is sent to the console) and a line containing the error is printed (thanks to ||, which means that the select had problems (the return code of the command is not 0)).

I was lucky, I had created indexes on the field id:

My first postgres database crash recovery experience (invalid page in block 4123007 of relatton base/16490)

And this means that finding a line with the desired id should not take much time. In theory it should work. Well, let's run the command in tmux and we go to sleep.

By morning, I found that there were about 90 entries viewed, which is just over 000%. Excellent result when compared with the previous method (5%)! But I didn't want to wait 2 days...

Attempt 6: SELECT, FROM, WHERE id >= and id

The customer had an excellent server allocated for the database: a two-processor Intel Xeon E5 2697 v2, there were as many as 48 threads in our arrangement! The load on the server was average, we could pick up about 20 streams without any problems. RAM was also enough: as much as 384 gigabytes!

Therefore, the command had to be parallelized:

for ((i=1; i<1628991; i=$((i+1)) )); do psql -U my_user -d my_database  -c "SELECT * FROM ws_log_smevlog where id=$i" >/dev/null || echo $i; done

It was possible to write a beautiful and elegant script here, but I chose the fastest way to parallelize: manually split the range 0-1628991 into intervals of 100 records and run separately 000 commands of the form:

for ((i=N; i<M; i=$((i+1)) )); do psql -U my_user -d my_database  -c "SELECT * FROM ws_log_smevlog where id=$i" >/dev/null || echo $i; done

But that's not all. In theory, connecting to the database also takes some time and system resources. It was not very reasonable to connect 1, you see. So let's fetch 628 rows on one connection instead of one. As a result, the command changed into this:

for ((i=N; i<M; i=$((i+1000)) )); do psql -U my_user -d my_database  -c "SELECT * FROM ws_log_smevlog where id>=$i and id<$((i+1000))" >/dev/null || echo $i; done

Open 16 windows in a tmux session and run the commands:

1) for ((i=0; i<100000; i=$((i+1000)) )); do psql -U my_user -d my_database  -c "SELECT * FROM ws_log_smevlog where id>=$i and id<$((i+1000))" >/dev/null || echo $i; done
2) for ((i=100000; i<200000; i=$((i+1000)) )); do psql -U my_user -d my_database  -c "SELECT * FROM ws_log_smevlog where id>=$i and id<$((i+1000))" >/dev/null || echo $i; done
…
15) for ((i=1400000; i<1500000; i=$((i+1000)) )); do psql -U my_user -d my_database -c "SELECT * FROM ws_log_smevlog where id>=$i and id<$((i+1000))" >/dev/null || echo $i; done
16) for ((i=1500000; i<1628991; i=$((i+1000)) )); do psql -U my_user -d my_database  -c "SELECT * FROM ws_log_smevlog where id>=$i and id<$((i+1000))" >/dev/null || echo $i; done

A day later I got the first results! Namely (the values ​​of XXX and ZZZ are no longer preserved):

ERROR:  missing chunk number 0 for toast value 37837571 in pg_toast_106070
829000
ERROR:  missing chunk number 0 for toast value XXX in pg_toast_106070
829000
ERROR:  missing chunk number 0 for toast value ZZZ in pg_toast_106070
146000

This means that we have three lines containing an error. The id of the first and second problematic records were between 829 and 000, the id of the third was between 830 and 000. Next, we just had to find the exact value of the id of the problematic records. To do this, we look at our range with problematic records with a step of 146 and identify the id:

for ((i=829000; i<830000; i=$((i+1)) )); do psql -U my_user -d my_database -c "SELECT * FROM ws_log_smevlog where id=$i" >/dev/null || echo $i; done
829417
ERROR:  unexpected chunk number 2 (expected 0) for toast value 37837843 in pg_toast_106070
829449
for ((i=146000; i<147000; i=$((i+1)) )); do psql -U my_user -d my_database -c "SELECT * FROM ws_log_smevlog where id=$i" >/dev/null || echo $i; done
829417
ERROR:  unexpected chunk number ZZZ (expected 0) for toast value XXX in pg_toast_106070
146911

Happy Finals

We found the problem lines. We go into the database through psql and try to delete them:

my_database=# delete from ws_log_smevlog where id=829417;
DELETE 1
my_database=# delete from ws_log_smevlog where id=829449;
DELETE 1
my_database=# delete from ws_log_smevlog where id=146911;
DELETE 1

To my surprise, the entries deleted without any problems even without the option zero_damaged_pages.

Then I connected to the base, did VACUUM FULL (I think it was optional), and finally successfully removed the backup using pg_dump. The dump took off without any errors! The problem was solved in such a stupid way. Joy knew no bounds, after so many failures, we managed to find a solution!

Acknowledgments and Conclusion

This is my first experience with restoring a real Postgres database. I will remember this experience for a long time.

And finally, I would like to say thanks to PostgresPro for the translated documentation into Russian and for completely free online courses, which helped a lot during the analysis of the problem.

Source: habr.com

Add a comment