WAL-G: backups and restores of PostgreSQL DBMS

It has long been known to backup to SQL dumps (using pg_dump or pg_dumpall) is not a good idea. To back up PostgreSQL DBMS, it is better to use the command pg_basebackup, which makes a binary copy of WAL logs. But when you start learning the whole backup and restore process, you will realize that you need to write at least a couple of tricycles for it to work and not cause you pain from above and below. In order to alleviate suffering, WAL-G was developed.

WAL-G is a tool written in Go for backing up and restoring PostgreSQL databases (and more recently MySQL/MariaDB, MongoDB and FoundationDB). It supports working with Amazon S3 storages (and analogues, for example, Yandex Object Storage), as well as Google Cloud Storage, Azure Storage, Swift Object Storage and just with the file system. The whole setup comes down to simple steps, but due to the fact that articles about it are scattered on the Internet, there is no complete how-to manual that would include all the steps from and to (there are several posts on Habré, but many points are missed there).

WAL-G: backups and restores of PostgreSQL DBMS

This article is written primarily to systematize your knowledge. I'm not a DBA and I can speak in layman's language somewhere, so any corrections are welcome!

Separately, I note that everything below is relevant and tested for PostgreSQL 12.3 on Ubuntu 18.04, all commands must be executed as a privileged user.

Installation

At the time of this writing, the stable version of WAL-G is v0.2.15 (March 2020). We will be using itbut if you want to build it yourself from the master branch, then the github repository has all the instructions for this). To download and install, you need to run:

#!/bin/bash

curl -L "https://github.com/wal-g/wal-g/releases/download/v0.2.15/wal-g.linux-amd64.tar.gz" -o "wal-g.linux-amd64.tar.gz"
tar -xzf wal-g.linux-amd64.tar.gz
mv wal-g /usr/local/bin/

After that, you need to configure WAL-G first, and then PostgreSQL itself.

WAL-G setup

For an example of storing backups, Amazon S3 will be used (because it is closer to my servers and it is very cheap to use). To work with it, you need an "s3-bucket" and access keys.

All previous WAL-G articles have used configuration with environment variables, but as of this release, settings can be located in .walg.json file in the home directory of the postgres user. To create it, run the following bash script:

#!/bin/bash

cat > /var/lib/postgresql/.walg.json << EOF
{
    "WALG_S3_PREFIX": "s3://your_bucket/path",
    "AWS_ACCESS_KEY_ID": "key_id",
    "AWS_SECRET_ACCESS_KEY": "secret_key",
    "WALG_COMPRESSION_METHOD": "brotli",
    "WALG_DELTA_MAX_STEPS": "5",
    "PGDATA": "/var/lib/postgresql/12/main",
    "PGHOST": "/var/run/postgresql/.s.PGSQL.5432"
}
EOF
# обязательно меняем владельца файла:
chown postgres: /var/lib/postgresql/.walg.json

Let me explain a little about everything:

  • WALG_S3_PREFIX - the path to your S3 bucket where backups will be uploaded (you can either to the root or to the folder);
  • AWS_ACCESS_KEY_ID – access key in S3 (in case of restoration on a test server - these keys must have a ReadOnly Policy! This is discussed in more detail in the recovery section.);
  • AWS_SECRET_ACCESS_KEY – secret key in S3 storage;
  • WALG_COMPRESSION_METHOD - compression method, it is better to use Brotli (since this is the golden mean between the final size and the speed of compression / decompression);
  • WALG_DELTA_MAX_STEPS - the number of "deltas" before creating a full backup (allows you to save time and the size of the downloaded data, but can slightly slow down the recovery process, so it is not advisable to use large values);
  • PGDATA – path to the directory with your database data (can be found by running the command pg_lsclusters);
  • PGHOST - connection to the database, with a local backup, it is better to do it through unix-socket, as in this example.

Other options can be found in the documentation: https://github.com/wal-g/wal-g/blob/v0.2.15/PostgreSQL.md#configuration.

Configuring PostgreSQL

In order for the archiver inside the database to upload WAL logs to the cloud and recover from them (if necessary), you need to set several parameters in the configuration file /etc/postgresql/12/main/postgresql.conf. Just for starters you need to make surethat none of the settings below are set to any other values, so that when the configuration is reloaded, the DBMS does not fall. You can add these options with:

#!/bin/bash

echo "wal_level=replica" >> /etc/postgresql/12/main/postgresql.conf
echo "archive_mode=on" >> /etc/postgresql/12/main/postgresql.conf
echo "archive_command='/usr/local/bin/wal-g wal-push "%p" >> /var/log/postgresql/archive_command.log 2>&1' " >> /etc/postgresql/12/main/postgresql.conf
echo “archive_timeout=60” >> /etc/postgresql/12/main/postgresql.conf
echo "restore_command='/usr/local/bin/wal-g wal-fetch "%f" "%p" >> /var/log/postgresql/restore_command.log 2>&1' " >> /etc/postgresql/12/main/postgresql.conf

# перезагружаем конфиг через отправку SIGHUP сигнала всем процессам БД
killall -s HUP postgres

Description of the parameters to be set:

  • wal_level - how much information to write in WAL logs, "replica" - write everything;
  • archive_mode – enable loading of WAL logs using the command from the parameter archive_command;
  • archive_command – a command for archiving the completed WAL log;
  • archive_timeout - archiving of logs is performed only when it is completed, but if your server changes/adds little data to the database, then it makes sense to set a limit here in seconds, after which the archiving command will be called forcibly (I have an intensive write to the database every second, so I refused to set this parameter in production);
  • restore_command – the command to restore the WAL log from a backup will be used if the “full backup” (base backup) lacks the latest changes in the database.

You can read more about all these parameters in the translation of the official documentation: https://postgrespro.ru/docs/postgresql/12/runtime-config-wal.

Setting up a backup schedule

Like it or not, but the most convenient way to run it is cron. This is what we will configure to create backups. Let's start with the command to create a full backup: in wal-g, this is the launch argument backup-push. But first, it's better to run this command manually as a postgres user to make sure everything is fine (and there are no access errors):

#!/bin/bash

su - postgres -c '/usr/local/bin/wal-g backup-push /var/lib/postgresql/12/main'

The launch arguments contain the path to the data directory - I remind you that you can find it by running pg_lsclusters.

If everything went without errors and the data was loaded into the S3 storage, then you can further configure periodic launch in crontab:

#!/bin/bash

echo "15 4 * * *    /usr/local/bin/wal-g backup-push /var/lib/postgresql/12/main >> /var/log/postgresql/walg_backup.log 2>&1" >> /var/spool/cron/crontabs/postgres
# задаем владельца и выставляем правильные права файлу
chown postgres: /var/spool/cron/crontabs/postgres
chmod 600 /var/spool/cron/crontabs/postgres

In this example, the backup process starts every day at 4:15 am.

Deleting old backups

You probably don't need to keep absolutely all backups from the Mesozoic era, so it's a good idea to periodically "clean up" your storage (both "full backups" and WAL logs). We will do this all also through a cron task:

#!/bin/bash

echo "30 6 * * *    /usr/local/bin/wal-g delete before FIND_FULL $(date -d '-10 days' '+%FT%TZ') --confirm >> /var/log/postgresql/walg_delete.log 2>&1" >> /var/spool/cron/crontabs/postgres
# ещё раз задаем владельца и выставляем правильные права файлу (хоть это обычно это и не нужно повторно делать)
chown postgres: /var/spool/cron/crontabs/postgres
chmod 600 /var/spool/cron/crontabs/postgres

Cron will run this task every day at 6:30 am, deleting everything (full backups, deltas and WALs) except for the last 10 days, but leaving at least one backup to specified date to any point after dates got into PITR.

Restoring from a backup

It's no secret that the key to a healthy database is to periodically restore and check the integrity of the data inside. How to recover using WAL-G - I'll tell you in this section, and we'll talk about checks later.

Separately worth noting that to restore on a test environment (everything that is not production) - you need to use a Read Only account in S3, so as not to accidentally overwrite backups. In the case of WAL-G, you need to set the S3 user the following rights in Group Policy (Effect: Allow): s3:GetObject, s3:ListBucket, s3:GetBucketLocation. And, of course, don't forget to post archive_mode=off in settings file postgresql.confso that your test base doesn't want to back up silently.

Restoration is done with a slight movement of the hand with deleting all PostgreSQL data (including users), so please be extremely careful when you run the following commands.

#!/bin/bash

# если есть балансировщик подключений (например, pgbouncer), то вначале отключаем его, чтобы он не нарыгал ошибок в лог
service pgbouncer stop
# если есть демон, который перезапускает упавшие процессы (например, monit), то останавливаем в нём процесс мониторинга базы (у меня это pgsql12)
monit stop pgsql12
# или останавливаем мониторинг полностью
service monit stop
# останавливаем саму базу данных
service postgresql stop
# удаляем все данные из текущей базы (!!!); лучше предварительно сделать их копию, если есть свободное место на диске
rm -rf /var/lib/postgresql/12/main
# скачиваем резервную копию и разархивируем её
su - postgres -c '/usr/local/bin/wal-g backup-fetch /var/lib/postgresql/12/main LATEST'
# помещаем рядом с базой специальный файл-сигнал для восстановления (см. https://postgrespro.ru/docs/postgresql/12/runtime-config-wal#RUNTIME-CONFIG-WAL-ARCHIVE-RECOVERY ), он обязательно должен быть создан от пользователя postgres
su - postgres -c 'touch /var/lib/postgresql/12/main/recovery.signal'
# запускаем базу данных, чтобы она инициировала процесс восстановления
service postgresql start

For those who want to check the recovery process, a small piece of bash magic has been prepared below so that in case of problems in recovery, the script crashes with a non-zero exit code. In this example, 120 checks are made with a timeout of 5 seconds (total 10 minutes for recovery) to find out if the signal file was deleted (this will mean that the recovery was successful):

#!/bin/bash

CHECK_RECOVERY_SIGNAL_ITER=0
while [ ${CHECK_RECOVERY_SIGNAL_ITER} -le 120 ]
do
    if [ ! -f "/var/lib/postgresql/12/main/recovery.signal" ]
    then
        echo "recovery.signal removed"
        break
    fi
    sleep 5
    ((CHECK_RECOVERY_SIGNAL_ITER+1))
done

# если после всех проверок файл всё равно существует, то падаем с ошибкой
if [ -f "/var/lib/postgresql/12/main/recovery.signal" ]
then
    echo "recovery.signal still exists!"
    exit 17
fi

After a successful restore, do not forget to restart all processes (pgbouncer / monit, etc.).

Data verification after recovery

Be sure to check the integrity of the database after recovery, so that there is no situation with a broken / crooked backup. And it’s better to do this with each created archive, but where and how is up to your imagination (you can raise separate servers on an hourly basis or run a check in CI). But at a minimum, it is necessary to check the data and indexes in the database.

To check the data, it is enough to run it through a dump, but it is better that you have checksums enabled when creating the database (data checksums):

#!/bin/bash

if ! su - postgres -c 'pg_dumpall > /dev/null'
then
    echo 'pg_dumpall failed'
    exit 125
fi

To check indexes - exists amcheck module, sql-query to it we will take from WAL-G tests and build a little logic around:

#!/bin/bash

# добавляем sql-запрос для проверки в файл во временной директории
cat > /tmp/amcheck.sql << EOF
CREATE EXTENSION IF NOT EXISTS amcheck;
SELECT bt_index_check(c.oid), c.relname, c.relpages
FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid
JOIN pg_am am ON op.opcmethod = am.oid
JOIN pg_class c ON i.indexrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE am.amname = 'btree'
AND c.relpersistence != 't'
AND i.indisready AND i.indisvalid;
EOF
chown postgres: /tmp/amcheck.sql

# добавляем скрипт для запуска проверок всех доступных баз в кластере
# (обратите внимание что переменные и запуск команд – экранированы)
cat > /tmp/run_amcheck.sh << EOF
for DBNAME in $(su - postgres -c 'psql -q -A -t -c "SELECT datname FROM pg_database WHERE datistemplate = false;" ')
do
    echo "Database: ${DBNAME}"
    su - postgres -c "psql -f /tmp/amcheck.sql -v 'ON_ERROR_STOP=1' ${DBNAME}" && EXIT_STATUS=$? || EXIT_STATUS=$?
    if [ "${EXIT_STATUS}" -ne 0 ]
    then
        echo "amcheck failed on DB: ${DBNAME}"
        exit 125
    fi
done
EOF
chmod +x /tmp/run_amcheck.sh

# запускаем скрипт
/tmp/run_amcheck.sh > /tmp/amcheck.log

# для проверки что всё прошло успешно можно проверить exit code или grep’нуть ошибку
if grep 'amcheck failed' "/tmp/amcheck.log"
then
    echo 'amcheck failed: '
    cat /tmp/amcheck.log
    exit 125
fi

Summarizing

I express my gratitude to Andrey Borodin for his help in preparing the publication and special thanks for his contribution to the development of WAL-G!

This concludes this note. I hope that I was able to convey the ease of customization and the huge potential for using this tool in your company. I heard a lot about WAL-G, but there was no time to sit down and figure it out. And after I implemented it at home, this article came out of me.

Separately, it is worth noting that WAL-G can also work with the following DBMS:

Source: habr.com

Add a comment