How We Used Lazy Replication for Disaster Recovery with PostgreSQL

How We Used Lazy Replication for Disaster Recovery with PostgreSQL
Replication is not a backup. Or not? Here's how we used lazy replication for recovery by accidentally deleting shortcuts.

infrastructure specialists GitLab is responsible for the work GitLab.com - the largest instance of GitLab in nature. With 3 million users and almost 7 million projects, it is one of the largest open source SaaS sites with a dedicated architecture. Without the PostgreSQL database system, the GitLab.com infrastructure will not go far, and what we just don’t do for fault tolerance in case of any failures when you can lose data. It is unlikely that such a catastrophe will happen, but we prepared well and stocked up with various backup and replication mechanisms.

Replication is not your database backup tool (see below). But now we will see how to quickly recover accidentally deleted data using lazy replication: on GitLab.com user removed shortcut for the project gitlab-ce and lost connections with merge requests and tasks.

With delayed replica, we recovered data in just 1,5 hours. See how it was.

Point-in-time recovery with PostgreSQL

PostgreSQL has a built-in function that restores the state of a database to a specific point in time. It is called Point-in-Time Recovery (PITR) and uses the same mechanisms that keep a replica up-to-date: starting with a reliable snapshot of the entire database cluster (base backup), we apply a series of state changes up to a certain point in time.

To use this feature for a cold backup, we regularly make a base database backup and store it in an archive (the GitLab archives live in Google cloud storage). We also monitor database state changes by archiving a write-ahead log (write ahead log, WAL). And with all this, we can do PITR for disaster recovery: we start with a snapshot taken before the error and apply the changes from the WAL archive up to the crash.

What is delayed replication?

Delayed replication is the application of changes from WAL with a delay. That is, the transaction occurred on the hour X, but it will appear in the replica with a delay d in time X + d.

There are 2 ways to set up a physical database replica in PostgreSQL: archive restore and streaming replication. Restoring from an archive, essentially works like PITR, but continuously: we are constantly extracting changes from the WAL archive and applying them to the replica. A streaming replication fetches the WAL stream directly from the upstream database host. We prefer restore from an archive - it is easier to manage and has normal performance, which does not lag behind a production cluster.

How to set up delayed backup recovery

Recovery options described in the file recovery.conf. Example:

standby_mode = 'on'
restore_command = '/usr/bin/envdir /etc/wal-e.d/env /opt/wal-e/bin/wal-e wal-fetch -p 4 "%f" "%p"'
recovery_min_apply_delay = '8h'
recovery_target_timeline = 'latest'

With these settings, we have configured a delayed replica with archive restore. Used here wal-e to extract WAL segments (restore_command) from the archive, and the changes will be applied after eight hours (recovery_min_apply_delay). Replica will watch for timeline changes in the archive, such as due to a cluster failover (recovery_target_timeline).

Π‘ recovery_min_apply_delay you can set up latency streaming replication, but there are a couple of pitfalls associated with replication slots, hot spare feedback, and so on. The WAL archive avoids them.

Parameter recovery_min_apply_delay appeared only in PostgreSQL 9.3. In previous versions, delayed replication requires a combination of recovery management functions (pg_xlog_replay_pause(), pg_xlog_replay_resume()) or hold the WAL segments in the archive for the duration of the delay.

How does PostgreSQL do it?

It's interesting to see how PostgreSQL implements lazy restore. Let's look at recoveryApplyDelay(XlogReaderState). It is called from main loop repeat for each entry from WAL.

static bool
recoveryApplyDelay(XLogReaderState *record)
{
    uint8       xact_info;
    TimestampTz xtime;
    long        secs;
    int         microsecs;

    /* nothing to do if no delay configured */
    if (recovery_min_apply_delay <= 0)
        return false;

    /* no delay is applied on a database not yet consistent */
    if (!reachedConsistency)
        return false;

    /*
     * Is it a COMMIT record?
     *
     * We deliberately choose not to delay aborts since they have no effect on
     * MVCC. We already allow replay of records that don't have a timestamp,
     * so there is already opportunity for issues caused by early conflicts on
     * standbys.
     */
    if (XLogRecGetRmid(record) != RM_XACT_ID)
        return false;

    xact_info = XLogRecGetInfo(record) & XLOG_XACT_OPMASK;

    if (xact_info != XLOG_XACT_COMMIT &&
        xact_info != XLOG_XACT_COMMIT_PREPARED)
        return false;

    if (!getRecordTimestamp(record, &xtime))
        return false;

    recoveryDelayUntilTime =
        TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);

    /*
     * Exit without arming the latch if it's already past time to apply this
     * record
     */
    TimestampDifference(GetCurrentTimestamp(), recoveryDelayUntilTime,
                        &secs, &microsecs);
    if (secs <= 0 && microsecs <= 0)
        return false;

    while (true)
    {
        // Shortened:
        // Use WaitLatch until we reached recoveryDelayUntilTime
        // and then
        break;
    }
    return true;
}

The bottom line is that the delay is based on the physical time recorded in the timestamp of the transaction commit (xtime). As you can see, the delay applies only to commits and does not affect other records - all changes are applied directly, and the commit is delayed, so we will see the changes only after the configured delay.

How to use lazy replica for data recovery

Let's say we have a database cluster in production and a replica with an eight-hour delay. Let's see how to recover data using an example accidental deletion of shortcuts.

When we became aware of the problem, we paused backup recovery for delayed replica:

SELECT pg_xlog_replay_pause();

With a pause, we had no risk that the replica would repeat the request DELETE. Useful thing if you need time to figure everything out.

The bottom line is that the delayed replica must reach the moment before the request DELETE. We approximately knew the physical time of removal. We removed recovery_min_apply_delay and added recovery_target_time Π² recovery.conf. So the replica reaches the right moment without delay:

recovery_target_time = '2018-10-12 09:25:00+00'

With timestamps, it is better to reduce the excess so as not to miss. True, the greater the decrease, the more data we lose. Again, if we skip the request DELETE, everything will be deleted again and you will have to start over (or even take a cold backup for PITR).

We restarted the delayed Postgres instance and the WAL segments were repeated until the specified time. You can track progress at this stage by querying:

SELECT
  -- current location in WAL
  pg_last_xlog_replay_location(),
  -- current transaction timestamp (state of the replica)
  pg_last_xact_replay_timestamp(),
  -- current physical time
  now(),
  -- the amount of time still to be applied until recovery_target_time has been reached
  '2018-10-12 09:25:00+00'::timestamptz - pg_last_xact_replay_timestamp() as delay;

If the timestamp no longer changes, the restore is complete. You can customize the action recovery_target_actionto close, promote, or pause the instance after a retry (it pauses by default).

The database came to the state before that ill-fated request. Now you can, for example, export data. We have exported the remote label data and all the links to issues and merge requests and transferred them to the production database. If the losses are large, you can simply promote the replica and use it as the main one. But then all changes will be lost after the moment to which we recovered.

It is better to use transaction IDs instead of timestamps. It is useful to record these IDs, for example, for DDL statements (such as DROP TABLE), by using log_statements = 'ddl'. If we had a transaction ID, we would take recovery_target_xid and ran everything down to the transaction before the request DELETE.

Getting back to work is very simple: remove all changes from recovery.conf and restart postgres. Soon the cue will again have an eight-hour delay, and we are ready for future troubles.

Recovery Benefits

With a delayed replica, instead of a cold backup, you do not have to spend hours restoring the entire snapshot from the archive. For example, we need five hours to get the entire 2 TB base backup. And then you still have to apply the entire daily WAL to recover to the desired state (in the worst case).

A delayed replica is better than a cold backup in two ways:

  1. You do not need to get the entire base backup from the archive.
  2. There is a fixed eight-hour window of WAL segments that must be repeated.

Also, we are constantly checking to see if WAL can be PITRed, and we would quickly notice corruption or other problems with the WAL archive by monitoring the backlog of the delayed replica.

In this example, it took us 50 minutes to restore, that is, the speed was 110 GB of WAL data per hour (the archive was still on AWS S3). In total, we solved the problem and restored the data in 1,5 hours.

Summary: where a delayed replica is useful (and where not)

Use delayed replication as a first aid if you accidentally lose data and notice this disaster within the configured delay.

But keep in mind: replication is not a backup.

Backup and replication have different purposes. A cold backup will come in handy if you accidentally made DELETE or DROP TABLE. We make a backup from cold storage and restore the previous state of a table or an entire database. But at the same time the request DROP TABLE almost instantly reproduced in all replicas on the working cluster, so regular replication will not save here. Replication itself keeps the database available when individual servers are leased out and distributes the load.

Even with a delayed replica, we sometimes really need a cold backup in a safe place, if suddenly there is a data center failure, hidden damage, or other events that you don’t immediately notice. Here from one replication there is no sense.

Note. On GitLab.com we currently only protect against data loss at the system level and do not restore data at the user level.

Source: habr.com

Add a comment