How incremental backups work in PostgreSQL, and how to implement them in 10 minutes
Postgres as an RDBMS offers limited support for incremental backups, in that they are officially supported via WAL (Write Ahead Log) archiving, but there is no built in method for actually implementing them. The DBA is expected to configure and create scripts and cron jobs to manage backups, using the built in tools. Such scripts can be very simple or extremely complex, and often create overhead for a DBA, especially one who is migrating to Postgres from an enterprise rdbms such as Oracle, which includes powerful backup and recovery options with inbuilt tools which allow for minimal effort to have a robust backup system in place.
Enter Barman.
Barman is an open source tool maintained by 2ndQuadrant, and is essentially a wrapper for all of Postgres’s inbuilt Backup And Recovery MANagement. With this tool, a DBA can specify backup policies in a manner similar to Oracle’s RMAN, and not have to worry about maintaining their own backup scripts. Installation is very simple, and so is usage. In this post I will explain how to install and configure Barman, and explain some of how it works under the hood. For more information you can check out the project on the official github repository, here.
First off, what are WAL files?
As in other RDBMSs, Postgres logs transactions in log files, using which the database can roll forwards changes and achieve PITR (Point In Time Recovery). As long as the database has access to the log files, it can always use the data in them to roll the database to any point in time. WAL archiving is simply the process of storing the WAL files in a secure location for future use. Streaming replication is arching to a standby database in real-time.
Postgres servers are specified with a data directory, containing all cluster data. To restore a server, you create a new data directory, with the datafiles relevant to the desired point in time, and then start the postgres process using the new data directory. You can also cold (offline) clone a postgres cluster by copying the data directory to a new server and starting the instance.
To achieve non incremental backup, Barman uses the replication slot feature in Postgres, which streams the WALs to any specified location, and uses pg_basebackup to hot backup an entire database cluster.
For incremental backups, Barman uses a different approach, relying on rsync with ssh.
You set an ssh command which connects the barman server to the database server with passwordless ssh, and rsync will handle the transfer of all necessary files.
Somthing that confused me is that in many other rdbms’s, you specify a full backup interval, say once a week, and then in shorter intervals you have a an incremental backup which stores WALs of some sort, building on the full backup. The incremental backups containing the transaction logs would be useless without the underlying full backup, because during a restore, the database restores the full backup and then “rolls forward” the logs, effectively applying the transactions that happened after the full backup state has been achieved in the recovery environment. But Barman does not do this, instead only using an incremental backup schedule without specifying a full backup. I couldn’t understand how Barman never takes a full backup to use with the incremental ones. At first I thought that maybe it transparently creates a full one whenever it is needed, but that didn’t make sense to me and so I did some digging. The results are what prompted me to make this explanation post.
The important thing to note here is that barman is not what creates the incremental backups. rather rsync is what manages the deltas. This allows for a very simple scheduling system where you only specify an incremental backup schedule, which does not require a full database backup schedule, ever. Here’s how.
The way rsync handles copying deltas is by making use of hardlinks.
When you copy a file with rsync, it defaults to using incremental copy. What happens is that rsync will check the destination directory for the source files. If they do not exist, rsync will copy them. If they exist but are different, rsync will replace them. If they exist and are identical (not just the file names but the permissions and last modified dates as well. From the man page:
-c, --checksum
This changes the way rsync checks if the files have been changed
and are in need of a transfer. Without this option, rsync uses
a "quick check" that (by default) checks if each file’s size and
time of last modification match between the sender and receiver.
This option changes this to compare a 128-bit checksum for each
file that has a matching size. Generating the checksums means
that both sides will expend a lot of disk I/O reading all the
data in the files in the transfer (and this is prior to any
reading that will be done to transfer changed files), so this
can slow things down significantly.
), rsync will not touch them . What does this mean? After all, we do need the unchanged files to be present in the new (or backup, in this case) directory as well, so how can it not copy them? The trick is hardlinking.
rsync does not copy in new files when they have not changed, but it does create a hardlink to them. So, both the original and new directories will now have an identical file. How is this different from simply copying the file, and how does this create an incremental copy? Let’s take a look at link structure.
A soft (symbolic) link is a completely separate file which points to an origin file. if the origin file is deleted, the link file will not point anywhere and will have no data. A hard link however, works quite differently.
Say we have a file named FILE1, with the line “I am a file with some data” in it. If we were to create a hardlink from it, to a file named FILE2, The two files would be identical in every way. even the inode number would be the same. How is this achieved? We must remember that the files you see in the filesystem on your unix machine are not actually the location of the bits on the disk, but rather pointers, much like an index leaf in a database. So in our example, when you create a hardlink, the operating system will create another pointer named FILE2, that points to the location on disk where “I am a file with some data” is stored. Because the data is the same data and not a copy, the inode is the same. This leads to an interesting point- the hard link file FILE2 is not reliant on the original file FILE1 in any way. Even if FILE1 were to be deleted, FILE2 still points to the data on disk. So as long as there is even one hard link left, the data will not be lost.
Therefore, when creating a backup for the first time, the entire postgres data directory is created as a clone. The next time the backup is run, rsync will only bring in new changes (and in this usage also delete the files that have deletes) and create hardlinks to the files created on the first run, Thus, the second run will be significantly shorter and take less space- it isn’t creating many of the new bits, they are not being written to the disk again. Then, at some point you reach the end of the storage policy you specify in Barman, and the original files are deleted. But you don’t need to create a new full backup, because the next backup that was once the incremental one is now the full backup. As long as you have even one hardlink, you can safely delete the older ones and your data is the same. It’s the same file! Just a different pointer. So you can have your incremental schedule and not have to worry about creating a full backup as well.
Now, how do I set up and use Barman? Here is the basic installation and configuration for an incremental backup policy with Barman.
The architecture in this example is as follows:
One postgres server, called pghost.
One Barman server, called bmhost.
Each server is a separate linux machine.
To install barman on bmhost, run the following:
sudo apt-get install barman
or
sudo yum install barman
You may also need to install the package barman-cli.
Packages are also available to download from https://sourceforge.net/projects/pgbarman/files/. You can even build from source if you like.
Set up passwordless ssh between pghost and bmhost. You can use the guide I linked earlier, or this handy one:
The default location for the main barman configuration file is /etc/barman.conf
Enter the following values:
[barman]
barman_home = /data/barman
barman_user = barman
log_file = /var/log/barman/barman.log
compression = gzip
reuse_backup = link
backup_method = rsync
archiver = on
I’m assuming that /data/ is a separate drive which can be extended and is secure, this is where all backups will be and you don’t want to run out of space. For using logical backups (instead of incremental ones) change the backup_method to “postgres’.
Each postgres cluster that you want barman to manage gets its own configuration file. A good practice is to use hostname.conf . These files are often stored in /etc/barman.d .
Our server configuration file will contain the following lines:
[pghost]
description = “Postgres server”
ssh_command = ssh postgres@pghost
conninfo = host=pghost user=postgres port=5432
retention_policy_mode = auto
retention_policy = RECOVERY WINDOW OF 7 days
wal_retention_policy = main
Instead of having barman store a time window, you can set it to hold a number of backups. Do this by changing retention_policy to:
retention_policy = REDUNDANCY 4
Naturally, you can change the number to any integer. If using a time window, you can specify in weeks or months as well.
Next we allow barman to connect to the postgres server, by editing Postgres’s Host Based Authentication file, pg_hba.conf . This file is always in the data directory we mentioned earlier. The path changes depending on postgres version, but you can usually find it in $PGDATA .
The file must contain the following line:
host barman all pghosts_ip_address:/32 trust
If you care about security, you should set the “trust” option to something other than trust. md5 is for password. You can also use peer or ident. More information can be found in the postgres documentation:
The postgresql.conf , also in the $PGDATA, must accept the incoming request from the barman server. Set:
listen_adressess = '*'
To listen on all interfaces.
Next we need to make sure that WAL streaming is set up correctly.
In the postgresql.conf :
archive_mode = on
archive_command = ‘rsync -a %p barman@bmhost:/PATH/TO/WALS/%f`
Replace /PATH/TO/WALS/ with the actual path barman thinks they are getting sent to, verifiable by running:
barman show-server pghost | grep streaming_wals_directory
%p points to the path the wals are in, %f is each wal file name.
After any change in postgresql.conf , we need to reload the postgres server. This will bounce your db, so be careful!!!
pg_ctl reload
or
systemctl reload postgresql
Within barman’s server you can list all managed servers with
barman list-server
To check a particular server’s status:
barman check pghost
Everything should return “OK” except for the minimum redundancy/retention policy, as we have not yet taken any backups.
If wal archive failed, make sure the server is rsyncing properly, to the right paths.
To force start a new WAL file:
barman switch-wal pghost
There are many guides and tutorials for configuring barman. If you still have failed, try a google search. There is a lot of good information.
Once everything is “OK”, backup the severer for the first time:
barman backup pghost
You can the check status of backups for a server with
barman list-backup pghost
Each backup will have an ID with a timestamp that you can check with:
barman show-backup pghost BACKUPID
At this point you can schedule a cronjob to run the backups. Besides the backup command itself, you have to schedule the command “barman cron”. In later version repository installs of barman, this is generated automatically. If not, set the following:
* * * * * /usr/bin/barman cron
To backup your server once a day at 8 pm:
0 20 * * * /usr/bin/barman backup pghost
To restore your backup, you use the barman recover command to create a data directory:
barman recover <server_name> <backup_id> /path/to/recover/dir
DO NOT set this as the current data directory that you are restoring. Instead set up a new instance. If you MUST have it in the same location, first stop the original server if it is running and remove the $PGDATA contents first. It is HIGHLY recommended to first restore to a 3rd site to verify that the cluster is restored properly before replacing your production PGDATA.
If you want to restore to a remote host, which you usually do, add an ssh command to barman recover:
barman recover <server_name> <backup_id> --remote-ssh-command <COMMAND> /path/to/recover/dir
You can add another flag to set a PITR recover:
--target-time TARGET_TIME
Where TARGET_TIME is a standard timestamp ‘2019–11–03 13:00:00’
Once your data directory has been created, use pg_ctl to start the server using the new directory.
That’s it. You now have incremental backup managed by barman.
If you encounter any issues, go over the barman manual:
docs.pgbarman.org/release/2.9/
Good luck, and happy Postgres-ing!!