Trying to solve problems, only to find other problems

pg_rewind Gotchas

pg_rewind Gotchas


When I have started this blog I was planing to write once or twice a month, and the backlog of topics that I have should suffice for a year. Despite of having plenty of topics to write about I seem to have a fair lack of time for doing so. Having said that, this is an attempt to write at least on post this year.

Rebuilding a Slave

It happens from time to time that your master crashes and your slave promotes to a new master. In that situation you have to demote the old master to a slave. In PostgreSQL that is usually done using pg_start_backup/rsync/pg_stop_backup or pg_basebackup. In either case you copy the entire data folder from the new master to the slave. And if it is a large instance you are limited by the disk speed and network bandwidth which results in a really long waiting time for the rebuild process (up to 5 hours in my case). But the old master should be very similar to the new master as they diverged in the recent past, can something revert the changes on the old master and apply the changes after the timeline fork? This smart rewinding is something that pg_rewind does. If you read about it you will find the information that it rewinds the instance to the time of the last checkpoint before the fork. In theory it all sounds wonderful, and if you read a couple of posts from Giuseppe Broccolo (Part1, Part2, Part3) you can get a grasp how this could work in production. But, it is not that simple…


For pg_rewind to work you have to enable wal_log_hints or data checksums (-k or --data-checksums when using initdb). Having page checksums enabled is a new feature added in 9.3 and I advise you to use it because it can help you detect disk problems and avoid database corruption. The wal_log_hints forces your database to behave similarly, but without the checksuming. An interesting part in the documentation about the wal_log_hints is the following:

When this parameter is on, the PostgreSQL server writes the entire content of each disk page to WAL during the first modification of that page after a checkpoint, even for non-critical modifications of so-called hint bits.

If data checksums are enabled, hint bit updates are always WAL-logged and this setting is ignored. You can use this setting to test how much extra WAL-logging would occur if your database had data checksums enabled.

So, there is a way to test how much more disk usage for WAL files you will have if you enable data checksums. I didn’t use it before, because the database was not initialised that way, but I wanted to use this feature for our new instances. Having two new production servers comes very handy when you want to test things.

To have a baseline I run the following commands on a cluster with no data checksums and wal_log_hints = off:

createdb benchdisk
pgbench -i -s 10000 benchdisk
# This created a WAL archive of 121GB

pgbench -c 32 -j 16 -t 100000 benchdisk
# The WAL archive is now 167GB. Increase of 167 - 121 = 46GB

pgbench -c 32 -j 16 -t 100000 -N benchdisk
# The WAL archive is now 209GB. Increase of 209 - 167 = 42GB

Now, I run the same test with either data checksum or wal_log_hints = on, the results were the same:

createdb benchdisk
pgbench -i -s 10000 benchdisk
# This creates a WAL archive of 245GB

pgbench -c 32 -j 16 -t 100000 benchdisk
# The WAL archive is now 292GB. Increase of 292 - 245 = 47GB

pgbench -c 32 -j 16 -t 100000 -N benchdisk
# The WAL archive is now 334GB. Increase of 334 - 292 = 42GB

So, the numbers are really interesting, when creating the database the volume of generated WAL files is doubled. As for the rest of the tests, the numbers are similar. Tweaking checkpoint parameters might reduce this numbers, but the tested setup was already tweaked with high values for checkpoint parameters.

Making it work

After thinking about the numbers from the test I decided to use data checksums despite the possible double disk usage for the WAL archive. I have set up a standard master slave cluster as in our regular production, with all WAL streaming and archiving that we usually have. I forced a failover and then connected to the old master to check how pg_rewind works:

pg_rewind --target-pgdata=/var/lib/postgresql/9.6/main --source-server="postgresql://postgres@server-two/postgres"
stdout: servers diverged at WAL position 0/45B90D00 on timeline 1
could not open file "/var/lib/postgresql/9.6/main/pg_xlog/000000010000000000000040": No such file or directory

could not find previous WAL record at 0/40FFFFD8
Failure, exiting

And here we have another problem, pg_rewind is looking for a WAL file that is not present any more in the pg_xlog folder because it has been removed by PostgreSQL. The removal of WAL files from the pg_xlog folder is expected because we have streamed the WAL on the archive server and the file can now be removed when the number of WAL files in the pg_xlog folder reaches the configured maximum. The problem is that pg_rewind can’t cope with this regular situation. How to fix this? Well, you can copy the file manually and rerun pg_rewind. The misfortune of that solution is that pg_rewind can tell you, after processing the copied file, that it needs the previous one also. There is no way to know how many files you will need in advance. And, requiring manual intervention in something you want to automate is not a solution. Examining the error output of pg_rewind you can find the name of the WAL file you need. It is now just a matter of a simple regex ([0-9A-F]{24}), a wile loop in a bash script and good error handling :) It is not a very good solution, but it works with the available tools.


pg_rewind is a good tool for the job, it reduced our slave rebuilding time from a few hours to a few minutes, and the shorter time reduces additional connectivity issues that could arise in such a long process. To make it usable I had to wrap it into a bash script to circumvent the lack of a parameter similar to the restore_command that you usually use in recovery.conf on the slave. The developers should really try to add this functionality as it is clearly needed in any automated environment. Additionally, we learned that using page checksums can cause an increase in the WAL volume that in some cases is not negligible.

P.S.: If you have noticed the new image style for this article, it is made with Marmoset Hexels, a tool that I’ll try to use for the next articles if my drawing skills improve a bit.

Mladen Marinović

Mladen Marinović

CTO @ Smartivo