Repmgr failover didn’t happen

Nov 2, 2020

In our production system we have repmgr for managing failover for our Postgres databases. However, in one incident we had a while ago, our Primary (master) node went down due to a network problem, and repmgr didn’t perform the failover as expected. On reviewing the logs we saw that one of the standby(slave) instances recognized the master failure, and initiated a vote, but the other nodes ignored it and no failover occurred.

Eventually we discovered the reason why- the “ignoring” nodes and the primary server were all on the same physical compute. (In an Openstack cloud environment). This may be worth looking into!

Repmgr failover didn’t happen

Written by Kobi Rosenstein