Repmgr failover didn’t happen

In our production system we have repmgr for managing failover for our Postgres databases. However, in one incident we had a while ago, our Primary (master) node went down due to a network problem, and repmgr didn’t perform the failover as expected. On reviewing the logs we saw that one of the standby(slave) instances recognized the master failure, and initiated a vote, but the other nodes ignored it and no failover occurred.

Eventually we discovered the reason why- the “ignoring” nodes and the primary server were all on the same physical compute. (In an Openstack cloud environment). This may be worth looking into!

--

--

Kobi Rosenstein

Linux infrastructure guy. This blog chronicles my “gotcha” moments — Each post contains an answer I would have like to have found when trawling google.