In our production system we have repmgr for managing failover for our Postgres databases. However, in one incident we had a while ago, our Primary (master) node went down due to a network problem, and repmgr didn’t perform the failover as expected. On reviewing the logs we saw that one of the standby(slave) instances recognized the master failure, and initiated a vote, but the other nodes ignored it and no failover occurred.
Eventually we discovered the reason why- the “ignoring” nodes and the primary server were all on the same physical compute. (In an Openstack cloud environment). This may be worth looking into!