Description of problem:
We're starting to see bootstrapping failures that result in the VIP on both the bootstrap host and a control plane host:
The bootstrap host holds 192.168.111.5, as well as master-1 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi/1389177323177119744
$ cat bootstrap/network/ip-addr.txt | grep 111.5/
inet 192.168.111.5/32 scope global ens3
$ cat control-plane/192.168.111.21/network/ip-addr.txt| grep 111.5/
inet 192.168.111.5/32 scope global enp2s0
Version-Release number of selected component (if applicable):
Often, seems to happen more with IPv6
The installer log bundle now has networking information since https://github.com/openshift/installer/pull/4892
It looks like the behavior of the unicast_peers config option in keepalived.conf changed from 2.0.10 to 2.1.5. In 2.0.10 if you had an empty unicast_peers config it would still respect unicast traffic from other nodes. In 2.1.5, it seems to ignore traffic from other nodes and will take the VIP regardless of what the other nodes do. There appears to be a race where a master can come up with an empty peer list (even though we try to avoid that).
*** Bug 1955082 has been marked as a duplicate of this bug. ***
*** Bug 1936502 has been marked as a duplicate of this bug. ***
We haven't experienced job failure upon this issue on the last week
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.