Bug 1957708

Summary: e2e-metal-ipi and related jobs fail to bootstrap due to multiple VIP's
Product: OpenShift Container Platform Reporter: Stephen Benjamin <stbenjam>
Component: NetworkingAssignee: Beth White <beth.white>
Networking sub component: runtime-cfg QA Contact: Victor Voronkov <vvoronko>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: akrzos, bnemec, ercohen, jniu, odepaz
Version: 4.8Keywords: Triaged
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:06:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1962949    

Description Stephen Benjamin 2021-05-06 10:40:12 UTC
Description of problem:

We're starting to see bootstrapping failures that result in the VIP on both the bootstrap host and a control plane host:

The bootstrap host holds 192.168.111.5, as well as master-1 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi/1389177323177119744
$ cat bootstrap/network/ip-addr.txt | grep 111.5/
    inet 192.168.111.5/32 scope global ens3
$ cat control-plane/192.168.111.21/network/ip-addr.txt| grep 111.5/ 
    inet 192.168.111.5/32 scope global enp2s0

Version-Release number of selected component (if applicable):

4.8 nightly

How reproducible:

Often, seems to happen more with IPv6

Additional info:

The installer log bundle now has networking information since https://github.com/openshift/installer/pull/4892

Comment 1 Ben Nemec 2021-05-11 16:37:35 UTC
It looks like the behavior of the unicast_peers config option in keepalived.conf changed from 2.0.10 to 2.1.5. In 2.0.10 if you had an empty unicast_peers config it would still respect unicast traffic from other nodes. In 2.1.5, it seems to ignore traffic from other nodes and will take the VIP regardless of what the other nodes do. There appears to be a race where a master can come up with an empty peer list (even though we try to avoid that).

Comment 3 Yossi Boaron 2021-05-19 06:46:34 UTC
*** Bug 1955082 has been marked as a duplicate of this bug. ***

Comment 4 Eran Cohen 2021-05-20 07:25:20 UTC
*** Bug 1936502 has been marked as a duplicate of this bug. ***

Comment 5 Nataf Sharabi 2021-05-25 13:43:12 UTC
We haven't experienced job failure upon this issue on the last week

Especially ipv6.

Verifying

Comment 9 errata-xmlrpc 2021-07-27 23:06:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438