Bug 1957708 - e2e-metal-ipi and related jobs fail to bootstrap due to multiple VIP's
Summary: e2e-metal-ipi and related jobs fail to bootstrap due to multiple VIP's
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.8.0
Assignee: Beth White
QA Contact: Victor Voronkov
URL:
Whiteboard:
: 1936502 1955082 (view as bug list)
Depends On:
Blocks: 1962949
TreeView+ depends on / blocked
 
Reported: 2021-05-06 10:40 UTC by Stephen Benjamin
Modified: 2022-04-25 05:35 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 23:06:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift baremetal-runtimecfg pull 137 0 None open Bug 1957708: Keepalived- verify that unicast peers list isn't empty on master nodes 2021-05-11 14:50:05 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:07:12 UTC

Description Stephen Benjamin 2021-05-06 10:40:12 UTC
Description of problem:

We're starting to see bootstrapping failures that result in the VIP on both the bootstrap host and a control plane host:

The bootstrap host holds 192.168.111.5, as well as master-1 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi/1389177323177119744
$ cat bootstrap/network/ip-addr.txt | grep 111.5/
    inet 192.168.111.5/32 scope global ens3
$ cat control-plane/192.168.111.21/network/ip-addr.txt| grep 111.5/ 
    inet 192.168.111.5/32 scope global enp2s0

Version-Release number of selected component (if applicable):

4.8 nightly

How reproducible:

Often, seems to happen more with IPv6

Additional info:

The installer log bundle now has networking information since https://github.com/openshift/installer/pull/4892

Comment 1 Ben Nemec 2021-05-11 16:37:35 UTC
It looks like the behavior of the unicast_peers config option in keepalived.conf changed from 2.0.10 to 2.1.5. In 2.0.10 if you had an empty unicast_peers config it would still respect unicast traffic from other nodes. In 2.1.5, it seems to ignore traffic from other nodes and will take the VIP regardless of what the other nodes do. There appears to be a race where a master can come up with an empty peer list (even though we try to avoid that).

Comment 3 Yossi Boaron 2021-05-19 06:46:34 UTC
*** Bug 1955082 has been marked as a duplicate of this bug. ***

Comment 4 Eran Cohen 2021-05-20 07:25:20 UTC
*** Bug 1936502 has been marked as a duplicate of this bug. ***

Comment 5 Nataf Sharabi 2021-05-25 13:43:12 UTC
We haven't experienced job failure upon this issue on the last week

Especially ipv6.

Verifying

Comment 9 errata-xmlrpc 2021-07-27 23:06:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.