Bug 1867080

Summary: baremetal: API VIP ends up pointing to bootstrap and a master
Product: OpenShift Container Platform Reporter: Steven Hardy <shardy>
Component: NetworkingAssignee: Yossi Boaron <yboaron>
Networking sub component: runtime-cfg QA Contact: Victor Voronkov <vvoronko>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: agarcial, asegurap, vvoronko
Version: 4.6Keywords: Triaged
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Keepalived monitor container periodically compares the current status to the previous status and updates Keepalived config file accordingly. A bug in the previous to current comparison code caused to wrong Keepalived config file. Consequence: Two nodes owned API VIP and as a result of that deployment breaks. Fix: Update the previous to current comparison code in keepalived-monitor container. Result: Deployment succeeds.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:25:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Steven Hardy 2020-08-07 10:23:26 UTC
Description of problem:

We're seeing bootstrap failures, which I reproduced and traced to the API VIP being up on both the bootstrap VM and a master, hence the kube-api endpoint is not accessible to the installer and bootstrapping times out.

$ sudo virsh net-dhcp-leases ostestbm
 Expiry Time          MAC address        Protocol  IP address                Hostname        Client ID or DUID
 2020-08-07 11:51:51  00:41:7c:a4:d3:02  ipv4         master-0        01:00:41:7c:a4:d3:02
 2020-08-07 11:51:50  00:41:7c:a4:d3:06  ipv4         master-1        01:00:41:7c:a4:d3:06
 2020-08-07 11:51:53  00:41:7c:a4:d3:0a  ipv4         master-2        01:00:41:7c:a4:d3:0a
 2020-08-07 11:42:36  52:54:00:0e:5f:2c  ipv4         -               01:52:54:00:0e:5f:2c
$ ssh core@ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:0e:5f:2c brd ff:ff:ff:ff:ff:ff
    inet brd scope global dynamic noprefixroute ens3
       valid_lft 2423sec preferred_lft 2423sec
    inet scope global secondary ens3
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe0e:5f2c/64 scope link noprefixroute
       valid_lft forever preferred_lft forever
3: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:1a:12:11 brd ff:ff:ff:ff:ff:ff
    inet brd scope global noprefixroute ens4
       valid_lft forever preferred_lft forever
    inet6 fe80::ea56:e09d:7473:abe8/64 scope link noprefixroute
       valid_lft forever preferred_lft forever
$ sudo arping -I ostestbm
ARPING from ostestbm
Unicast reply from [52:54:00:0E:5F:2C]  0.743ms
Unicast reply from [00:41:7C:A4:D3:06]  0.775ms
Unicast reply from [00:41:7C:A4:D3:06]  0.694ms
Unicast reply from [00:41:7C:A4:D3:06]  0.713ms
Unicast reply from [00:41:7C:A4:D3:06]  0.827ms
Unicast reply from [00:41:7C:A4:D3:06]  0.847ms
^CSent 5 probes (1 broadcast(s))
Received 6 response(s)

How reproducible:

Seems flaky, sometimes work but we're seeing a lot of similar failures in CI.

Actual results:

API VIP pointing at two hosts :-O

Expected results:

The API VIP should always point to exactly one host.

Additional info:

https://github.com/openshift/machine-config-operator/pull/1972 fix may be related, testing needed to confirm if it helps.

Comment 1 Steven Hardy 2020-08-07 11:46:56 UTC
The MCO fix in the additional info doesn't resolve the issue, testing now with a revert from unicast https://github.com/openshift/machine-config-operator/pull/1987

Comment 4 Andrea Fasano 2020-08-11 16:33:54 UTC
*** Bug 1866419 has been marked as a duplicate of this bug. ***

Comment 7 errata-xmlrpc 2020-10-27 16:25:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.