Cause:
Keepalived monitor container periodically compares the current status to the previous status and updates Keepalived config file accordingly.
A bug in the previous to current comparison code caused to wrong Keepalived config file.
Consequence:
Two nodes owned API VIP and as a result of that deployment breaks.
Fix:
Update the previous to current comparison code in keepalived-monitor container.
Result:
Deployment succeeds.
Description of problem:
We're seeing bootstrap failures, which I reproduced and traced to the API VIP being up on both the bootstrap VM and a master, hence the kube-api endpoint is not accessible to the installer and bootstrapping times out.
$ sudo virsh net-dhcp-leases ostestbm
Expiry Time MAC address Protocol IP address Hostname Client ID or DUID
-------------------------------------------------------------------------------------------------------------------
2020-08-07 11:51:51 00:41:7c:a4:d3:02 ipv4 192.168.111.20/24 master-0 01:00:41:7c:a4:d3:02
2020-08-07 11:51:50 00:41:7c:a4:d3:06 ipv4 192.168.111.21/24 master-1 01:00:41:7c:a4:d3:06
2020-08-07 11:51:53 00:41:7c:a4:d3:0a ipv4 192.168.111.22/24 master-2 01:00:41:7c:a4:d3:0a
2020-08-07 11:42:36 52:54:00:0e:5f:2c ipv4 192.168.111.28/24 - 01:52:54:00:0e:5f:2c
$ ssh core.111.28 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 52:54:00:0e:5f:2c brd ff:ff:ff:ff:ff:ff
inet 192.168.111.28/24 brd 192.168.111.255 scope global dynamic noprefixroute ens3
valid_lft 2423sec preferred_lft 2423sec
inet 192.168.111.5/24 scope global secondary ens3
valid_lft forever preferred_lft forever
inet6 fe80::5054:ff:fe0e:5f2c/64 scope link noprefixroute
valid_lft forever preferred_lft forever
3: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 52:54:00:1a:12:11 brd ff:ff:ff:ff:ff:ff
inet 172.22.0.2/24 brd 172.22.0.255 scope global noprefixroute ens4
valid_lft forever preferred_lft forever
inet6 fe80::ea56:e09d:7473:abe8/64 scope link noprefixroute
valid_lft forever preferred_lft forever
$ sudo arping -I ostestbm 192.168.111.5
ARPING 192.168.111.5 from 192.168.111.1 ostestbm
Unicast reply from 192.168.111.5 [52:54:00:0E:5F:2C] 0.743ms
Unicast reply from 192.168.111.5 [00:41:7C:A4:D3:06] 0.775ms
Unicast reply from 192.168.111.5 [00:41:7C:A4:D3:06] 0.694ms
Unicast reply from 192.168.111.5 [00:41:7C:A4:D3:06] 0.713ms
Unicast reply from 192.168.111.5 [00:41:7C:A4:D3:06] 0.827ms
Unicast reply from 192.168.111.5 [00:41:7C:A4:D3:06] 0.847ms
^CSent 5 probes (1 broadcast(s))
Received 6 response(s)
How reproducible:
Seems flaky, sometimes work but we're seeing a lot of similar failures in CI.
Actual results:
API VIP pointing at two hosts :-O
Expected results:
The API VIP should always point to exactly one host.
Additional info:
https://github.com/openshift/machine-config-operator/pull/1972 fix may be related, testing needed to confirm if it helps.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2020:4196
Description of problem: We're seeing bootstrap failures, which I reproduced and traced to the API VIP being up on both the bootstrap VM and a master, hence the kube-api endpoint is not accessible to the installer and bootstrapping times out. $ sudo virsh net-dhcp-leases ostestbm Expiry Time MAC address Protocol IP address Hostname Client ID or DUID ------------------------------------------------------------------------------------------------------------------- 2020-08-07 11:51:51 00:41:7c:a4:d3:02 ipv4 192.168.111.20/24 master-0 01:00:41:7c:a4:d3:02 2020-08-07 11:51:50 00:41:7c:a4:d3:06 ipv4 192.168.111.21/24 master-1 01:00:41:7c:a4:d3:06 2020-08-07 11:51:53 00:41:7c:a4:d3:0a ipv4 192.168.111.22/24 master-2 01:00:41:7c:a4:d3:0a 2020-08-07 11:42:36 52:54:00:0e:5f:2c ipv4 192.168.111.28/24 - 01:52:54:00:0e:5f:2c $ ssh core.111.28 ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 52:54:00:0e:5f:2c brd ff:ff:ff:ff:ff:ff inet 192.168.111.28/24 brd 192.168.111.255 scope global dynamic noprefixroute ens3 valid_lft 2423sec preferred_lft 2423sec inet 192.168.111.5/24 scope global secondary ens3 valid_lft forever preferred_lft forever inet6 fe80::5054:ff:fe0e:5f2c/64 scope link noprefixroute valid_lft forever preferred_lft forever 3: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 52:54:00:1a:12:11 brd ff:ff:ff:ff:ff:ff inet 172.22.0.2/24 brd 172.22.0.255 scope global noprefixroute ens4 valid_lft forever preferred_lft forever inet6 fe80::ea56:e09d:7473:abe8/64 scope link noprefixroute valid_lft forever preferred_lft forever $ sudo arping -I ostestbm 192.168.111.5 ARPING 192.168.111.5 from 192.168.111.1 ostestbm Unicast reply from 192.168.111.5 [52:54:00:0E:5F:2C] 0.743ms Unicast reply from 192.168.111.5 [00:41:7C:A4:D3:06] 0.775ms Unicast reply from 192.168.111.5 [00:41:7C:A4:D3:06] 0.694ms Unicast reply from 192.168.111.5 [00:41:7C:A4:D3:06] 0.713ms Unicast reply from 192.168.111.5 [00:41:7C:A4:D3:06] 0.827ms Unicast reply from 192.168.111.5 [00:41:7C:A4:D3:06] 0.847ms ^CSent 5 probes (1 broadcast(s)) Received 6 response(s) How reproducible: Seems flaky, sometimes work but we're seeing a lot of similar failures in CI. Actual results: API VIP pointing at two hosts :-O Expected results: The API VIP should always point to exactly one host. Additional info: https://github.com/openshift/machine-config-operator/pull/1972 fix may be related, testing needed to confirm if it helps.