Bug 1931505
Summary: | [IPI baremetal] Two nodes hold the VIP post remove and start of the Keepalived container | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Yossi Boaron <yboaron> | |
Component: | Machine Config Operator | Assignee: | Yossi Boaron <yboaron> | |
Status: | CLOSED ERRATA | QA Contact: | Eldar Weiss <eweiss> | |
Severity: | urgent | Docs Contact: | ||
Priority: | high | |||
Version: | 4.7 | CC: | bnemec, bperkins, mkrejci, pmuller, stbenjam, vvoronko, yobshans | |
Target Milestone: | --- | Keywords: | Triaged | |
Target Release: | 4.8.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause:
a Bug in keepalived 2.0.10
Consequence:
Due to this bug, if the liveness probe kills
keepalived container, any vips that were assigned to the system remain and are not cleaned up when keepalived restarts
Fix:
Clean up the VIPs before starting keepalived
Result:
Only a single node holds the VIP
|
Story Points: | --- | |
Clone Of: | ||||
: | 1957015 (view as bug list) | Environment: | ||
Last Closed: | 2021-07-27 22:47:38 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1957015 |
Description
Yossi Boaron
2021-02-22 15:10:45 UTC
According to Ben Nemec, the same problem seems to exist in CENTOS8 as well, but in FEDORA 33 (Keepalived v2.1.5 (07/13,2020) ) everything is fine. It seems that the issue has been fixed in later versions of Keepalived. To reproduce this, I deployed two centos 8 vms and configured keepalived on them as follows: vrrp_instance ostest_API { state BACKUP interface eth0 virtual_router_id 14 priority 70 advert_int 1 nopreempt unicast_src_ip 12.1.1.122 unicast_peer { 12.1.1.111 } authentication { auth_type PASS auth_pass ostest_api_vip } virtual_ipaddress { 12.2.2.2/32 } } The other node has the same config with the unicast addresses flipped appropriately. To trigger the problem, I just ran "killall -9 keepalived" to force a hard shutdown on the node holding the VIP. The VIP correctly fails over to the other node, but it never gets removed from the first one so you end up with it in two places. When I did the same flow in fedora 33 it correctly unconfigured the IP on the node where I killed keepalived. For the record, the centos 8 version of keepalived is 2.0.10, same as in the OCP container. https://github.com/openshift/machine-config-operator/pull/2511 provides a workaround for this bug. Bumping priority and severity as this is now frequently causing ci failures and is likely to break real deployments. Since the workaround will address this in the OCP context, I've opened https://bugzilla.redhat.com/show_bug.cgi?id=1946799 against keepalived itself to track the underlying issue. Description of problem: Two master nodes hold the API VIP post removal of Keepalived container in master node holds the VIP. Version-Release number the bug was found on: 4.7.0-0.nightly-2021-02-06-084550 Bug was recreated on 4.7.5 Version-Release number Verified on: 4.8.0-0.nightly-2021-04-18-101412 How reproducible: 1) Retrieve the API VIP (possible from install-config.yaml) 2) SSH into the the master node (master-X) holding it (possible by ssh core@<API VIP>) 3) On the node, use the following command to get the KEEPALIVED_CONTAINER_ID: [core@master-X ~]$sudo crictl ps | grep keepalived 4)Restart the keepalived container by either: a)Use either Triggering of liveness probe failure of Keepalived container: [core@master-X ~]$ sudo crictl exec -it KEEPALIVED_CONTAINER_ID /bin/sh sh-4.4# pidof keepalived sh-4.4# kill -9 11 8 B) Removing the Keepalived container: [core@master-X ~]$ KEEPALIVED_CONTAINER_ID KEEPALIVED_CONTAINER_ID 5) Verify only one of the nodes holds the API VIP by using the following on all master nodes: for IPV4: [core@master-X ~]$ ip -4 a for IPV6: [core@master-X ~]$ ip -6 a The API VIP should only appear in the output of this command if it was run on the master-node that now holds the API VIP, not any of the others. Actual results: API VIP is assigned to another single master node, not the above master-x from which the keepalived container was removed. Issue returns on this post-fix build: NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-22-050208 True False 4d20h Cluster version is 4.8.0-0.nightly-2021-04-22-050208 Will try to redeploy build this was verified on. This is affecting CI and our ability to land patches. *** Bug 1935159 has been marked as a duplicate of this bug. *** Description of problem: Two master nodes hold the API VIP post removal of Keepalived container in master node holds the VIP. Version-Release number Verified on: 4.8.0-0.nightly-2021-04-30-201824 Re-verified post-fix: [core@master-0-1 ~]$ ip a | grep fd2e:6f44:5dd8::5 inet6 fd2e:6f44:5dd8::5/128 scope global nodad deprecated [core@master-0-1 ~]$ sudo crictl ps | grep keepalived 6efe0de2516eb 21eb1783b3937eb370942c57faebab660d05ccf833a6e9ef7cf20ef811e4d98d 6 minutes ago Running keepalived 1 c249aaf6f31ff [core@master-0-1 ~]$ sudo crictl rm -f 6efe0de2516eb 6efe0de2516eb [core@master-0-1 ~]$ ip a | grep fd2e:6f44:5dd8::5 [core@master-0-1 ~]$ [core@master-0-1 ~]$ [kni@provisionhost-0-0 ~]$ ssh core@fd2e:6f44:5dd8::5 [core@master-0-2 ~]$ [core@master-0-2 ~]$ [core@master-0-2 ~]$ ip a | grep fd2e:6f44:5dd8::5 inet6 fd2e:6f44:5dd8::5/128 scope global nodad deprecated noprefixroute API VIP appears on one master node, the keepalived container is removed and then the API VIP is also removed and is on another master node. Repeated 3-4 times repeatedly to stress the change and see if issue finally fixed. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |