Description of problem: Two master nodes hold the API VIP post removal of Keepalived container in master node holds the VIP. Version-Release number of selected component (if applicable): 4.7.0-0.nightly-2021-02-06-084550 How reproducible: Ssh into master node (i.e: master-x) holds the VIP To force remove and start of Keepalived container we can: A. Trigger liveness probe failure of Keepalived container ( as a result of that Kubelet should remove and create new container ) Or B. Simply remove the container using crictl command, like so : sudo crictl rm -f <Keepalived-container-id> Steps to Reproduce: 1. 2. 3. Actual results: VIP was assigned to another master node (expected behavior) but wasn't removed from master-x. We got to a point where two nodes hold the same VIP. Expected results: VIP should be assigned to another master node and removed from master-x
According to Ben Nemec, the same problem seems to exist in CENTOS8 as well, but in FEDORA 33 (Keepalived v2.1.5 (07/13,2020) ) everything is fine. It seems that the issue has been fixed in later versions of Keepalived.
To reproduce this, I deployed two centos 8 vms and configured keepalived on them as follows: vrrp_instance ostest_API { state BACKUP interface eth0 virtual_router_id 14 priority 70 advert_int 1 nopreempt unicast_src_ip 12.1.1.122 unicast_peer { 12.1.1.111 } authentication { auth_type PASS auth_pass ostest_api_vip } virtual_ipaddress { 12.2.2.2/32 } } The other node has the same config with the unicast addresses flipped appropriately. To trigger the problem, I just ran "killall -9 keepalived" to force a hard shutdown on the node holding the VIP. The VIP correctly fails over to the other node, but it never gets removed from the first one so you end up with it in two places. When I did the same flow in fedora 33 it correctly unconfigured the IP on the node where I killed keepalived.
For the record, the centos 8 version of keepalived is 2.0.10, same as in the OCP container.
https://github.com/openshift/machine-config-operator/pull/2511 provides a workaround for this bug.
Bumping priority and severity as this is now frequently causing ci failures and is likely to break real deployments.
Since the workaround will address this in the OCP context, I've opened https://bugzilla.redhat.com/show_bug.cgi?id=1946799 against keepalived itself to track the underlying issue.
Description of problem: Two master nodes hold the API VIP post removal of Keepalived container in master node holds the VIP. Version-Release number the bug was found on: 4.7.0-0.nightly-2021-02-06-084550 Bug was recreated on 4.7.5 Version-Release number Verified on: 4.8.0-0.nightly-2021-04-18-101412 How reproducible: 1) Retrieve the API VIP (possible from install-config.yaml) 2) SSH into the the master node (master-X) holding it (possible by ssh core@<API VIP>) 3) On the node, use the following command to get the KEEPALIVED_CONTAINER_ID: [core@master-X ~]$sudo crictl ps | grep keepalived 4)Restart the keepalived container by either: a)Use either Triggering of liveness probe failure of Keepalived container: [core@master-X ~]$ sudo crictl exec -it KEEPALIVED_CONTAINER_ID /bin/sh sh-4.4# pidof keepalived sh-4.4# kill -9 11 8 B) Removing the Keepalived container: [core@master-X ~]$ KEEPALIVED_CONTAINER_ID KEEPALIVED_CONTAINER_ID 5) Verify only one of the nodes holds the API VIP by using the following on all master nodes: for IPV4: [core@master-X ~]$ ip -4 a for IPV6: [core@master-X ~]$ ip -6 a The API VIP should only appear in the output of this command if it was run on the master-node that now holds the API VIP, not any of the others. Actual results: API VIP is assigned to another single master node, not the above master-x from which the keepalived container was removed.
Issue returns on this post-fix build: NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-22-050208 True False 4d20h Cluster version is 4.8.0-0.nightly-2021-04-22-050208 Will try to redeploy build this was verified on.
This is affecting CI and our ability to land patches.
*** Bug 1935159 has been marked as a duplicate of this bug. ***
Description of problem: Two master nodes hold the API VIP post removal of Keepalived container in master node holds the VIP. Version-Release number Verified on: 4.8.0-0.nightly-2021-04-30-201824 Re-verified post-fix: [core@master-0-1 ~]$ ip a | grep fd2e:6f44:5dd8::5 inet6 fd2e:6f44:5dd8::5/128 scope global nodad deprecated [core@master-0-1 ~]$ sudo crictl ps | grep keepalived 6efe0de2516eb 21eb1783b3937eb370942c57faebab660d05ccf833a6e9ef7cf20ef811e4d98d 6 minutes ago Running keepalived 1 c249aaf6f31ff [core@master-0-1 ~]$ sudo crictl rm -f 6efe0de2516eb 6efe0de2516eb [core@master-0-1 ~]$ ip a | grep fd2e:6f44:5dd8::5 [core@master-0-1 ~]$ [core@master-0-1 ~]$ [kni@provisionhost-0-0 ~]$ ssh core@fd2e:6f44:5dd8::5 [core@master-0-2 ~]$ [core@master-0-2 ~]$ [core@master-0-2 ~]$ ip a | grep fd2e:6f44:5dd8::5 inet6 fd2e:6f44:5dd8::5/128 scope global nodad deprecated noprefixroute API VIP appears on one master node, the keepalived container is removed and then the API VIP is also removed and is on another master node. Repeated 3-4 times repeatedly to stress the change and see if issue finally fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438