Bug 1931505 - [IPI baremetal] Two nodes hold the VIP post remove and start of the Keepalived container
Summary: [IPI baremetal] Two nodes hold the VIP post remove and start of the Keepaliv...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: 4.8.0
Assignee: Yossi Boaron
QA Contact: Eldar Weiss
URL:
Whiteboard:
: 1935159 (view as bug list)
Depends On:
Blocks: 1957015
TreeView+ depends on / blocked
 
Reported: 2021-02-22 15:10 UTC by Yossi Boaron
Modified: 2021-07-27 22:48 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: a Bug in keepalived 2.0.10 Consequence: Due to this bug, if the liveness probe kills keepalived container, any vips that were assigned to the system remain and are not cleaned up when keepalived restarts Fix: Clean up the VIPs before starting keepalived Result: Only a single node holds the VIP
Clone Of:
: 1957015 (view as bug list)
Environment:
Last Closed: 2021-07-27 22:47:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2511 0 None open Bug 1931505: [on-prem] Cleanup keepalived vips before starting service 2021-04-06 21:38:21 UTC
Github openshift machine-config-operator pull 2548 0 None open Bug 1931505: [On-prem] - fix grep syntax for Keepalived remove-vips 2021-04-27 09:31:06 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:48:13 UTC

Description Yossi Boaron 2021-02-22 15:10:45 UTC
Description of problem:
Two master nodes hold the API VIP post removal of Keepalived container in master node holds the VIP.



Version-Release number of selected component (if applicable):
4.7.0-0.nightly-2021-02-06-084550

How reproducible:

Ssh into master node (i.e: master-x) holds the VIP

To force remove and start of Keepalived container we can:
 A. Trigger liveness probe failure of Keepalived container ( as a result of that Kubelet should remove and create new container )
Or
 B. Simply remove the container using crictl command, like so : sudo crictl rm -f <Keepalived-container-id>

Steps to Reproduce:
1.
2.
3.

Actual results:

VIP was assigned to another master node (expected behavior) but wasn't removed from master-x.
We got to a point where two nodes hold the same VIP.



Expected results:
VIP should be assigned to another master node and removed from master-x

Comment 1 Yossi Boaron 2021-03-25 14:55:18 UTC
According to Ben Nemec, the same problem seems to exist in CENTOS8 as well, but in FEDORA 33 (Keepalived v2.1.5 (07/13,2020) ) everything is fine.

It seems that the issue has been fixed in later versions of Keepalived.

Comment 2 Ben Nemec 2021-03-29 16:57:58 UTC
To reproduce this, I deployed two centos 8 vms and configured keepalived on them as follows:

vrrp_instance ostest_API {
    state BACKUP
    interface eth0
    virtual_router_id 14
    priority 70
    advert_int 1
    nopreempt
    
    unicast_src_ip 12.1.1.122
    unicast_peer {
        12.1.1.111
    }
    
    authentication {
        auth_type PASS
        auth_pass ostest_api_vip
    }
    virtual_ipaddress {
        12.2.2.2/32
    }
}

The other node has the same config with the unicast addresses flipped appropriately.

To trigger the problem, I just ran "killall -9 keepalived" to force a hard shutdown on the node holding the VIP. The VIP correctly fails over to the other node, but it never gets removed from the first one so you end up with it in two places. When I did the same flow in fedora 33 it correctly unconfigured the IP on the node where I killed keepalived.

Comment 3 Ben Nemec 2021-03-30 19:19:34 UTC
For the record, the centos 8 version of keepalived is 2.0.10, same as in the OCP container.

Comment 4 Yossi Boaron 2021-04-05 13:04:30 UTC
https://github.com/openshift/machine-config-operator/pull/2511 provides a workaround for this bug.

Comment 5 Ben Nemec 2021-04-06 19:39:12 UTC
Bumping priority and severity as this is now frequently causing ci failures and is likely to break real deployments.

Comment 6 Ben Nemec 2021-04-06 21:39:29 UTC
Since the workaround will address this in the OCP context, I've opened https://bugzilla.redhat.com/show_bug.cgi?id=1946799 against keepalived itself to track the underlying issue.

Comment 8 Eldar Weiss 2021-04-19 11:43:59 UTC
Description of problem:
Two master nodes hold the API VIP post removal of Keepalived container in master node holds the VIP.

Version-Release number the bug was found on:
4.7.0-0.nightly-2021-02-06-084550

Bug was recreated on 4.7.5

Version-Release number Verified on:
4.8.0-0.nightly-2021-04-18-101412


How reproducible:

1) Retrieve the API VIP (possible from install-config.yaml)
2) SSH into the the master node (master-X) holding it (possible by ssh core@<API VIP>)
3) On the node, use the following command to get the KEEPALIVED_CONTAINER_ID:

[core@master-X ~]$sudo crictl ps | grep keepalived


4)Restart the keepalived container by either:

a)Use either Triggering of liveness probe failure of Keepalived container:

[core@master-X ~]$ sudo crictl exec -it KEEPALIVED_CONTAINER_ID /bin/sh 
sh-4.4# pidof keepalived
sh-4.4# kill -9 11 8 

B) Removing the Keepalived container:

[core@master-X ~]$   KEEPALIVED_CONTAINER_ID
KEEPALIVED_CONTAINER_ID


5) Verify only one of the nodes holds the API VIP  by using the following on all master nodes:

for IPV4:
[core@master-X ~]$ ip -4 a

for IPV6:
[core@master-X ~]$ ip -6 a

The API VIP should only appear in the output of this command if it was run on the master-node that now holds the API VIP, not any of the others.


Actual results:

API VIP is assigned to another single master node, not the above master-x from which the keepalived container was removed.

Comment 9 Eldar Weiss 2021-04-27 09:06:55 UTC
Issue returns on this post-fix build:
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-22-050208   True        False         4d20h   Cluster version is 4.8.0-0.nightly-2021-04-22-050208

Will try to redeploy build this was verified on.

Comment 10 Stephen Benjamin 2021-04-27 11:13:06 UTC
This is affecting CI and our ability to land patches.

Comment 11 Igal Tsoiref 2021-04-27 14:01:47 UTC
*** Bug 1935159 has been marked as a duplicate of this bug. ***

Comment 13 Eldar Weiss 2021-05-02 09:45:01 UTC
Description of problem:
Two master nodes hold the API VIP post removal of Keepalived container in master node holds the VIP.

Version-Release number Verified on:
4.8.0-0.nightly-2021-04-30-201824

Re-verified post-fix:

[core@master-0-1 ~]$ ip a | grep fd2e:6f44:5dd8::5
    inet6 fd2e:6f44:5dd8::5/128 scope global nodad deprecated 
[core@master-0-1 ~]$ sudo crictl ps | grep keepalived
6efe0de2516eb       21eb1783b3937eb370942c57faebab660d05ccf833a6e9ef7cf20ef811e4d98d                                                         6 minutes ago       Running             keepalived                                    1                   c249aaf6f31ff
[core@master-0-1 ~]$ sudo crictl rm -f 6efe0de2516eb
6efe0de2516eb
[core@master-0-1 ~]$ ip a | grep fd2e:6f44:5dd8::5
[core@master-0-1 ~]$
[core@master-0-1 ~]$ 
[kni@provisionhost-0-0 ~]$ ssh core@fd2e:6f44:5dd8::5
[core@master-0-2 ~]$ 
[core@master-0-2 ~]$ 
[core@master-0-2 ~]$ ip a | grep fd2e:6f44:5dd8::5
    inet6 fd2e:6f44:5dd8::5/128 scope global nodad deprecated noprefixroute 

API VIP appears on one master node, the keepalived container is removed and then the API VIP is also removed and is on another master node.
Repeated 3-4 times repeatedly to stress the change and see if issue finally fixed.

Comment 16 errata-xmlrpc 2021-07-27 22:47:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.