Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1898877

Summary: keepalived consumes 100% of cpu
Product: OpenShift Container Platform Reporter: Martin Sivák <msivak>
Component: NetworkingAssignee: Yossi Boaron <yboaron>
Networking sub component: runtime-cfg QA Contact: Eldar Weiss <eweiss>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: asegurap, bperkins, rpittau, vvoronko, yboaron, yprokule
Version: 4.6.zKeywords: Triaged, UpcomingSprint
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-18 17:28:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Screenshoft from TOP showing keepalived eating 100% of cpu none

Description Martin Sivák 2020-11-18 09:54:20 UTC
Created attachment 1730508 [details]
Screenshoft from TOP showing keepalived eating 100% of cpu

Description of problem:

Keepalived goes wild on a single node openshift and eats 100% of a cpu. This slows down the machine. Since the node is already running on the edge of its capacity (most of the resources are dedicated to a guaranteed resource class DPDK workload), this contributes to a total overload of the node.


This issue was originally reported on RHEL as https://bugzilla.redhat.com/show_bug.cgi?id=1890626 as was believed to be fixed.

Version-Release number of selected component (if applicable):

Containers:
  keepalived:
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bd5ec3fe868531b24bcfe6f91df387144744da2edf04ee120eb464c3682821de
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bd5ec3fe868531b24bcfe6f91df387144744da2edf04ee120eb464c3682821de


[msivak@localhost packages]$ oc exec -it keepalived-<node>.redhat.com -n openshift-kni-infra -- sh
Defaulting container name to keepalived.
Use 'oc describe pod/keepalived-<node>.redhat.com -n openshift-kni-infra' to see all of the containers in this pod.
sh-4.4# rpm
rpm          rpm2archive  rpm2cpio     rpmdb        rpmkeys      rpmquery     rpmverify    
sh-4.4# rpm -qi keepalived
Name        : keepalived
Version     : 2.0.10
Release     : 10.el8
Architecture: x86_64
Install Date: Sat Oct 31 16:03:18 2020
Group       : System Environment/Daemons
Size        : 1448487
License     : GPLv2+
Signature   : RSA/SHA256, Mon Feb 24 18:10:55 2020, Key ID 199e2f91fd431d51
Source RPM  : keepalived-2.0.10-10.el8.src.rpm
Build Date  : Mon Feb 24 17:45:36 2020
Build Host  : x86-vm-01.build.eng.bos.redhat.com


How reproducible:

Always on my Single node openshift baremetal setup. Usually happens when I load the machine or over night.


Steps to Reproduce:
1. Just wait


Additional info:

There were two proposed workarounds by Yossi Baron:

- Edit /etc/kubernetes/manifests/keepalived.yaml and change liveness probe to: ps -C keepalived -o pid=,pcpu= | awk --assign maxcpu=75 '$2>maxcpu {exit 1}'
- Comment out API track scripts and  Ingress script in /etc/kubernetes/static-pod-resources/keepalived/keepalived.conf.tmpl and then restart the keepalived-monitor  container to apply the changes: sudo crictl rm -f <container-id>

The first one works with one caveat.. when the machine is overloaded there is no cpu time to give to the liveness probe. The second one seems to work so far.

Comment 1 Yossi Boaron 2020-12-06 16:04:32 UTC
The root cause of this problem is a bug in Keepalived RPM (see 1 ), that was addressed in keepalived-2.0.10-10.el8_2.1.x86_64

Since the Keepalived container image in OCP 4.6.x uses the Keepalived RPM version that includes the bug fix (see 2), we can close this bug. 

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1890626
[2] 
sudo crictl exec -i <Keepalived Container ID> yum list installed | grep -i keepalived

keepalived.x86_64                             2.0.10-10.el8_2.1                    @rhel-8-appstream-rpms-x86_64

Comment 2 Yossi Boaron 2020-12-06 18:24:23 UTC
I verified the Keepalived RPM includes the bug fix on OCP 4.6.4

Comment 8 errata-xmlrpc 2021-10-18 17:28:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759