Bug 1898877 - keepalived consumes 100% of cpu
Summary: keepalived consumes 100% of cpu
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6.z
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.9.0
Assignee: Yossi Boaron
QA Contact: Eldar Weiss
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-18 09:54 UTC by Martin Sivák
Modified: 2021-10-18 17:29 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-18 17:28:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Screenshoft from TOP showing keepalived eating 100% of cpu (101.89 KB, image/png)
2020-11-18 09:54 UTC, Martin Sivák
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1890626 1 None None None 2024-06-13 23:19:41 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:29:06 UTC

Description Martin Sivák 2020-11-18 09:54:20 UTC
Created attachment 1730508 [details]
Screenshoft from TOP showing keepalived eating 100% of cpu

Description of problem:

Keepalived goes wild on a single node openshift and eats 100% of a cpu. This slows down the machine. Since the node is already running on the edge of its capacity (most of the resources are dedicated to a guaranteed resource class DPDK workload), this contributes to a total overload of the node.


This issue was originally reported on RHEL as https://bugzilla.redhat.com/show_bug.cgi?id=1890626 as was believed to be fixed.

Version-Release number of selected component (if applicable):

Containers:
  keepalived:
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bd5ec3fe868531b24bcfe6f91df387144744da2edf04ee120eb464c3682821de
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bd5ec3fe868531b24bcfe6f91df387144744da2edf04ee120eb464c3682821de


[msivak@localhost packages]$ oc exec -it keepalived-<node>.redhat.com -n openshift-kni-infra -- sh
Defaulting container name to keepalived.
Use 'oc describe pod/keepalived-<node>.redhat.com -n openshift-kni-infra' to see all of the containers in this pod.
sh-4.4# rpm
rpm          rpm2archive  rpm2cpio     rpmdb        rpmkeys      rpmquery     rpmverify    
sh-4.4# rpm -qi keepalived
Name        : keepalived
Version     : 2.0.10
Release     : 10.el8
Architecture: x86_64
Install Date: Sat Oct 31 16:03:18 2020
Group       : System Environment/Daemons
Size        : 1448487
License     : GPLv2+
Signature   : RSA/SHA256, Mon Feb 24 18:10:55 2020, Key ID 199e2f91fd431d51
Source RPM  : keepalived-2.0.10-10.el8.src.rpm
Build Date  : Mon Feb 24 17:45:36 2020
Build Host  : x86-vm-01.build.eng.bos.redhat.com


How reproducible:

Always on my Single node openshift baremetal setup. Usually happens when I load the machine or over night.


Steps to Reproduce:
1. Just wait


Additional info:

There were two proposed workarounds by Yossi Baron:

- Edit /etc/kubernetes/manifests/keepalived.yaml and change liveness probe to: ps -C keepalived -o pid=,pcpu= | awk --assign maxcpu=75 '$2>maxcpu {exit 1}'
- Comment out API track scripts and  Ingress script in /etc/kubernetes/static-pod-resources/keepalived/keepalived.conf.tmpl and then restart the keepalived-monitor  container to apply the changes: sudo crictl rm -f <container-id>

The first one works with one caveat.. when the machine is overloaded there is no cpu time to give to the liveness probe. The second one seems to work so far.

Comment 1 Yossi Boaron 2020-12-06 16:04:32 UTC
The root cause of this problem is a bug in Keepalived RPM (see 1 ), that was addressed in keepalived-2.0.10-10.el8_2.1.x86_64

Since the Keepalived container image in OCP 4.6.x uses the Keepalived RPM version that includes the bug fix (see 2), we can close this bug. 

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1890626
[2] 
sudo crictl exec -i <Keepalived Container ID> yum list installed | grep -i keepalived

keepalived.x86_64                             2.0.10-10.el8_2.1                    @rhel-8-appstream-rpms-x86_64

Comment 2 Yossi Boaron 2020-12-06 18:24:23 UTC
I verified the Keepalived RPM includes the bug fix on OCP 4.6.4

Comment 8 errata-xmlrpc 2021-10-18 17:28:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.