1898877 – keepalived consumes 100% of cpu

Bug 1898877 - keepalived consumes 100% of cpu

Summary: keepalived consumes 100% of cpu

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Yossi Boaron
QA Contact:	Eldar Weiss
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-11-18 09:54 UTC by Martin Sivák
Modified:	2021-10-18 17:29 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-18 17:28:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Screenshoft from TOP showing keepalived eating 100% of cpu (101.89 KB, image/png) 2020-11-18 09:54 UTC, Martin Sivák	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1890626	1	None	None	None	2024-06-13 23:19:41 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:29:06 UTC

Description Martin Sivák 2020-11-18 09:54:20 UTC

Created attachment 1730508 [details]
Screenshoft from TOP showing keepalived eating 100% of cpu

Description of problem:

Keepalived goes wild on a single node openshift and eats 100% of a cpu. This slows down the machine. Since the node is already running on the edge of its capacity (most of the resources are dedicated to a guaranteed resource class DPDK workload), this contributes to a total overload of the node.


This issue was originally reported on RHEL as https://bugzilla.redhat.com/show_bug.cgi?id=1890626 as was believed to be fixed.

Version-Release number of selected component (if applicable):

Containers:
  keepalived:
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bd5ec3fe868531b24bcfe6f91df387144744da2edf04ee120eb464c3682821de
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bd5ec3fe868531b24bcfe6f91df387144744da2edf04ee120eb464c3682821de


[msivak@localhost packages]$ oc exec -it keepalived-<node>.redhat.com -n openshift-kni-infra -- sh
Defaulting container name to keepalived.
Use 'oc describe pod/keepalived-<node>.redhat.com -n openshift-kni-infra' to see all of the containers in this pod.
sh-4.4# rpm
rpm          rpm2archive  rpm2cpio     rpmdb        rpmkeys      rpmquery     rpmverify    
sh-4.4# rpm -qi keepalived
Name        : keepalived
Version     : 2.0.10
Release     : 10.el8
Architecture: x86_64
Install Date: Sat Oct 31 16:03:18 2020
Group       : System Environment/Daemons
Size        : 1448487
License     : GPLv2+
Signature   : RSA/SHA256, Mon Feb 24 18:10:55 2020, Key ID 199e2f91fd431d51
Source RPM  : keepalived-2.0.10-10.el8.src.rpm
Build Date  : Mon Feb 24 17:45:36 2020
Build Host  : x86-vm-01.build.eng.bos.redhat.com


How reproducible:

Always on my Single node openshift baremetal setup. Usually happens when I load the machine or over night.


Steps to Reproduce:
1. Just wait


Additional info:

There were two proposed workarounds by Yossi Baron:

- Edit /etc/kubernetes/manifests/keepalived.yaml and change liveness probe to: ps -C keepalived -o pid=,pcpu= | awk --assign maxcpu=75 '$2>maxcpu {exit 1}'
- Comment out API track scripts and  Ingress script in /etc/kubernetes/static-pod-resources/keepalived/keepalived.conf.tmpl and then restart the keepalived-monitor  container to apply the changes: sudo crictl rm -f <container-id>

The first one works with one caveat.. when the machine is overloaded there is no cpu time to give to the liveness probe. The second one seems to work so far.

Comment 1 Yossi Boaron 2020-12-06 16:04:32 UTC

The root cause of this problem is a bug in Keepalived RPM (see 1 ), that was addressed in keepalived-2.0.10-10.el8_2.1.x86_64

Since the Keepalived container image in OCP 4.6.x uses the Keepalived RPM version that includes the bug fix (see 2), we can close this bug. 

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1890626
[2] 
sudo crictl exec -i <Keepalived Container ID> yum list installed | grep -i keepalived

keepalived.x86_64                             2.0.10-10.el8_2.1                    @rhel-8-appstream-rpms-x86_64

Comment 2 Yossi Boaron 2020-12-06 18:24:23 UTC

I verified the Keepalived RPM includes the bug fix on OCP 4.6.4

Comment 8 errata-xmlrpc 2021-10-18 17:28:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.