Created attachment 1730508 [details] Screenshoft from TOP showing keepalived eating 100% of cpu Description of problem: Keepalived goes wild on a single node openshift and eats 100% of a cpu. This slows down the machine. Since the node is already running on the edge of its capacity (most of the resources are dedicated to a guaranteed resource class DPDK workload), this contributes to a total overload of the node. This issue was originally reported on RHEL as https://bugzilla.redhat.com/show_bug.cgi?id=1890626 as was believed to be fixed. Version-Release number of selected component (if applicable): Containers: keepalived: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bd5ec3fe868531b24bcfe6f91df387144744da2edf04ee120eb464c3682821de Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bd5ec3fe868531b24bcfe6f91df387144744da2edf04ee120eb464c3682821de [msivak@localhost packages]$ oc exec -it keepalived-<node>.redhat.com -n openshift-kni-infra -- sh Defaulting container name to keepalived. Use 'oc describe pod/keepalived-<node>.redhat.com -n openshift-kni-infra' to see all of the containers in this pod. sh-4.4# rpm rpm rpm2archive rpm2cpio rpmdb rpmkeys rpmquery rpmverify sh-4.4# rpm -qi keepalived Name : keepalived Version : 2.0.10 Release : 10.el8 Architecture: x86_64 Install Date: Sat Oct 31 16:03:18 2020 Group : System Environment/Daemons Size : 1448487 License : GPLv2+ Signature : RSA/SHA256, Mon Feb 24 18:10:55 2020, Key ID 199e2f91fd431d51 Source RPM : keepalived-2.0.10-10.el8.src.rpm Build Date : Mon Feb 24 17:45:36 2020 Build Host : x86-vm-01.build.eng.bos.redhat.com How reproducible: Always on my Single node openshift baremetal setup. Usually happens when I load the machine or over night. Steps to Reproduce: 1. Just wait Additional info: There were two proposed workarounds by Yossi Baron: - Edit /etc/kubernetes/manifests/keepalived.yaml and change liveness probe to: ps -C keepalived -o pid=,pcpu= | awk --assign maxcpu=75 '$2>maxcpu {exit 1}' - Comment out API track scripts and Ingress script in /etc/kubernetes/static-pod-resources/keepalived/keepalived.conf.tmpl and then restart the keepalived-monitor container to apply the changes: sudo crictl rm -f <container-id> The first one works with one caveat.. when the machine is overloaded there is no cpu time to give to the liveness probe. The second one seems to work so far.
The root cause of this problem is a bug in Keepalived RPM (see 1 ), that was addressed in keepalived-2.0.10-10.el8_2.1.x86_64 Since the Keepalived container image in OCP 4.6.x uses the Keepalived RPM version that includes the bug fix (see 2), we can close this bug. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1890626 [2] sudo crictl exec -i <Keepalived Container ID> yum list installed | grep -i keepalived keepalived.x86_64 2.0.10-10.el8_2.1 @rhel-8-appstream-rpms-x86_64
I verified the Keepalived RPM includes the bug fix on OCP 4.6.4
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759