Bug 1931467

Summary: Kubelet consuming a large amount of CPU and memory and node becoming unhealthy
Product: OpenShift Container Platform Reporter: Lucas López Montero <llopezmo>
Component: RHCOSAssignee: Micah Abbott <miabbott>
Status: CLOSED ERRATA QA Contact: Michael Nguyen <mnguyen>
Severity: high Docs Contact:
Priority: medium    
Version: 4.6CC: aos-bugs, bbreard, bhershbe, harpatil, hhei, imcleod, jligon, llong, longman, miabbott, nagrawal, nstielau, rphillips, saniyer, smilner
Target Milestone: ---Keywords: Reopened
Target Release: 4.8.0Flags: miabbott: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Workloads consuming memory faster than the kernel was able to reclaim the memory were triggering the OOM killer, which resulted in nodes locking up. Consequence: Nodes were reported as unhealthy. Fix: Improve how the kernel reclaims memory and how it handles OOM situations. Result: Nodes no longer reported unhealthy during strenuous workloads.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:47:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 21 Harshal Patil 2021-03-02 07:09:44 UTC

*** This bug has been marked as a duplicate of bug 1857446 ***

Comment 36 Micah Abbott 2021-05-19 13:50:42 UTC
The problem discussed in this BZ should be addressed with the fix for BZ#1873759, which was resolved in `kernel-4.18.0-248.el8` released as part of RHEL 8.4 GA

RHCOS 48.84.202105182219-0 was build using RHEL 8.4 GA content and was successfully included in a 4.8 nightly release payload.

Moving to MODIFIED.

Comment 38 Michael Nguyen 2021-05-24 13:14:37 UTC
Verified on 4.8.0-0.nightly-2021-05-21-200728.  RHCOS 48.84.202105211054-0 has newer kernel with fix kernel-4.18.0-305.el8.x86_64

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-05-21-200728   True        False         28m     Cluster version is 4.8.0-0.nightly-2021-05-21-200728
$ oc get nodes
NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-d7rw6f2-f76d1-9wg8c-master-0         Ready    master   47m   v1.21.0-rc.0+c656d63
ci-ln-d7rw6f2-f76d1-9wg8c-master-1         Ready    master   47m   v1.21.0-rc.0+c656d63
ci-ln-d7rw6f2-f76d1-9wg8c-master-2         Ready    master   47m   v1.21.0-rc.0+c656d63
ci-ln-d7rw6f2-f76d1-9wg8c-worker-b-7hgrx   Ready    worker   40m   v1.21.0-rc.0+c656d63
ci-ln-d7rw6f2-f76d1-9wg8c-worker-c-vmrbc   Ready    worker   39m   v1.21.0-rc.0+c656d63
ci-ln-d7rw6f2-f76d1-9wg8c-worker-d-j5l8b   Ready    worker   40m   v1.21.0-rc.0+c656d63
$ oc debug node/ci-ln-d7rw6f2-f76d1-9wg8c-worker-b-7hgrx
Starting pod/ci-ln-d7rw6f2-f76d1-9wg8c-worker-b-7hgrx-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# rpm -qa kernel
kernel-4.18.0-305.el8.x86_64
sh-4.4# rpm-ostree status
State: idle
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f620068b78e684b615ac01c5b79d6043bee9727644b1a976d45ae023d49fa850
              CustomOrigin: Managed by machine-config-operator
                   Version: 48.84.202105211054-0 (2021-05-21T10:58:00Z)

  ostree://92ede04b462bc884de5562062fb45e06d803754cbaa466e3a2d34b4ee5e9634b
                   Version: 48.84.202105190318-0 (2021-05-19T03:22:10Z)
sh-4.4# exit
exit
sh-4.2# exit
exit

Removing debug pod ...

Comment 43 errata-xmlrpc 2021-07-27 22:47:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438