Bug 1931467 - Kubelet consuming a large amount of CPU and memory and node becoming unhealthy
Summary: Kubelet consuming a large amount of CPU and memory and node becoming unhealthy
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.8.0
Assignee: Micah Abbott
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-22 13:35 UTC by Lucas López Montero
Modified: 2022-04-28 13:44 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Workloads consuming memory faster than the kernel was able to reclaim the memory were triggering the OOM killer, which resulted in nodes locking up. Consequence: Nodes were reported as unhealthy. Fix: Improve how the kernel reclaims memory and how it handles OOM situations. Result: Nodes no longer reported unhealthy during strenuous workloads.
Clone Of:
Environment:
Last Closed: 2021-07-27 22:47:38 UTC
Target Upstream Version:
Embargoed:
miabbott: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github coreos coreos-assembler pull 2818 0 None Merged kola/harness: check console for oom-killer 2022-04-28 13:44:29 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:48:13 UTC

Internal Links: 1962220

Comment 21 Harshal Patil 2021-03-02 07:09:44 UTC

*** This bug has been marked as a duplicate of bug 1857446 ***

Comment 36 Micah Abbott 2021-05-19 13:50:42 UTC
The problem discussed in this BZ should be addressed with the fix for BZ#1873759, which was resolved in `kernel-4.18.0-248.el8` released as part of RHEL 8.4 GA

RHCOS 48.84.202105182219-0 was build using RHEL 8.4 GA content and was successfully included in a 4.8 nightly release payload.

Moving to MODIFIED.

Comment 38 Michael Nguyen 2021-05-24 13:14:37 UTC
Verified on 4.8.0-0.nightly-2021-05-21-200728.  RHCOS 48.84.202105211054-0 has newer kernel with fix kernel-4.18.0-305.el8.x86_64

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-05-21-200728   True        False         28m     Cluster version is 4.8.0-0.nightly-2021-05-21-200728
$ oc get nodes
NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-d7rw6f2-f76d1-9wg8c-master-0         Ready    master   47m   v1.21.0-rc.0+c656d63
ci-ln-d7rw6f2-f76d1-9wg8c-master-1         Ready    master   47m   v1.21.0-rc.0+c656d63
ci-ln-d7rw6f2-f76d1-9wg8c-master-2         Ready    master   47m   v1.21.0-rc.0+c656d63
ci-ln-d7rw6f2-f76d1-9wg8c-worker-b-7hgrx   Ready    worker   40m   v1.21.0-rc.0+c656d63
ci-ln-d7rw6f2-f76d1-9wg8c-worker-c-vmrbc   Ready    worker   39m   v1.21.0-rc.0+c656d63
ci-ln-d7rw6f2-f76d1-9wg8c-worker-d-j5l8b   Ready    worker   40m   v1.21.0-rc.0+c656d63
$ oc debug node/ci-ln-d7rw6f2-f76d1-9wg8c-worker-b-7hgrx
Starting pod/ci-ln-d7rw6f2-f76d1-9wg8c-worker-b-7hgrx-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# rpm -qa kernel
kernel-4.18.0-305.el8.x86_64
sh-4.4# rpm-ostree status
State: idle
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f620068b78e684b615ac01c5b79d6043bee9727644b1a976d45ae023d49fa850
              CustomOrigin: Managed by machine-config-operator
                   Version: 48.84.202105211054-0 (2021-05-21T10:58:00Z)

  ostree://92ede04b462bc884de5562062fb45e06d803754cbaa466e3a2d34b4ee5e9634b
                   Version: 48.84.202105190318-0 (2021-05-19T03:22:10Z)
sh-4.4# exit
exit
sh-4.2# exit
exit

Removing debug pod ...

Comment 43 errata-xmlrpc 2021-07-27 22:47:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.