Bug 1931467
| Summary: | Kubelet consuming a large amount of CPU and memory and node becoming unhealthy | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Lucas López Montero <llopezmo> |
| Component: | RHCOS | Assignee: | Micah Abbott <miabbott> |
| Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> |
| Severity: | high | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.6 | CC: | aos-bugs, bbreard, bhershbe, harpatil, hhei, imcleod, jligon, llong, longman, miabbott, nagrawal, nstielau, rphillips, saniyer, smilner |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | 4.8.0 | Flags: | miabbott:
needinfo-
|
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: Workloads consuming memory faster than the kernel was able to reclaim the memory were triggering the OOM killer, which resulted in nodes locking up.
Consequence: Nodes were reported as unhealthy.
Fix: Improve how the kernel reclaims memory and how it handles OOM situations.
Result: Nodes no longer reported unhealthy during strenuous workloads.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-07-27 22:47:38 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Comment 21
Harshal Patil
2021-03-02 07:09:44 UTC
The problem discussed in this BZ should be addressed with the fix for BZ#1873759, which was resolved in `kernel-4.18.0-248.el8` released as part of RHEL 8.4 GA RHCOS 48.84.202105182219-0 was build using RHEL 8.4 GA content and was successfully included in a 4.8 nightly release payload. Moving to MODIFIED. Verified on 4.8.0-0.nightly-2021-05-21-200728. RHCOS 48.84.202105211054-0 has newer kernel with fix kernel-4.18.0-305.el8.x86_64
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.8.0-0.nightly-2021-05-21-200728 True False 28m Cluster version is 4.8.0-0.nightly-2021-05-21-200728
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ci-ln-d7rw6f2-f76d1-9wg8c-master-0 Ready master 47m v1.21.0-rc.0+c656d63
ci-ln-d7rw6f2-f76d1-9wg8c-master-1 Ready master 47m v1.21.0-rc.0+c656d63
ci-ln-d7rw6f2-f76d1-9wg8c-master-2 Ready master 47m v1.21.0-rc.0+c656d63
ci-ln-d7rw6f2-f76d1-9wg8c-worker-b-7hgrx Ready worker 40m v1.21.0-rc.0+c656d63
ci-ln-d7rw6f2-f76d1-9wg8c-worker-c-vmrbc Ready worker 39m v1.21.0-rc.0+c656d63
ci-ln-d7rw6f2-f76d1-9wg8c-worker-d-j5l8b Ready worker 40m v1.21.0-rc.0+c656d63
$ oc debug node/ci-ln-d7rw6f2-f76d1-9wg8c-worker-b-7hgrx
Starting pod/ci-ln-d7rw6f2-f76d1-9wg8c-worker-b-7hgrx-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# rpm -qa kernel
kernel-4.18.0-305.el8.x86_64
sh-4.4# rpm-ostree status
State: idle
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f620068b78e684b615ac01c5b79d6043bee9727644b1a976d45ae023d49fa850
CustomOrigin: Managed by machine-config-operator
Version: 48.84.202105211054-0 (2021-05-21T10:58:00Z)
ostree://92ede04b462bc884de5562062fb45e06d803754cbaa466e3a2d34b4ee5e9634b
Version: 48.84.202105190318-0 (2021-05-19T03:22:10Z)
sh-4.4# exit
exit
sh-4.2# exit
exit
Removing debug pod ...
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |