Bug 1993980
Summary: | Kubelet regularly freeze control groups causing issues further down | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Stephen Benjamin <stbenjam> | |
Component: | Node | Assignee: | Kir Kolyshkin <kir> | |
Node sub component: | Kubelet | QA Contact: | Sunil Choudhary <schoudha> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | unspecified | CC: | alukiano, aos-bugs, jluhrsen, kir, lakshmi.ravichandran1, nagrawal, sippy, wking | |
Version: | 4.9 | |||
Target Milestone: | --- | |||
Target Release: | 4.9.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1999273 (view as bug list) | Environment: | ||
Last Closed: | 2021-10-18 17:46:26 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1998391, 1999273 |
Description
Stephen Benjamin
2021-08-16 13:29:54 UTC
Fixed in https://github.com/opencontainers/runc/releases/tag/v1.0.2 by https://github.com/opencontainers/runc/pull/3167 Now we need to vendor it to all kubernetes releases master: https://github.com/kubernetes/kubernetes/pull/104528 1.22: https://github.com/kubernetes/kubernetes/pull/104529 1.21: https://github.com/kubernetes/kubernetes/pull/104530 I think that we should not try to bring it to 1.20 (the runc in there does the freeze so it's affected, but the amount of changes to backport is too high). *** Bug 1996755 has been marked as a duplicate of this bug. *** The bug is valid for 4.8 and 4.9. Fixed by 4.9: https://github.com/openshift/kubernetes/pull/910 4.8: https://github.com/openshift/kubernetes/pull/912 Strictly speaking, we have two bugs here. One is "unit already exists", introduced in runc rc94 and it's fixed in runc 1.0.0. Another is "cgroup freeze", introduced in runc rc92, mostly fixed in runc 1.0.1, fully fixed in runc 1.0.2 (or so we hope). > Another is "cgroup freeze", introduced in runc rc92, mostly fixed in runc 1.0.1, fully fixed in runc 1.0.2 (or so we hope).
Pardon me, this was introduced in rc91 (runc commit b810da149), not rc92.
*** Bug 1996187 has been marked as a duplicate of this bug. *** This is not about version of a standalone runc binary being used, this is about version of runc's libcontainer imported by kubelet during compilation. Now, I am not sure, if 4.9.0-0.nightly-2021-09-01-193941 includes https://github.com/openshift/kubernetes/pull/910 Checked again and I see last error was 5 days ago. https://search.ci.openshift.org/?search=Unit+.*.slice+already+exists&maxAge=336h&context=1&type=junit&name=4.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job noticing "[sig-arch] events should not repeat pathologically" test failure on 4.9 to 4.10 upgrade CI jobs / s390x from today (2021-09-21) : [sig-arch] events should not repeat pathologically expand_less 0s 1 events happened too frequently event happened 38 times, something is wrong: ns/openshift-machine-api machine/libvirt-s390x-1-3-708-542pk-worker-0-qntpm - reason/Updated Updated Machine libvirt-s390x-1-3-708-542pk-worker-0-qntpm https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.10-upgrade-from-nightly-4.9-ocp-remote-libvirt-s390x/1440148923523010560 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |