Bug 1993980

Summary: Kubelet regularly freeze control groups causing issues further down
Product: OpenShift Container Platform Reporter: Stephen Benjamin <stbenjam>
Component: NodeAssignee: Kir Kolyshkin <kir>
Node sub component: Kubelet QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: alukiano, aos-bugs, jluhrsen, kir, lakshmi.ravichandran1, nagrawal, sippy, wking
Version: 4.9   
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1999273 (view as bug list) Environment:
Last Closed: 2021-10-18 17:46:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1998391, 1999273    

Description Stephen Benjamin 2021-08-16 13:29:54 UTC
[sig-arch] events should not repeat pathologically


Since the 1.22 rebase, we've seen these events repeating:

event happened 47 times, something is wrong: ns/openshift-cluster-node-tuning-operator pod/tuned-9vncf node/ip-10-0-203-226.ec2.internal - reason/FailedCreatePodContainer unable to ensure pod container exists: failed to create container for [kubepods burstable pod4724bbf2-ae65-4296-87a2-f36f56b7cc03] : Unit kubepods-burstable-pod4724bbf2_ae65_4296_87a2_f36f56b7cc03.slice already exists.


see:
https://search.ci.openshift.org/?search=Unit+.*.slice+already+exists&maxAge=48h&context=1&type=bug%2Bjunit&name=4.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job


This has an issue upstream in kubernetes, see https://github.com/kubernetes/kubernetes/issues/104280 for details.

Comment 3 Kir Kolyshkin 2021-08-24 01:17:27 UTC
Now we need to vendor it to all kubernetes releases

master: https://github.com/kubernetes/kubernetes/pull/104528
  1.22: https://github.com/kubernetes/kubernetes/pull/104529
  1.21: https://github.com/kubernetes/kubernetes/pull/104530

I think that we should not try to bring it to 1.20 (the runc in there does the freeze so it's affected, but the amount of changes to backport is too high).

Comment 4 Ryan Phillips 2021-08-25 14:40:11 UTC
*** Bug 1996755 has been marked as a duplicate of this bug. ***

Comment 5 Kir Kolyshkin 2021-08-26 21:25:36 UTC
The bug is valid for 4.8 and 4.9.

Fixed by
4.9: https://github.com/openshift/kubernetes/pull/910
4.8: https://github.com/openshift/kubernetes/pull/912

Comment 6 Kir Kolyshkin 2021-08-26 21:37:08 UTC
Strictly speaking, we have two bugs here.

One is "unit already exists", introduced in runc rc94 and it's fixed in runc 1.0.0. 
Another is "cgroup freeze", introduced in runc rc92, mostly fixed in runc 1.0.1, fully fixed in runc 1.0.2 (or so we hope).

Comment 7 Kir Kolyshkin 2021-08-26 21:45:53 UTC
> Another is "cgroup freeze", introduced in runc rc92, mostly fixed in runc 1.0.1, fully fixed in runc 1.0.2 (or so we hope).

Pardon me, this was introduced in rc91 (runc commit b810da149), not rc92.

Comment 8 Kir Kolyshkin 2021-08-27 01:07:21 UTC
*** Bug 1996187 has been marked as a duplicate of this bug. ***

Comment 12 Kir Kolyshkin 2021-09-02 15:04:53 UTC
This is not about version of a standalone runc binary being used, this is about version of runc's libcontainer imported by kubelet during compilation.

Now, I am not sure, if 4.9.0-0.nightly-2021-09-01-193941 includes https://github.com/openshift/kubernetes/pull/910

Comment 15 Lakshmi Ravichandran 2021-09-21 11:59:19 UTC
noticing "[sig-arch] events should not repeat pathologically" test failure on 4.9 to 4.10 upgrade CI jobs / s390x from today (2021-09-21)

: [sig-arch] events should not repeat pathologically expand_less	0s
1 events happened too frequently

event happened 38 times, something is wrong: ns/openshift-machine-api machine/libvirt-s390x-1-3-708-542pk-worker-0-qntpm - reason/Updated Updated Machine libvirt-s390x-1-3-708-542pk-worker-0-qntpm

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.10-upgrade-from-nightly-4.9-ocp-remote-libvirt-s390x/1440148923523010560

Comment 17 errata-xmlrpc 2021-10-18 17:46:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759