Bug 1999273

Summary: [4.8] Kubelet regularly freeze control groups causing issues further down
Product: OpenShift Container Platform Reporter: Kir Kolyshkin <kir>
Component: NodeAssignee: Kir Kolyshkin <kir>
Node sub component: Kubelet QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: unspecified CC: alukiano, aos-bugs, jluhrsen, nagrawal, schoudha, sippy, stbenjam
Version: 4.8   
Target Milestone: ---   
Target Release: 4.8.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1993980 Environment:
Last Closed: 2021-08-30 19:34:49 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1993980    
Bug Blocks:    

Description Kir Kolyshkin 2021-08-30 18:57:07 UTC
+++ This bug was initially created as a clone of Bug #1993980 +++

[sig-arch] events should not repeat pathologically


Since the 1.22 rebase, we've seen these events repeating:

event happened 47 times, something is wrong: ns/openshift-cluster-node-tuning-operator pod/tuned-9vncf node/ip-10-0-203-226.ec2.internal - reason/FailedCreatePodContainer unable to ensure pod container exists: failed to create container for [kubepods burstable pod4724bbf2-ae65-4296-87a2-f36f56b7cc03] : Unit kubepods-burstable-pod4724bbf2_ae65_4296_87a2_f36f56b7cc03.slice already exists.


see:
https://search.ci.openshift.org/?search=Unit+.*.slice+already+exists&maxAge=48h&context=1&type=bug%2Bjunit&name=4.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job


This has an issue upstream in kubernetes, see https://github.com/kubernetes/kubernetes/issues/104280 for details.

--- Additional comment from Eric Paris on 2021-08-19 16:00:18 UTC ---

This bug sets blocker+ without setting a Target Release. This is an invalid state as it is impossible to determine what is being blocked. Please be sure to set Priority, Severity, and Target Release before you attempt to set blocker+

--- Additional comment from Kir Kolyshkin on 2021-08-23 19:53:13 UTC ---

Fixed in https://github.com/opencontainers/runc/releases/tag/v1.0.2 by https://github.com/opencontainers/runc/pull/3167

--- Additional comment from Kir Kolyshkin on 2021-08-24 01:17:27 UTC ---

Now we need to vendor it to all kubernetes releases

master: https://github.com/kubernetes/kubernetes/pull/104528
  1.22: https://github.com/kubernetes/kubernetes/pull/104529
  1.21: https://github.com/kubernetes/kubernetes/pull/104530

I think that we should not try to bring it to 1.20 (the runc in there does the freeze so it's affected, but the amount of changes to backport is too high).

--- Additional comment from Ryan Phillips on 2021-08-25 14:40:11 UTC ---



--- Additional comment from Kir Kolyshkin on 2021-08-26 21:25:36 UTC ---

The bug is valid for 4.8 and 4.9.

Fixed by
4.9: https://github.com/openshift/kubernetes/pull/910
4.8: https://github.com/openshift/kubernetes/pull/912

--- Additional comment from Kir Kolyshkin on 2021-08-26 21:37:08 UTC ---

Strictly speaking, we have two bugs here.

One is "unit already exists", introduced in runc rc94 and it's fixed in runc 1.0.0. 
Another is "cgroup freeze", introduced in runc rc92, mostly fixed in runc 1.0.1, fully fixed in runc 1.0.2 (or so we hope).

--- Additional comment from Kir Kolyshkin on 2021-08-26 21:45:53 UTC ---

> Another is "cgroup freeze", introduced in runc rc92, mostly fixed in runc 1.0.1, fully fixed in runc 1.0.2 (or so we hope).

Pardon me, this was introduced in rc91 (runc commit b810da149), not rc92.

--- Additional comment from Kir Kolyshkin on 2021-08-27 01:07:21 UTC ---



--- Additional comment from OpenShift Automated Release Tooling on 2021-08-27 21:41:27 UTC ---

Elliott changed bug status from MODIFIED to ON_QA.

--- Additional comment from OpenShift Automated Release Tooling on 2021-08-27 21:41:29 UTC ---

This bug will be shipped at next planned release date of 4.9 if this is not a GA bug.

Comment 1 Kir Kolyshkin 2021-08-30 19:34:49 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1998391

*** This bug has been marked as a duplicate of bug 1998391 ***