Bug 1993980 - Kubelet regularly freeze control groups causing issues further down
Summary: Kubelet regularly freeze control groups causing issues further down
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.9.0
Assignee: Kir Kolyshkin
QA Contact: Sunil Choudhary
URL:
Whiteboard:
: 1996187 1996755 (view as bug list)
Depends On:
Blocks: 1998391 1999273
TreeView+ depends on / blocked
 
Reported: 2021-08-16 13:29 UTC by Stephen Benjamin
Modified: 2021-10-18 17:46 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1999273 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:46:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes kubernetes issues 104280 0 None None None 2021-08-16 13:30:15 UTC
Github openshift kubernetes pull 910 0 None None None 2021-08-26 21:27:32 UTC
Github openshift origin pull 26409 0 None None None 2021-08-18 19:02:33 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:46:41 UTC

Description Stephen Benjamin 2021-08-16 13:29:54 UTC
[sig-arch] events should not repeat pathologically


Since the 1.22 rebase, we've seen these events repeating:

event happened 47 times, something is wrong: ns/openshift-cluster-node-tuning-operator pod/tuned-9vncf node/ip-10-0-203-226.ec2.internal - reason/FailedCreatePodContainer unable to ensure pod container exists: failed to create container for [kubepods burstable pod4724bbf2-ae65-4296-87a2-f36f56b7cc03] : Unit kubepods-burstable-pod4724bbf2_ae65_4296_87a2_f36f56b7cc03.slice already exists.


see:
https://search.ci.openshift.org/?search=Unit+.*.slice+already+exists&maxAge=48h&context=1&type=bug%2Bjunit&name=4.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job


This has an issue upstream in kubernetes, see https://github.com/kubernetes/kubernetes/issues/104280 for details.

Comment 3 Kir Kolyshkin 2021-08-24 01:17:27 UTC
Now we need to vendor it to all kubernetes releases

master: https://github.com/kubernetes/kubernetes/pull/104528
  1.22: https://github.com/kubernetes/kubernetes/pull/104529
  1.21: https://github.com/kubernetes/kubernetes/pull/104530

I think that we should not try to bring it to 1.20 (the runc in there does the freeze so it's affected, but the amount of changes to backport is too high).

Comment 4 Ryan Phillips 2021-08-25 14:40:11 UTC
*** Bug 1996755 has been marked as a duplicate of this bug. ***

Comment 5 Kir Kolyshkin 2021-08-26 21:25:36 UTC
The bug is valid for 4.8 and 4.9.

Fixed by
4.9: https://github.com/openshift/kubernetes/pull/910
4.8: https://github.com/openshift/kubernetes/pull/912

Comment 6 Kir Kolyshkin 2021-08-26 21:37:08 UTC
Strictly speaking, we have two bugs here.

One is "unit already exists", introduced in runc rc94 and it's fixed in runc 1.0.0. 
Another is "cgroup freeze", introduced in runc rc92, mostly fixed in runc 1.0.1, fully fixed in runc 1.0.2 (or so we hope).

Comment 7 Kir Kolyshkin 2021-08-26 21:45:53 UTC
> Another is "cgroup freeze", introduced in runc rc92, mostly fixed in runc 1.0.1, fully fixed in runc 1.0.2 (or so we hope).

Pardon me, this was introduced in rc91 (runc commit b810da149), not rc92.

Comment 8 Kir Kolyshkin 2021-08-27 01:07:21 UTC
*** Bug 1996187 has been marked as a duplicate of this bug. ***

Comment 12 Kir Kolyshkin 2021-09-02 15:04:53 UTC
This is not about version of a standalone runc binary being used, this is about version of runc's libcontainer imported by kubelet during compilation.

Now, I am not sure, if 4.9.0-0.nightly-2021-09-01-193941 includes https://github.com/openshift/kubernetes/pull/910

Comment 15 Lakshmi Ravichandran 2021-09-21 11:59:19 UTC
noticing "[sig-arch] events should not repeat pathologically" test failure on 4.9 to 4.10 upgrade CI jobs / s390x from today (2021-09-21)

: [sig-arch] events should not repeat pathologically expand_less	0s
1 events happened too frequently

event happened 38 times, something is wrong: ns/openshift-machine-api machine/libvirt-s390x-1-3-708-542pk-worker-0-qntpm - reason/Updated Updated Machine libvirt-s390x-1-3-708-542pk-worker-0-qntpm

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.10-upgrade-from-nightly-4.9-ocp-remote-libvirt-s390x/1440148923523010560

Comment 17 errata-xmlrpc 2021-10-18 17:46:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.