1993980 – Kubelet regularly freeze control groups causing issues further down

Bug 1993980 - Kubelet regularly freeze control groups causing issues further down

Summary: Kubelet regularly freeze control groups causing issues further down

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Kir Kolyshkin
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1996187 1996755 (view as bug list)
Depends On:
Blocks:	1998391 1999273
TreeView+	depends on / blocked

Reported:	2021-08-16 13:29 UTC by Stephen Benjamin
Modified:	2021-10-18 17:46 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1999273 (view as bug list)
Environment:
Last Closed:	2021-10-18 17:46:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubernetes kubernetes issues 104280	None	None	None	2021-08-16 13:30:15 UTC
Github	openshift kubernetes pull 910	None	None	None	2021-08-26 21:27:32 UTC
Github	openshift origin pull 26409	None	None	None	2021-08-18 19:02:33 UTC
Red Hat Product Errata	RHSA-2021:3759	None	None	None	2021-10-18 17:46:41 UTC

Description Stephen Benjamin 2021-08-16 13:29:54 UTC

[sig-arch] events should not repeat pathologically


Since the 1.22 rebase, we've seen these events repeating:

event happened 47 times, something is wrong: ns/openshift-cluster-node-tuning-operator pod/tuned-9vncf node/ip-10-0-203-226.ec2.internal - reason/FailedCreatePodContainer unable to ensure pod container exists: failed to create container for [kubepods burstable pod4724bbf2-ae65-4296-87a2-f36f56b7cc03] : Unit kubepods-burstable-pod4724bbf2_ae65_4296_87a2_f36f56b7cc03.slice already exists.


see:
https://search.ci.openshift.org/?search=Unit+.*.slice+already+exists&maxAge=48h&context=1&type=bug%2Bjunit&name=4.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job


This has an issue upstream in kubernetes, see https://github.com/kubernetes/kubernetes/issues/104280 for details.

Comment 2 Kir Kolyshkin 2021-08-23 19:53:13 UTC

Fixed in https://github.com/opencontainers/runc/releases/tag/v1.0.2 by https://github.com/opencontainers/runc/pull/3167

Comment 3 Kir Kolyshkin 2021-08-24 01:17:27 UTC

Now we need to vendor it to all kubernetes releases

master: https://github.com/kubernetes/kubernetes/pull/104528
  1.22: https://github.com/kubernetes/kubernetes/pull/104529
  1.21: https://github.com/kubernetes/kubernetes/pull/104530

I think that we should not try to bring it to 1.20 (the runc in there does the freeze so it's affected, but the amount of changes to backport is too high).

Comment 4 Ryan Phillips 2021-08-25 14:40:11 UTC

*** Bug 1996755 has been marked as a duplicate of this bug. ***

Comment 5 Kir Kolyshkin 2021-08-26 21:25:36 UTC

The bug is valid for 4.8 and 4.9.

Fixed by
4.9: https://github.com/openshift/kubernetes/pull/910
4.8: https://github.com/openshift/kubernetes/pull/912

Comment 6 Kir Kolyshkin 2021-08-26 21:37:08 UTC

Strictly speaking, we have two bugs here.

One is "unit already exists", introduced in runc rc94 and it's fixed in runc 1.0.0. 
Another is "cgroup freeze", introduced in runc rc92, mostly fixed in runc 1.0.1, fully fixed in runc 1.0.2 (or so we hope).

Comment 7 Kir Kolyshkin 2021-08-26 21:45:53 UTC

> Another is "cgroup freeze", introduced in runc rc92, mostly fixed in runc 1.0.1, fully fixed in runc 1.0.2 (or so we hope).

Pardon me, this was introduced in rc91 (runc commit b810da149), not rc92.

Comment 8 Kir Kolyshkin 2021-08-27 01:07:21 UTC

*** Bug 1996187 has been marked as a duplicate of this bug. ***

Comment 12 Kir Kolyshkin 2021-09-02 15:04:53 UTC

This is not about version of a standalone runc binary being used, this is about version of runc's libcontainer imported by kubelet during compilation.

Now, I am not sure, if 4.9.0-0.nightly-2021-09-01-193941 includes https://github.com/openshift/kubernetes/pull/910

Comment 13 Sunil Choudhary 2021-09-06 10:24:01 UTC

Checked again and I see last error was 5 days ago.

https://search.ci.openshift.org/?search=Unit+.*.slice+already+exists&maxAge=336h&context=1&type=junit&name=4.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 15 Lakshmi Ravichandran 2021-09-21 11:59:19 UTC

noticing "[sig-arch] events should not repeat pathologically" test failure on 4.9 to 4.10 upgrade CI jobs / s390x from today (2021-09-21)

: [sig-arch] events should not repeat pathologically expand_less	0s
1 events happened too frequently

event happened 38 times, something is wrong: ns/openshift-machine-api machine/libvirt-s390x-1-3-708-542pk-worker-0-qntpm - reason/Updated Updated Machine libvirt-s390x-1-3-708-542pk-worker-0-qntpm

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.10-upgrade-from-nightly-4.9-ocp-remote-libvirt-s390x/1440148923523010560

Comment 17 errata-xmlrpc 2021-10-18 17:46:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.