1953102 – kubelet CPU use during an e2e run increased 25% after rebase

Bug 1953102 - kubelet CPU use during an e2e run increased 25% after rebase

Summary: kubelet CPU use during an e2e run increased 25% after rebase

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Elana Hashman
QA Contact:	Weinan Liu
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1962186 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-04-23 21:23 UTC by Clayton Coleman
Modified:	2021-07-27 23:03 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 23:03:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
CPU use over total e2e-gcp run for last 8 weeks (291.35 KB, image/png) 2021-04-23 21:23 UTC, Clayton Coleman	no flags	Details
sum cpu usage per /system.slice process across entire cluster 4/9 (109.72 KB, image/png) 2021-04-23 21:24 UTC, Clayton Coleman	no flags	Details
sum cpu usage per /system.slice process across entire cluster 4/23 (135.14 KB, image/png) 2021-04-23 21:24 UTC, Clayton Coleman	no flags	Details
pprof - 4.7.8 over 30m (172.41 KB, application/gzip) 2021-04-26 23:13 UTC, Elana Hashman	no flags	Details
pprof - 4.8.0-0.ci-2021-04-26-154116 over 30m (218.58 KB, application/gzip) 2021-04-26 23:13 UTC, Elana Hashman	no flags	Details
GetStats comparison 4.7 v 4.8 (205.06 KB, image/png) 2021-04-26 23:21 UTC, Elana Hashman	no flags	Details
Eviction comparison 4.7 v 4.8 (253.09 KB, image/png) 2021-04-26 23:24 UTC, Elana Hashman	no flags	Details
4.8-fc5 CPU usage (168.39 KB, image/png) 2021-06-12 07:14 UTC, Sunil Choudhary	no flags	Details
4.8.0-0.nightly-2021-06-10-224448 CPU usage (170.08 KB, image/png) 2021-06-12 07:16 UTC, Sunil Choudhary	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubernetes kubernetes issues 101989	None	open	Kubelet CPU perf regression 1.20->1.21 (25%)	2021-06-03 20:23:41 UTC
Github	openshift kubernetes pull 761	None	closed	Bug 1953102: UPSTREAM: 102147: vendor: bump runc to rc95	2021-06-02 15:27:20 UTC
Red Hat Product Errata	RHSA-2021:2438	None	None	None	2021-07-27 23:03:47 UTC

Description Clayton Coleman 2021-04-23 21:23:13 UTC

Created attachment 1774926 [details]
CPU use over total e2e-gcp run for last 8 weeks

Between 4/10 and 4/15 we saw a significant increase in CPU use from the cluster during runs (linked chart).

Digging in between 4/9 and 4/23 we examined total CPU use across the cluster - one of the most significant component increases was kubelet. The attached graphs show the cpu core seconds used of crio, kubelet, and ovs in those orders across a 6 node gcp cluster - kubelet especially went from 2000 core/seconds on all 6 nodes to 2750 core seconds.

4/9: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1380506424706076672/artifacts/e2e-gcp/gather-extra/artifacts/metrics/

4/23: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1385624321132597248/artifacts/e2e-gcp/gather-extra/artifacts/metrics/

Will open separate bugs for other impacting regressions.

Comment 1 Clayton Coleman 2021-04-23 21:24:01 UTC

Created attachment 1774927 [details]
sum cpu usage per /system.slice process across entire cluster 4/9

Comment 2 Clayton Coleman 2021-04-23 21:24:19 UTC

Created attachment 1774928 [details]
sum cpu usage per /system.slice process across entire cluster 4/23

Comment 3 Elana Hashman 2021-04-23 21:29:34 UTC

My immediate suspicion is the migration to structured logging in 1.21. Next steps for me are to see if this is also an issue upstream and confirm if structured logging is the root cause.

Comment 5 Elana Hashman 2021-04-26 23:13:11 UTC

Created attachment 1775734 [details]
pprof - 4.7.8 over 30m

Attaching pprofs from 30m dumps on some cluster-bot clusters (4.7.8, 4.8.0-0.ci-2021-04-26-154116).

Comment 6 Elana Hashman 2021-04-26 23:13:53 UTC

Created attachment 1775735 [details]
pprof - 4.8.0-0.ci-2021-04-26-154116 over 30m

Comment 7 Elana Hashman 2021-04-26 23:21:18 UTC

Created attachment 1775736 [details]
GetStats comparison 4.7 v 4.8

Attached: screenshot comparing pprof data from 4.7 vs. 4.8. We see a 10s/60% CPU increase.

In particular, the following jumps out to me:

- fscommon.OpenFile -> fscommon.openWithSecureJoin appears new, is hot in the call graph, and appears quite slow. It looks responsible for increased syscall activity. MemoryGroup stats look 30% slower.
- We have some net new stats being gathered that are pretty slow: CpusetGroup adds 4s.

runc got bumped in OpenShift ~Apr. 7 during the rebase.

Ref: libct/cg: introduce and use fscommon.OpenFile https://github.com/opencontainers/runc/pull/2635

Comment 8 Elana Hashman 2021-04-26 23:24:49 UTC

Created attachment 1775737 [details]
Eviction comparison 4.7 v 4.8

Eviction cpu time up 200%.

Appears to be due in part to some new volume stats being gathered.

Slack context: https://coreos.slack.com/archives/CK1AE4ZCK/p1619477015207100?thread_ts=1619211998.146100&cid=CK1AE4ZCK

Comment 9 Elana Hashman 2021-04-26 23:33:47 UTC

Upstream k8s doesn't currently have monitoring for this: https://kubernetes.slack.com/archives/C09QZTRH7/p1619422034034900?thread_ts=1619223170.034800&cid=C09QZTRH7

I will try to get an upstream bug filed to add that support so we can catch these sorts of regressions upstream. I will also test and confirm to see if this affects upstream k8s as well, and file a bug if so.

Comment 10 Elana Hashman 2021-04-26 23:36:34 UTC

Assigning to Kir to investigate cadvisor/runc regression.

Comment 11 Kir Kolyshkin 2021-04-27 03:21:24 UTC

Proposed fix for runc, together with some analysis: https://github.com/opencontainers/runc/pull/2921

Will backport to 4.7 later.

Comment 12 Kir Kolyshkin 2021-04-27 03:33:40 UTC

> Will backport to 4.7 later.

Ughm, I mean, 4.8. Here: https://github.com/projectatomic/runc/pull/48 (currently a draft pending the upstream PR merge, but can be tested).

Comment 13 Elana Hashman 2021-05-03 22:35:20 UTC

https://github.com/opencontainers/runc/pull/2921 with patch has merged.

Waiting on runc94 release in: https://github.com/opencontainers/runc/pull/2927

Then will proceed to bump runc in cadvisor, then cadvisor in k8s, then backport the dep bump.

Comment 14 Ryan Phillips 2021-05-05 19:11:13 UTC

Kir is going to look into a backport into openshift/opencontainer-runc for 4.8.

Comment 15 Kir Kolyshkin 2021-05-05 19:37:14 UTC

A preliminary patch for cadvisor HEAD: https://github.com/google/cadvisor/pull/2863 (will be finalized once rc94 is out).

Comment 16 Elana Hashman 2021-05-06 17:58:14 UTC

Adding a NEEDINFO for you, Kir, just so you can ping me when the rc94 release is done... I'm keeping David Porter up to date upstream as well for the cadvisor release.

Comment 17 Elana Hashman 2021-05-11 01:38:53 UTC

runc94 is out: https://github.com/opencontainers/runc/releases/tag/v1.0.0-rc94

cadvisor has been bumped: https://github.com/google/cadvisor/pull/2869 (still need to cut a release)

cri-o bump: https://github.com/cri-o/cri-o/pull/4882

Comment 18 Elana Hashman 2021-05-18 18:11:16 UTC

Per https://github.com/kubernetes/kubernetes/pull/101888#pullrequestreview-661762880 this attempt at bumping to runc94 hasn't worked because of an issue with the conformance tests, so the PR was reverted. We will need to wait to update runc until we can fix these tests.

Comment 19 Elana Hashman 2021-05-19 23:23:59 UTC

PR for runc95 plus text fixes: https://github.com/kubernetes/kubernetes/pull/102147

Comment 21 Neelesh Agrawal 2021-05-21 18:15:16 UTC

kube 1.21 backport PR https://github.com/kubernetes/kubernetes/pull/102196
This there will be an openshift backport PR to fix this in 4.8.

Comment 23 Ben Bennett 2021-05-26 14:55:41 UTC

*** Bug 1962186 has been marked as a duplicate of this bug. ***

Comment 24 Ken Young 2021-06-03 13:15:56 UTC

Elana / Sunil,

Where are we with this Bug?  Has QA tested it?  Has it been backported to 4.8?

/KenY

Comment 25 Elana Hashman 2021-06-03 20:25:17 UTC

Backports have landed in both Kube 1.21 and OpenShift 4.8. Awaiting QE for OpenShift and for Googlers to verify the upstream bug can be closed.

Comment 26 Sunil Choudhary 2021-06-12 07:14:44 UTC

Created attachment 1790415 [details]
4.8-fc5 CPU usage

Comment 27 Sunil Choudhary 2021-06-12 07:16:42 UTC

Created attachment 1790416 [details]
4.8.0-0.nightly-2021-06-10-224448 CPU usage

Comment 28 Sunil Choudhary 2021-06-12 07:18:58 UTC

Checked by comparing cpu usage between 4.8.0-fc.5 & 4.8.0-0.nightly-2021-06-10-224448 over a period of 24 hours by running some workload

$ oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-fc.5   True        False         26h     Cluster version is 4.8.0-fc.5

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-06-10-224448   True        False         25h     Cluster version is 4.8.0-0.nightly-2021-06-10-224448

I see drop in cpu usage by kubelet, crio & ovs

Comment 31 errata-xmlrpc 2021-07-27 23:03:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.