Bug 1953102

Summary: kubelet CPU use during an e2e run increased 25% after rebase
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: NodeAssignee: Elana Hashman <ehashman>
Node sub component: Kubelet QA Contact: Weinan Liu <weinliu>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aos-bugs, danijel.soldo, ehashman, imiller, keyoung, kir, msivak, nagrawal, prubenda, rphillips, schoudha, sreber, weinliu, wking
Version: 4.8   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:03:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
CPU use over total e2e-gcp run for last 8 weeks
none
sum cpu usage per /system.slice process across entire cluster 4/9
none
sum cpu usage per /system.slice process across entire cluster 4/23
none
pprof - 4.7.8 over 30m
none
pprof - 4.8.0-0.ci-2021-04-26-154116 over 30m
none
GetStats comparison 4.7 v 4.8
none
Eviction comparison 4.7 v 4.8
none
4.8-fc5 CPU usage
none
4.8.0-0.nightly-2021-06-10-224448 CPU usage none

Description Clayton Coleman 2021-04-23 21:23:13 UTC
Created attachment 1774926 [details]
CPU use over total e2e-gcp run for last 8 weeks

Between 4/10 and 4/15 we saw a significant increase in CPU use from the cluster during runs (linked chart).

Digging in between 4/9 and 4/23 we examined total CPU use across the cluster - one of the most significant component increases was kubelet. The attached graphs show the cpu core seconds used of crio, kubelet, and ovs in those orders across a 6 node gcp cluster - kubelet especially went from 2000 core/seconds on all 6 nodes to 2750 core seconds.

4/9: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1380506424706076672/artifacts/e2e-gcp/gather-extra/artifacts/metrics/

4/23: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1385624321132597248/artifacts/e2e-gcp/gather-extra/artifacts/metrics/

Will open separate bugs for other impacting regressions.

Comment 1 Clayton Coleman 2021-04-23 21:24:01 UTC
Created attachment 1774927 [details]
sum cpu usage per /system.slice process across entire cluster 4/9

Comment 2 Clayton Coleman 2021-04-23 21:24:19 UTC
Created attachment 1774928 [details]
sum cpu usage per /system.slice process across entire cluster 4/23

Comment 3 Elana Hashman 2021-04-23 21:29:34 UTC
My immediate suspicion is the migration to structured logging in 1.21. Next steps for me are to see if this is also an issue upstream and confirm if structured logging is the root cause.

Comment 5 Elana Hashman 2021-04-26 23:13:11 UTC
Created attachment 1775734 [details]
pprof - 4.7.8 over 30m

Attaching pprofs from 30m dumps on some cluster-bot clusters (4.7.8, 4.8.0-0.ci-2021-04-26-154116).

Comment 6 Elana Hashman 2021-04-26 23:13:53 UTC
Created attachment 1775735 [details]
pprof - 4.8.0-0.ci-2021-04-26-154116 over 30m

Comment 7 Elana Hashman 2021-04-26 23:21:18 UTC
Created attachment 1775736 [details]
GetStats comparison 4.7 v 4.8

Attached: screenshot comparing pprof data from 4.7 vs. 4.8. We see a 10s/60% CPU increase.

In particular, the following jumps out to me:

- fscommon.OpenFile -> fscommon.openWithSecureJoin appears new, is hot in the call graph, and appears quite slow. It looks responsible for increased syscall activity. MemoryGroup stats look 30% slower.
- We have some net new stats being gathered that are pretty slow: CpusetGroup adds 4s.

runc got bumped in OpenShift ~Apr. 7 during the rebase.

Ref: libct/cg: introduce and use fscommon.OpenFile https://github.com/opencontainers/runc/pull/2635

Comment 8 Elana Hashman 2021-04-26 23:24:49 UTC
Created attachment 1775737 [details]
Eviction comparison 4.7 v 4.8

Eviction cpu time up 200%.

Appears to be due in part to some new volume stats being gathered.

Slack context: https://coreos.slack.com/archives/CK1AE4ZCK/p1619477015207100?thread_ts=1619211998.146100&cid=CK1AE4ZCK

Comment 9 Elana Hashman 2021-04-26 23:33:47 UTC
Upstream k8s doesn't currently have monitoring for this: https://kubernetes.slack.com/archives/C09QZTRH7/p1619422034034900?thread_ts=1619223170.034800&cid=C09QZTRH7

I will try to get an upstream bug filed to add that support so we can catch these sorts of regressions upstream. I will also test and confirm to see if this affects upstream k8s as well, and file a bug if so.

Comment 10 Elana Hashman 2021-04-26 23:36:34 UTC
Assigning to Kir to investigate cadvisor/runc regression.

Comment 11 Kir Kolyshkin 2021-04-27 03:21:24 UTC
Proposed fix for runc, together with some analysis: https://github.com/opencontainers/runc/pull/2921

Will backport to 4.7 later.

Comment 12 Kir Kolyshkin 2021-04-27 03:33:40 UTC
> Will backport to 4.7 later.

Ughm, I mean, 4.8. Here: https://github.com/projectatomic/runc/pull/48 (currently a draft pending the upstream PR merge, but can be tested).

Comment 13 Elana Hashman 2021-05-03 22:35:20 UTC
https://github.com/opencontainers/runc/pull/2921 with patch has merged.

Waiting on runc94 release in: https://github.com/opencontainers/runc/pull/2927

Then will proceed to bump runc in cadvisor, then cadvisor in k8s, then backport the dep bump.

Comment 14 Ryan Phillips 2021-05-05 19:11:13 UTC
Kir is going to look into a backport into openshift/opencontainer-runc for 4.8.

Comment 15 Kir Kolyshkin 2021-05-05 19:37:14 UTC
A preliminary patch for cadvisor HEAD: https://github.com/google/cadvisor/pull/2863 (will be finalized once rc94 is out).

Comment 16 Elana Hashman 2021-05-06 17:58:14 UTC
Adding a NEEDINFO for you, Kir, just so you can ping me when the rc94 release is done... I'm keeping David Porter up to date upstream as well for the cadvisor release.

Comment 17 Elana Hashman 2021-05-11 01:38:53 UTC
runc94 is out: https://github.com/opencontainers/runc/releases/tag/v1.0.0-rc94

cadvisor has been bumped: https://github.com/google/cadvisor/pull/2869 (still need to cut a release)

cri-o bump: https://github.com/cri-o/cri-o/pull/4882

Comment 18 Elana Hashman 2021-05-18 18:11:16 UTC
Per https://github.com/kubernetes/kubernetes/pull/101888#pullrequestreview-661762880 this attempt at bumping to runc94 hasn't worked because of an issue with the conformance tests, so the PR was reverted. We will need to wait to update runc until we can fix these tests.

Comment 19 Elana Hashman 2021-05-19 23:23:59 UTC
PR for runc95 plus text fixes: https://github.com/kubernetes/kubernetes/pull/102147

Comment 21 Neelesh Agrawal 2021-05-21 18:15:16 UTC
kube 1.21 backport PR https://github.com/kubernetes/kubernetes/pull/102196
This there will be an openshift backport PR to fix this in 4.8.

Comment 23 Ben Bennett 2021-05-26 14:55:41 UTC
*** Bug 1962186 has been marked as a duplicate of this bug. ***

Comment 24 Ken Young 2021-06-03 13:15:56 UTC
Elana / Sunil,

Where are we with this Bug?  Has QA tested it?  Has it been backported to 4.8?

/KenY

Comment 25 Elana Hashman 2021-06-03 20:25:17 UTC
Backports have landed in both Kube 1.21 and OpenShift 4.8. Awaiting QE for OpenShift and for Googlers to verify the upstream bug can be closed.

Comment 26 Sunil Choudhary 2021-06-12 07:14:44 UTC
Created attachment 1790415 [details]
4.8-fc5 CPU usage

Comment 27 Sunil Choudhary 2021-06-12 07:16:42 UTC
Created attachment 1790416 [details]
4.8.0-0.nightly-2021-06-10-224448 CPU usage

Comment 28 Sunil Choudhary 2021-06-12 07:18:58 UTC
Checked by comparing cpu usage between 4.8.0-fc.5 & 4.8.0-0.nightly-2021-06-10-224448 over a period of 24 hours by running some workload

$ oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-fc.5   True        False         26h     Cluster version is 4.8.0-fc.5

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-06-10-224448   True        False         25h     Cluster version is 4.8.0-0.nightly-2021-06-10-224448

I see drop in cpu usage by kubelet, crio & ovs

Comment 31 errata-xmlrpc 2021-07-27 23:03:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438