Created attachment 1774926 [details] CPU use over total e2e-gcp run for last 8 weeks Between 4/10 and 4/15 we saw a significant increase in CPU use from the cluster during runs (linked chart). Digging in between 4/9 and 4/23 we examined total CPU use across the cluster - one of the most significant component increases was kubelet. The attached graphs show the cpu core seconds used of crio, kubelet, and ovs in those orders across a 6 node gcp cluster - kubelet especially went from 2000 core/seconds on all 6 nodes to 2750 core seconds. 4/9: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1380506424706076672/artifacts/e2e-gcp/gather-extra/artifacts/metrics/ 4/23: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1385624321132597248/artifacts/e2e-gcp/gather-extra/artifacts/metrics/ Will open separate bugs for other impacting regressions.
Created attachment 1774927 [details] sum cpu usage per /system.slice process across entire cluster 4/9
Created attachment 1774928 [details] sum cpu usage per /system.slice process across entire cluster 4/23
My immediate suspicion is the migration to structured logging in 1.21. Next steps for me are to see if this is also an issue upstream and confirm if structured logging is the root cause.
Created attachment 1775734 [details] pprof - 4.7.8 over 30m Attaching pprofs from 30m dumps on some cluster-bot clusters (4.7.8, 4.8.0-0.ci-2021-04-26-154116).
Created attachment 1775735 [details] pprof - 4.8.0-0.ci-2021-04-26-154116 over 30m
Created attachment 1775736 [details] GetStats comparison 4.7 v 4.8 Attached: screenshot comparing pprof data from 4.7 vs. 4.8. We see a 10s/60% CPU increase. In particular, the following jumps out to me: - fscommon.OpenFile -> fscommon.openWithSecureJoin appears new, is hot in the call graph, and appears quite slow. It looks responsible for increased syscall activity. MemoryGroup stats look 30% slower. - We have some net new stats being gathered that are pretty slow: CpusetGroup adds 4s. runc got bumped in OpenShift ~Apr. 7 during the rebase. Ref: libct/cg: introduce and use fscommon.OpenFile https://github.com/opencontainers/runc/pull/2635
Created attachment 1775737 [details] Eviction comparison 4.7 v 4.8 Eviction cpu time up 200%. Appears to be due in part to some new volume stats being gathered. Slack context: https://coreos.slack.com/archives/CK1AE4ZCK/p1619477015207100?thread_ts=1619211998.146100&cid=CK1AE4ZCK
Upstream k8s doesn't currently have monitoring for this: https://kubernetes.slack.com/archives/C09QZTRH7/p1619422034034900?thread_ts=1619223170.034800&cid=C09QZTRH7 I will try to get an upstream bug filed to add that support so we can catch these sorts of regressions upstream. I will also test and confirm to see if this affects upstream k8s as well, and file a bug if so.
Assigning to Kir to investigate cadvisor/runc regression.
Proposed fix for runc, together with some analysis: https://github.com/opencontainers/runc/pull/2921 Will backport to 4.7 later.
> Will backport to 4.7 later. Ughm, I mean, 4.8. Here: https://github.com/projectatomic/runc/pull/48 (currently a draft pending the upstream PR merge, but can be tested).
https://github.com/opencontainers/runc/pull/2921 with patch has merged. Waiting on runc94 release in: https://github.com/opencontainers/runc/pull/2927 Then will proceed to bump runc in cadvisor, then cadvisor in k8s, then backport the dep bump.
Kir is going to look into a backport into openshift/opencontainer-runc for 4.8.
A preliminary patch for cadvisor HEAD: https://github.com/google/cadvisor/pull/2863 (will be finalized once rc94 is out).
Adding a NEEDINFO for you, Kir, just so you can ping me when the rc94 release is done... I'm keeping David Porter up to date upstream as well for the cadvisor release.
runc94 is out: https://github.com/opencontainers/runc/releases/tag/v1.0.0-rc94 cadvisor has been bumped: https://github.com/google/cadvisor/pull/2869 (still need to cut a release) cri-o bump: https://github.com/cri-o/cri-o/pull/4882
Per https://github.com/kubernetes/kubernetes/pull/101888#pullrequestreview-661762880 this attempt at bumping to runc94 hasn't worked because of an issue with the conformance tests, so the PR was reverted. We will need to wait to update runc until we can fix these tests.
PR for runc95 plus text fixes: https://github.com/kubernetes/kubernetes/pull/102147
kube 1.21 backport PR https://github.com/kubernetes/kubernetes/pull/102196 This there will be an openshift backport PR to fix this in 4.8.
*** Bug 1962186 has been marked as a duplicate of this bug. ***
Elana / Sunil, Where are we with this Bug? Has QA tested it? Has it been backported to 4.8? /KenY
Backports have landed in both Kube 1.21 and OpenShift 4.8. Awaiting QE for OpenShift and for Googlers to verify the upstream bug can be closed.
Created attachment 1790415 [details] 4.8-fc5 CPU usage
Created attachment 1790416 [details] 4.8.0-0.nightly-2021-06-10-224448 CPU usage
Checked by comparing cpu usage between 4.8.0-fc.5 & 4.8.0-0.nightly-2021-06-10-224448 over a period of 24 hours by running some workload $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-fc.5 True False 26h Cluster version is 4.8.0-fc.5 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-06-10-224448 True False 25h Cluster version is 4.8.0-0.nightly-2021-06-10-224448 I see drop in cpu usage by kubelet, crio & ovs
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438