We noticed a significant CPU regression (~20%) in total cluster CPU before and after rebase + 8.4 rhcos changes landed. A portion of that is a kubelet regression tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1953102. Investigating the remainder, we noticed that we went from about 500 core/seconds during a 60m run to about 1750 core/seconds for system.slice, excluding cri-o/kubelet/ovs (crio and ovs were constantish). That is 1200 core/seconds across 6 machines and 60m, so about 60m per node constant extra usage. That impacts our SNO margins, so setting this high. udisks2 is the biggest new impact, with auditd being a significant source as well. Top users before (4/9) (cpu seconds consumed across all nodes by service, normalized to duration of run, measured in % of one core used average) {id="/system.slice/openshift-gcp-routes.service"} 0.05591455369138888 {id="/system.slice/dbus.service"} 0.019401089587777774 {id="/system.slice/systemd-udevd.service"} 0.018729595322222223 {id="/system.slice/systemd-journald.service"} 0.017247115179444446 {id="/system.slice/NetworkManager.service"} 0.017027189633333332 {id="/system.slice/ovsdb-server.service"} 0.011805269392222222 {id="/system.slice/sssd.service"} 0.003392703144166667 {id="/system.slice/systemd-logind.service"} 0.0032346640252777776 {id="/system.slice/polkit.service"} 0.002284859525555556 {id="/system.slice/gcp-routes.service"} 0.0014037745169444445 Top users after (4/23) {id="/system.slice/udisks2.service"} 0.23363467572083332 {id="/system.slice/auditd.service"} 0.08049729748916666 {id="/system.slice/openshift-gcp-routes.service"} 0.07127809961194445 {id="/system.slice/systemd-udevd.service"} 0.019705834968611113 {id="/system.slice/dbus.service"} 0.01926550800027778 {id="/system.slice/NetworkManager.service"} 0.01904460827472222 {id="/system.slice/systemd-journald.service"} 0.018749651370833335 {id="/system.slice/ovsdb-server.service"} 0.012165161891944444 {id="/system.slice/system-getty.slice/getty"} 0.004353772910555555 {id="/system.slice/system-getty.slice"} 0.00435304708 {id="/system.slice/sssd.service"} 0.003993742236111111 {id="/system.slice/systemd-logind.service"} 0.0027815936383333334 {id="/system.slice/polkit.service"} 0.0023779125527777782 {id="/system.slice/system-serial\x2dgetty.slice"} 0.002189938009722223 {id="/system.slice/system-serial\x2dgetty.slice/serial-getty"} 0.002189938009722223 {id="/system.slice/gcp-routes.service"} 0.0015536997230555555 Before 8.4/rebase: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1380506424706076672/artifacts/e2e-gcp/gather-extra/artifacts/metrics/ After 8.4/rebase: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1385624321132597248/artifacts/e2e-gcp/gather-extra/artifacts/metrics/
Of note, it appears the CPU not changed to system.slice or kubepods.slice on a node went from about 79.8 millicore to 89.6 millicore (query is (sum by (id) (container_cpu_usage_seconds_total{container="",id="/"}) - scalar(sum by (id) (container_cpu_usage_seconds_total{container="",id="/system.slice"})) - scalar(sum by (id) (container_cpu_usage_seconds_total{container="",id="/kubepods.slice"}))) / 60 / 60 / 6) Is that pure kernel CPU? Is a 10% regression in steady state in the kernel something we should notice (good chance whatever udisks or auditd are doing may be related of course)?
> Top users after (4/23) > > {id="/system.slice/udisks2.service"} > 0.23363467572083332 For reference, this is from `udisks2-2.9.0-6.el8`. The service unit is a new one coming with 8.4 content, as I don't see the package in previous RHCOS images based on 8.3 content (thus it wasn't running at all in the "before" measurement). So far I haven't found an explicit reference to it in RHCOS manifests, so I think this is coming in as a transitive dependency of some other RPMs. It seems to have first appeared in `48.84.202104131710-0`.
Will be included in next openshift/os bump.
`udisks2` was removed from RHCOS 4.8 in 48.84.202105071421-0 Newer builds of RHCOS 4.8 have been included in the OCP 4.8 nightly payloads, so moving this to MODIFIED.
Verified on 4.8.0-0.nightly-2021-05-21-200728. udisks2 is removed from RHCOS 48.84.202105211054-0. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-05-21-200728 True False 10m Cluster version is 4.8.0-0.nightly-2021-05-21-200728 $ oc get nodes NAME STATUS ROLES AGE VERSION ci-ln-d7rw6f2-f76d1-9wg8c-master-0 Ready master 28m v1.21.0-rc.0+c656d63 ci-ln-d7rw6f2-f76d1-9wg8c-master-1 Ready master 28m v1.21.0-rc.0+c656d63 ci-ln-d7rw6f2-f76d1-9wg8c-master-2 Ready master 28m v1.21.0-rc.0+c656d63 ci-ln-d7rw6f2-f76d1-9wg8c-worker-b-7hgrx Ready worker 21m v1.21.0-rc.0+c656d63 ci-ln-d7rw6f2-f76d1-9wg8c-worker-c-vmrbc Ready worker 21m v1.21.0-rc.0+c656d63 ci-ln-d7rw6f2-f76d1-9wg8c-worker-d-j5l8b Ready worker 21m v1.21.0-rc.0+c656d63 $ oc debug node/ci-ln-d7rw6f2-f76d1-9wg8c-worker-b-7hgrx Starting pod/ci-ln-d7rw6f2-f76d1-9wg8c-worker-b-7hgrx-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# rpm -qa | grep udisk sh-4.4# rpm-ostree status State: idle Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f620068b78e684b615ac01c5b79d6043bee9727644b1a976d45ae023d49fa850 CustomOrigin: Managed by machine-config-operator Version: 48.84.202105211054-0 (2021-05-21T10:58:00Z) ostree://92ede04b462bc884de5562062fb45e06d803754cbaa466e3a2d34b4ee5e9634b Version: 48.84.202105190318-0 (2021-05-19T03:22:10Z) sh-4.4# exit exit sh-4.2# exit exit Removing debug pod ...
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438