Description of problem: The Openshift API server maintains a cache used to scope project list and watch requests to namespaces that are visible to the requesting user. A periodic task runs in each openshift-apiserver process and updates the cache when namespaces, roles, rolebindings, clusterroles, or clusterrolebindings change. The cache can be updated in parts when a namespace, role, or rolebinding changes because the effect of these resources is limited to specific namespaces. Changes to clusterroles and clusterrolebindings perform a full invalidation, since they may impact any or all namespaces. Today, the cache sync task is always performing a full invalidation, which is particularly expensive on clusters with many namespaces. Version-Release number of selected component (if applicable): 4 How reproducible: Always Steps to Reproduce: It's difficult to observe directly, because the full invalidation still produces the correct behavior, but the secondary effect of increased CPU consumption in all openshift-apiserver processes is easy to observe. 1a. Create 100 namespaces (not necessary, but it makes the effect more obvious). 1b. Repeatedly update a namespace about once per second (suggest patching an annotation with a current timestamp as the value). $ while true; do sleep 1; kubectl annotate namespace default --overwrite "timestamp=$(date)"; done 3. While continuing to update the namespace, monitor the CPU utilization metrics for openshift-apiserver. rate(container_cpu_usage_seconds_total{namespace="openshift-apiserver",container="openshift-apiserver"}[1m]) Actual results: Significant cpu utilization increase over idle. At least doubling, and I see about a 6-7x increase on a cluster with 1000 namespaces. Expected results: Little or no cpu utilization change.
oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-2022-07-27-133042 True False 4h47m Cluster version is 4.12.0-0.nightly-2022-07-27-133042 CPU utilisation before creating 1000 namespace oc adm top node NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% rgangwar-28t3-vngb5-master-0.c.openshift-qe.internal 680m 19% 7053Mi 51% rgangwar-28t3-vngb5-master-1.c.openshift-qe.internal 946m 27% 8818Mi 64% rgangwar-28t3-vngb5-master-2.c.openshift-qe.internal 1005m 28% 10001Mi 72% rgangwar-28t3-vngb5-worker-a-5qcr8.c.openshift-qe.internal 325m 9% 3808Mi 27% rgangwar-28t3-vngb5-worker-b-pngcz.c.openshift-qe.internal 311m 8% 3397Mi 24% rgangwar-28t3-vngb5-worker-c-6qtpb.c.openshift-qe.internal 207m 5% 2102Mi 15% CPU utilisation after creating 1000 namespace. oc get namespace|grep -i "test-"|wc -l 1000 while true; do sleep 1; oc annotate namespace default --overwrite "timestamp=$(date)"; done oc adm top node NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% rgangwar-28t3-vngb5-master-0.c.openshift-qe.internal 731m 20% 8259Mi 60% rgangwar-28t3-vngb5-master-1.c.openshift-qe.internal 919m 26% 10733Mi 78% rgangwar-28t3-vngb5-master-2.c.openshift-qe.internal 1071m 30% 12024Mi 87% rgangwar-28t3-vngb5-worker-a-5qcr8.c.openshift-qe.internal 323m 9% 4066Mi 29% rgangwar-28t3-vngb5-worker-b-pngcz.c.openshift-qe.internal 402m 11% 3403Mi 24% There is no much spike in CPU utilisation rgangwar-28t3-vngb5-worker-c-6qtpb.c.openshift-qe.internal 178m 5% 2115Mi 15%
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399