2111165 – Project auth cache is fully invalidated on changes to namespaces and namespaced RBAC

Bug 2111165 - Project auth cache is fully invalidated on changes to namespaces and namespaced RBAC

Summary: Project auth cache is fully invalidated on changes to namespaces and namespac...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	openshift-apiserver
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.12.0
Assignee:	Abu Kashem
QA Contact:	Rahul Gangwar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2111167
TreeView+	depends on / blocked

Reported:	2022-07-26 15:59 UTC by Ben Luddy
Modified:	2023-01-17 19:54 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2111167 (view as bug list)
Environment:
Last Closed:	2023-01-17 19:53:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift openshift-apiserver pull 295	0	None	Merged	Bug 2111165: Stop unnecessary project auth cache invalidations.	2022-07-27 13:49:06 UTC
Red Hat Product Errata	RHSA-2022:7399	0	None	None	None	2023-01-17 19:54:01 UTC

Description Ben Luddy 2022-07-26 15:59:57 UTC

Description of problem:

The Openshift API server maintains a cache used to scope project list and watch requests to namespaces that are visible to the requesting user. A periodic task runs in each openshift-apiserver process and updates the cache when namespaces, roles, rolebindings, clusterroles, or clusterrolebindings change. The cache can be updated in parts when a namespace, role, or rolebinding changes because the effect of these resources is limited to specific namespaces. Changes to clusterroles and clusterrolebindings perform a full invalidation, since they may impact any or all namespaces.

Today, the cache sync task is always performing a full invalidation, which is particularly expensive on clusters with many namespaces.

Version-Release number of selected component (if applicable): 4

How reproducible: Always

Steps to Reproduce:

It's difficult to observe directly, because the full invalidation still produces the correct behavior, but the secondary effect of increased CPU consumption in all openshift-apiserver processes is easy to observe.

1a. Create 100 namespaces (not necessary, but it makes the effect more obvious).

1b. Repeatedly update a namespace about once per second (suggest patching an annotation with a current timestamp as the value).

$ while true; do sleep 1; kubectl annotate namespace default --overwrite "timestamp=$(date)"; done

3. While continuing to update the namespace, monitor the CPU utilization metrics for openshift-apiserver.

rate(container_cpu_usage_seconds_total{namespace="openshift-apiserver",container="openshift-apiserver"}[1m])

Actual results:

Significant cpu utilization increase over idle. At least doubling, and I see about a 6-7x increase on a cluster with 1000 namespaces.

Expected results:

Little or no cpu utilization change.

Comment 2 Rahul Gangwar 2022-07-28 13:01:04 UTC

oc get clusterversion   
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2022-07-27-133042   True        False         4h47m   Cluster version is 4.12.0-0.nightly-2022-07-27-133042

CPU utilisation before creating 1000 namespace
 
oc adm top node         
NAME                                                         CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
rgangwar-28t3-vngb5-master-0.c.openshift-qe.internal         680m         19%    7053Mi          51%       
rgangwar-28t3-vngb5-master-1.c.openshift-qe.internal         946m         27%    8818Mi          64%       
rgangwar-28t3-vngb5-master-2.c.openshift-qe.internal         1005m        28%    10001Mi         72%       
rgangwar-28t3-vngb5-worker-a-5qcr8.c.openshift-qe.internal   325m         9%     3808Mi          27%       
rgangwar-28t3-vngb5-worker-b-pngcz.c.openshift-qe.internal   311m         8%     3397Mi          24%       
rgangwar-28t3-vngb5-worker-c-6qtpb.c.openshift-qe.internal   207m         5%     2102Mi          15%   

CPU utilisation after creating 1000 namespace.

oc get namespace|grep -i "test-"|wc -l
    1000

while true; do sleep 1; oc annotate namespace default --overwrite "timestamp=$(date)"; done

oc adm top node
NAME                                                         CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
rgangwar-28t3-vngb5-master-0.c.openshift-qe.internal         731m         20%    8259Mi          60%       
rgangwar-28t3-vngb5-master-1.c.openshift-qe.internal         919m         26%    10733Mi         78%       
rgangwar-28t3-vngb5-master-2.c.openshift-qe.internal         1071m        30%    12024Mi         87%       
rgangwar-28t3-vngb5-worker-a-5qcr8.c.openshift-qe.internal   323m         9%     4066Mi          29%       
rgangwar-28t3-vngb5-worker-b-pngcz.c.openshift-qe.internal   402m         11%    3403Mi          24%       


There is no much spike in CPU utilisation
rgangwar-28t3-vngb5-worker-c-6qtpb.c.openshift-qe.internal   178m         5%     2115Mi          15%

Comment 9 errata-xmlrpc 2023-01-17 19:53:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Note You need to log in before you can comment on or make changes to this bug.