1959291 – cluster-storage-operator should not rely on external networking for health check

Bug 1959291 - cluster-storage-operator should not rely on external networking for health check

Summary: cluster-storage-operator should not rely on external networking for health check

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Jan Safranek
QA Contact:	Qin Ping
Docs Contact:
URL:
Whiteboard:
Depends On:	1959285 1959290
Blocks:	1959292 1959293 1959294
TreeView+	depends on / blocked

Reported:	2021-05-11 08:18 UTC by Rom Freiman
Modified:	2021-05-13 10:06 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1959290
Clones:	1959292 (view as bug list)
Environment:
Last Closed:	2021-05-13 10:06:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
cluster-storage-operator audit event (213.24 KB, text/plain) 2021-05-12 20:54 UTC, Rom Freiman	no flags	Details
View All

Description Rom Freiman 2021-05-11 08:18:41 UTC

+++ This bug was initially created as a clone of Bug #1959290 +++

+++ This bug was initially created as a clone of Bug #1959285 +++

Apparently, cluster-storage-operator has dependency on SAR as part of it's healthcheck, which causes it to be restarted in case of kubeapi rollout in SNO.


How reproducible:

User cluster-bot:
1. launch nightly aws,single-node
2. Update audit log verbosity to: AllRequestBodies
3. Wait for api rollout (oc get kubeapiserver -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}')
4. reboot the node to cleanup the caches (oc debug node/ip-10-0-136-254.ec2.internal)
5. Wait
6. Grep the audit log: 

oc adm node-logs ip-10-0-128-254.ec2.internal --path=kube-apiserver/audit.log | grep -i health | grep -i subjectaccessreviews | grep -v Unhealth > rbac.log
cat rbac.log  | jq . -C | less -r | grep 'username' | sort | uniq



Actual results:
~/work/installer [master]> cat rbac.log  | jq . -C | less -r | grep 'username' | sort | uniq
    "username": "system:serviceaccount:openshift-cluster-storage-operator:cluster-storage-operator"

Expected results:
It should not appear

Additional info:
Affects SNO stability upon api rollout (certificates rotation)

Comment 1 Jan Safranek 2021-05-12 16:56:25 UTC

I don't undestand this bug. cluster-storage-operator does not have any healthcheck (yeah, maybe it should have one...) The operator may create SubjectAccessReviews for its /healthz endpoint (currently unused) or /metrics, but I don't see a way how it could include "health" and "subjectaccessreviews" in a single audit log. Can you please post complete audit log line?

Comment 2 Jan Safranek 2021-05-12 16:59:57 UTC

Oh, and please include cluster-storage-operator logs too, just in case.

Comment 3 Rom Freiman 2021-05-12 20:54:19 UTC

Created attachment 1782545 [details]
cluster-storage-operator audit event

Attaching the audit log

I dont have the cso log.

Comment 5 Jan Safranek 2021-05-13 09:28:07 UTC

Thanks for the audit line. It's List on ClusterRoles, /apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0. probably when initializing an informer. The response contains all your keywords (some rules allow accessing "/healthz", some other get/list SubjectAccessReviews), still, it does not mean CSO does any form of health check using SubjectAccessReviews.

Comment 6 Rom Freiman 2021-05-13 09:30:51 UTC

@jsafrane if this is the case, feel free to close it. Thanks for the explanation.

Note You need to log in before you can comment on or make changes to this bug.