Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1959291

Summary: cluster-storage-operator should not rely on external networking for health check
Product: OpenShift Container Platform Reporter: Rom Freiman <rfreiman>
Component: StorageAssignee: Jan Safranek <jsafrane>
Storage sub component: Operators QA Contact: Qin Ping <piqin>
Status: CLOSED NOTABUG Docs Contact:
Severity: high    
Priority: unspecified CC: aos-bugs, jsafrane, mfojtik, sttts, xxia
Version: 4.8   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1959290
: 1959292 (view as bug list) Environment:
Last Closed: 2021-05-13 10:06:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1959285, 1959290    
Bug Blocks: 1959292, 1959293, 1959294    
Attachments:
Description Flags
cluster-storage-operator audit event none

Description Rom Freiman 2021-05-11 08:18:41 UTC
+++ This bug was initially created as a clone of Bug #1959290 +++

+++ This bug was initially created as a clone of Bug #1959285 +++

Apparently, cluster-storage-operator has dependency on SAR as part of it's healthcheck, which causes it to be restarted in case of kubeapi rollout in SNO.


How reproducible:

User cluster-bot:
1. launch nightly aws,single-node
2. Update audit log verbosity to: AllRequestBodies
3. Wait for api rollout (oc get kubeapiserver -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}')
4. reboot the node to cleanup the caches (oc debug node/ip-10-0-136-254.ec2.internal)
5. Wait
6. Grep the audit log: 

oc adm node-logs ip-10-0-128-254.ec2.internal --path=kube-apiserver/audit.log | grep -i health | grep -i subjectaccessreviews | grep -v Unhealth > rbac.log
cat rbac.log  | jq . -C | less -r | grep 'username' | sort | uniq



Actual results:
~/work/installer [master]> cat rbac.log  | jq . -C | less -r | grep 'username' | sort | uniq
    "username": "system:serviceaccount:openshift-cluster-storage-operator:cluster-storage-operator"

Expected results:
It should not appear

Additional info:
Affects SNO stability upon api rollout (certificates rotation)

Comment 1 Jan Safranek 2021-05-12 16:56:25 UTC
I don't undestand this bug. cluster-storage-operator does not have any healthcheck (yeah, maybe it should have one...) The operator may create SubjectAccessReviews for its /healthz endpoint (currently unused) or /metrics, but I don't see a way how it could include "health" and "subjectaccessreviews" in a single audit log. Can you please post complete audit log line?

Comment 2 Jan Safranek 2021-05-12 16:59:57 UTC
Oh, and please include cluster-storage-operator logs too, just in case.

Comment 3 Rom Freiman 2021-05-12 20:54:19 UTC
Created attachment 1782545 [details]
cluster-storage-operator audit event

Attaching the audit log

I dont have the cso log.

Comment 5 Jan Safranek 2021-05-13 09:28:07 UTC
Thanks for the audit line. It's List on ClusterRoles, /apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0. probably when initializing an informer. The response contains all your keywords (some rules allow accessing "/healthz", some other get/list SubjectAccessReviews), still, it does not mean CSO does any form of health check using SubjectAccessReviews.

Comment 6 Rom Freiman 2021-05-13 09:30:51 UTC
@jsafrane if this is the case, feel free to close it. Thanks for the explanation.