1959292 – kube-controller-manager-operator should not rely on external networking for health check

Bug 1959292 - kube-controller-manager-operator should not rely on external networking for health check

Summary: kube-controller-manager-operator should not rely on external networking for h...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-controller-manager
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	ravig
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:
Depends On:	1959285 1959290 1959291
Blocks:	1959293 1959294
TreeView+	depends on / blocked

Reported:	2021-05-11 08:20 UTC by Rom Freiman
Modified:	2021-06-08 17:39 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1959291
Clones:	1959293 (view as bug list)
Environment:
Last Closed:	2021-06-08 15:58:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Rom Freiman 2021-05-11 08:20:27 UTC

+++ This bug was initially created as a clone of Bug #1959291 +++

+++ This bug was initially created as a clone of Bug #1959290 +++

+++ This bug was initially created as a clone of Bug #1959285 +++

Apparently, kube-controller-manager-operator has dependency on SAR as part of it's healthcheck, which causes it to be restarted in case of kubeapi rollout in SNO.


How reproducible:

User cluster-bot:
1. launch nightly aws,single-node
2. Update audit log verbosity to: AllRequestBodies
3. Wait for api rollout (oc get kubeapiserver -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}')
4. reboot the node to cleanup the caches (oc debug node/ip-10-0-136-254.ec2.internal)
5. Wait
6. Grep the audit log: 

oc adm node-logs ip-10-0-128-254.ec2.internal --path=kube-apiserver/audit.log | grep -i health | grep -i subjectaccessreviews | grep -v Unhealth > rbac.log
cat rbac.log  | jq . -C | less -r | grep 'username' | sort | uniq



Actual results:
~/work/installer [master]> cat rbac.log  | jq . -C | less -r | grep 'username' | sort | uniq
    "username": "system:serviceaccount:openshift-kube-controller-manager-operator:kube-controller-manager-operator"

Expected results:
It should not appear

Additional info:
Affects SNO stability upon api rollout (certificates rotation)

Comment 1 ravig 2021-06-08 15:58:19 UTC

Hi Rom,

The health checks we have for KCMO check for the KCM endpoint. We just have health checks for 10257 port as you can see here:

https://github.com/openshift/cluster-kube-controller-manager-operator/blob/dc54142035982bc44581936a7e90cdd9ac9ad24e/bindata/v4.1.0/kube-controller-manager/pod.yaml

When KCMO starts it connects to APIServer and perhaps that was the reason you are noticing those entries in event log. 

So, closing this BZ for now. Feel free to open it in case you feel otherwise.

Note You need to log in before you can comment on or make changes to this bug.