Bug 2118286

Summary:	KCMO should not be dependent on monitoring stack
Product:	OpenShift Container Platform	Reporter:	Filip Krepinsky <fkrepins>
Component:	kube-controller-manager	Assignee:	Filip Krepinsky <fkrepins>
Status:	CLOSED ERRATA	QA Contact:	zhou ying <yinzhou>
Severity:	high	Docs Contact:
Priority:	medium
Version:	4.11	CC:	mfojtik, stevsmit, yinzhou
Target Milestone:	---
Target Release:	4.12.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	* Previously, the `kube-controller-manager` Operator was reporting `degraded` on environments without a monitoring stack presence. With this update, the `kube-controller-manager` Operator skips checking the monitoring for cues about degradation when the monitoring stack is not present. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2118286[BZ#2118286])	Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-01-17 19:54:55 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2118282

Description Filip Krepinsky 2022-08-15 11:44:25 UTC

Description of problem:
kcmo should not fail on environments without monitoring stack

Comment 3 Filip Krepinsky 2022-08-16 20:22:39 UTC

This is meant for single node deployments and removal of `clusteroperators.config.openshift.io monitoring` is needed to test this. Just disabling of monitoring is not enough and degradation is expected as this should signal that there is something wrong.

Also if you bring the log level of KCM operator up, you should see this message:

klog.V(5).Info("Monitoring is disabled in the cluster and a diagnostic of the garbage collector is not working. Please look at the kcm logs for more information to debug the garbage collector further") 

and consequently KCM should not be degraded

Comment 4 zhou ying 2022-08-18 07:10:23 UTC

Tested with latest payload , two questions :

[root@localhost oc]# oc get node
NAME                                        STATUS   ROLES                         AGE     VERSION
ip-10-0-159-19.us-east-2.compute.internal   Ready    control-plane,master,worker   4h30m   v1.24.0+ed93380
[root@localhost oc]# oc get clusterversion 
NAME      VERSION                                    AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-arm64-2022-08-17-172700   True        True          22m     Unable to apply 4.12.0-0.nightly-arm64-2022-08-17-172700: some cluster operators are not available

1) After remove monitoring stack and delete clusteroperators.config.openshift.io monitoring ,  the KCM will degraded:
oc get co kube-controller-manager 
NAME                      VERSION                                    AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-controller-manager   4.12.0-0.nightly-arm64-2022-08-17-172700   True        False         True       4h21m   GarbageCollectorDegraded: error querying alerts: Post "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query": dial tcp 172.30.227.26:9091: connect: connection refused

2) can't see the logs after set the loglevel==Trace  : klog.V(5).Info("Monitoring is disabled in the cluster and a diagnostic of the garbage collector is not working. Please look at the kcm logs for more information to debug the garbage collector further"), please see :

oc describe pod/kube-controller-manager-ip-10-0-159-19.us-east-2.compute.internal
Name:                 kube-controller-manager-ip-10-0-159-19.us-east-2.compute.internal
Namespace:            openshift-kube-controller-manager
..
Containers:
  kube-controller-manager:
    Container ID:  cri-o://2a95b5fdd71b21878389d1ea16169b753c120d7bded5e8a7d303e504d8db763a
...
      exec hyperkube kube-controller-manager --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml \
        --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig \
        --authentication-kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig \
        --authorization-kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig \
        --client-ca-file=/etc/kubernetes/static-pod-certs/configmaps/client-ca/ca-bundle.crt \
        --requestheader-client-ca-file=/etc/kubernetes/static-pod-certs/configmaps/aggregator-client-ca/ca-bundle.crt -v=6 
...

oc logs -f po/kube-controller-manager-ip-10-0-159-19.us-east-2.compute.internal |grep "Monitoring"
 
oc logs -f po/kube-controller-manager-ip-10-0-159-19.us-east-2.compute.internal |grep "gcwatcher_controller.go"

Comment 5 Filip Krepinsky 2022-08-18 14:06:46 UTC

I think you may have found an issue and we need a fix for it. But can you please check these 2 things first?

1. please check clusteroperators.config.openshift.io monitoring is still deleted and was not recreated again
2. you should still see the logs mentioned, but you have to increase the log level and check the logs in kube-controller-manager-operator pod (not kcm)

but the operator should not go degraded, so we have to fix that..

Comment 6 Filip Krepinsky 2022-08-24 14:12:39 UTC

the fix merged, now additionally the GarbageCollectorDegraded condition should show "MonitoringDisabled" as a reason

Comment 8 zhou ying 2022-09-01 07:04:02 UTC

tested with latest payload , the issue has fixed:

after disable the monitoring stack
oc get co kube-controller-manager
NAME                      VERSION                                    AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-controller-manager   4.12.0-0.nightly-arm64-2022-08-31-125219   True        False         False      82m
oc logs -f po/kube-controller-manager-operator-78dddd9d74-t87kd  |grep "Monitoring"
I0901 06:59:58.690554       1 gcwatcher_controller.go:129] Monitoring is disabled in the cluster and a diagnostic of the garbage collector is not working. Please look at the kcm logs for more information to debug the garbage collector further

[root@localhost home]# oc get clusterversion 
NAME      VERSION                                    AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-arm64-2022-08-31-125219   True        False         7m43s   Error while reconciling 4.12.0-0.nightly-arm64-2022-08-31-125219: some resources could not be updated
[root@localhost home]# oc get node
 NAME                                         STATUS   ROLES                         AGE   VERSION
ip-10-0-158-144.us-east-2.compute.internal   Ready    control-plane,master,worker   92m   v1.24.0+a097e26

Comment 11 errata-xmlrpc 2023-01-17 19:54:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399