Hide Forgot
Description of problem: kcmo should not fail on environments without monitoring stack
This is meant for single node deployments and removal of `clusteroperators.config.openshift.io monitoring` is needed to test this. Just disabling of monitoring is not enough and degradation is expected as this should signal that there is something wrong. Also if you bring the log level of KCM operator up, you should see this message: klog.V(5).Info("Monitoring is disabled in the cluster and a diagnostic of the garbage collector is not working. Please look at the kcm logs for more information to debug the garbage collector further") and consequently KCM should not be degraded
Tested with latest payload , two questions : [root@localhost oc]# oc get node NAME STATUS ROLES AGE VERSION ip-10-0-159-19.us-east-2.compute.internal Ready control-plane,master,worker 4h30m v1.24.0+ed93380 [root@localhost oc]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-arm64-2022-08-17-172700 True True 22m Unable to apply 4.12.0-0.nightly-arm64-2022-08-17-172700: some cluster operators are not available 1) After remove monitoring stack and delete clusteroperators.config.openshift.io monitoring , the KCM will degraded: oc get co kube-controller-manager NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-controller-manager 4.12.0-0.nightly-arm64-2022-08-17-172700 True False True 4h21m GarbageCollectorDegraded: error querying alerts: Post "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query": dial tcp 172.30.227.26:9091: connect: connection refused 2) can't see the logs after set the loglevel==Trace : klog.V(5).Info("Monitoring is disabled in the cluster and a diagnostic of the garbage collector is not working. Please look at the kcm logs for more information to debug the garbage collector further"), please see : oc describe pod/kube-controller-manager-ip-10-0-159-19.us-east-2.compute.internal Name: kube-controller-manager-ip-10-0-159-19.us-east-2.compute.internal Namespace: openshift-kube-controller-manager .. Containers: kube-controller-manager: Container ID: cri-o://2a95b5fdd71b21878389d1ea16169b753c120d7bded5e8a7d303e504d8db763a ... exec hyperkube kube-controller-manager --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml \ --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig \ --authentication-kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig \ --authorization-kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig \ --client-ca-file=/etc/kubernetes/static-pod-certs/configmaps/client-ca/ca-bundle.crt \ --requestheader-client-ca-file=/etc/kubernetes/static-pod-certs/configmaps/aggregator-client-ca/ca-bundle.crt -v=6 ... oc logs -f po/kube-controller-manager-ip-10-0-159-19.us-east-2.compute.internal |grep "Monitoring" oc logs -f po/kube-controller-manager-ip-10-0-159-19.us-east-2.compute.internal |grep "gcwatcher_controller.go"
I think you may have found an issue and we need a fix for it. But can you please check these 2 things first? 1. please check clusteroperators.config.openshift.io monitoring is still deleted and was not recreated again 2. you should still see the logs mentioned, but you have to increase the log level and check the logs in kube-controller-manager-operator pod (not kcm) but the operator should not go degraded, so we have to fix that..
the fix merged, now additionally the GarbageCollectorDegraded condition should show "MonitoringDisabled" as a reason
tested with latest payload , the issue has fixed: after disable the monitoring stack oc get co kube-controller-manager NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-controller-manager 4.12.0-0.nightly-arm64-2022-08-31-125219 True False False 82m oc logs -f po/kube-controller-manager-operator-78dddd9d74-t87kd |grep "Monitoring" I0901 06:59:58.690554 1 gcwatcher_controller.go:129] Monitoring is disabled in the cluster and a diagnostic of the garbage collector is not working. Please look at the kcm logs for more information to debug the garbage collector further [root@localhost home]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-arm64-2022-08-31-125219 True False 7m43s Error while reconciling 4.12.0-0.nightly-arm64-2022-08-31-125219: some resources could not be updated [root@localhost home]# oc get node NAME STATUS ROLES AGE VERSION ip-10-0-158-144.us-east-2.compute.internal Ready control-plane,master,worker 92m v1.24.0+a097e26
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399