Bug 2118286
Summary: | KCMO should not be dependent on monitoring stack | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Filip Krepinsky <fkrepins> |
Component: | kube-controller-manager | Assignee: | Filip Krepinsky <fkrepins> |
Status: | CLOSED ERRATA | QA Contact: | zhou ying <yinzhou> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 4.11 | CC: | mfojtik, stevsmit, yinzhou |
Target Milestone: | --- | ||
Target Release: | 4.12.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
* Previously, the `kube-controller-manager` Operator was reporting `degraded` on environments without a monitoring stack presence. With this update, the `kube-controller-manager` Operator skips checking the monitoring for cues about degradation when the monitoring stack is not present. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2118286[*BZ#2118286*])
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2023-01-17 19:54:55 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Bug Depends On: | |||
Bug Blocks: | 2118282 |
Description
Filip Krepinsky
2022-08-15 11:44:25 UTC
This is meant for single node deployments and removal of `clusteroperators.config.openshift.io monitoring` is needed to test this. Just disabling of monitoring is not enough and degradation is expected as this should signal that there is something wrong. Also if you bring the log level of KCM operator up, you should see this message: klog.V(5).Info("Monitoring is disabled in the cluster and a diagnostic of the garbage collector is not working. Please look at the kcm logs for more information to debug the garbage collector further") and consequently KCM should not be degraded Tested with latest payload , two questions : [root@localhost oc]# oc get node NAME STATUS ROLES AGE VERSION ip-10-0-159-19.us-east-2.compute.internal Ready control-plane,master,worker 4h30m v1.24.0+ed93380 [root@localhost oc]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-arm64-2022-08-17-172700 True True 22m Unable to apply 4.12.0-0.nightly-arm64-2022-08-17-172700: some cluster operators are not available 1) After remove monitoring stack and delete clusteroperators.config.openshift.io monitoring , the KCM will degraded: oc get co kube-controller-manager NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-controller-manager 4.12.0-0.nightly-arm64-2022-08-17-172700 True False True 4h21m GarbageCollectorDegraded: error querying alerts: Post "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query": dial tcp 172.30.227.26:9091: connect: connection refused 2) can't see the logs after set the loglevel==Trace : klog.V(5).Info("Monitoring is disabled in the cluster and a diagnostic of the garbage collector is not working. Please look at the kcm logs for more information to debug the garbage collector further"), please see : oc describe pod/kube-controller-manager-ip-10-0-159-19.us-east-2.compute.internal Name: kube-controller-manager-ip-10-0-159-19.us-east-2.compute.internal Namespace: openshift-kube-controller-manager .. Containers: kube-controller-manager: Container ID: cri-o://2a95b5fdd71b21878389d1ea16169b753c120d7bded5e8a7d303e504d8db763a ... exec hyperkube kube-controller-manager --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml \ --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig \ --authentication-kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig \ --authorization-kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig \ --client-ca-file=/etc/kubernetes/static-pod-certs/configmaps/client-ca/ca-bundle.crt \ --requestheader-client-ca-file=/etc/kubernetes/static-pod-certs/configmaps/aggregator-client-ca/ca-bundle.crt -v=6 ... oc logs -f po/kube-controller-manager-ip-10-0-159-19.us-east-2.compute.internal |grep "Monitoring" oc logs -f po/kube-controller-manager-ip-10-0-159-19.us-east-2.compute.internal |grep "gcwatcher_controller.go" I think you may have found an issue and we need a fix for it. But can you please check these 2 things first? 1. please check clusteroperators.config.openshift.io monitoring is still deleted and was not recreated again 2. you should still see the logs mentioned, but you have to increase the log level and check the logs in kube-controller-manager-operator pod (not kcm) but the operator should not go degraded, so we have to fix that.. the fix merged, now additionally the GarbageCollectorDegraded condition should show "MonitoringDisabled" as a reason tested with latest payload , the issue has fixed: after disable the monitoring stack oc get co kube-controller-manager NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-controller-manager 4.12.0-0.nightly-arm64-2022-08-31-125219 True False False 82m oc logs -f po/kube-controller-manager-operator-78dddd9d74-t87kd |grep "Monitoring" I0901 06:59:58.690554 1 gcwatcher_controller.go:129] Monitoring is disabled in the cluster and a diagnostic of the garbage collector is not working. Please look at the kcm logs for more information to debug the garbage collector further [root@localhost home]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-arm64-2022-08-31-125219 True False 7m43s Error while reconciling 4.12.0-0.nightly-arm64-2022-08-31-125219: some resources could not be updated [root@localhost home]# oc get node NAME STATUS ROLES AGE VERSION ip-10-0-158-144.us-east-2.compute.internal Ready control-plane,master,worker 92m v1.24.0+a097e26 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399 |