Bug 2118286 - KCMO should not be dependent on monitoring stack
Summary: KCMO should not be dependent on monitoring stack
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-controller-manager
Version: 4.11
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.12.0
Assignee: Filip Krepinsky
QA Contact: zhou ying
URL:
Whiteboard:
Depends On:
Blocks: 2118282
TreeView+ depends on / blocked
 
Reported: 2022-08-15 11:44 UTC by Filip Krepinsky
Modified: 2023-01-17 19:55 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
* Previously, the `kube-controller-manager` Operator was reporting `degraded` on environments without a monitoring stack presence. With this update, the `kube-controller-manager` Operator skips checking the monitoring for cues about degradation when the monitoring stack is not present. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2118286[*BZ#2118286*])
Clone Of:
Environment:
Last Closed: 2023-01-17 19:54:55 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-kube-controller-manager-operator pull 639 0 None Merged Make KCM-O conditionally dependent on monitoring stack availability 2022-08-24 12:02:20 UTC
Github openshift cluster-kube-controller-manager-operator pull 650 0 None Merged Bug 2118286: always report and reconcile GarbageCollectorDegraded condition 2022-08-24 19:32:17 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:55:07 UTC

Description Filip Krepinsky 2022-08-15 11:44:25 UTC
Description of problem:
kcmo should not fail on environments without monitoring stack

Comment 3 Filip Krepinsky 2022-08-16 20:22:39 UTC
This is meant for single node deployments and removal of `clusteroperators.config.openshift.io monitoring` is needed to test this. Just disabling of monitoring is not enough and degradation is expected as this should signal that there is something wrong.

Also if you bring the log level of KCM operator up, you should see this message:

klog.V(5).Info("Monitoring is disabled in the cluster and a diagnostic of the garbage collector is not working. Please look at the kcm logs for more information to debug the garbage collector further") 

and consequently KCM should not be degraded

Comment 4 zhou ying 2022-08-18 07:10:23 UTC
Tested with latest payload , two questions :

[root@localhost oc]# oc get node
NAME                                        STATUS   ROLES                         AGE     VERSION
ip-10-0-159-19.us-east-2.compute.internal   Ready    control-plane,master,worker   4h30m   v1.24.0+ed93380
[root@localhost oc]# oc get clusterversion 
NAME      VERSION                                    AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-arm64-2022-08-17-172700   True        True          22m     Unable to apply 4.12.0-0.nightly-arm64-2022-08-17-172700: some cluster operators are not available

1) After remove monitoring stack and delete clusteroperators.config.openshift.io monitoring ,  the KCM will degraded:
oc get co kube-controller-manager 
NAME                      VERSION                                    AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-controller-manager   4.12.0-0.nightly-arm64-2022-08-17-172700   True        False         True       4h21m   GarbageCollectorDegraded: error querying alerts: Post "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query": dial tcp 172.30.227.26:9091: connect: connection refused

2) can't see the logs after set the loglevel==Trace  : klog.V(5).Info("Monitoring is disabled in the cluster and a diagnostic of the garbage collector is not working. Please look at the kcm logs for more information to debug the garbage collector further"), please see :

oc describe pod/kube-controller-manager-ip-10-0-159-19.us-east-2.compute.internal
Name:                 kube-controller-manager-ip-10-0-159-19.us-east-2.compute.internal
Namespace:            openshift-kube-controller-manager
..
Containers:
  kube-controller-manager:
    Container ID:  cri-o://2a95b5fdd71b21878389d1ea16169b753c120d7bded5e8a7d303e504d8db763a
...
      exec hyperkube kube-controller-manager --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml \
        --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig \
        --authentication-kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig \
        --authorization-kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig \
        --client-ca-file=/etc/kubernetes/static-pod-certs/configmaps/client-ca/ca-bundle.crt \
        --requestheader-client-ca-file=/etc/kubernetes/static-pod-certs/configmaps/aggregator-client-ca/ca-bundle.crt -v=6 
...

oc logs -f po/kube-controller-manager-ip-10-0-159-19.us-east-2.compute.internal |grep "Monitoring"
 
oc logs -f po/kube-controller-manager-ip-10-0-159-19.us-east-2.compute.internal |grep "gcwatcher_controller.go"

Comment 5 Filip Krepinsky 2022-08-18 14:06:46 UTC
I think you may have found an issue and we need a fix for it. But can you please check these 2 things first?

1. please check clusteroperators.config.openshift.io monitoring is still deleted and was not recreated again
2. you should still see the logs mentioned, but you have to increase the log level and check the logs in kube-controller-manager-operator pod (not kcm)

but the operator should not go degraded, so we have to fix that..

Comment 6 Filip Krepinsky 2022-08-24 14:12:39 UTC
the fix merged, now additionally the GarbageCollectorDegraded condition should show "MonitoringDisabled" as a reason

Comment 8 zhou ying 2022-09-01 07:04:02 UTC
tested with latest payload , the issue has fixed:

after disable the monitoring stack
oc get co kube-controller-manager
NAME                      VERSION                                    AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-controller-manager   4.12.0-0.nightly-arm64-2022-08-31-125219   True        False         False      82m
oc logs -f po/kube-controller-manager-operator-78dddd9d74-t87kd  |grep "Monitoring"
I0901 06:59:58.690554       1 gcwatcher_controller.go:129] Monitoring is disabled in the cluster and a diagnostic of the garbage collector is not working. Please look at the kcm logs for more information to debug the garbage collector further

[root@localhost home]# oc get clusterversion 
NAME      VERSION                                    AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-arm64-2022-08-31-125219   True        False         7m43s   Error while reconciling 4.12.0-0.nightly-arm64-2022-08-31-125219: some resources could not be updated
[root@localhost home]# oc get node
 NAME                                         STATUS   ROLES                         AGE   VERSION
ip-10-0-158-144.us-east-2.compute.internal   Ready    control-plane,master,worker   92m   v1.24.0+a097e26

Comment 11 errata-xmlrpc 2023-01-17 19:54:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399


Note You need to log in before you can comment on or make changes to this bug.