Description of problem: node_exporter that can't be running on nodes that are offline/unavailable is one of the top reasons why CMO goes degraded. It would make sense to have CMO correlate the number of running node_exporter pods with the status of the nodes and not go degraded if the node_exporter pods are running on all nodes which are ready. As an example, if the cluster has N nodes with one node being not ready and (N-1) node_exporter pods are running then CMO should report Available rather than Degraded. Version-Release number of selected component (if applicable): How reproducible: Always when nodes are offline. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
tested with 4.9.0-0.nightly-2021-07-28-181504, monitoring won't be reported as DEGRADED due to offline/unavailable nodes, steps see below set one node to SchedulingDisabled to not affect other pods # oc adm cordon ip-10-0-217-156.us-east-2.compute.internal # oc get node ip-10-0-217-156.us-east-2.compute.internal NAME STATUS ROLES AGE VERSION ip-10-0-217-156.us-east-2.compute.internal Ready,SchedulingDisabled worker 4h45m v1.21.1+8268f88 scale down cluster-version-operator/cluster-monitoring-operator,remove daemonset # oc -n openshift-cluster-version scale deploy cluster-version-operator --replicas=0 # oc -n openshift-monitoring scale deploy cluster-monitoring-operator --replicas=0 # oc -n openshift-monitoring delete daemonset node-exporter make sure other pods are normal # oc -n openshift-monitoring get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES alertmanager-main-0 5/5 Running 0 5h8m 10.129.2.8 ip-10-0-134-137.us-east-2.compute.internal <none> <none> alertmanager-main-1 5/5 Running 0 25m 10.129.2.16 ip-10-0-134-137.us-east-2.compute.internal <none> <none> alertmanager-main-2 5/5 Running 0 5h8m 10.131.0.17 ip-10-0-176-74.us-east-2.compute.internal <none> <none> grafana-6c679c5748-vct2g 2/2 Running 0 5h8m 10.129.2.9 ip-10-0-134-137.us-east-2.compute.internal <none> <none> kube-state-metrics-59f44f65fb-qgghv 3/3 Running 0 5h12m 10.131.0.15 ip-10-0-176-74.us-east-2.compute.internal <none> <none> openshift-state-metrics-78c5465bcd-bkndb 3/3 Running 0 5h12m 10.131.0.7 ip-10-0-176-74.us-east-2.compute.internal <none> <none> prometheus-adapter-7d6b95dd6-cbv7h 1/1 Running 0 58m 10.131.0.122 ip-10-0-176-74.us-east-2.compute.internal <none> <none> prometheus-adapter-7d6b95dd6-zgflb 1/1 Running 0 5h8m 10.129.2.7 ip-10-0-134-137.us-east-2.compute.internal <none> <none> prometheus-k8s-0 7/7 Running 0 26m 10.131.0.137 ip-10-0-176-74.us-east-2.compute.internal <none> <none> prometheus-k8s-1 7/7 Running 0 5h7m 10.129.2.11 ip-10-0-134-137.us-east-2.compute.internal <none> <none> prometheus-operator-cd5899dbc-trcpx 2/2 Running 1 5h14m 10.128.0.40 ip-10-0-176-171.us-east-2.compute.internal <none> <none> telemeter-client-567dc564fd-pvpcp 3/3 Running 0 5h12m 10.131.0.16 ip-10-0-176-74.us-east-2.compute.internal <none> <none> thanos-querier-865d44b845-58cnf 5/5 Running 0 4h11m 10.129.2.13 ip-10-0-134-137.us-east-2.compute.internal <none> <none> thanos-querier-865d44b845-hrxvb 5/5 Running 0 4h11m 10.131.0.40 ip-10-0-176-74.us-east-2.compute.internal <none> <none> use following script to stop kubelet and sleep for 20m then start again # oc debug node/ip-10-0-217-156.us-east-2.compute.internal sh-4.4# chroot /host sh-4.4# chmod +x /tmp/run.sh sh-4.4# /tmp/run.sh & cat /tmp/run.sh ************** systemctl stop kubelet sleep 20m systemctl start kubelet ************** scale up cluster-version-operator/cluster-monitoring-operator # oc -n openshift-cluster-version scale deploy cluster-version-operator --replicas=1 # oc -n openshift-monitoring scale deploy cluster-monitoring-operator --replicas=1 # oc get node ip-10-0-217-156.us-east-2.compute.internal NAME STATUS ROLES AGE VERSION ip-10-0-217-156.us-east-2.compute.internal NotReady,SchedulingDisabled worker 5h12m v1.21.1+8268f88 make sure only the node-exporter pod is abornal which scheduled on the NotReady node # oc -n openshift-monitoring get pod -o wide | grep -Ev "Running|Completed" NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES node-exporter-l884h 0/2 Pending 0 2m58s <none> ip-10-0-217-156.us-east-2.compute.internal <none> <none> watch for a while, monitoring won't be reported as DEGRADED # node="ip-10-0-217-156.us-east-2.compute.internal"; while true; do oc get node ${node}; oc -n openshift-monitoring get pod -o wide | grep node-exporter | grep ${node}; oc get co monitoring; oc -n openshift-monitoring get ds;sleep 20s; done ... NAME STATUS ROLES AGE VERSION ip-10-0-217-156.us-east-2.compute.internal NotReady,SchedulingDisabled worker 5h19m v1.21.1+8268f88 node-exporter-l884h 0/2 Pending 0 7m32s <none> ip-10-0-217-156.us-east-2.compute.internal <none> <none> NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE monitoring 4.9.0-0.nightly-2021-07-28-181504 True False False 35m NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE node-exporter 6 6 5 6 5 kubernetes.io/os=linux 7m42s ...
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759
we have found the same issue on 4.8.44. Was this backported to 4.8? thanks,