Bug 1986375 - Avoid CMO being degraded when some nodes aren't available
Summary: Avoid CMO being degraded when some nodes aren't available
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.9.0
Assignee: Prashant Balachandran
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-27 12:04 UTC by Prashant Balachandran
Modified: 2022-11-04 09:07 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-18 17:41:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1279 0 None open Bug 1986375: adding check for node exporter daemon set 2021-07-27 15:56:31 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:41:55 UTC

Description Prashant Balachandran 2021-07-27 12:04:02 UTC
Description of problem:
node_exporter that can't be running on nodes that are offline/unavailable is one of the top reasons why CMO goes degraded. It would make sense to have CMO correlate the number of running node_exporter pods with the status of the nodes and not go degraded if the node_exporter pods are running on all nodes which are ready. As an example, if the cluster has N nodes with one node being not ready and (N-1) node_exporter pods are running then CMO should report Available rather than Degraded.

Version-Release number of selected component (if applicable):


How reproducible:
Always when nodes are offline.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Junqi Zhao 2021-07-29 08:19:43 UTC
tested with 4.9.0-0.nightly-2021-07-28-181504, monitoring won't be reported as DEGRADED due to offline/unavailable nodes, steps see below
set one node to SchedulingDisabled to not affect other pods
# oc adm cordon ip-10-0-217-156.us-east-2.compute.internal
# oc get node ip-10-0-217-156.us-east-2.compute.internal
NAME                                         STATUS                     ROLES    AGE     VERSION
ip-10-0-217-156.us-east-2.compute.internal   Ready,SchedulingDisabled   worker   4h45m   v1.21.1+8268f88

scale down cluster-version-operator/cluster-monitoring-operator,remove daemonset
# oc -n openshift-cluster-version scale deploy cluster-version-operator --replicas=0
# oc -n openshift-monitoring scale deploy cluster-monitoring-operator --replicas=0
# oc -n openshift-monitoring delete daemonset node-exporter

make sure other pods are normal
# oc -n openshift-monitoring get pod -o wide
NAME                                       READY   STATUS    RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
alertmanager-main-0                        5/5     Running   0          5h8m    10.129.2.8     ip-10-0-134-137.us-east-2.compute.internal   <none>           <none>
alertmanager-main-1                        5/5     Running   0          25m     10.129.2.16    ip-10-0-134-137.us-east-2.compute.internal   <none>           <none>
alertmanager-main-2                        5/5     Running   0          5h8m    10.131.0.17    ip-10-0-176-74.us-east-2.compute.internal    <none>           <none>
grafana-6c679c5748-vct2g                   2/2     Running   0          5h8m    10.129.2.9     ip-10-0-134-137.us-east-2.compute.internal   <none>           <none>
kube-state-metrics-59f44f65fb-qgghv        3/3     Running   0          5h12m   10.131.0.15    ip-10-0-176-74.us-east-2.compute.internal    <none>           <none>
openshift-state-metrics-78c5465bcd-bkndb   3/3     Running   0          5h12m   10.131.0.7     ip-10-0-176-74.us-east-2.compute.internal    <none>           <none>
prometheus-adapter-7d6b95dd6-cbv7h         1/1     Running   0          58m     10.131.0.122   ip-10-0-176-74.us-east-2.compute.internal    <none>           <none>
prometheus-adapter-7d6b95dd6-zgflb         1/1     Running   0          5h8m    10.129.2.7     ip-10-0-134-137.us-east-2.compute.internal   <none>           <none>
prometheus-k8s-0                           7/7     Running   0          26m     10.131.0.137   ip-10-0-176-74.us-east-2.compute.internal    <none>           <none>
prometheus-k8s-1                           7/7     Running   0          5h7m    10.129.2.11    ip-10-0-134-137.us-east-2.compute.internal   <none>           <none>
prometheus-operator-cd5899dbc-trcpx        2/2     Running   1          5h14m   10.128.0.40    ip-10-0-176-171.us-east-2.compute.internal   <none>           <none>
telemeter-client-567dc564fd-pvpcp          3/3     Running   0          5h12m   10.131.0.16    ip-10-0-176-74.us-east-2.compute.internal    <none>           <none>
thanos-querier-865d44b845-58cnf            5/5     Running   0          4h11m   10.129.2.13    ip-10-0-134-137.us-east-2.compute.internal   <none>           <none>
thanos-querier-865d44b845-hrxvb            5/5     Running   0          4h11m   10.131.0.40    ip-10-0-176-74.us-east-2.compute.internal    <none>           <none>

use following script to stop kubelet and sleep for 20m then start again
# oc debug node/ip-10-0-217-156.us-east-2.compute.internal
sh-4.4# chroot /host
sh-4.4# chmod +x /tmp/run.sh
sh-4.4# /tmp/run.sh &

cat /tmp/run.sh
**************
systemctl stop kubelet
sleep 20m
systemctl start kubelet
**************

scale up cluster-version-operator/cluster-monitoring-operator
# oc -n openshift-cluster-version scale deploy cluster-version-operator --replicas=1
# oc -n openshift-monitoring scale deploy cluster-monitoring-operator --replicas=1

# oc get node ip-10-0-217-156.us-east-2.compute.internal
NAME                                         STATUS                        ROLES    AGE     VERSION
ip-10-0-217-156.us-east-2.compute.internal   NotReady,SchedulingDisabled   worker   5h12m   v1.21.1+8268f88

make sure only the node-exporter pod is abornal which scheduled on the NotReady node
# oc -n openshift-monitoring get pod -o wide | grep -Ev "Running|Completed"
NAME                                           READY   STATUS    RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
node-exporter-l884h                            0/2     Pending   0          2m58s   <none>         ip-10-0-217-156.us-east-2.compute.internal   <none>           <none>

watch for a while, monitoring won't be reported as DEGRADED
# node="ip-10-0-217-156.us-east-2.compute.internal"; while true; do oc get node ${node}; oc -n openshift-monitoring get pod -o wide | grep node-exporter | grep ${node}; oc get co monitoring; oc -n openshift-monitoring get ds;sleep 20s; done
...
NAME                                         STATUS                        ROLES    AGE     VERSION
ip-10-0-217-156.us-east-2.compute.internal   NotReady,SchedulingDisabled   worker   5h19m   v1.21.1+8268f88
node-exporter-l884h                            0/2     Pending   0          7m32s   <none>         ip-10-0-217-156.us-east-2.compute.internal   <none>           <none>
NAME         VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
monitoring   4.9.0-0.nightly-2021-07-28-181504   True        False         False      35m     
NAME            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
node-exporter   6         6         5       6            5           kubernetes.io/os=linux   7m42s
...

Comment 9 errata-xmlrpc 2021-10-18 17:41:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Comment 10 Jose Gato 2022-11-04 09:07:04 UTC
we have found the same issue on 4.8.44. Was this backported to 4.8?
thanks,


Note You need to log in before you can comment on or make changes to this bug.