1986375 – Avoid CMO being degraded when some nodes aren't available

Bug 1986375 - Avoid CMO being degraded when some nodes aren't available

Summary: Avoid CMO being degraded when some nodes aren't available

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Prashant Balachandran
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-27 12:04 UTC by Prashant Balachandran
Modified:	2022-11-04 09:07 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-18 17:41:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1279	0	None	open	Bug 1986375: adding check for node exporter daemon set	2021-07-27 15:56:31 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:41:55 UTC

Description Prashant Balachandran 2021-07-27 12:04:02 UTC

Description of problem:
node_exporter that can't be running on nodes that are offline/unavailable is one of the top reasons why CMO goes degraded. It would make sense to have CMO correlate the number of running node_exporter pods with the status of the nodes and not go degraded if the node_exporter pods are running on all nodes which are ready. As an example, if the cluster has N nodes with one node being not ready and (N-1) node_exporter pods are running then CMO should report Available rather than Degraded.

Version-Release number of selected component (if applicable):


How reproducible:
Always when nodes are offline.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Junqi Zhao 2021-07-29 08:19:43 UTC

tested with 4.9.0-0.nightly-2021-07-28-181504, monitoring won't be reported as DEGRADED due to offline/unavailable nodes, steps see below
set one node to SchedulingDisabled to not affect other pods
# oc adm cordon ip-10-0-217-156.us-east-2.compute.internal
# oc get node ip-10-0-217-156.us-east-2.compute.internal
NAME                                         STATUS                     ROLES    AGE     VERSION
ip-10-0-217-156.us-east-2.compute.internal   Ready,SchedulingDisabled   worker   4h45m   v1.21.1+8268f88

scale down cluster-version-operator/cluster-monitoring-operator,remove daemonset
# oc -n openshift-cluster-version scale deploy cluster-version-operator --replicas=0
# oc -n openshift-monitoring scale deploy cluster-monitoring-operator --replicas=0
# oc -n openshift-monitoring delete daemonset node-exporter

make sure other pods are normal
# oc -n openshift-monitoring get pod -o wide
NAME                                       READY   STATUS    RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
alertmanager-main-0                        5/5     Running   0          5h8m    10.129.2.8     ip-10-0-134-137.us-east-2.compute.internal   <none>           <none>
alertmanager-main-1                        5/5     Running   0          25m     10.129.2.16    ip-10-0-134-137.us-east-2.compute.internal   <none>           <none>
alertmanager-main-2                        5/5     Running   0          5h8m    10.131.0.17    ip-10-0-176-74.us-east-2.compute.internal    <none>           <none>
grafana-6c679c5748-vct2g                   2/2     Running   0          5h8m    10.129.2.9     ip-10-0-134-137.us-east-2.compute.internal   <none>           <none>
kube-state-metrics-59f44f65fb-qgghv        3/3     Running   0          5h12m   10.131.0.15    ip-10-0-176-74.us-east-2.compute.internal    <none>           <none>
openshift-state-metrics-78c5465bcd-bkndb   3/3     Running   0          5h12m   10.131.0.7     ip-10-0-176-74.us-east-2.compute.internal    <none>           <none>
prometheus-adapter-7d6b95dd6-cbv7h         1/1     Running   0          58m     10.131.0.122   ip-10-0-176-74.us-east-2.compute.internal    <none>           <none>
prometheus-adapter-7d6b95dd6-zgflb         1/1     Running   0          5h8m    10.129.2.7     ip-10-0-134-137.us-east-2.compute.internal   <none>           <none>
prometheus-k8s-0                           7/7     Running   0          26m     10.131.0.137   ip-10-0-176-74.us-east-2.compute.internal    <none>           <none>
prometheus-k8s-1                           7/7     Running   0          5h7m    10.129.2.11    ip-10-0-134-137.us-east-2.compute.internal   <none>           <none>
prometheus-operator-cd5899dbc-trcpx        2/2     Running   1          5h14m   10.128.0.40    ip-10-0-176-171.us-east-2.compute.internal   <none>           <none>
telemeter-client-567dc564fd-pvpcp          3/3     Running   0          5h12m   10.131.0.16    ip-10-0-176-74.us-east-2.compute.internal    <none>           <none>
thanos-querier-865d44b845-58cnf            5/5     Running   0          4h11m   10.129.2.13    ip-10-0-134-137.us-east-2.compute.internal   <none>           <none>
thanos-querier-865d44b845-hrxvb            5/5     Running   0          4h11m   10.131.0.40    ip-10-0-176-74.us-east-2.compute.internal    <none>           <none>

use following script to stop kubelet and sleep for 20m then start again
# oc debug node/ip-10-0-217-156.us-east-2.compute.internal
sh-4.4# chroot /host
sh-4.4# chmod +x /tmp/run.sh
sh-4.4# /tmp/run.sh &

cat /tmp/run.sh
**************
systemctl stop kubelet
sleep 20m
systemctl start kubelet
**************

scale up cluster-version-operator/cluster-monitoring-operator
# oc -n openshift-cluster-version scale deploy cluster-version-operator --replicas=1
# oc -n openshift-monitoring scale deploy cluster-monitoring-operator --replicas=1

# oc get node ip-10-0-217-156.us-east-2.compute.internal
NAME                                         STATUS                        ROLES    AGE     VERSION
ip-10-0-217-156.us-east-2.compute.internal   NotReady,SchedulingDisabled   worker   5h12m   v1.21.1+8268f88

make sure only the node-exporter pod is abornal which scheduled on the NotReady node
# oc -n openshift-monitoring get pod -o wide | grep -Ev "Running|Completed"
NAME                                           READY   STATUS    RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
node-exporter-l884h                            0/2     Pending   0          2m58s   <none>         ip-10-0-217-156.us-east-2.compute.internal   <none>           <none>

watch for a while, monitoring won't be reported as DEGRADED
# node="ip-10-0-217-156.us-east-2.compute.internal"; while true; do oc get node ${node}; oc -n openshift-monitoring get pod -o wide | grep node-exporter | grep ${node}; oc get co monitoring; oc -n openshift-monitoring get ds;sleep 20s; done
...
NAME                                         STATUS                        ROLES    AGE     VERSION
ip-10-0-217-156.us-east-2.compute.internal   NotReady,SchedulingDisabled   worker   5h19m   v1.21.1+8268f88
node-exporter-l884h                            0/2     Pending   0          7m32s   <none>         ip-10-0-217-156.us-east-2.compute.internal   <none>           <none>
NAME         VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
monitoring   4.9.0-0.nightly-2021-07-28-181504   True        False         False      35m     
NAME            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
node-exporter   6         6         5       6            5           kubernetes.io/os=linux   7m42s
...

Comment 9 errata-xmlrpc 2021-10-18 17:41:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Comment 10 Jose Gato 2022-11-04 09:07:04 UTC

we have found the same issue on 4.8.44. Was this backported to 4.8?
thanks,

Note You need to log in before you can comment on or make changes to this bug.