2004051 – CMO can report as being Degraded while node-exporter is deployed on all nodes

Bug 2004051 - CMO can report as being Degraded while node-exporter is deployed on all nodes

Summary: CMO can report as being Degraded while node-exporter is deployed on all nodes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Prashant Balachandran
QA Contact:	Junqi Zhao
Docs Contact:	Brian Burt
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-09-14 12:31 UTC by Simon Pasquier
Modified:	2022-03-10 16:10 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, if the number of daemon set pods for the `node-exporter` agent was not equal to the number of nodes in the cluster, the Cluster Monitoring Operator (CMO) would report a condition of `degraded`. This issue would occur when one of the nodes was not in the `ready` condition. This release now verifies that the number of daemon set pods for the `node-exporter` agent is not less than the number of ready nodes in the cluster. This process ensures that a `node-exporter` pod is running on every active node. As a result, the CMO will not report a degraded condition if one of the nodes is not in a ready state.
Clone Of:
Environment:
Last Closed:	2022-03-10 16:10:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1385	0	None	open	Bug 2004051: changing the condition for error in daemon set creation	2021-09-16 06:11:26 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:10:40 UTC

Description Simon Pasquier 2021-09-14 12:31:37 UTC

Description of problem:

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-upgrade/1436412625549266945 is a test run that failed because CMO was degraded with the following message:

Failed to rollout the stack. Error: updating node-exporter: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 5 ready pods for "node-exporter" daemonset, got 6 

In this example, 1 node was reported as NotReady while the node-exporter daemonset status stated that all 6 nodes were running the daemon pod.

Version-Release number of selected component (if applicable):
4.9

How reproducible:
Sometimes

Steps to Reproduce:
1. Stop the kubelet service on a node and kick off CMO reconciliation
2.
3.

Actual results:
CMO reports degraded=true.

Expected results:
CMO reports degraded=false.


Additional info:
https://github.com/openshift/cluster-monitoring-operator/blob/10c16ae6ead9da2b4c0f68ca8567f4e0ee08a6c4/pkg/client/client.go#L990-L1005

Comment 4 Junqi Zhao 2021-09-27 10:10:22 UTC

checked with 4.10.0-0.nightly-2021-09-26-233013, stop kubelet service for one node and watched for 20 minutes, does not report degraded=true for CMO with the fix
# oc get node | grep NotReady
ip-10-0-140-4.us-east-2.compute.internal     NotReady   worker   6h25m   v1.22.0-rc.0+af080cb

# oc -n openshift-monitoring get ds node-exporter
NAME            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
node-exporter   6         6         5       6            5           kubernetes.io/os=linux   115m

# oc -n openshift-monitoring get ds node-exporter -oyaml
...
status:
  currentNumberScheduled: 6
  desiredNumberScheduled: 6
  numberAvailable: 5
  numberMisscheduled: 0
  numberReady: 5
  numberUnavailable: 1
  observedGeneration: 1
  updatedNumberScheduled: 6

# oc get co monitoring -oyaml
...
  - lastTransitionTime: "2021-09-27T07:50:14Z"
    message: 'Prometheus is running without persistent storage which can lead to data
      loss during upgrades and cluster disruptions. Please refer to the official documentation
      to see how to configure storage for Prometheus: https://docs.openshift.com/container-platform/4.8/monitoring/configuring-the-monitoring-stack.html'
    reason: PrometheusDataPersistenceNotConfigured
    status: "False"
    type: Degraded

Comment 8 errata-xmlrpc 2022-03-10 16:10:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.