Bug 2004051

Summary: CMO can report as being Degraded while node-exporter is deployed on all nodes
Product: OpenShift Container Platform Reporter: Simon Pasquier <spasquie>
Component: MonitoringAssignee: Prashant Balachandran <pnair>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact: Brian Burt <bburt>
Priority: unspecified    
Version: 4.9CC: amuller, anpicker, aos-bugs, bburt, erooth
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, if the number of daemon set pods for the `node-exporter` agent was not equal to the number of nodes in the cluster, the Cluster Monitoring Operator (CMO) would report a condition of `degraded`. This issue would occur when one of the nodes was not in the `ready` condition. This release now verifies that the number of daemon set pods for the `node-exporter` agent is not less than the number of ready nodes in the cluster. This process ensures that a `node-exporter` pod is running on every active node. As a result, the CMO will not report a degraded condition if one of the nodes is not in a ready state.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-10 16:10:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Simon Pasquier 2021-09-14 12:31:37 UTC
Description of problem:

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-upgrade/1436412625549266945 is a test run that failed because CMO was degraded with the following message:

Failed to rollout the stack. Error: updating node-exporter: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 5 ready pods for "node-exporter" daemonset, got 6 

In this example, 1 node was reported as NotReady while the node-exporter daemonset status stated that all 6 nodes were running the daemon pod.

Version-Release number of selected component (if applicable):
4.9

How reproducible:
Sometimes

Steps to Reproduce:
1. Stop the kubelet service on a node and kick off CMO reconciliation
2.
3.

Actual results:
CMO reports degraded=true.

Expected results:
CMO reports degraded=false.


Additional info:
https://github.com/openshift/cluster-monitoring-operator/blob/10c16ae6ead9da2b4c0f68ca8567f4e0ee08a6c4/pkg/client/client.go#L990-L1005

Comment 4 Junqi Zhao 2021-09-27 10:10:22 UTC
checked with 4.10.0-0.nightly-2021-09-26-233013, stop kubelet service for one node and watched for 20 minutes, does not report degraded=true for CMO with the fix
# oc get node | grep NotReady
ip-10-0-140-4.us-east-2.compute.internal     NotReady   worker   6h25m   v1.22.0-rc.0+af080cb

# oc -n openshift-monitoring get ds node-exporter
NAME            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
node-exporter   6         6         5       6            5           kubernetes.io/os=linux   115m

# oc -n openshift-monitoring get ds node-exporter -oyaml
...
status:
  currentNumberScheduled: 6
  desiredNumberScheduled: 6
  numberAvailable: 5
  numberMisscheduled: 0
  numberReady: 5
  numberUnavailable: 1
  observedGeneration: 1
  updatedNumberScheduled: 6

# oc get co monitoring -oyaml
...
  - lastTransitionTime: "2021-09-27T07:50:14Z"
    message: 'Prometheus is running without persistent storage which can lead to data
      loss during upgrades and cluster disruptions. Please refer to the official documentation
      to see how to configure storage for Prometheus: https://docs.openshift.com/container-platform/4.8/monitoring/configuring-the-monitoring-stack.html'
    reason: PrometheusDataPersistenceNotConfigured
    status: "False"
    type: Degraded

Comment 8 errata-xmlrpc 2022-03-10 16:10:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056