Description of problem:
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-upgrade/1436412625549266945 is a test run that failed because CMO was degraded with the following message:
Failed to rollout the stack. Error: updating node-exporter: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 5 ready pods for "node-exporter" daemonset, got 6
In this example, 1 node was reported as NotReady while the node-exporter daemonset status stated that all 6 nodes were running the daemon pod.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Stop the kubelet service on a node and kick off CMO reconciliation
CMO reports degraded=true.
CMO reports degraded=false.
checked with 4.10.0-0.nightly-2021-09-26-233013, stop kubelet service for one node and watched for 20 minutes, does not report degraded=true for CMO with the fix
# oc get node | grep NotReady
ip-10-0-140-4.us-east-2.compute.internal NotReady worker 6h25m v1.22.0-rc.0+af080cb
# oc -n openshift-monitoring get ds node-exporter
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
node-exporter 6 6 5 6 5 kubernetes.io/os=linux 115m
# oc -n openshift-monitoring get ds node-exporter -oyaml
# oc get co monitoring -oyaml
- lastTransitionTime: "2021-09-27T07:50:14Z"
message: 'Prometheus is running without persistent storage which can lead to data
loss during upgrades and cluster disruptions. Please refer to the official documentation
to see how to configure storage for Prometheus: https://docs.openshift.com/container-platform/4.8/monitoring/configuring-the-monitoring-stack.html'
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.