Hide Forgot
Description of problem: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-upgrade/1436412625549266945 is a test run that failed because CMO was degraded with the following message: Failed to rollout the stack. Error: updating node-exporter: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 5 ready pods for "node-exporter" daemonset, got 6 In this example, 1 node was reported as NotReady while the node-exporter daemonset status stated that all 6 nodes were running the daemon pod. Version-Release number of selected component (if applicable): 4.9 How reproducible: Sometimes Steps to Reproduce: 1. Stop the kubelet service on a node and kick off CMO reconciliation 2. 3. Actual results: CMO reports degraded=true. Expected results: CMO reports degraded=false. Additional info: https://github.com/openshift/cluster-monitoring-operator/blob/10c16ae6ead9da2b4c0f68ca8567f4e0ee08a6c4/pkg/client/client.go#L990-L1005
checked with 4.10.0-0.nightly-2021-09-26-233013, stop kubelet service for one node and watched for 20 minutes, does not report degraded=true for CMO with the fix # oc get node | grep NotReady ip-10-0-140-4.us-east-2.compute.internal NotReady worker 6h25m v1.22.0-rc.0+af080cb # oc -n openshift-monitoring get ds node-exporter NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE node-exporter 6 6 5 6 5 kubernetes.io/os=linux 115m # oc -n openshift-monitoring get ds node-exporter -oyaml ... status: currentNumberScheduled: 6 desiredNumberScheduled: 6 numberAvailable: 5 numberMisscheduled: 0 numberReady: 5 numberUnavailable: 1 observedGeneration: 1 updatedNumberScheduled: 6 # oc get co monitoring -oyaml ... - lastTransitionTime: "2021-09-27T07:50:14Z" message: 'Prometheus is running without persistent storage which can lead to data loss during upgrades and cluster disruptions. Please refer to the official documentation to see how to configure storage for Prometheus: https://docs.openshift.com/container-platform/4.8/monitoring/configuring-the-monitoring-stack.html' reason: PrometheusDataPersistenceNotConfigured status: "False" type: Degraded
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056