Description of problem: We have a 4.8 cluster with 3 masters, 3 RHCOS workers and 2 Windows workers. During upgrading from 4.8.6-x86_64--> 4.9.0-0.nightly-2021-08-22-070405, monitoring operator is degraded and reports the following error message: Failed to rollout the stack. Error: updating node-exporter: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 ready pods for "node-exporter" daemonset, got 6 Profile: 53_IPI on AWS & OVN & WindowsContainer Upgrade ci job link: https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/upgrade_CI/16866/console nodeSelector for node-exporter DaemonSet is kubernetes.io/os: linux two windows node: ip-10-0-137-24.us-east-2.compute.internal/ip-10-0-158-243.us-east-2.compute.internal don’t have label kubernetes.io/os: linux, it is labeled with kubernetes.io/os: windows, then node-exporter pod won't be scheduled to the windows nodes. From file: namespaces/openshift-monitoring/apps/daemonsets.yaml, the desiredNumberScheduled is 6, but monitoring operator reports “updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 ready pods for "node-exporter" daemonset, got 6 '” # cat namespaces/openshift-monitoring/apps/daemonsets.yaml ... status: currentNumberScheduled: 6 desiredNumberScheduled: 6 numberAvailable: 6 numberMisscheduled: 0 numberReady: 6 observedGeneration: 2 updatedNumberScheduled: 6 # cat cluster-scoped-resources/config.openshift.io/clusteroperators/monitoring.yaml - lastTransitionTime: "2021-08-22T13:58:40Z" message: 'Failed to rollout the stack. Error: updating node-exporter: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 ready pods for "node-exporter" daemonset, got 6 ' reason: UpdatingNodeExporterFailed status: "True" type: Degraded Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-08-22-070405 How reproducible: 2/2 Steps to Reproduce: 1. Install a 4.8 cluster with 3 masters, 3 RHCOS workers and 2 Windows workers 2. Upgrade to 4.9.0-0.nightly-2021-08-22-070405 3. Actual results: Monitoring is degraded with message "expected 8 ready pods for "node-exporter" daemonset, got 6" Expected results: Upgrade is successful Additional info:
Profile: 53_IPI on AWS & OVN & WindowsContainer upgraded from 4.8.6 to 4.9.0-0.nightly-2021-08-26-164418 with windows nodes, monitoring upgrade is successful # oc get co monitoring NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE monitoring 4.9.0-0.nightly-2021-08-26-164418 True False False 157m # oc get node -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-131-61.us-east-2.compute.internal Ready worker 116m v1.21.1-1397+a678cfd2c37e87 10.0.131.61 <none> Windows Server 2019 Datacenter 10.0.17763.2114 docker://20.10.6 ip-10-0-149-100.us-east-2.compute.internal Ready worker 110m v1.21.1-1397+a678cfd2c37e87 10.0.149.100 <none> Windows Server 2019 Datacenter 10.0.17763.2114 docker://20.10.6 ip-10-0-153-190.us-east-2.compute.internal Ready master 161m v1.21.1+9807387 10.0.153.190 <none> Red Hat Enterprise Linux CoreOS 48.84.202108161759-0 (Ootpa) 4.18.0-305.12.1.el8_4.x86_64 cri-o://1.21.2-11.rhaos4.8.git5d31399.el8 ip-10-0-153-192.us-east-2.compute.internal Ready worker 152m v1.21.1+9807387 10.0.153.192 <none> Red Hat Enterprise Linux CoreOS 48.84.202108161759-0 (Ootpa) 4.18.0-305.12.1.el8_4.x86_64 cri-o://1.21.2-11.rhaos4.8.git5d31399.el8 ip-10-0-164-193.us-east-2.compute.internal Ready worker 154m v1.21.1+9807387 10.0.164.193 <none> Red Hat Enterprise Linux CoreOS 48.84.202108161759-0 (Ootpa) 4.18.0-305.12.1.el8_4.x86_64 cri-o://1.21.2-11.rhaos4.8.git5d31399.el8 ip-10-0-170-80.us-east-2.compute.internal Ready master 161m v1.21.1+9807387 10.0.170.80 <none> Red Hat Enterprise Linux CoreOS 48.84.202108161759-0 (Ootpa) 4.18.0-305.12.1.el8_4.x86_64 cri-o://1.21.2-11.rhaos4.8.git5d31399.el8 ip-10-0-195-55.us-east-2.compute.internal Ready master 161m v1.21.1+9807387 10.0.195.55 <none> Red Hat Enterprise Linux CoreOS 48.84.202108161759-0 (Ootpa) 4.18.0-305.12.1.el8_4.x86_64 cri-o://1.21.2-11.rhaos4.8.git5d31399.el8 ip-10-0-213-141.us-east-2.compute.internal Ready worker 152m v1.21.1+9807387 10.0.213.141 <none> Red Hat Enterprise Linux CoreOS 48.84.202108161759-0 (Ootpa) 4.18.0-305.12.1.el8_4.x86_64 cri-o://1.21.2-11.rhaos4.8.git5d31399.el8 # oc -n openshift-monitoring get ds NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE node-exporter 6 6 6 6 6 kubernetes.io/os=linux 159m # oc get node -l kubernetes.io/os=linux NAME STATUS ROLES AGE VERSION ip-10-0-153-190.us-east-2.compute.internal Ready master 163m v1.21.1+9807387 ip-10-0-153-192.us-east-2.compute.internal Ready worker 153m v1.21.1+9807387 ip-10-0-164-193.us-east-2.compute.internal Ready worker 155m v1.21.1+9807387 ip-10-0-170-80.us-east-2.compute.internal Ready master 163m v1.21.1+9807387 ip-10-0-195-55.us-east-2.compute.internal Ready master 163m v1.21.1+9807387 ip-10-0-213-141.us-east-2.compute.internal Ready worker 153m v1.21.1+9807387 # oc -n openshift-monitoring get pod -o wide | grep node-exporter | awk '{print $7}' ip-10-0-213-141.us-east-2.compute.internal ip-10-0-195-55.us-east-2.compute.internal ip-10-0-164-193.us-east-2.compute.internal ip-10-0-170-80.us-east-2.compute.internal ip-10-0-153-190.us-east-2.compute.internal ip-10-0-153-192.us-east-2.compute.internal
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759