1996941 – Monitoring operator is degraded because expected 8 ready pods for "node-exporter" daemonset but got 6 when upgrading windows cluster to 4.9

Bug 1996941 - Monitoring operator is degraded because expected 8 ready pods for "node-exporter" daemonset but got 6 when upgrading windows cluster to 4.9

Summary: Monitoring operator is degraded because expected 8 ready pods for "node-expor...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Prashant Balachandran
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-24 04:15 UTC by Yang Yang
Modified:	2021-11-01 10:21 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-18 17:48:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1339	0	None	None	None	2021-08-24 10:25:25 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:48:37 UTC

Description Yang Yang 2021-08-24 04:15:16 UTC

Description of problem:

We have a 4.8 cluster with 3 masters, 3 RHCOS workers and 2 Windows workers.  During upgrading from 4.8.6-x86_64--> 4.9.0-0.nightly-2021-08-22-070405, monitoring operator is degraded and reports the following error message:
Failed to rollout the stack. Error: updating node-exporter: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 ready pods for "node-exporter" daemonset, got 6

Profile: 53_IPI on AWS & OVN & WindowsContainer
Upgrade ci job link: https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/upgrade_CI/16866/console

nodeSelector for node-exporter DaemonSet is kubernetes.io/os: linux
two windows node: ip-10-0-137-24.us-east-2.compute.internal/ip-10-0-158-243.us-east-2.compute.internal
don’t have label kubernetes.io/os: linux, it is labeled with kubernetes.io/os: windows, then node-exporter pod won't be scheduled to the windows nodes.

From file: namespaces/openshift-monitoring/apps/daemonsets.yaml, the desiredNumberScheduled is 6, but monitoring operator reports “updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 ready pods for "node-exporter" daemonset, got 6 '”
# cat namespaces/openshift-monitoring/apps/daemonsets.yaml
...
  status:
	currentNumberScheduled: 6
	desiredNumberScheduled: 6
	numberAvailable: 6
	numberMisscheduled: 0
	numberReady: 6
	observedGeneration: 2
	updatedNumberScheduled: 6
# cat cluster-scoped-resources/config.openshift.io/clusteroperators/monitoring.yaml
  - lastTransitionTime: "2021-08-22T13:58:40Z"
	message: 'Failed to rollout the stack. Error: updating node-exporter: reconciling
  	node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for
  	DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 ready pods
  	for "node-exporter" daemonset, got 6 '
	reason: UpdatingNodeExporterFailed
	status: "True"
	type: Degraded


Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-08-22-070405

How reproducible:
2/2

Steps to Reproduce:
1. Install a 4.8 cluster with 3 masters, 3 RHCOS workers and 2 Windows workers
2. Upgrade to 4.9.0-0.nightly-2021-08-22-070405
3.

Actual results:
Monitoring is degraded with message "expected 8 ready pods for "node-exporter" daemonset, got 6"

Expected results:
Upgrade is successful

Additional info:

Comment 8 Junqi Zhao 2021-08-27 03:51:25 UTC

Profile: 53_IPI on AWS & OVN & WindowsContainer
upgraded from 4.8.6 to 4.9.0-0.nightly-2021-08-26-164418 with windows nodes, monitoring upgrade is successful
# oc get co monitoring 
NAME         VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
monitoring   4.9.0-0.nightly-2021-08-26-164418   True        False         False      157m    

# oc get node -o wide
NAME                                         STATUS   ROLES    AGE    VERSION                       INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
ip-10-0-131-61.us-east-2.compute.internal    Ready    worker   116m   v1.21.1-1397+a678cfd2c37e87   10.0.131.61    <none>        Windows Server 2019 Datacenter                                 10.0.17763.2114                docker://20.10.6
ip-10-0-149-100.us-east-2.compute.internal   Ready    worker   110m   v1.21.1-1397+a678cfd2c37e87   10.0.149.100   <none>        Windows Server 2019 Datacenter                                 10.0.17763.2114                docker://20.10.6
ip-10-0-153-190.us-east-2.compute.internal   Ready    master   161m   v1.21.1+9807387               10.0.153.190   <none>        Red Hat Enterprise Linux CoreOS 48.84.202108161759-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.21.2-11.rhaos4.8.git5d31399.el8
ip-10-0-153-192.us-east-2.compute.internal   Ready    worker   152m   v1.21.1+9807387               10.0.153.192   <none>        Red Hat Enterprise Linux CoreOS 48.84.202108161759-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.21.2-11.rhaos4.8.git5d31399.el8
ip-10-0-164-193.us-east-2.compute.internal   Ready    worker   154m   v1.21.1+9807387               10.0.164.193   <none>        Red Hat Enterprise Linux CoreOS 48.84.202108161759-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.21.2-11.rhaos4.8.git5d31399.el8
ip-10-0-170-80.us-east-2.compute.internal    Ready    master   161m   v1.21.1+9807387               10.0.170.80    <none>        Red Hat Enterprise Linux CoreOS 48.84.202108161759-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.21.2-11.rhaos4.8.git5d31399.el8
ip-10-0-195-55.us-east-2.compute.internal    Ready    master   161m   v1.21.1+9807387               10.0.195.55    <none>        Red Hat Enterprise Linux CoreOS 48.84.202108161759-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.21.2-11.rhaos4.8.git5d31399.el8
ip-10-0-213-141.us-east-2.compute.internal   Ready    worker   152m   v1.21.1+9807387               10.0.213.141   <none>        Red Hat Enterprise Linux CoreOS 48.84.202108161759-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.21.2-11.rhaos4.8.git5d31399.el8

# oc -n openshift-monitoring get ds
NAME            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
node-exporter   6         6         6       6            6           kubernetes.io/os=linux   159m


# oc get node -l kubernetes.io/os=linux
NAME                                         STATUS   ROLES    AGE    VERSION
ip-10-0-153-190.us-east-2.compute.internal   Ready    master   163m   v1.21.1+9807387
ip-10-0-153-192.us-east-2.compute.internal   Ready    worker   153m   v1.21.1+9807387
ip-10-0-164-193.us-east-2.compute.internal   Ready    worker   155m   v1.21.1+9807387
ip-10-0-170-80.us-east-2.compute.internal    Ready    master   163m   v1.21.1+9807387
ip-10-0-195-55.us-east-2.compute.internal    Ready    master   163m   v1.21.1+9807387
ip-10-0-213-141.us-east-2.compute.internal   Ready    worker   153m   v1.21.1+9807387

# oc -n openshift-monitoring get pod -o wide | grep node-exporter | awk '{print $7}'
ip-10-0-213-141.us-east-2.compute.internal
ip-10-0-195-55.us-east-2.compute.internal
ip-10-0-164-193.us-east-2.compute.internal
ip-10-0-170-80.us-east-2.compute.internal
ip-10-0-153-190.us-east-2.compute.internal
ip-10-0-153-192.us-east-2.compute.internal

Comment 16 errata-xmlrpc 2021-10-18 17:48:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.