Bug 1945458

Summary:	4.8 Metal CI upgrade jobs failing with image-registry pod failing to drain
Product:	OpenShift Container Platform	Reporter:	Yu Qi Zhang <jerzhang>
Component:	Image Registry	Assignee:	Oleg Bulatov <obulatov>
Status:	CLOSED DUPLICATE	QA Contact:	Wenjing Zheng <wzheng>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.8	CC:	aos-bugs, wking
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-04-01 04:45:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Yu Qi Zhang 2021-04-01 01:02:11 UTC

Description of problem:
Build-watcher looking at the 4.8 nightly jobs, the metal upgrade jobs has a significant degrade in performance. Example job tracker:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade

Most recently it started failing on MCO, which upon checking MCO logs, I see

'Node worker-1 is reporting: "failed to drain node (5 tries): timed out
      waiting for the condition: error when evicting pods/\"image-registry-556b7484d5-pbrqc\"
      -n \"openshift-image-registry\": global timeout reached: 1m30s"'

Which degrades the upgrade. Not sure why the pod is unable to drain (and is repeatedly failing to do so). So opening this bug.

Example jobs

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade/1377260372473417728

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1377201785415929856

See must-gather/cluster-scoped-resources/machineconfig.../machineconfigpools/worker status for the error

Version-Release number of selected component (if applicable):
4.8 metal

How reproducible:
100% across 3 jobs I looked at

Steps to Reproduce:
See CI

Actual results:
Fail

Expected results:
Pass

Additional info:

Comment 1 W. Trevor King 2021-04-01 04:45:03 UTC

At least [1] has PodDisruptionBudgetAtLimit firing as well.

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade/1377260372473417728

*** This bug has been marked as a duplicate of bug 1944762 ***