1945458 – 4.8 Metal CI upgrade jobs failing with image-registry pod failing to drain

Bug 1945458 - 4.8 Metal CI upgrade jobs failing with image-registry pod failing to drain

Summary: 4.8 Metal CI upgrade jobs failing with image-registry pod failing to drain

Keywords:
Status:	CLOSED DUPLICATE of bug 1944762
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Image Registry
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Oleg Bulatov
QA Contact:	Wenjing Zheng
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-04-01 01:02 UTC by Yu Qi Zhang
Modified:	2021-04-01 04:45 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-04-01 04:45:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Yu Qi Zhang 2021-04-01 01:02:11 UTC

Description of problem:
Build-watcher looking at the 4.8 nightly jobs, the metal upgrade jobs has a significant degrade in performance. Example job tracker:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade

Most recently it started failing on MCO, which upon checking MCO logs, I see

'Node worker-1 is reporting: "failed to drain node (5 tries): timed out
      waiting for the condition: error when evicting pods/\"image-registry-556b7484d5-pbrqc\"
      -n \"openshift-image-registry\": global timeout reached: 1m30s"'

Which degrades the upgrade. Not sure why the pod is unable to drain (and is repeatedly failing to do so). So opening this bug.

Example jobs

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade/1377260372473417728

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1377201785415929856

See must-gather/cluster-scoped-resources/machineconfig.../machineconfigpools/worker status for the error

Version-Release number of selected component (if applicable):
4.8 metal

How reproducible:
100% across 3 jobs I looked at

Steps to Reproduce:
See CI

Actual results:
Fail

Expected results:
Pass

Additional info:

Comment 1 W. Trevor King 2021-04-01 04:45:03 UTC

At least [1] has PodDisruptionBudgetAtLimit firing as well.

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade/1377260372473417728

*** This bug has been marked as a duplicate of bug 1944762 ***

Note You need to log in before you can comment on or make changes to this bug.