Bug 1945458

Summary: 4.8 Metal CI upgrade jobs failing with image-registry pod failing to drain
Product: OpenShift Container Platform Reporter: Yu Qi Zhang <jerzhang>
Component: Image RegistryAssignee: Oleg Bulatov <obulatov>
Status: CLOSED DUPLICATE QA Contact: Wenjing Zheng <wzheng>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.8CC: aos-bugs, wking
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-04-01 04:45:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yu Qi Zhang 2021-04-01 01:02:11 UTC
Description of problem:
Build-watcher looking at the 4.8 nightly jobs, the metal upgrade jobs has a significant degrade in performance. Example job tracker:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade

Most recently it started failing on MCO, which upon checking MCO logs, I see

'Node worker-1 is reporting: "failed to drain node (5 tries): timed out
      waiting for the condition: error when evicting pods/\"image-registry-556b7484d5-pbrqc\"
      -n \"openshift-image-registry\": global timeout reached: 1m30s"'

Which degrades the upgrade. Not sure why the pod is unable to drain (and is repeatedly failing to do so). So opening this bug.

Example jobs

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade/1377260372473417728

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1377201785415929856

See must-gather/cluster-scoped-resources/machineconfig.../machineconfigpools/worker status for the error

Version-Release number of selected component (if applicable):
4.8 metal

How reproducible:
100% across 3 jobs I looked at

Steps to Reproduce:
See CI

Actual results:
Fail

Expected results:
Pass

Additional info:

Comment 1 W. Trevor King 2021-04-01 04:45:03 UTC
At least [1] has PodDisruptionBudgetAtLimit firing as well.

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade/1377260372473417728

*** This bug has been marked as a duplicate of bug 1944762 ***