Bug 1945458 - 4.8 Metal CI upgrade jobs failing with image-registry pod failing to drain
Summary: 4.8 Metal CI upgrade jobs failing with image-registry pod failing to drain
Keywords:
Status: CLOSED DUPLICATE of bug 1944762
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Image Registry
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Oleg Bulatov
QA Contact: Wenjing Zheng
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-01 01:02 UTC by Yu Qi Zhang
Modified: 2021-04-01 04:45 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-04-01 04:45:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Yu Qi Zhang 2021-04-01 01:02:11 UTC
Description of problem:
Build-watcher looking at the 4.8 nightly jobs, the metal upgrade jobs has a significant degrade in performance. Example job tracker:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade

Most recently it started failing on MCO, which upon checking MCO logs, I see

'Node worker-1 is reporting: "failed to drain node (5 tries): timed out
      waiting for the condition: error when evicting pods/\"image-registry-556b7484d5-pbrqc\"
      -n \"openshift-image-registry\": global timeout reached: 1m30s"'

Which degrades the upgrade. Not sure why the pod is unable to drain (and is repeatedly failing to do so). So opening this bug.

Example jobs

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade/1377260372473417728

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1377201785415929856

See must-gather/cluster-scoped-resources/machineconfig.../machineconfigpools/worker status for the error

Version-Release number of selected component (if applicable):
4.8 metal

How reproducible:
100% across 3 jobs I looked at

Steps to Reproduce:
See CI

Actual results:
Fail

Expected results:
Pass

Additional info:

Comment 1 W. Trevor King 2021-04-01 04:45:03 UTC
At least [1] has PodDisruptionBudgetAtLimit firing as well.

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade/1377260372473417728

*** This bug has been marked as a duplicate of bug 1944762 ***


Note You need to log in before you can comment on or make changes to this bug.