Bug 1971602

Summary: e2e-metal-ipi-upgrade for 4.7 to 4.8 is permafailing
Product: OpenShift Container Platform Reporter: Stephen Benjamin <stbenjam>
Component: Bare Metal Hardware ProvisioningAssignee: Arda Guclu <aguclu>
Bare Metal Hardware Provisioning sub component: cluster-baremetal-operator QA Contact: Ori Michaeli <omichael>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: urgent CC: aguclu, aos-bugs, derekh, rbartal
Version: 4.8Keywords: OtherQA, Triaged
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-18 17:33:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Stephen Benjamin 2021-06-14 12:25:13 UTC
Description of problem:

Even after the disk space increase, we're still seeing some jobs fail

See: 
  https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade


There are several tests failing, or incomplete upgrades. Please investigate.

Note: OCP has a soft 75m time limit for upgrades, which is one of the failing tests. It's often just a little bit over, so either the job needs to find a way to reduce the upgrade time, or you can add an exception like AWS. This is a soft limit though so I don't think it's the root cause of the latest failures.

Comment 1 Derek Higgins 2021-06-15 11:39:04 UTC
(In reply to Stephen Benjamin from comment #0)
> Description of problem:
> 
> Even after the disk space increase, we're still seeing some jobs fail
> 
> See: 
>  
> https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-
> openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-
> upgrade
> 
> https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-
> ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-
> ipi-upgrade
> 
> 
> There are several tests failing, or incomplete upgrades. Please investigate.
> 
> Note: OCP has a soft 75m time limit for upgrades, which is one of the
> failing tests. It's often just a little bit over, so either the job needs to
> find a way to reduce the upgrade time, or you can add an exception like AWS.
> This is a soft limit though so I don't think it's the root cause of the
> latest failures.

Looking at some of the recent failures, all of the jobs that timed out with a 
report of how far they got (4/7), failed in the same place "568 of 676 done (84% complete)"

"Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.8.0-0.nightly-2021-06-12-223426: 568 of 676 done (84% complete)"

The other 3 failures varied

Comment 4 Ori Michaeli 2021-06-22 07:34:00 UTC
This was verified on CI

Comment 7 errata-xmlrpc 2021-10-18 17:33:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759