Bug 1971602

Summary:	e2e-metal-ipi-upgrade for 4.7 to 4.8 is permafailing
Product:	OpenShift Container Platform	Reporter:	Stephen Benjamin <stbenjam>
Component:	Bare Metal Hardware Provisioning	Assignee:	Arda Guclu <aguclu>
Bare Metal Hardware Provisioning sub component:	cluster-baremetal-operator	QA Contact:	Ori Michaeli <omichael>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	urgent	CC:	aguclu, aos-bugs, derekh, rbartal
Version:	4.8	Keywords:	OtherQA, Triaged
Target Milestone:	---
Target Release:	4.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-10-18 17:33:59 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Stephen Benjamin 2021-06-14 12:25:13 UTC

Description of problem:

Even after the disk space increase, we're still seeing some jobs fail

See: 
  https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade


There are several tests failing, or incomplete upgrades. Please investigate.

Note: OCP has a soft 75m time limit for upgrades, which is one of the failing tests. It's often just a little bit over, so either the job needs to find a way to reduce the upgrade time, or you can add an exception like AWS. This is a soft limit though so I don't think it's the root cause of the latest failures.

Comment 1 Derek Higgins 2021-06-15 11:39:04 UTC

(In reply to Stephen Benjamin from comment #0)
> Description of problem:
> 
> Even after the disk space increase, we're still seeing some jobs fail
> 
> See: 
>  
> https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-
> openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-
> upgrade
> 
> https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-
> ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-
> ipi-upgrade
> 
> 
> There are several tests failing, or incomplete upgrades. Please investigate.
> 
> Note: OCP has a soft 75m time limit for upgrades, which is one of the
> failing tests. It's often just a little bit over, so either the job needs to
> find a way to reduce the upgrade time, or you can add an exception like AWS.
> This is a soft limit though so I don't think it's the root cause of the
> latest failures.

Looking at some of the recent failures, all of the jobs that timed out with a 
report of how far they got (4/7), failed in the same place "568 of 676 done (84% complete)"

"Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.8.0-0.nightly-2021-06-12-223426: 568 of 676 done (84% complete)"

The other 3 failures varied

Comment 4 Ori Michaeli 2021-06-22 07:34:00 UTC

This was verified on CI

Comment 7 errata-xmlrpc 2021-10-18 17:33:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759