Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1984582

Summary:	Metal IPI jobs are failing a high percentage of the time
Product:	OpenShift Container Platform	Reporter:	Stephen Benjamin <stbenjam>
Component:	Installer	Assignee:	Arda Guclu <aguclu>
Installer sub component:	OpenShift on Bare Metal IPI	QA Contact:	Amit Ugol <augol>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	unspecified	CC:	bfournie, bparees, tsedovic, tsze, wking, yboaron
Version:	4.9	Keywords:	OtherQA, Triaged
Target Milestone:	---
Target Release:	4.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:	job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-ovn-dualstack-local-gateway=all job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-ovn-ipv6=all job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi=all job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-virtualmedia=all job=periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-dualstack-local-gateway=all job=periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-ipv6=all job=periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi=all job=periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-virtualmedia=all
Last Closed:	2021-10-18 17:40:27 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Stephen Benjamin 2021-07-21 16:43:13 UTC

Overall pass rates for baremetal are pretty low -- in the 30-40%ish range.

TestGrid looks pretty bad for metal 4.8 and 4.9 CI -- 

4.8: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-blocking#periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi

4.9: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-blocking#periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi


That waterfall like pattern of 'F''s indicates most runs something fails, but not the same test. That has a high likelihood of being platform-related (or at least related to the on-prem networking)

I filed https://bugzilla.redhat.com/show_bug.cgi?id=1974350  about a similar problem, someone one from the metal platform team should dig into the failures and see if it's something similar.

Example test failure:
  https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi/1417753887451910144

If you click on "open stdout" from the "local kubeconfig" test, you'll see networking-related problems:

+ oc --kubeconfig /etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost-recovery.kubeconfig get namespace kube-system
The connection to the server localhost:6443 was refused - did you specify the right host or port?

Definitely could be networking related, I'd try to correlate the time of that message(08:52:22) to logs in openshift-kni-infra namespace

Comment 4 Tomas Sedovic 2021-07-27 16:09:11 UTC

Arda had a look, assigning to him for now. Looks like this is caused by several different bugs, some of which are being investigated or merged already.

Comment 5 Arda Guclu 2021-08-10 11:11:19 UTC

There are 2 PRs for 2 flaky tests;

https://github.com/openshift/origin/pull/26377 is for kubeconfig tests.

https://github.com/openshift/origin/pull/26385 is for oc explain tests.

After these PRs is merged, it is not expected to these tests fail.

Comment 6 Ben Parees 2021-08-18 18:45:37 UTC

I'm removing this job from origin until it's stable, please revert https://github.com/openshift/release/pull/21200 when this BZ is resolved.

Comment 7 Bob Fournier 2021-08-18 18:49:32 UTC

https://github.com/openshift/origin/pull/26377 has merged but there is another, follow-on PR:
https://github.com/openshift/origin/pull/26407

Comment 8 W. Trevor King 2021-08-25 22:06:07 UTC

origin#26407 has now merged as well.  Both PRs from the previous comment link bug 1986003, which is still POST.  Are we waiting for more of those to land before doing more with this bug?

Comment 9 Stephen Benjamin 2021-08-25 22:30:45 UTC

Metal IPI looks pretty healthy[1], and gating jobs are regularly getting through on 1 try.  IMHO, 'd considered this fixed at this point, and TRT can open bugs for anything else we find if it affects metal.

[1] https://sippy.ci.openshift.org/sippy-ng/jobs/4.9?filters=%7B%22items%22%3A%5B%7B%22id%22%3A1%2C%22columnField%22%3A%22current_runs%22%2C%22operatorValue%22%3A%22%3E%3D%22%2C%22value%22%3A%221%22%7D%2C%7B%22id%22%3A99%2C%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22contains%22%2C%22value%22%3A%22metal-ipi%22%7D%5D%2C%22linkOperator%22%3A%22and%22%7D&period=twoDay&sort=desc&sortField=current_pass_percentage

Comment 10 Arda Guclu 2021-08-31 07:54:46 UTC

According to the testgrid, https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-blocking#periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi

"kubeconfig" failures have been resolved. Other test failures are followed in https://bugzilla.redhat.com/show_bug.cgi?id=1998643.

I'm closing this bug as verified.

Comment 13 errata-xmlrpc 2021-10-18 17:40:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759