Bug 1984582

Summary: Metal IPI jobs are failing a high percentage of the time
Product: OpenShift Container Platform Reporter: Stephen Benjamin <stbenjam>
Component: InstallerAssignee: Arda Guclu <aguclu>
Installer sub component: OpenShift on Bare Metal IPI QA Contact: Amit Ugol <augol>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: bfournie, bparees, tsedovic, tsze, wking, yboaron
Version: 4.9Keywords: OtherQA, Triaged
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-ovn-dualstack-local-gateway=all job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-ovn-ipv6=all job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi=all job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-virtualmedia=all job=periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-dualstack-local-gateway=all job=periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-ipv6=all job=periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi=all job=periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-virtualmedia=all
Last Closed: 2021-10-18 17:40:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Stephen Benjamin 2021-07-21 16:43:13 UTC
Overall pass rates for baremetal are pretty low -- in the 30-40%ish range.

TestGrid looks pretty bad for metal 4.8 and 4.9 CI -- 

4.8: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-blocking#periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi

4.9: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-blocking#periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi


That waterfall like pattern of 'F''s indicates most runs something fails, but not the same test. That has a high likelihood of being platform-related (or at least related to the on-prem networking)

I filed https://bugzilla.redhat.com/show_bug.cgi?id=1974350  about a similar problem, someone one from the metal platform team should dig into the failures and see if it's something similar.

Example test failure:
  https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi/1417753887451910144

If you click on "open stdout" from the "local kubeconfig" test, you'll see networking-related problems:

+ oc --kubeconfig /etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost-recovery.kubeconfig get namespace kube-system
The connection to the server localhost:6443 was refused - did you specify the right host or port?

Definitely could be networking related, I'd try to correlate the time of that message(08:52:22) to logs in openshift-kni-infra namespace

Comment 4 Tomas Sedovic 2021-07-27 16:09:11 UTC
Arda had a look, assigning to him for now. Looks like this is caused by several different bugs, some of which are being investigated or merged already.

Comment 5 Arda Guclu 2021-08-10 11:11:19 UTC
There are 2 PRs for 2 flaky tests;

https://github.com/openshift/origin/pull/26377 is for kubeconfig tests.

https://github.com/openshift/origin/pull/26385 is for oc explain tests.

After these PRs is merged, it is not expected to these tests fail.

Comment 6 Ben Parees 2021-08-18 18:45:37 UTC
I'm removing this job from origin until it's stable, please revert https://github.com/openshift/release/pull/21200 when this BZ is resolved.

Comment 7 Bob Fournier 2021-08-18 18:49:32 UTC
https://github.com/openshift/origin/pull/26377 has merged but there is another, follow-on PR:
https://github.com/openshift/origin/pull/26407

Comment 8 W. Trevor King 2021-08-25 22:06:07 UTC
origin#26407 has now merged as well.  Both PRs from the previous comment link bug 1986003, which is still POST.  Are we waiting for more of those to land before doing more with this bug?

Comment 10 Arda Guclu 2021-08-31 07:54:46 UTC
According to the testgrid, https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-blocking#periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi

"kubeconfig" failures have been resolved. Other test failures are followed in https://bugzilla.redhat.com/show_bug.cgi?id=1998643.

I'm closing this bug as verified.

Comment 13 errata-xmlrpc 2021-10-18 17:40:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759