Bug 1984582 - Metal IPI jobs are failing a high percentage of the time
Summary: Metal IPI jobs are failing a high percentage of the time
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.9.0
Assignee: Arda Guclu
QA Contact: Amit Ugol
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-21 16:43 UTC by Stephen Benjamin
Modified: 2021-10-18 17:40 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-ovn-dualstack-local-gateway=all job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-ovn-ipv6=all job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi=all job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-virtualmedia=all job=periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-dualstack-local-gateway=all job=periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-ipv6=all job=periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi=all job=periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-virtualmedia=all
Last Closed: 2021-10-18 17:40:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:40:57 UTC

Description Stephen Benjamin 2021-07-21 16:43:13 UTC
Overall pass rates for baremetal are pretty low -- in the 30-40%ish range.

TestGrid looks pretty bad for metal 4.8 and 4.9 CI -- 

4.8: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-blocking#periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi

4.9: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-blocking#periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi


That waterfall like pattern of 'F''s indicates most runs something fails, but not the same test. That has a high likelihood of being platform-related (or at least related to the on-prem networking)

I filed https://bugzilla.redhat.com/show_bug.cgi?id=1974350  about a similar problem, someone one from the metal platform team should dig into the failures and see if it's something similar.

Example test failure:
  https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi/1417753887451910144

If you click on "open stdout" from the "local kubeconfig" test, you'll see networking-related problems:

+ oc --kubeconfig /etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost-recovery.kubeconfig get namespace kube-system
The connection to the server localhost:6443 was refused - did you specify the right host or port?

Definitely could be networking related, I'd try to correlate the time of that message(08:52:22) to logs in openshift-kni-infra namespace

Comment 4 Tomas Sedovic 2021-07-27 16:09:11 UTC
Arda had a look, assigning to him for now. Looks like this is caused by several different bugs, some of which are being investigated or merged already.

Comment 5 Arda Guclu 2021-08-10 11:11:19 UTC
There are 2 PRs for 2 flaky tests;

https://github.com/openshift/origin/pull/26377 is for kubeconfig tests.

https://github.com/openshift/origin/pull/26385 is for oc explain tests.

After these PRs is merged, it is not expected to these tests fail.

Comment 6 Ben Parees 2021-08-18 18:45:37 UTC
I'm removing this job from origin until it's stable, please revert https://github.com/openshift/release/pull/21200 when this BZ is resolved.

Comment 7 Bob Fournier 2021-08-18 18:49:32 UTC
https://github.com/openshift/origin/pull/26377 has merged but there is another, follow-on PR:
https://github.com/openshift/origin/pull/26407

Comment 8 W. Trevor King 2021-08-25 22:06:07 UTC
origin#26407 has now merged as well.  Both PRs from the previous comment link bug 1986003, which is still POST.  Are we waiting for more of those to land before doing more with this bug?

Comment 10 Arda Guclu 2021-08-31 07:54:46 UTC
According to the testgrid, https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-blocking#periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi

"kubeconfig" failures have been resolved. Other test failures are followed in https://bugzilla.redhat.com/show_bug.cgi?id=1998643.

I'm closing this bug as verified.

Comment 13 errata-xmlrpc 2021-10-18 17:40:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.