Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2018208

Summary: e2e-metal-ipi-ovn-ipv6 are failing 75% of the time
Product: OpenShift Container Platform Reporter: Stephen Benjamin <stbenjam>
Component: InstallerAssignee: Bob Fournier <bfournie>
Installer sub component: OpenShift on Bare Metal IPI QA Contact: Amit Ugol <augol>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: augol, bfournie, derekh, dgoodwin, dhellmann, jdelft, lshilin, sdasu, vlaad, wking
Version: 4.8Keywords: DeliveryBlocker, OtherQA, Triaged
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
job=periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-ipv6=all
Last Closed: 2022-03-10 16:22:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2018005    

Description Stephen Benjamin 2021-10-28 14:06:56 UTC
e2e-metal-ipi-ovn-ipv6 jobs are failing 75% of the time. Beginning on 10/26, these jobs began causing all release payloads to be rejected as 3 runs haven't been able to make it through the gate.

See https://amd64.ocp.releases.ci.openshift.org/#4.8.0-0.nightly


ART manually overrode 4.8.0-0.nightly-2021-10-27-154740 to be accepted so QE can begin their work, but this job needs to be fixed ASAP. Looking at TestGrid, it doesn't appear to be consistent in which tests are failing:

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-blocking#periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-ipv6


"local kubeconfig "lb-ext.kubeconfig" should be present on all masters and work" is the most common which has it's own bug @ https://bugzilla.redhat.com/show_bug.cgi?id=2018005.

Comment 1 Derek Higgins 2021-10-29 12:01:14 UTC
we've noticed that the kube-apiserver restarts a number of times when the cluster comes up,
this stops after a while but a few minutes after the openshift-tests have started

Oct 27 18:45:01.817027 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 18:45:01.816945    3482 kubelet.go:1944] "SyncLoop ADD" source="file" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org]
Oct 27 18:45:01.874693 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 18:45:01.874641    3482 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org]
Oct 27 18:57:47.751057 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 18:57:47.751016    3482 kubelet.go:1944] "SyncLoop ADD" source="file" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org]
Oct 27 19:00:07.515483 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 19:00:07.515443    3482 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org]
Oct 27 19:10:30.514918 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 19:10:30.514760    3482 kubelet.go:1944] "SyncLoop ADD" source="file" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org]
Oct 27 19:12:51.504022 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 19:12:51.503910    3482 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org]
Oct 27 19:19:00.542545 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 19:19:00.542401    3482 kubelet.go:1944] "SyncLoop ADD" source="file" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org]
Oct 27 19:21:13.508735 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 19:21:13.508686    3482 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org]

The same thing happens on the aws jobs but in this case it doesn't go on for as long and is settled down long before the tests start

I've proposed a 10 minute delay to the tests(https://github.com/openshift/release/pull/23131),
we can leave it in place for a few days and see if it improves things,
in the mean time we'll confirm why its happening and if its expected

Comment 2 Bob Fournier 2021-11-02 14:40:14 UTC
With https://github.com/openshift/release/pull/23131 merged, things seem to be marginally better, 3 out 5 (60%) have passed in the last 2 days and the failures are different than what we saw earlier (https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-blocking#periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-ipv6)

Comment 3 Derek Higgins 2021-11-03 09:38:03 UTC
(In reply to Stephen Benjamin from comment #0)
> e2e-metal-ipi-ovn-ipv6 jobs are failing 75% of the time. 

(In reply to Derek Higgins from comment #1)
> I've proposed a 10 minute delay to the
> tests(https://github.com/openshift/release/pull/23131),
> we can leave it in place for a few days and see if it improves things,
> in the mean time we'll confirm why its happening and if its expected

Since adding the 10m sleep the failure rates has dropped to about 25% (based on 13 nightly jobs run),
I've proposed a replacement to instead wait until clusteroperators stop progressing (this is already done in the other e2e jobs)

Comment 7 sdasu 2021-11-08 20:39:16 UTC
Opened https://bugzilla.redhat.com/show_bug.cgi?id=2021324 to track the "oc explain" failures.

Comment 8 Devan Goodwin 2021-11-10 14:31:07 UTC
We updated the pr to linkup with this bug, now automoved to modified and hopefully soon on_qe.

Comment 14 errata-xmlrpc 2022-03-10 16:22:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056