2018208 – e2e-metal-ipi-ovn-ipv6 are failing 75% of the time

Bug 2018208 - e2e-metal-ipi-ovn-ipv6 are failing 75% of the time

Summary: e2e-metal-ipi-ovn-ipv6 are failing 75% of the time

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Bob Fournier
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2018005
TreeView+	depends on / blocked

Reported:	2021-10-28 14:06 UTC by Stephen Benjamin
Modified:	2022-03-10 16:23 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:	job=periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-ipv6=all
Last Closed:	2022-03-10 16:22:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift release pull 23258	0	None	Merged	Add wait conditions (from openshift-e2e-test) before triggering the tests	2021-11-05 12:40:51 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:23:10 UTC

Description Stephen Benjamin 2021-10-28 14:06:56 UTC

e2e-metal-ipi-ovn-ipv6 jobs are failing 75% of the time. Beginning on 10/26, these jobs began causing all release payloads to be rejected as 3 runs haven't been able to make it through the gate.

See https://amd64.ocp.releases.ci.openshift.org/#4.8.0-0.nightly


ART manually overrode 4.8.0-0.nightly-2021-10-27-154740 to be accepted so QE can begin their work, but this job needs to be fixed ASAP. Looking at TestGrid, it doesn't appear to be consistent in which tests are failing:

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-blocking#periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-ipv6


"local kubeconfig "lb-ext.kubeconfig" should be present on all masters and work" is the most common which has it's own bug @ https://bugzilla.redhat.com/show_bug.cgi?id=2018005.

Comment 1 Derek Higgins 2021-10-29 12:01:14 UTC

we've noticed that the kube-apiserver restarts a number of times when the cluster comes up,
this stops after a while but a few minutes after the openshift-tests have started

Oct 27 18:45:01.817027 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 18:45:01.816945    3482 kubelet.go:1944] "SyncLoop ADD" source="file" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org]
Oct 27 18:45:01.874693 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 18:45:01.874641    3482 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org]
Oct 27 18:57:47.751057 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 18:57:47.751016    3482 kubelet.go:1944] "SyncLoop ADD" source="file" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org]
Oct 27 19:00:07.515483 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 19:00:07.515443    3482 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org]
Oct 27 19:10:30.514918 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 19:10:30.514760    3482 kubelet.go:1944] "SyncLoop ADD" source="file" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org]
Oct 27 19:12:51.504022 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 19:12:51.503910    3482 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org]
Oct 27 19:19:00.542545 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 19:19:00.542401    3482 kubelet.go:1944] "SyncLoop ADD" source="file" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org]
Oct 27 19:21:13.508735 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 19:21:13.508686    3482 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org]

The same thing happens on the aws jobs but in this case it doesn't go on for as long and is settled down long before the tests start

I've proposed a 10 minute delay to the tests(https://github.com/openshift/release/pull/23131),
we can leave it in place for a few days and see if it improves things,
in the mean time we'll confirm why its happening and if its expected

Comment 2 Bob Fournier 2021-11-02 14:40:14 UTC

With https://github.com/openshift/release/pull/23131 merged, things seem to be marginally better, 3 out 5 (60%) have passed in the last 2 days and the failures are different than what we saw earlier (https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-blocking#periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-ipv6)

Comment 3 Derek Higgins 2021-11-03 09:38:03 UTC

(In reply to Stephen Benjamin from comment #0)
> e2e-metal-ipi-ovn-ipv6 jobs are failing 75% of the time. 

(In reply to Derek Higgins from comment #1)
> I've proposed a 10 minute delay to the
> tests(https://github.com/openshift/release/pull/23131),
> we can leave it in place for a few days and see if it improves things,
> in the mean time we'll confirm why its happening and if its expected

Since adding the 10m sleep the failure rates has dropped to about 25% (based on 13 nightly jobs run),
I've proposed a replacement to instead wait until clusteroperators stop progressing (this is already done in the other e2e jobs)

Comment 7 sdasu 2021-11-08 20:39:16 UTC

Opened https://bugzilla.redhat.com/show_bug.cgi?id=2021324 to track the "oc explain" failures.

Comment 8 Devan Goodwin 2021-11-10 14:31:07 UTC

We updated the pr to linkup with this bug, now automoved to modified and hopefully soon on_qe.

Comment 14 errata-xmlrpc 2022-03-10 16:22:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.