e2e-metal-ipi-ovn-ipv6 jobs are failing 75% of the time. Beginning on 10/26, these jobs began causing all release payloads to be rejected as 3 runs haven't been able to make it through the gate. See https://amd64.ocp.releases.ci.openshift.org/#4.8.0-0.nightly ART manually overrode 4.8.0-0.nightly-2021-10-27-154740 to be accepted so QE can begin their work, but this job needs to be fixed ASAP. Looking at TestGrid, it doesn't appear to be consistent in which tests are failing: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-blocking#periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-ipv6 "local kubeconfig "lb-ext.kubeconfig" should be present on all masters and work" is the most common which has it's own bug @ https://bugzilla.redhat.com/show_bug.cgi?id=2018005.
we've noticed that the kube-apiserver restarts a number of times when the cluster comes up, this stops after a while but a few minutes after the openshift-tests have started Oct 27 18:45:01.817027 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 18:45:01.816945 3482 kubelet.go:1944] "SyncLoop ADD" source="file" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org] Oct 27 18:45:01.874693 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 18:45:01.874641 3482 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org] Oct 27 18:57:47.751057 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 18:57:47.751016 3482 kubelet.go:1944] "SyncLoop ADD" source="file" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org] Oct 27 19:00:07.515483 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 19:00:07.515443 3482 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org] Oct 27 19:10:30.514918 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 19:10:30.514760 3482 kubelet.go:1944] "SyncLoop ADD" source="file" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org] Oct 27 19:12:51.504022 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 19:12:51.503910 3482 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org] Oct 27 19:19:00.542545 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 19:19:00.542401 3482 kubelet.go:1944] "SyncLoop ADD" source="file" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org] Oct 27 19:21:13.508735 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 19:21:13.508686 3482 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org] The same thing happens on the aws jobs but in this case it doesn't go on for as long and is settled down long before the tests start I've proposed a 10 minute delay to the tests(https://github.com/openshift/release/pull/23131), we can leave it in place for a few days and see if it improves things, in the mean time we'll confirm why its happening and if its expected
With https://github.com/openshift/release/pull/23131 merged, things seem to be marginally better, 3 out 5 (60%) have passed in the last 2 days and the failures are different than what we saw earlier (https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-blocking#periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-ipv6)
(In reply to Stephen Benjamin from comment #0) > e2e-metal-ipi-ovn-ipv6 jobs are failing 75% of the time. (In reply to Derek Higgins from comment #1) > I've proposed a 10 minute delay to the > tests(https://github.com/openshift/release/pull/23131), > we can leave it in place for a few days and see if it improves things, > in the mean time we'll confirm why its happening and if its expected Since adding the 10m sleep the failure rates has dropped to about 25% (based on 13 nightly jobs run), I've proposed a replacement to instead wait until clusteroperators stop progressing (this is already done in the other e2e jobs)
Opened https://bugzilla.redhat.com/show_bug.cgi?id=2021324 to track the "oc explain" failures.
We updated the pr to linkup with this bug, now automoved to modified and hopefully soon on_qe.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056