Bug 2018208
| Summary: | e2e-metal-ipi-ovn-ipv6 are failing 75% of the time | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Stephen Benjamin <stbenjam> |
| Component: | Installer | Assignee: | Bob Fournier <bfournie> |
| Installer sub component: | OpenShift on Bare Metal IPI | QA Contact: | Amit Ugol <augol> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | urgent | ||
| Priority: | unspecified | CC: | augol, bfournie, derekh, dgoodwin, dhellmann, jdelft, lshilin, sdasu, vlaad, wking |
| Version: | 4.8 | Keywords: | DeliveryBlocker, OtherQA, Triaged |
| Target Milestone: | --- | ||
| Target Release: | 4.10.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: |
job=periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-ipv6=all
|
|
| Last Closed: | 2022-03-10 16:22:54 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 2018005 | ||
|
Description
Stephen Benjamin
2021-10-28 14:06:56 UTC
we've noticed that the kube-apiserver restarts a number of times when the cluster comes up, this stops after a while but a few minutes after the openshift-tests have started Oct 27 18:45:01.817027 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 18:45:01.816945 3482 kubelet.go:1944] "SyncLoop ADD" source="file" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org] Oct 27 18:45:01.874693 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 18:45:01.874641 3482 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org] Oct 27 18:57:47.751057 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 18:57:47.751016 3482 kubelet.go:1944] "SyncLoop ADD" source="file" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org] Oct 27 19:00:07.515483 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 19:00:07.515443 3482 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org] Oct 27 19:10:30.514918 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 19:10:30.514760 3482 kubelet.go:1944] "SyncLoop ADD" source="file" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org] Oct 27 19:12:51.504022 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 19:12:51.503910 3482 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org] Oct 27 19:19:00.542545 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 19:19:00.542401 3482 kubelet.go:1944] "SyncLoop ADD" source="file" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org] Oct 27 19:21:13.508735 master-1.ostest.test.metalkube.org hyperkube[3482]: I1027 19:21:13.508686 3482 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-kube-apiserver/kube-apiserver-master-1.ostest.test.metalkube.org] The same thing happens on the aws jobs but in this case it doesn't go on for as long and is settled down long before the tests start I've proposed a 10 minute delay to the tests(https://github.com/openshift/release/pull/23131), we can leave it in place for a few days and see if it improves things, in the mean time we'll confirm why its happening and if its expected With https://github.com/openshift/release/pull/23131 merged, things seem to be marginally better, 3 out 5 (60%) have passed in the last 2 days and the failures are different than what we saw earlier (https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-blocking#periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-ipv6) (In reply to Stephen Benjamin from comment #0) > e2e-metal-ipi-ovn-ipv6 jobs are failing 75% of the time. (In reply to Derek Higgins from comment #1) > I've proposed a 10 minute delay to the > tests(https://github.com/openshift/release/pull/23131), > we can leave it in place for a few days and see if it improves things, > in the mean time we'll confirm why its happening and if its expected Since adding the 10m sleep the failure rates has dropped to about 25% (based on 13 nightly jobs run), I've proposed a replacement to instead wait until clusteroperators stop progressing (this is already done in the other e2e jobs) Opened https://bugzilla.redhat.com/show_bug.cgi?id=2021324 to track the "oc explain" failures. We updated the pr to linkup with this bug, now automoved to modified and hopefully soon on_qe. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |