test: Cluster frontend ingress remain available is failing frequently in CI, see search results: https://search.svc.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=Cluster+frontend+ingress+remain+available For instance - https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/origin-ci-test/pr-logs/pull/25188/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1275394971154780160 There are several errors in this test similar to: Jun 23 12:51:53.917 - 5s E ns/openshift-authentication route/oauth-openshift Route is not responding to GET requests over new connections
Checking the failure message from the reference job [1]: Frontends were unreachable during disruption for at least 3m13s of 35m29s (9%), this is currently sufficient to pass the test/job but not considered completely correct: But: * PR presubmit jobs are noisy, and it's possible that something changed by that PR broke this behavior, in which case it's the PR author's problem. * As the failure message says, the test-case failed, but not by enough to fail the job. That job actually failed on: fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:237]: Jun 23 13:17:47.218: Service was unreachable during disruption for at least 53s of 33m45s (3%): out of the 'Application behind service load balancer with PDB is not disrupted' test-case, and that's bug 1828858. However, there are jobs that are actually failing on frontend reachability: $ w3m -dump -cols 200 'https://search.svc.ci.openshift.org/?search=fail.*Frontends were unreachable during disruption for at least&type=junit&maxAge=336h&name=release-openshift-' | grep 'failures match' release-openshift-origin-installer-e2e-aws-upgrade - 1419 runs, 28% failed, 2% of failures match release-openshift-origin-installer-launch-aws - 1032 runs, 56% failed, 0% of failures match release-openshift-origin-installer-e2e-gcp-upgrade - 196 runs, 25% failed, 6% of failures match release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3 - 19 runs, 37% failed, 14% of failures match release-openshift-origin-installer-e2e-aws-upgrade-4.2-nightly-to-4.3-nightly - 18 runs, 33% failed, 17% of failures match release-openshift-okd-installer-e2e-aws-upgrade - 568 runs, 37% failed, 0% of failures match For example 4.2.36 -> 4.3.0-0.ci-2020-06-26-162041 [2] with: fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:226]: Jun 26 17:51:17.340: Frontends were unreachable during disruption for at least 10m21s of 48m4s (22%): or 4.6.0-0.ci-2020-06-23-223142 -> 4.6.0-0.ci-2020-06-24-003142 [3] with: fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:237]: Jun 24 01:47:09.073: Frontends were unreachable during disruption for at least 6m22s of 29m23s (22%): [1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/origin-ci-test/pr-logs/pull/25188/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1275394971154780160 [2]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1276551522351583232 [3]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/1275587702091157504
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
https://bugzilla.redhat.com/show_bug.cgi?id=1848264 We are facing similar issue on ppc64le hardware as well.. but the toleration levels on power(>30%) platfrom is quite high as compared to the x86_64(22%) Error message: fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:226]: Jun 5 04:04:58.773: Frontends were unreachable during disruption for at least 16m30s of 47m16s (35%):
This appears to be a manifestation of bug 1809668. *** This bug has been marked as a duplicate of bug 1809668 ***