Bug 1850074

Summary: Cluster frontend ingress remain available
Product: OpenShift Container Platform Reporter: Christian Huffman <chuffman>
Component: NetworkingAssignee: Miciah Dashiel Butler Masters <mmasters>
Networking sub component: router QA Contact: Hongan Li <hongli>
Status: CLOSED DUPLICATE Docs Contact:
Severity: urgent    
Priority: urgent CC: aconstan, amurdaca, aos-bugs, bkhadars, danili, kgarriso, wking
Version: 4.3.z   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: non-multi-arch
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Cluster frontend ingress remain available
Last Closed: 2020-08-20 16:01:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Christian Huffman 2020-06-23 13:48:41 UTC
test: Cluster frontend ingress remain available is failing frequently in CI, see search results:
https://search.svc.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=Cluster+frontend+ingress+remain+available

  For instance - https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/origin-ci-test/pr-logs/pull/25188/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1275394971154780160

There are several errors in this test similar to:

  Jun 23 12:51:53.917 - 5s    E ns/openshift-authentication route/oauth-openshift Route is not responding to GET requests over new connections

Comment 3 W. Trevor King 2020-06-29 18:23:09 UTC
Checking the failure message from the reference job [1]:

  Frontends were unreachable during disruption for at least 3m13s of 35m29s (9%), this is currently sufficient to pass the test/job but not considered completely correct:

But:

* PR presubmit jobs are noisy, and it's possible that something changed by that PR broke this behavior, in which case it's the PR author's problem.
* As the failure message says, the test-case failed, but not by enough to fail the job.  That job actually failed on:

    fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:237]: Jun 23 13:17:47.218: Service was unreachable during disruption for at least 53s of 33m45s (3%):

  out of the 'Application behind service load balancer with PDB is not disrupted' test-case, and that's bug 1828858.

However, there are jobs that are actually failing on frontend reachability:

$ w3m -dump -cols 200 'https://search.svc.ci.openshift.org/?search=fail.*Frontends were unreachable during disruption for at least&type=junit&maxAge=336h&name=release-openshift-' | grep 'failures match'
release-openshift-origin-installer-e2e-aws-upgrade - 1419 runs, 28% failed, 2% of failures match
release-openshift-origin-installer-launch-aws - 1032 runs, 56% failed, 0% of failures match
release-openshift-origin-installer-e2e-gcp-upgrade - 196 runs, 25% failed, 6% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3 - 19 runs, 37% failed, 14% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-4.2-nightly-to-4.3-nightly - 18 runs, 33% failed, 17% of failures match
release-openshift-okd-installer-e2e-aws-upgrade - 568 runs, 37% failed, 0% of failures match

For example 4.2.36 -> 4.3.0-0.ci-2020-06-26-162041 [2] with:

  fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:226]: Jun 26 17:51:17.340: Frontends were unreachable during disruption for at least 10m21s of 48m4s (22%):

or 4.6.0-0.ci-2020-06-23-223142 -> 4.6.0-0.ci-2020-06-24-003142 [3] with:

  fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:237]: Jun 24 01:47:09.073: Frontends were unreachable during disruption for at least 6m22s of 29m23s (22%):

[1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/origin-ci-test/pr-logs/pull/25188/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1275394971154780160
[2]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1276551522351583232
[3]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/1275587702091157504

Comment 5 Andrew McDermott 2020-07-09 12:11:41 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 6 Basheer 2020-07-10 05:14:43 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1848264

We are facing similar issue on ppc64le hardware as well.. but the toleration levels on power(>30%) platfrom is quite high as compared to the x86_64(22%)

Error message:
fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:226]: Jun  5 04:04:58.773: Frontends were unreachable during disruption for at least 16m30s of 47m16s (35%):

Comment 7 Andrew McDermott 2020-07-30 10:06:52 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 8 Miciah Dashiel Butler Masters 2020-08-20 16:01:15 UTC
This appears to be a manifestation of bug 1809668.

*** This bug has been marked as a duplicate of bug 1809668 ***