Bug 1850074

Summary:	Cluster frontend ingress remain available
Product:	OpenShift Container Platform	Reporter:	Christian Huffman <chuffman>
Component:	Networking	Assignee:	Miciah Dashiel Butler Masters <mmasters>
Networking sub component:	router	QA Contact:	Hongan Li <hongli>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	aconstan, amurdaca, aos-bugs, bkhadars, danili, kgarriso, wking
Version:	4.3.z
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	non-multi-arch
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:	Cluster frontend ingress remain available
Last Closed:	2020-08-20 16:01:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Christian Huffman 2020-06-23 13:48:41 UTC

test: Cluster frontend ingress remain available is failing frequently in CI, see search results:
https://search.svc.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=Cluster+frontend+ingress+remain+available

  For instance - https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/origin-ci-test/pr-logs/pull/25188/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1275394971154780160

There are several errors in this test similar to:

  Jun 23 12:51:53.917 - 5s    E ns/openshift-authentication route/oauth-openshift Route is not responding to GET requests over new connections

Comment 3 W. Trevor King 2020-06-29 18:23:09 UTC

Checking the failure message from the reference job [1]:

  Frontends were unreachable during disruption for at least 3m13s of 35m29s (9%), this is currently sufficient to pass the test/job but not considered completely correct:

But:

* PR presubmit jobs are noisy, and it's possible that something changed by that PR broke this behavior, in which case it's the PR author's problem.
* As the failure message says, the test-case failed, but not by enough to fail the job.  That job actually failed on:

    fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:237]: Jun 23 13:17:47.218: Service was unreachable during disruption for at least 53s of 33m45s (3%):

  out of the 'Application behind service load balancer with PDB is not disrupted' test-case, and that's bug 1828858.

However, there are jobs that are actually failing on frontend reachability:

$ w3m -dump -cols 200 'https://search.svc.ci.openshift.org/?search=fail.*Frontends were unreachable during disruption for at least&type=junit&maxAge=336h&name=release-openshift-' | grep 'failures match'
release-openshift-origin-installer-e2e-aws-upgrade - 1419 runs, 28% failed, 2% of failures match
release-openshift-origin-installer-launch-aws - 1032 runs, 56% failed, 0% of failures match
release-openshift-origin-installer-e2e-gcp-upgrade - 196 runs, 25% failed, 6% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3 - 19 runs, 37% failed, 14% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-4.2-nightly-to-4.3-nightly - 18 runs, 33% failed, 17% of failures match
release-openshift-okd-installer-e2e-aws-upgrade - 568 runs, 37% failed, 0% of failures match

For example 4.2.36 -> 4.3.0-0.ci-2020-06-26-162041 [2] with:

  fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:226]: Jun 26 17:51:17.340: Frontends were unreachable during disruption for at least 10m21s of 48m4s (22%):

or 4.6.0-0.ci-2020-06-23-223142 -> 4.6.0-0.ci-2020-06-24-003142 [3] with:

  fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:237]: Jun 24 01:47:09.073: Frontends were unreachable during disruption for at least 6m22s of 29m23s (22%):

[1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/origin-ci-test/pr-logs/pull/25188/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1275394971154780160
[2]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1276551522351583232
[3]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/1275587702091157504

Comment 5 Andrew McDermott 2020-07-09 12:11:41 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 6 Basheer 2020-07-10 05:14:43 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1848264

We are facing similar issue on ppc64le hardware as well.. but the toleration levels on power(>30%) platfrom is quite high as compared to the x86_64(22%)

Error message:
fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:226]: Jun  5 04:04:58.773: Frontends were unreachable during disruption for at least 16m30s of 47m16s (35%):

Comment 7 Andrew McDermott 2020-07-30 10:06:52 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 8 Miciah Dashiel Butler Masters 2020-08-20 16:01:15 UTC

This appears to be a manifestation of bug 1809668.

*** This bug has been marked as a duplicate of bug 1809668 ***