Bug 1690087
Summary: | Could not reach HTTP service through adf61ce70497f11e99af60e4be0b8886-676259996.us-east-1.elb.amazonaws.com:80 after 2m0s | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ben Parees <bparees> | ||||
Component: | Networking | Assignee: | Miciah Dashiel Butler Masters <mmasters> | ||||
Networking sub component: | router | QA Contact: | Hongan Li <hongli> | ||||
Status: | CLOSED INSUFFICIENT_DATA | Docs Contact: | |||||
Severity: | high | ||||||
Priority: | medium | CC: | adam.kaplan, aos-bugs, bbennett, ccoleman, dmace, mmasters, nstielau, rgudimet, weliang, wking | ||||
Version: | 4.1.0 | ||||||
Target Milestone: | --- | ||||||
Target Release: | 4.3.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2019-10-11 16:12:46 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Ben Parees
2019-03-18 18:42:36 UTC
also: Mar 18 17:24:13.330: INFO: Got error testing for reachability of http://aaab68b09499c11e9be550a85713d10d-104965029.us-east-1.elb.amazonaws.com:80/echo?msg=hello: Get http://aaab68b09499c11e9be550a85713d10d-104965029.us-east-1.elb.amazonaws.com:80/echo?msg=hello: net/http: request canceled (Client.Timeout exceeded while awaiting headers) https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/259/ If this is a flake, as noted above, we should know what the flake rate is. Interacting with external services, we will encounter things outside of our control. If this happens infrequently, I think we can move forward with the beta, and consider mitigating this by recreating or waiting longer. I do not think this is not an issue of interacting w/ external services outside of our control. It was observed in all of these builds over a span of 2-3 days (I don't know if it's been seen since then): https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/249/ https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/259/ https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/248/ https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/246/ As for the severity implications+mitigations, i defer to the routing team. > I do not think this is not an issue of interacting w/ external services outside of our control.
double negatives...... meant to say: I do not think this is an issue of interacting w/ external services outside of our control.
Created attachment 1546775 [details] Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC This occurred in 5 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours. Generated with [1]: $ deck-build-log-plot 'Could not reach HTTP service through .*elb.amazonaws.com:80 after' [1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log At 0.05% failure rate I think we can take this off of TestBlocker and BetaBlocker list. Can anyone +1? But it is definitely still a bug. I'm inclined to agree but i think someone from the networking team needs to make the final call based on their understanding of the potential cause + implications. I have not reproduced the issue yet, and CI does not have pod logs until after the initial point of failure. I did find an upstream report that looks related, indicating that this is a known, as-yet undiagnosed flake upstream: https://github.com/kubernetes/kubernetes/issues/71239 +1 to removing it from the blocker list. We need to stress test the upstream LB bits from `test/e2e/upgrades/services.go` in our AWS environments to reproduce and diagnose. I don't have enough data at this time to attribute the problem to external DNS, the SDN, etc. The failure seems unrelated directly to ingress controllers — the test is creating a bare LoadBalancer Service backed by custom endpoints created during the test in a temporary namespace. However, we must understand the problem since upstream LoadBalancer Services are key to our ingress controller implementation, and any flakiness around that "primitive" will introduce downstream instability in our ingress solution. Removing BetaBlocker Just checking in here to say we have this message in 16 of our 327 -e2e-aws-upgrade jobs in the past 24 hours (so ~9%). Recent examples: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/837 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/1675/pull-ci-openshift-installer-master-e2e-aws-upgrade/140 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/274/pull-ci-openshift-machine-api-operator-master-e2e-aws-upgrade/32 Upstream issue was unhelpfully closed as rotten [1]. [1]: https://github.com/kubernetes/kubernetes/issues/71239#issuecomment-485040442 I'm bumping the timeout and moving back to 4.1.z so we stop seeing these flakes during CI runs. https://github.com/openshift/origin/pull/23160 is for master, will cherrypick under this bug. Increased the severity, 10% flaking is high. Haven't observed this in 2 weeks: https://search.svc.ci.openshift.org/?search=Could+not+reach+HTTP+service+through&maxAge=336h&context=2&type=all Please re-open if I'm mistaken. This bug was found in CI e2e testing and haven't observed it in 2 weeks. QA can not run any CI e2e testing. According to comment 15, QA move this bug to verified. Turned up again in 4.1.13->4.1.14 upgrade testing [1,2]. Checking, I see a few other instance from the past 14 days: $ curl -s 'https://ci-search-ci-search-next.svc.ci.openshift.org/search?name=-e2e-&maxAge=336h&context=0&search=Could+not+reach+HTTP+service+through+.*.us-east-1.elb.amazonaws.com' | jq -r '. | keys | sort[]' https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/340 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/342 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/344 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/345 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/346 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/417 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/142 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/143 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/145 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2/228 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6704 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6777 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/batch/pull-ci-openshift-origin-master-e2e-aws-upgrade/4574 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-scheduler-operator/168/pull-ci-openshift-cluster-kube-scheduler-operator-master-e2e-aws-upgrade/162 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1105/pull-ci-openshift-machine-config-operator-master-e2e-aws-upgrade/1733 I'm reopening this one because I'm mostly concerned about the 4.1.13->4.1.14 upgrade. But bug 1749448 is about this same error reported against 4.2 and it was closed as a dup of bug 1749446 (still POST), so that may explain the 4.1->4.2 and rollback instances. Bug 1703878 also mentions this error, but the purported fix [3] landed before release-4.1 split off so it's in all of these recent failures. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/417 [2]: https://openshift-release.svc.ci.openshift.org/releasestream/4-stable/release/4.1.15 [3]: https://github.com/openshift/origin/pull/22711 I've started spot checking these and so far 2 of them are clearly some sort of SDN issue: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/340/artifacts/ https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/346 Multiple pods (including all coreDNS pods, router pods, ingress operator pods, and others) are failing to route packets to the apiserver for a window lasting at least a minute, with errors like dial tcp 172.30.0.1:443: connect: no route to host In https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/340, the apiserver was last reported reachable by ingress operator at 2019-09-06T16:54:21.643Z and first observed broken at 17:15:54.752105. Here's another which has some SDN issue front and center: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6777 I'll spot check a few more throughout the day, but I suspect the ingress ELB is a red herring here and routing is collateral damage of a networking issue. Can you check https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/417 ? That doesn't involve any 4.2 code, so I'd have expected it to not be affected by 4.2 SDN instability. Looks like all the ones you've looked into so far do involve 4.2. Dan Winship clued me in to another state in which we would see `no route to host` — if the service has no endpoints yet (e.g. health checks for pods aren't passing). In at least one case[1] the apiserver is dead because it can't talk to etcd, and so pods get `no route to host` talking to the apiserver because the apiserver has no endpoints and thus no route through iptables. My gut says most of these are going to be something similar... [1] https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/340 (In reply to W. Trevor King from comment #20) > Can you check > https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release- > openshift-origin-installer-e2e-aws-upgrade-4.1/417 ? That doesn't involve > any 4.2 code, so I'd have expected it to not be affected by 4.2 SDN > instability. Looks like all the ones you've looked into so far do involve > 4.2. This one's pretty interesting. Looks like k8s upgrade test setup[1] (executing from the openshift-test binary in, I think, a pod network namespace[2]) creates and verifies a LoadBalancer service by dialing the LB over its external host or IP (host in this case). The test process (a Go program) fails to resolve the ELB hostname: dial tcp: lookup a2b0cc25ed0fc11e9bf400a28fd1874a-909427042.us-east-1.elb.amazonaws.com on 10.142.0.13:53: no such host The nameserver IP _looks_ like the resolver upstream of the node (Route53). Some things that are possible: * The ELB existed long enough to get picked up by k8s and the test, but then got deleted before verification * DNS packets are not going where they should * Route53 gave us a bad answer For that particular class of problem, I suspect there's only so much we can do post-mortem without pcaps or AWS state dumps around the failure time, so trying to add some diagnostics stuff in a PR and trying to reproduce might be productive... [1] https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/test/e2e/upgrades/services.go#L73 [2] https://github.com/openshift/release/blob/master/ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml Another occurrence of the problem: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6942/ Haven't seen any evidence of this being a significant issue in the last 2 weeks (according to https://ci-search-ci-search-next.svc.ci.openshift.org), so closing it. We can open a new bug if the problem recurs. *** Bug 1789440 has been marked as a duplicate of this bug. *** |