Bug 1690087

Summary: Could not reach HTTP service through adf61ce70497f11e99af60e4be0b8886-676259996.us-east-1.elb.amazonaws.com:80 after 2m0s
Product: OpenShift Container Platform Reporter: Ben Parees <bparees>
Component: NetworkingAssignee: Miciah Dashiel Butler Masters <mmasters>
Networking sub component: router QA Contact: Hongan Li <hongli>
Status: CLOSED INSUFFICIENT_DATA Docs Contact:
Severity: high    
Priority: medium CC: adam.kaplan, aos-bugs, bbennett, ccoleman, dmace, mmasters, nstielau, rgudimet, weliang, wking
Version: 4.1.0   
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-11 16:12:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC none

Description Ben Parees 2019-03-18 18:42:36 UTC
Description of problem:
ELB became inaccessible during upgrade testing

fail [k8s.io/kubernetes/test/e2e/framework/service_util.go:857]: Mar 18 13:37:27.884: Could not reach HTTP service through adf61ce70497f11e99af60e4be0b8886-676259996.us-east-1.elb.amazonaws.com:80 after 2m0s

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/249/


Version-Release number of selected component (if applicable):
4.1


How reproducible:
appears to be a flake

Comment 2 Nick Stielau 2019-03-21 19:43:43 UTC
If this is a flake, as noted above, we should know what the flake rate is.  Interacting with external services, we will encounter things outside of our control.  If this happens infrequently, I think we can move forward with the beta, and consider mitigating this by recreating or waiting longer.

Comment 4 Ben Parees 2019-03-21 19:48:14 UTC
> I do not think this is not an issue of interacting w/ external services outside of our control.

double negatives......  meant to say: I do not think this is an issue of interacting w/ external services outside of our control.

Comment 5 W. Trevor King 2019-03-22 05:49:36 UTC
Created attachment 1546775 [details]
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC

This occurred in 5 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours.  Generated with [1]:

  $ deck-build-log-plot 'Could not reach HTTP service through .*elb.amazonaws.com:80 after'

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 6 Nick Stielau 2019-03-22 16:09:14 UTC
At 0.05% failure rate I think we can take this off of TestBlocker and BetaBlocker list. Can anyone +1?


But it is definitely still a bug.

Comment 7 Ben Parees 2019-03-22 17:04:16 UTC
I'm inclined to agree but i think someone from the networking team needs to make the final call based on their understanding of the potential cause + implications.

Comment 8 Miciah Dashiel Butler Masters 2019-03-22 17:31:42 UTC
I have not reproduced the issue yet, and CI does not have pod logs until after the initial point of failure.  I did find an upstream report that looks related, indicating that this is a known, as-yet undiagnosed flake upstream: https://github.com/kubernetes/kubernetes/issues/71239

Comment 9 Dan Mace 2019-03-22 17:32:47 UTC
+1 to removing it from the blocker list. We need to stress test the upstream LB bits from `test/e2e/upgrades/services.go` in our AWS environments to reproduce and diagnose. I don't have enough data at this time to attribute the problem to external DNS, the SDN, etc.

The failure seems unrelated directly to ingress controllers — the test is creating a bare LoadBalancer Service backed by custom endpoints created during the test in a temporary namespace. However, we must understand the problem since upstream LoadBalancer Services are key to our ingress controller implementation, and any flakiness around that "primitive" will introduce downstream instability in our ingress solution.

Comment 10 Nick Stielau 2019-03-22 17:34:09 UTC
Removing BetaBlocker

Comment 12 W. Trevor King 2019-04-26 23:01:34 UTC
Upstream issue was unhelpfully closed as rotten [1].

[1]: https://github.com/kubernetes/kubernetes/issues/71239#issuecomment-485040442

Comment 13 Clayton Coleman 2019-06-13 15:43:17 UTC
I'm bumping the timeout and moving back to 4.1.z so we stop seeing these flakes during CI runs.  https://github.com/openshift/origin/pull/23160 is for master, will cherrypick under this bug.

Comment 14 Clayton Coleman 2019-06-13 15:43:37 UTC
Increased the severity, 10% flaking is high.

Comment 15 Dan Mace 2019-08-06 18:16:10 UTC
Haven't observed this in 2 weeks:

https://search.svc.ci.openshift.org/?search=Could+not+reach+HTTP+service+through&maxAge=336h&context=2&type=all

Please re-open if I'm mistaken.

Comment 17 Weibin Liang 2019-08-06 20:27:20 UTC
This bug was found in CI e2e testing and haven't observed it in 2 weeks. QA can not run any CI e2e testing.
According to comment 15, QA move this bug to verified.

Comment 18 W. Trevor King 2019-09-08 19:02:13 UTC
Turned up again in 4.1.13->4.1.14 upgrade testing [1,2].  Checking, I see a few other instance from the past 14 days:

$ curl -s 'https://ci-search-ci-search-next.svc.ci.openshift.org/search?name=-e2e-&maxAge=336h&context=0&search=Could+not+reach+HTTP+service+through+.*.us-east-1.elb.amazonaws.com' | jq -r '. | keys | sort[]'
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/340
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/342
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/344
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/345
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/346
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/417
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/142
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/143
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/145
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2/228
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6704
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6777
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/batch/pull-ci-openshift-origin-master-e2e-aws-upgrade/4574
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-scheduler-operator/168/pull-ci-openshift-cluster-kube-scheduler-operator-master-e2e-aws-upgrade/162
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1105/pull-ci-openshift-machine-config-operator-master-e2e-aws-upgrade/1733

I'm reopening this one because I'm mostly concerned about the 4.1.13->4.1.14 upgrade.  But bug 1749448 is about this same error reported against 4.2 and it was closed as a dup of bug 1749446 (still POST), so that may explain the 4.1->4.2 and rollback instances.  Bug 1703878 also mentions this error, but the purported fix [3] landed before release-4.1 split off so it's in all of these recent failures.


[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/417
[2]: https://openshift-release.svc.ci.openshift.org/releasestream/4-stable/release/4.1.15
[3]: https://github.com/openshift/origin/pull/22711

Comment 19 Dan Mace 2019-09-09 17:14:42 UTC
I've started spot checking these and so far 2 of them are clearly some sort of SDN issue:

https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/340/artifacts/
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/346


Multiple pods (including all coreDNS pods, router pods, ingress operator pods, and others) are failing to route packets to the apiserver for a window lasting at least a minute, with errors like

   dial tcp 172.30.0.1:443: connect: no route to host


In https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/340, the apiserver was last reported reachable by ingress operator at 2019-09-06T16:54:21.643Z and first observed broken at 17:15:54.752105.

Here's another which has some SDN issue front and center:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6777

I'll spot check a few more throughout the day, but I suspect the ingress ELB is a red herring here and routing is collateral damage of a networking issue.

Comment 20 W. Trevor King 2019-09-09 17:20:03 UTC
Can you check https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/417 ?  That doesn't involve any 4.2 code, so I'd have expected it to not be affected by 4.2 SDN instability.  Looks like all the ones you've looked into so far do involve 4.2.

Comment 21 Dan Mace 2019-09-09 17:29:02 UTC
Dan Winship clued me in to another state in which we would see `no route to host` — if the service has no endpoints yet (e.g. health checks for pods aren't passing). In at least one case[1] the apiserver is dead because it can't talk to etcd, and so pods get `no route to host` talking to the apiserver because the apiserver has no endpoints and thus no route through iptables.

My gut says most of these are going to be something similar...

[1] https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/340

Comment 22 Dan Mace 2019-09-09 20:58:01 UTC
(In reply to W. Trevor King from comment #20)
> Can you check
> https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-
> openshift-origin-installer-e2e-aws-upgrade-4.1/417 ?  That doesn't involve
> any 4.2 code, so I'd have expected it to not be affected by 4.2 SDN
> instability.  Looks like all the ones you've looked into so far do involve
> 4.2.

This one's pretty interesting. Looks like k8s upgrade test setup[1] (executing from the openshift-test binary in, I think, a pod network namespace[2]) creates and verifies a LoadBalancer service by dialing the LB over its external host or IP (host in this case). The test process (a Go program) fails to resolve the ELB hostname:

    dial tcp: lookup a2b0cc25ed0fc11e9bf400a28fd1874a-909427042.us-east-1.elb.amazonaws.com on 10.142.0.13:53: no such host

The nameserver IP _looks_ like the resolver upstream of the node (Route53). Some things that are possible:

* The ELB existed long enough to get picked up by k8s and the test, but then got deleted before verification
* DNS packets are not going where they should
* Route53 gave us a bad answer

For that particular class of problem, I suspect there's only so much we can do post-mortem without pcaps or AWS state dumps around the failure time, so trying to add some diagnostics stuff in a PR and trying to reproduce might be productive...

[1] https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/test/e2e/upgrades/services.go#L73
[2] https://github.com/openshift/release/blob/master/ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml

Comment 25 Dan Mace 2019-10-11 16:12:46 UTC
Haven't seen any evidence of this being a significant issue in the last 2 weeks (according to https://ci-search-ci-search-next.svc.ci.openshift.org), so closing it. We can open a new bug if the problem recurs.

Comment 26 W. Trevor King 2020-01-10 03:34:56 UTC
*** Bug 1789440 has been marked as a duplicate of this bug. ***