Bug 1751916
| Summary: | [ci] Router restarts / oauth fails / restarts is unreachable 30 min into e2e run | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
| Component: | Networking | Assignee: | Casey Callendrello <cdc> |
| Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> |
| Status: | CLOSED DUPLICATE | Docs Contact: | |
| Severity: | medium | ||
| Priority: | medium | CC: | adam.kaplan, aos-bugs, bbennett, bparees |
| Version: | 4.2.0 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.3.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-11-22 14:20:34 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Clayton Coleman
2019-09-13 05:15:31 UTC
In a randomly chosen older run https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.2/4249/artifacts/e2e-aws/e2e.log Sep 10 03:25:37.341 I ns/openshift-config-managed secret/router-certs Updated the published router certificates (2 times) Sep 10 03:25:37.354 I ns/openshift-authentication-operator deployment/authentication-operator Updated Secret/v4-0-config-system-router-certs -n openshift-authentication because it changed (2 times) Sep 10 03:25:38.353 I ns/openshift-authentication-operator deployment/authentication-operator Updated ConfigMap/v4-0-config-system-cliconfig -n openshift-authentication: cause by changes in data.v4-0-config-system-cliconfig (2 times) It looks like cert rotation isn't maintaining availability for routers? Quick update:
I think the the cert rotation stuff is a red herring. The default router didn't restart in response to cert changes; those cert messages are in response to a new e2e ingresscontroller being created/destroyed (which causes the shared certs secret to get updated).
The registry failures don't seem to involve ingress, but instead show a failure communicating with the registry over its service DNS name and IP:
Log Tail: Pulling image image-registry.openshift-image-registry.svc:...d22c5817ee169da72509071d19f789db914b4d52b581f0436e3f0cf ...
Warning: Pull failed, retrying in 5s ...
Warning: Pull failed, retrying in 5s ...
Warning: Pull failed, retrying in 5s ...
error: build error: failed to pull image: After retrying 2...istry.svc:5000/v2/: dial tcp 172.30.16.62:5000: i/o timeout
However, the RouteHealthCheck errors floating around could be:
https://bugzilla.redhat.com/show_bug.cgi?id=1709958
https://bugzilla.redhat.com/show_bug.cgi?id=1750953
https://bugzilla.redhat.com/show_bug.cgi?id=1740148
I propose #1709958 should be the tracker for anything related to packet loss during router restarts. Miciah's going to put together a WIP PR to get some CI data on a possible fix he has been designing for some time now.
If this bz was opened first because of the registry error, and if you agree that ingress is not implicated in the registry error, and if the registry error is still important, I think this bug needs reassigned. Otherwise it risks getting lost as just another data point for #1709958.
Can we disambiguate this bug from the ingress packet loss issue already being tracked?
Router availability during rollouts is already covered by https://bugzilla.redhat.com/show_bug.cgi?id=1709958 (and we might start folding RouteHealthCheck errors into that for now). For the registry issue (which is what this bug was originally described as being about), it doesn't seem to involve ingress, so I'm reassigning this to registry for triage. It doesn't seem like an SDN issue on the surface, but more investigation is required. Analyzing this job: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/466/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1369 The registry pod log shows the registry was up the entire time: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/466/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1369/artifacts/e2e-aws/pods/openshift-image-registry_image-registry-64586f8d54-x7lr9_registry.log first log entry: time="2019-09-13T01:34:23.32679288Z" last log entry: time="2019-09-13T02:19:23.815146504Z" I see no indication of any problems in the registry at the time of the failure, which is reported from the build log as: 2019-09-13T01:45:32.880013771Z error: build error: failed to pull image: After retrying 2 times, Pull image still failed due to error: while pulling "docker://image-registry.openshift-image-registry.svc:5000/e2e-test-docker-build-pullsecret-nmdzk/image1@sha256:3ba7ed9f7d22c5817ee169da72509071d19f789db914b4d52b581f0436e3f0cf" as "image-registry.openshift-image-registry.svc:5000/e2e-test-docker-build-pullsecret-nmdzk/image1@sha256:3ba7ed9f7d22c5817ee169da72509071d19f789db914b4d52b581f0436e3f0cf": Error initializing source docker://image-registry.openshift-image-registry.svc:5000/e2e-test-docker-build-pullsecret-nmdzk/image1@sha256:3ba7ed9f7d22c5817ee169da72509071d19f789db914b4d52b581f0436e3f0cf: pinging docker registry returned: Get https://image-registry.openshift-image-registry.svc:5000/v2/: dial tcp 172.30.16.62:5000: i/o timeout The registry responded successfully to other requests around the same time period (though they are health check probes, not blob requests). right before: time="2019-09-13T01:45:31.486655764Z" level=info msg=response go.version=go1.11.13 http.request.host="10.128.2.6:5000" http.request.id=6356cfba-14cc-49f0-afd1-fe493f205cc2 http.request.method=GET http.request.remoteaddr="10.128.2.1:46328" http.request.uri=/healthz http.request.useragent=kube-probe/1.14+ http.response.duration="65.436µs" http.response.status=200 http.response.written=0 right after: time="2019-09-13T01:45:33.330541201Z" level=info msg=response go.version=go1.11.13 http.request.host="10.128.2.6:5000" http.request.id=863ce528-a6d8-4f4a-ac52-caff96161c2e http.request.method=GET http.request.remoteaddr="10.128.2.1:46344" http.request.uri=/healthz http.request.useragent=kube-probe/1.14+ http.response.duration="66.651µs" http.response.status=200 http.response.written=0 and I see no indication that a request for 3ba7ed9f7d22c5817ee169da72509071d19f789db914b4d52b581f0436e3f0cf (the image id in question) made it to the registry. The nearest blob requests in time were: time="2019-09-13T01:42:42.269596413Z" level=info msg=response go.version=go1.11.13 http.re quest.contenttype=application/octet-stream http.request.host="image-registry.openshift-ima ge-registry.svc:5000" http.request.id=7a357561-bd6c-4e90-beba-91eb2278a680 http.request.me thod=PUT http.request.remoteaddr="10.131.0.25:50400" http.request.uri="/v2/e2e-test-build- timing-klnpk/test/blobs/uploads/2d491021-36d8-47bc-adc8-4a877fe57129?_state=s9xA9H7g4FNtsH ypcBlLgwwys8QGnaHr3ZQfF6xmruF7Ik5hbWUiOiJlMmUtdGVzdC1idWlsZC10aW1pbmcta2xucGsvdGVzdCIsIlVV SUQiOiIyZDQ5MTAyMS0zNmQ4LTQ3YmMtYWRjOC00YTg3N2ZlNTcxMjkiLCJPZmZzZXQiOjI1NTksIlN0YXJ0ZWRBdC I6IjIwMTktMDktMTNUMDE6NDI6NDFaIn0%3D&digest=sha256%3Ab601a4d585dc320852efddaa5b0fed9461440 36a9f299e0fc9f3d957e51e6c29" http.request.useragent=Go-http-client/1.1 http.response.durat ion=607.564954ms http.response.status=201 http.response.written=0 and time="2019-09-13T01:46:45.828233469Z" level=info msg="response completed" go.version=go1.11.13 http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=add3f6a5-df34-43dd-8283-1c7ce7123199 http.request.method=GET http.request.remoteaddr="10.131.0.1:36878" http.request.uri="/v2/openshift/ruby/blobs/sha256:98e3063dda72c6fa2349fb59e83781f00635fe48de124272ab28d0a1a50ae583" http.request.useragent="cri-o/1.14.10-0.12.dev.rhaos4.2.git819260a.el8 go/go1.12.8 os/linux arch/amd64" http.response.contenttype=text/plain http.response.duration=2.444499838s http.response.status=200 http.response.written=7474 so they don't tightly bracket the time period in question, but again i see no indication anything was failing in the registry during this time. (might be worth noting that since the requests it responded to right around the failure time were probes, they would not have gone through the service proxy, I assume, since they'd go straight to the pod. Possibly relevant?) Another potential data point: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/23825/pull-ci-openshift-origin-master-e2e-aws-builds/3021 I can rule out the registry was the issue here - it came up, pod behaved normally and blobs were able to be pushed to imagestreams. Later when the imagestream tag was used in a deployment pull, the pull failed with no route to host. Moving this to the Networking - SDN component. Update from out-of-band debugging: Danw noticed that the iptables dump is missing the service IP in question. That suggests that something got lost in the informer chain. Taking a further look. Yup, looks like it never saw the service - https://storage.googleapis.com/origin-ci-test/pr-logs/pull/23825/pull-ci-openshift-origin-master-e2e-aws-builds/3021/artifacts/e2e-aws-builds/pods/openshift-sdn_sdn-xbz6n_sdn.log has no "svc add" line for openshift-image-registry/image-registry And, from what I can tell, it's not a hybrid proxy bug. That's because the code that logs the service add is *before* any of our unidling logic. endpoint informer service informer + + +--------+---------+ | +--------+---------+ | | | OsdnProxy | | | +--------+---------+ | +--------+---------+ | | | HybridProxier | | | +--------+---------+ | +------+---------+ | | +----------+----+ +--------+-------+ | | | | | iptablesProxy | | userspaceProxy | | | | | +---------------+ +----------------+ The log line I would expect to see is in the OsdnProxy, and this particular codepath is logicless. It's just log-and-forward. So that means either there is an informer bug (unlikely) or there's an error in how we use them. I don't see any obvious race conditions. The flow, as far as I can make out is: 1. Create the informers with NewSharedInformerFactory. 2. Create the proxy 3. Add the proxy's event handler to the informer 4. Start the informers Which is... fine. Totally fine. I'm looking now to see if there's some way that the informers were started beforehand. Normally this shouldn't matter, since adding a new handler to a shared informer results in a synthetic Add() against the store. However, there's a particularly silly and useless abstraction layer in the upstream proxy that un-does this behavior if somehow there's a bug. Which I very much doubt. Filed https://github.com/openshift/sdn/pull/35 with extra logging to see if we're somehow tickling this. *** This bug has been marked as a duplicate of bug 1765280 *** |