Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1751916

Summary:	[ci] Router restarts / oauth fails / restarts is unreachable 30 min into e2e run
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Networking	Assignee:	Casey Callendrello <cdc>
Networking sub component:	openshift-sdn	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	medium
Priority:	medium	CC:	adam.kaplan, aos-bugs, bbennett, bparees
Version:	4.2.0
Target Milestone:	---
Target Release:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-11-22 14:20:34 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Clayton Coleman 2019-09-13 05:15:31 UTC

About 30 minutes into this e2e run the router drops out, and image pulls / pushes from build e2e tests start failing because the router is down for many seconds.

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/466/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1369

I see similar flakes across many e2e-aws runs, so this is probably happening relatively consistently.  In fact, it looks like almost every run has this happening.

Snippet from the larger timeline:

Sep 13 02:08:34.426 I ns/openshift-config-managed secret/router-certs Updated the published router certificates (2 times)
Sep 13 02:08:34.442 I ns/openshift-authentication-operator deployment/authentication-operator Updated Secret/v4-0-config-system-router-certs -n openshift-authentication because it changed (2 times)

^ Um, why?  Cert rotation?

Marked urgent until we know why the registry is unreachable (a down router is an explanation).  I would expect even if we change certs there's a rolling operator restart.  If the router stops working in between the rotation, that's even worse.

Full timeline.

Sep 13 02:06:20.280 W clusteroperator/ingress changed Available to False: IngressUnavailable: Not all ingress controllers are available.
Sep 13 02:06:20.280 W clusteroperator/ingress changed Progressing to True: Reconciling: Not all ingress controllers are available.
Sep 13 02:06:20.302 I ns/openshift-ingress-operator ingresscontroller/prometheus ingresscontroller passed validation
Sep 13 02:06:20.430 I ns/openshift-ingress deployment/router-prometheus Scaled up replica set router-prometheus-7f889fbb56 to 1
Sep 13 02:06:20.483 I ns/openshift-ingress pod/router-prometheus-7f889fbb56-xcgxn node/ created
Sep 13 02:06:20.501 I ns/openshift-ingress replicaset/router-prometheus-7f889fbb56 Created pod: router-prometheus-7f889fbb56-xcgxn
Sep 13 02:06:20.527 I ns/openshift-ingress pod/router-prometheus-7f889fbb56-xcgxn Successfully assigned openshift-ingress/router-prometheus-7f889fbb56-xcgxn to ip-10-0-134-147.ec2.internal
Sep 13 02:06:21.075 I ns/openshift-ingress-operator ingresscontroller/prometheus Created default wildcard certificate "router-certs-prometheus"
Sep 13 02:06:21.095 I ns/openshift-config-managed secret/router-certs Updated the published router certificates
Sep 13 02:06:21.132 I ns/openshift-authentication-operator deployment/authentication-operator Updated Secret/v4-0-config-system-router-certs -n openshift-authentication because it changed
Sep 13 02:06:21.143 W clusteroperator/ingress changed Available to True: desired and current number of IngressControllers are equal
Sep 13 02:06:21.143 W clusteroperator/ingress changed Progressing to False: desired and current number of IngressControllers are equal
Sep 13 02:06:22.097 I ns/openshift-authentication-operator deployment/authentication-operator Updated ConfigMap/v4-0-config-system-cliconfig -n openshift-authentication: cause by changes in data.v4-0-config-system-cliconfig
Sep 13 02:06:22.703 I ns/openshift-authentication-operator deployment/authentication-operator Updated Deployment.apps/oauth-openshift -n openshift-authentication because it changed
Sep 13 02:06:22.728 I ns/openshift-authentication deployment/oauth-openshift Scaled up replica set oauth-openshift-795585cfc9 to 1
Sep 13 02:06:22.754 I ns/openshift-authentication pod/oauth-openshift-795585cfc9-wd8h5 node/ created
Sep 13 02:06:22.762 I ns/openshift-authentication replicaset/oauth-openshift-795585cfc9 Created pod: oauth-openshift-795585cfc9-wd8h5
Sep 13 02:06:22.788 I ns/openshift-authentication pod/oauth-openshift-795585cfc9-wd8h5 Successfully assigned openshift-authentication/oauth-openshift-795585cfc9-wd8h5 to ip-10-0-142-170.ec2.internal
Sep 13 02:06:23.236 W clusteroperator/authentication changed Progressing to True: ProgressingOAuthServerDeploymentNotReady: Progressing: deployment's observed generation did not reach the expected generation
Sep 13 02:06:23.248 I ns/openshift-authentication-operator deployment/authentication-operator Status for clusteroperator/authentication changed: Progressing changed from False to True ("Progressing: deployment's observed generation did not reach the expected generation")
Sep 13 02:06:24.394 I ns/openshift-ingress pod/router-prometheus-7f889fbb56-xcgxn Pulling image "registry.svc.ci.openshift.org/ci-op-8gsqxc9j/stable@sha256:f50c1f3b22896a37df4f49a5ccffede66ce7c7451af5d44cdf62ea547224ac72"
Sep 13 02:06:25.452 I ns/openshift-authentication pod/oauth-openshift-795585cfc9-wd8h5 Pulling image "registry.svc.ci.openshift.org/ci-op-8gsqxc9j/stable@sha256:8452b39fc96e03ea44050dc569a900f48fa2bb726ae72d193490c69af6626f65"
Sep 13 02:06:26.328 I ns/openshift-authentication-operator deployment/authentication-operator Status for clusteroperator/authentication changed: Progressing message changed from "Progressing: deployment's observed generation did not reach the expected generation" to "Progressing: not all deployment replicas are ready"
Sep 13 02:06:30.403 I ns/openshift-authentication pod/oauth-openshift-795585cfc9-wd8h5 Successfully pulled image "registry.svc.ci.openshift.org/ci-op-8gsqxc9j/stable@sha256:8452b39fc96e03ea44050dc569a900f48fa2bb726ae72d193490c69af6626f65"
Sep 13 02:06:30.496 I ns/openshift-ingress pod/router-prometheus-7f889fbb56-xcgxn Successfully pulled image "registry.svc.ci.openshift.org/ci-op-8gsqxc9j/stable@sha256:f50c1f3b22896a37df4f49a5ccffede66ce7c7451af5d44cdf62ea547224ac72"
Sep 13 02:06:30.796 I ns/openshift-authentication pod/oauth-openshift-795585cfc9-wd8h5 Created container oauth-openshift
Sep 13 02:06:30.865 I ns/openshift-authentication pod/oauth-openshift-795585cfc9-wd8h5 Started container oauth-openshift
Sep 13 02:06:33.296 I ns/openshift-ingress pod/router-prometheus-7f889fbb56-xcgxn Created container router
Sep 13 02:06:33.343 I ns/openshift-ingress pod/router-prometheus-7f889fbb56-xcgxn Started container router
Sep 13 02:06:36.252 I ns/openshift-authentication deployment/oauth-openshift Scaled down replica set oauth-openshift-57cfcdb597 to 1
Sep 13 02:06:36.252 W ns/openshift-authentication pod/oauth-openshift-57cfcdb597-cvl2t node/ip-10-0-130-51.ec2.internal graceful deletion within 30s
Sep 13 02:06:36.265 I ns/openshift-authentication replicaset/oauth-openshift-57cfcdb597 Deleted pod: oauth-openshift-57cfcdb597-cvl2t
Sep 13 02:06:36.278 I ns/openshift-authentication pod/oauth-openshift-57cfcdb597-cvl2t Stopping container oauth-openshift
Sep 13 02:06:36.289 I ns/openshift-authentication deployment/oauth-openshift Scaled up replica set oauth-openshift-795585cfc9 to 2
Sep 13 02:06:36.317 I ns/openshift-authentication pod/oauth-openshift-795585cfc9-45n2n node/ created
Sep 13 02:06:36.339 I ns/openshift-authentication replicaset/oauth-openshift-795585cfc9 Created pod: oauth-openshift-795585cfc9-45n2n
Sep 13 02:06:36.344 W ns/openshift-authentication pod/oauth-openshift-57cfcdb597-cvl2t Liveness probe failed: Get https://10.129.0.25:6443/healthz: dial tcp 10.129.0.25:6443: connect: connection refused
Sep 13 02:06:36.349 I ns/openshift-authentication pod/oauth-openshift-795585cfc9-45n2n Successfully assigned openshift-authentication/oauth-openshift-795585cfc9-45n2n to ip-10-0-152-201.ec2.internal
Sep 13 02:06:38.715 I ns/openshift-authentication pod/oauth-openshift-795585cfc9-45n2n Container image "registry.svc.ci.openshift.org/ci-op-8gsqxc9j/stable@sha256:8452b39fc96e03ea44050dc569a900f48fa2bb726ae72d193490c69af6626f65" already present on machine
Sep 13 02:06:39.071 I ns/openshift-authentication pod/oauth-openshift-795585cfc9-45n2n Created container oauth-openshift
Sep 13 02:06:39.158 I ns/openshift-authentication pod/oauth-openshift-795585cfc9-45n2n Started container oauth-openshift
Sep 13 02:06:42.405 W ns/openshift-authentication pod/oauth-openshift-57cfcdb597-cvl2t node/ip-10-0-130-51.ec2.internal deleted
Sep 13 02:06:47.792 I ns/openshift-authentication deployment/oauth-openshift Scaled down replica set oauth-openshift-57cfcdb597 to 0
Sep 13 02:06:47.809 W ns/openshift-authentication pod/oauth-openshift-57cfcdb597-hzdmj node/ip-10-0-152-201.ec2.internal graceful deletion within 30s
Sep 13 02:06:47.839 I ns/openshift-authentication replicaset/oauth-openshift-57cfcdb597 Deleted pod: oauth-openshift-57cfcdb597-hzdmj
Sep 13 02:06:47.839 I ns/openshift-authentication pod/oauth-openshift-57cfcdb597-hzdmj Stopping container oauth-openshift
Sep 13 02:06:48.922 W clusteroperator/authentication changed Progressing to False
Sep 13 02:06:48.930 I ns/openshift-authentication-operator deployment/authentication-operator Status for clusteroperator/authentication changed: Progressing changed from True to False ("")
Sep 13 02:06:49.861 W ns/openshift-authentication pod/oauth-openshift-57cfcdb597-hzdmj node/ip-10-0-152-201.ec2.internal deleted
Sep 13 02:07:24.239 W persistentvolume/pvc-243855a6-d5cb-11e9-a98d-12334125f0c6 Error deleting EBS volume "vol-0b55573a8ea5bb6c9" since volume is currently attached to "i-06bfa8b9e1ae42897"
Sep 13 02:07:36.359 W persistentvolume/pvc-3171883c-d5cb-11e9-a98d-12334125f0c6 Error deleting EBS volume "vol-0ad43a63dc550d0ce" since volume is currently attached to "i-06bfa8b9e1ae42897"
Sep 13 02:08:07.202 I test="[Feature:Builds][Conformance] oc new-app  should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal]" failed
Sep 13 02:08:34.426 I ns/openshift-config-managed secret/router-certs Updated the published router certificates (2 times)
Sep 13 02:08:34.442 I ns/openshift-authentication-operator deployment/authentication-operator Updated Secret/v4-0-config-system-router-certs -n openshift-authentication because it changed (2 times)
Sep 13 02:08:34.601 W ns/openshift-ingress pod/router-prometheus-7f889fbb56-xcgxn node/ip-10-0-134-147.ec2.internal graceful deletion within 30s
Sep 13 02:08:34.616 I ns/openshift-ingress pod/router-prometheus-7f889fbb56-xcgxn Stopping container router
Sep 13 02:08:35.039 I ns/openshift-ingress-operator ingresscontroller/prometheus Created default wildcard certificate "router-certs-prometheus" (2 times)
Sep 13 02:08:35.438 I ns/openshift-authentication-operator deployment/authentication-operator Updated ConfigMap/v4-0-config-system-cliconfig -n openshift-authentication: cause by changes in data.v4-0-config-system-cliconfig (2 times)
Sep 13 02:08:36.050 I ns/openshift-authentication-operator deployment/authentication-operator Updated Deployment.apps/oauth-openshift -n openshift-authentication because it changed (2 times)
Sep 13 02:08:36.064 I ns/openshift-authentication deployment/oauth-openshift Scaled up replica set oauth-openshift-677f75d86c to 1
Sep 13 02:08:36.092 I ns/openshift-authentication pod/oauth-openshift-677f75d86c-7wbtl node/ created
Sep 13 02:08:36.104 I ns/openshift-authentication replicaset/oauth-openshift-677f75d86c Created pod: oauth-openshift-677f75d86c-7wbtl
Sep 13 02:08:36.107 I ns/openshift-authentication pod/oauth-openshift-677f75d86c-7wbtl Successfully assigned openshift-authentication/oauth-openshift-677f75d86c-7wbtl to ip-10-0-130-51.ec2.internal
Sep 13 02:08:36.480 W clusteroperator/authentication changed Progressing to True: ProgressingOAuthServerDeploymentNotReady: Progressing: deployment's observed generation did not reach the expected generation
Sep 13 02:08:36.493 I ns/openshift-authentication-operator deployment/authentication-operator Status for clusteroperator/authentication changed: Progressing changed from False to True ("Progressing: deployment's observed generation did not reach the expected generation") (2 times)
Sep 13 02:08:38.043 E ns/openshift-ingress pod/router-prometheus-7f889fbb56-xcgxn node/ip-10-0-134-147.ec2.internal container=router container exited with code 2 (Error): go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\nI0913 02:07:13.814487       1 router.go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\nI0913 02:07:18.902829       1 router.go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\nI0913 02:07:28.878937       1 router.go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\nI0913 02:07:36.413386       1 router.go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\nI0913 02:07:41.416132       1 router.go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\nI0913 02:07:46.434299       1 router.go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\nI0913 02:07:51.424557       1 router.go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\nI0913 02:07:56.824770       1 router.go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\nI0913 02:08:01.829472       1 router.go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\nI0913 02:08:06.850999       1 router.go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\nI0913 02:08:11.858516       1 router.go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\nI0913 02:08:16.833063       1 router.go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\nI0913 02:08:21.825374       1 router.go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\nI0913 02:08:26.827145       1 router.go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n
Sep 13 02:08:38.055 I ns/openshift-authentication pod/oauth-openshift-677f75d86c-7wbtl Container image "registry.svc.ci.openshift.org/ci-op-8gsqxc9j/stable@sha256:8452b39fc96e03ea44050dc569a900f48fa2bb726ae72d193490c69af6626f65" already present on machine
Sep 13 02:08:38.257 I ns/openshift-authentication pod/oauth-openshift-677f75d86c-7wbtl Created container oauth-openshift
Sep 13 02:08:38.289 W ns/openshift-ingress pod/router-prometheus-7f889fbb56-xcgxn node/ip-10-0-134-147.ec2.internal deleted
Sep 13 02:08:38.294 I ns/openshift-authentication pod/oauth-openshift-677f75d86c-7wbtl Started container oauth-openshift
Sep 13 02:08:39.677 I ns/openshift-authentication-operator deployment/authentication-operator Status for clusteroperator/authentication changed: Progressing message changed from "Progressing: deployment's observed generation did not reach the expected generation" to "Progressing: not all deployment replicas are ready" (2 times)
Sep 13 02:08:44.248 W persistentvolume/pvc-54cbe29a-d5cb-11e9-a98d-12334125f0c6 Error deleting EBS volume "vol-085420d9a94d85b71" since volume is currently attached to "i-01bf5c57705b31b2b"
Sep 13 02:08:48.109 I ns/openshift-authentication deployment/oauth-openshift Scaled down replica set oauth-openshift-795585cfc9 to 1
Sep 13 02:08:48.118 W ns/openshift-authentication pod/oauth-openshift-795585cfc9-45n2n node/ip-10-0-152-201.ec2.internal graceful deletion within 30s
Sep 13 02:08:48.133 I ns/openshift-authentication pod/oauth-openshift-795585cfc9-45n2n Stopping container oauth-openshift
Sep 13 02:08:48.139 I ns/openshift-authentication replicaset/oauth-openshift-795585cfc9 Deleted pod: oauth-openshift-795585cfc9-45n2n
Sep 13 02:08:48.145 I ns/openshift-authentication deployment/oauth-openshift Scaled up replica set oauth-openshift-677f75d86c to 2
Sep 13 02:08:48.170 I ns/openshift-authentication pod/oauth-openshift-677f75d86c-nnkvd node/ created
Sep 13 02:08:48.187 I ns/openshift-authentication replicaset/oauth-openshift-677f75d86c Created pod: oauth-openshift-677f75d86c-nnkvd
Sep 13 02:08:48.190 I ns/openshift-authentication pod/oauth-openshift-677f75d86c-nnkvd Successfully assigned openshift-authentication/oauth-openshift-677f75d86c-nnkvd to ip-10-0-152-201.ec2.internal
Sep 13 02:08:50.275 W ns/openshift-authentication pod/oauth-openshift-795585cfc9-45n2n node/ip-10-0-152-201.ec2.internal deleted
Sep 13 02:08:50.630 I ns/openshift-authentication pod/oauth-openshift-677f75d86c-nnkvd Container image "registry.svc.ci.openshift.org/ci-op-8gsqxc9j/stable@sha256:8452b39fc96e03ea44050dc569a900f48fa2bb726ae72d193490c69af6626f65" already present on machine
Sep 13 02:08:50.912 I ns/openshift-authentication pod/oauth-openshift-677f75d86c-nnkvd Created container oauth-openshift
Sep 13 02:08:50.980 I ns/openshift-authentication pod/oauth-openshift-677f75d86c-nnkvd Started container oauth-openshift

Comment 1 Clayton Coleman 2019-09-13 05:19:18 UTC

In a randomly chosen older run https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.2/4249/artifacts/e2e-aws/e2e.log

Sep 10 03:25:37.341 I ns/openshift-config-managed secret/router-certs Updated the published router certificates (2 times)
Sep 10 03:25:37.354 I ns/openshift-authentication-operator deployment/authentication-operator Updated Secret/v4-0-config-system-router-certs -n openshift-authentication because it changed (2 times)
Sep 10 03:25:38.353 I ns/openshift-authentication-operator deployment/authentication-operator Updated ConfigMap/v4-0-config-system-cliconfig -n openshift-authentication: cause by changes in data.v4-0-config-system-cliconfig (2 times)

It looks like cert rotation isn't maintaining availability for routers?

Comment 2 Dan Mace 2019-09-16 20:52:45 UTC

Quick update:

I think the the cert rotation stuff is a red herring. The default router didn't restart in response to cert changes; those cert messages are in response to a new e2e ingresscontroller being created/destroyed (which causes the shared certs secret to get updated).

The registry failures don't seem to involve ingress, but instead show a failure communicating with the registry over its service DNS name and IP:

Log Tail: Pulling image image-registry.openshift-image-registry.svc:...d22c5817ee169da72509071d19f789db914b4d52b581f0436e3f0cf ...
Warning: Pull failed, retrying in 5s ...
Warning: Pull failed, retrying in 5s ...
Warning: Pull failed, retrying in 5s ...
error: build error: failed to pull image: After retrying 2...istry.svc:5000/v2/: dial tcp 172.30.16.62:5000: i/o timeout

However, the RouteHealthCheck errors floating around could be:

https://bugzilla.redhat.com/show_bug.cgi?id=1709958
https://bugzilla.redhat.com/show_bug.cgi?id=1750953
https://bugzilla.redhat.com/show_bug.cgi?id=1740148

I propose #1709958 should be the tracker for anything related to packet loss during router restarts. Miciah's going to put together a WIP PR to get some CI data on a possible fix he has been designing for some time now.

If this bz was opened first because of the registry error, and if you agree that ingress is not implicated in the registry error, and if the registry error is still important, I think this bug needs reassigned. Otherwise it risks getting lost as just another data point for #1709958.

Can we disambiguate this bug from the ingress packet loss issue already being tracked?

Comment 3 Dan Mace 2019-09-17 20:40:03 UTC

Router availability during rollouts is already covered by https://bugzilla.redhat.com/show_bug.cgi?id=1709958 (and we might start folding RouteHealthCheck errors into that for now).

For the registry issue (which is what this bug was originally described as being about), it doesn't seem to involve ingress, so I'm reassigning this to registry for triage. It doesn't seem like an SDN issue on the surface, but more investigation is required.

Comment 4 Ben Parees 2019-09-17 22:44:41 UTC

Analyzing this job:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/466/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1369

The registry pod log shows the registry was up the entire time:
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/466/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1369/artifacts/e2e-aws/pods/openshift-image-registry_image-registry-64586f8d54-x7lr9_registry.log

first log entry:
time="2019-09-13T01:34:23.32679288Z"

last log entry:
time="2019-09-13T02:19:23.815146504Z"


I see no indication of any problems in the registry at the time of the failure, which is reported from the build log as:

2019-09-13T01:45:32.880013771Z error: build error: failed to pull image: After retrying 2 times, Pull image still failed due to error: while pulling "docker://image-registry.openshift-image-registry.svc:5000/e2e-test-docker-build-pullsecret-nmdzk/image1@sha256:3ba7ed9f7d22c5817ee169da72509071d19f789db914b4d52b581f0436e3f0cf" as "image-registry.openshift-image-registry.svc:5000/e2e-test-docker-build-pullsecret-nmdzk/image1@sha256:3ba7ed9f7d22c5817ee169da72509071d19f789db914b4d52b581f0436e3f0cf": Error initializing source docker://image-registry.openshift-image-registry.svc:5000/e2e-test-docker-build-pullsecret-nmdzk/image1@sha256:3ba7ed9f7d22c5817ee169da72509071d19f789db914b4d52b581f0436e3f0cf: pinging docker registry returned: Get https://image-registry.openshift-image-registry.svc:5000/v2/: dial tcp 172.30.16.62:5000: i/o timeout


The registry responded successfully to other requests around the same time period (though they are health check probes, not blob requests).

right before:
time="2019-09-13T01:45:31.486655764Z" level=info msg=response go.version=go1.11.13 http.request.host="10.128.2.6:5000" http.request.id=6356cfba-14cc-49f0-afd1-fe493f205cc2 http.request.method=GET http.request.remoteaddr="10.128.2.1:46328" http.request.uri=/healthz http.request.useragent=kube-probe/1.14+ http.response.duration="65.436µs" http.response.status=200 http.response.written=0

right after:
time="2019-09-13T01:45:33.330541201Z" level=info msg=response go.version=go1.11.13 http.request.host="10.128.2.6:5000" http.request.id=863ce528-a6d8-4f4a-ac52-caff96161c2e http.request.method=GET http.request.remoteaddr="10.128.2.1:46344" http.request.uri=/healthz http.request.useragent=kube-probe/1.14+ http.response.duration="66.651µs" http.response.status=200 http.response.written=0


and I see no indication that a request for 3ba7ed9f7d22c5817ee169da72509071d19f789db914b4d52b581f0436e3f0cf (the image id in question) made it to the registry.


The nearest blob requests in time were:

time="2019-09-13T01:42:42.269596413Z" level=info msg=response go.version=go1.11.13 http.re
quest.contenttype=application/octet-stream http.request.host="image-registry.openshift-ima
ge-registry.svc:5000" http.request.id=7a357561-bd6c-4e90-beba-91eb2278a680 http.request.me
thod=PUT http.request.remoteaddr="10.131.0.25:50400" http.request.uri="/v2/e2e-test-build-
timing-klnpk/test/blobs/uploads/2d491021-36d8-47bc-adc8-4a877fe57129?_state=s9xA9H7g4FNtsH
ypcBlLgwwys8QGnaHr3ZQfF6xmruF7Ik5hbWUiOiJlMmUtdGVzdC1idWlsZC10aW1pbmcta2xucGsvdGVzdCIsIlVV
SUQiOiIyZDQ5MTAyMS0zNmQ4LTQ3YmMtYWRjOC00YTg3N2ZlNTcxMjkiLCJPZmZzZXQiOjI1NTksIlN0YXJ0ZWRBdC
I6IjIwMTktMDktMTNUMDE6NDI6NDFaIn0%3D&digest=sha256%3Ab601a4d585dc320852efddaa5b0fed9461440
36a9f299e0fc9f3d957e51e6c29" http.request.useragent=Go-http-client/1.1 http.response.durat
ion=607.564954ms http.response.status=201 http.response.written=0

and

time="2019-09-13T01:46:45.828233469Z" level=info msg="response completed" go.version=go1.11.13 http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=add3f6a5-df34-43dd-8283-1c7ce7123199 http.request.method=GET http.request.remoteaddr="10.131.0.1:36878" http.request.uri="/v2/openshift/ruby/blobs/sha256:98e3063dda72c6fa2349fb59e83781f00635fe48de124272ab28d0a1a50ae583" http.request.useragent="cri-o/1.14.10-0.12.dev.rhaos4.2.git819260a.el8 go/go1.12.8 os/linux arch/amd64" http.response.contenttype=text/plain http.response.duration=2.444499838s http.response.status=200 http.response.written=7474

so they don't tightly bracket the time period in question, but again i see no indication anything was failing in the registry during this time.

Comment 5 Ben Parees 2019-09-17 22:47:11 UTC

(might be worth noting that since the requests it responded to right around the failure time were probes, they would not have gone through the service proxy, I assume, since they'd go straight to the pod. Possibly relevant?)

Comment 6 Adam Kaplan 2019-09-18 18:08:41 UTC

Another potential data point: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/23825/pull-ci-openshift-origin-master-e2e-aws-builds/3021

I can rule out the registry was the issue here - it came up, pod behaved normally and blobs were able to be pushed to imagestreams. Later when the imagestream tag was used in a deployment pull, the pull failed with no route to host.
Moving this to the Networking - SDN component.

Comment 7 Casey Callendrello 2019-09-19 09:57:10 UTC

Update from out-of-band debugging:

Danw noticed that the iptables dump is missing the service IP in question. That suggests that something got lost in the informer chain. Taking a further look.

Comment 8 Casey Callendrello 2019-09-19 10:41:46 UTC

Yup, looks like it never saw the service  - https://storage.googleapis.com/origin-ci-test/pr-logs/pull/23825/pull-ci-openshift-origin-master-e2e-aws-builds/3021/artifacts/e2e-aws-builds/pods/openshift-sdn_sdn-xbz6n_sdn.log has no "svc add" line for openshift-image-registry/image-registry

And, from what I can tell, it's not a hybrid proxy bug. That's because the code that logs the service add is *before* any of our unidling logic.


endpoint informer    service informer
         +                  +
         +--------+---------+
                  |
         +--------+---------+
         |                  |
         |    OsdnProxy     |
         |                  |
         +--------+---------+
                  |
         +--------+---------+
         |                  |
         |  HybridProxier   |
         |                  |
         +--------+---------+
                  |
           +------+---------+
           |                |
+----------+----+  +--------+-------+
|               |  |                |
| iptablesProxy |  | userspaceProxy |
|               |  |                |
+---------------+  +----------------+

The log line I would expect to see is in the OsdnProxy, and this particular codepath is logicless. It's just log-and-forward.

So that means either there is an informer bug (unlikely) or there's an error in how we use them.

Comment 9 Casey Callendrello 2019-09-19 11:38:48 UTC

I don't see any obvious race conditions. The flow, as far as I can make out is:

1. Create the informers with NewSharedInformerFactory.
2. Create the proxy
3. Add the proxy's event handler to the informer
4. Start the informers

Which is... fine. Totally fine. I'm looking now to see if there's some way that the informers were started beforehand. Normally this shouldn't matter, since adding a new handler to a shared informer results in a synthetic Add() against the store. However, there's a particularly silly and useless abstraction layer in the upstream proxy that un-does this behavior if somehow there's a bug. Which I very much doubt.

Comment 10 Casey Callendrello 2019-09-19 11:44:25 UTC

Filed https://github.com/openshift/sdn/pull/35 with extra logging to see if we're somehow tickling this.

Comment 12 Casey Callendrello 2019-11-22 14:20:34 UTC


*** This bug has been marked as a duplicate of bug 1765280 ***