Moved from https://github.com/openshift/origin/issues/18317 Seen in multiple CI runs on GCP cluster. https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/18229/test_pull_request_origin_extended_conformance_gce/15151/#conformanceareanetworkingfeaturerouter-openshift-routers-the-haproxy-router-should-serve-the-correct-routes-when-scoped-to-a-single-namespace-and-label-set-suiteopenshiftconformanceparallel exec pod was trying to curl the router (which was up) but wasn't able to create a connection. ``` /tmp/openshift/build-rpms/rpm/BUILD/origin-3.9.0/_output/local/go/src/github.com/openshift/origin/test/extended/router/scoped.go:42 Expected error: <*errors.errorString | 0xc420bf1170>: { s: "last response from server was not 200:\n", } last response from server was not 200: not to have occurred /tmp/openshift/build-rpms/rpm/BUILD/origin-3.9.0/_output/local/go/src/github.com/openshift/origin/test/extended/router/scoped.go:79 ``` Router was up a few seconds after the exec pod was created, but it was never able to connect to the destination pod: ``` Jan 27 23:50:48.628: INFO: Running '/data/src/github.com/openshift/origin/_output/local/bin/linux/amd64/kubectl --server=https://internal-api.prtest-5a37c28-15151.origin-ci-int-gce.dev.rhcloud.com:8443 --kubeconfig=/tmp/cluster-admin.kubeconfig exec --namespace=extended-test-scoped-router-s9pmd-2sj74 execpod -- /bin/sh -c set -e for i in $(seq 1 180); do code=$( curl -k -s -o /dev/null -w '%{http_code}\n' --header 'Host: 172.16.2.45' "http://172.16.2.45:1936/healthz" ) || rc=$? if [[ "${rc:-0}" -eq 0 ]]; then echo $code if [[ $code -eq 200 ]]; then exit 0 fi if [[ $code -ne 503 ]]; then exit 1 fi else echo "error ${rc}" 1>&2 fi sleep 1 done ' Jan 27 23:53:53.739: INFO: stderr: "error 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\n" ``` Logs from router ``` Jan 27 23:53:53.828: INFO: Scoped Router test should serve the correct routes when scoped to a single namespace and label set [Suite:openshift/conformance/parallel] - waiting for the healthz endpoint to respond logs: I0127 23:51:07.732698 1 template.go:260] Starting template router (v3.9.0-alpha.3+5ec2b2f-218) I0127 23:51:07.733031 1 merged_client_builder.go:123] Using in-cluster configuration I0127 23:51:07.737366 1 reflector.go:202] Starting reflector *core.Service (10m0s) from github.com/openshift/origin/pkg/router/template/service_lookup.go:32 I0127 23:51:07.737420 1 reflector.go:240] Listing and watching *core.Service from github.com/openshift/origin/pkg/router/template/service_lookup.go:32 I0127 23:51:07.739146 1 router.go:154] Creating a new template router, writing to /var/lib/haproxy/router I0127 23:51:07.739275 1 router.go:228] Template router will coalesce reloads within 5s of each other I0127 23:51:07.739318 1 router.go:278] Router default cert from router container I0127 23:51:07.739327 1 router.go:215] Reading persisted state I0127 23:51:07.739355 1 router.go:219] Committing state I0127 23:51:07.739363 1 router.go:333] Writing the router state I0127 23:51:07.739855 1 router.go:340] Writing the router config I0127 23:51:07.741933 1 router.go:354] Reloading the router E0127 23:51:07.760463 1 reflector.go:205] github.com/openshift/origin/pkg/router/template/service_lookup.go:32: Failed to list *core.Service: services is forbidden: User "system:serviceaccount:extended-test-scoped-router-s9pmd-2sj74:default" cannot list services at the cluster scope: User "system:serviceaccount:extended-test-scoped-router-s9pmd-2sj74:default" cannot list all services in the cluster I0127 23:51:07.816597 1 router.go:441] Router reloaded: - Checking http://localhost:80 ... - Health check ok : 0 retry attempt(s). ``` The s4sd node ovs health check failed immediately after the router started: ``` Jan 27 23:50:31 ci-prtest-5a37c28-15151-ig-n-s4sd origin-node[2074]: F0127 23:50:31.053104 2074 healthcheck.go:96] SDN healthcheck detected unhealthy OVS server, restarting: Link not found Jan 27 23:50:31 ci-prtest-5a37c28-15151-ig-n-s4sd systemd[1]: origin-node.service: main process exited, code=exited, status=255/n/a Jan 27 23:50:31 ci-prtest-5a37c28-15151-ig-n-s4sd systemd[1]: Unit origin-node.service entered failed state. Jan 27 23:50:31 ci-prtest-5a37c28-15151-ig-n-s4sd systemd[1]: origin-node.service failed. Jan 27 23:50:36 ci-prtest-5a37c28-15151-ig-n-s4sd systemd[1]: origin-node.service holdoff time over, scheduling restart. Jan 27 23:50:36 ci-prtest-5a37c28-15151-ig-n-s4sd systemd[1]: Starting OpenShift Node... ``` Also seen in https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/18331/test_pull_request_origin_extended_conformance_gce/15216/ on node ci-prtest-5a37c28-15216-ig-n-t1xb. I suspect this is happening in many CI runs on GCP. Is this GCP specific, openshift-sdn specific, a too aggressive check?
Looks like a non-trivial amount of our flakes (1/3? 1/4?) are from this or something like it.
Also https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/18323/test_pull_request_origin_extended_conformance_gce/15217/ Jan 29 16:18:21 ci-prtest-5a37c28-15217-ig-n-sqrh origin-node[2102]: F0129 16:18:21.842955 2102 healthcheck.go:96] SDN healthcheck detected unhealthy OVS server, restarting: Jan 29 16:18:21 ci-prtest-5a37c28-15217-ig-n-sqrh systemd[1]: origin-node.service: main process exited, code=exited, status=255/n/a Jan 29 16:18:22 ci-prtest-5a37c28-15217-ig-n-sqrh systemd[1]: Unit origin-node.service entered failed state. Jan 29 16:18:22 ci-prtest-5a37c28-15217-ig-n-sqrh systemd[1]: origin-node.service failed. Jan 29 16:18:27 ci-prtest-5a37c28-15217-ig-n-sqrh systemd[1]: origin-node.service holdoff time over, scheduling restart. Jan 29 16:18:27 ci-prtest-5a37c28-15217-ig-n-sqrh systemd[1]: Starting OpenShift Node...
This looks like another symptom of https://github.com/openshift/origin/issues/15991 (which has suddenly started popping up everywhere, though we don't know why now). The "link not found" suggests that the healthcheck is getting run from a thread that has leaked into the wrong network namespace. (And in the 2nd and 3rd examples linked above, the SDN health check failure is preceeded by errors like "CNI request failed with status 400: 'error on port veth70fa69b6: "could not open network device veth70fa69b6 (No such device)"'" when the logs suggest that that device definitely should exist, again suggesting that code was running in the wrong namespace.) Will be fixed by https://github.com/openshift/origin/pull/18355.
verified in openshift v3.9.0-0.48.0 and cannot reproduce the issue
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0489