Bug 1539987

Summary:	Under load, openshift-sdn reports "link not found" and fails health check and restarts itself on GCP
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Networking	Assignee:	Dan Winship <danw>
Status:	CLOSED ERRATA	QA Contact:	Meng Bo <bmeng>
Severity:	high	Docs Contact:
Priority:	medium
Version:	3.9.0	CC:	aos-bugs, bbennett, hongli
Target Milestone:	---
Target Release:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Bugs in golang's handling of network namespaces could cause problems after setting up a new pod. Consequence: Multiple forms of sporadic networking-related failures in the OpenShift node service, including (1) atomic-node-service spontaneously restarting due to the OVS health check failing, (2) the node reporting an incorrect IP address to the master and thus becoming inaccessible for a period of time, (3) temporary pod creation failures that would eventually resolve themselves without user intervention. Fix: All pod-setup and pod-teardown operations in a pod's network namespace are now performed in a separate process from the main atomic-openshift-node service. Result: No sporadic network failures	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-03-28 14:23:55 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Clayton Coleman 2018-01-30 01:28:31 UTC

Moved from https://github.com/openshift/origin/issues/18317

Seen in multiple CI runs on GCP cluster.

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/18229/test_pull_request_origin_extended_conformance_gce/15151/#conformanceareanetworkingfeaturerouter-openshift-routers-the-haproxy-router-should-serve-the-correct-routes-when-scoped-to-a-single-namespace-and-label-set-suiteopenshiftconformanceparallel

exec pod was trying to curl the router (which was up) but wasn't able to create a connection.

```
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.9.0/_output/local/go/src/github.com/openshift/origin/test/extended/router/scoped.go:42
Expected error:
    <*errors.errorString | 0xc420bf1170>: {
        s: "last response from server was not 200:\n",
    }
    last response from server was not 200:
    
not to have occurred
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.9.0/_output/local/go/src/github.com/openshift/origin/test/extended/router/scoped.go:79
```

Router was up a few seconds after the exec pod was created, but it was never able to connect to the destination pod:

```
Jan 27 23:50:48.628: INFO: Running '/data/src/github.com/openshift/origin/_output/local/bin/linux/amd64/kubectl --server=https://internal-api.prtest-5a37c28-15151.origin-ci-int-gce.dev.rhcloud.com:8443 --kubeconfig=/tmp/cluster-admin.kubeconfig exec --namespace=extended-test-scoped-router-s9pmd-2sj74 execpod -- /bin/sh -c 
		set -e
		for i in $(seq 1 180); do
			code=$( curl -k -s -o /dev/null -w '%{http_code}\n' --header 'Host: 172.16.2.45' "http://172.16.2.45:1936/healthz" ) || rc=$?
			if [[ "${rc:-0}" -eq 0 ]]; then
				echo $code
				if [[ $code -eq 200 ]]; then
					exit 0
				fi
				if [[ $code -ne 503 ]]; then
					exit 1
				fi
			else
				echo "error ${rc}" 1>&2
			fi
			sleep 1
		done
		'
Jan 27 23:53:53.739: INFO: stderr: "error 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\n"
```

Logs from router

```
Jan 27 23:53:53.828: INFO: Scoped Router test should serve the correct routes when scoped to a single namespace and label set [Suite:openshift/conformance/parallel] - waiting for the healthz endpoint to respond logs:
 I0127 23:51:07.732698       1 template.go:260] Starting template router (v3.9.0-alpha.3+5ec2b2f-218)
I0127 23:51:07.733031       1 merged_client_builder.go:123] Using in-cluster configuration
I0127 23:51:07.737366       1 reflector.go:202] Starting reflector *core.Service (10m0s) from github.com/openshift/origin/pkg/router/template/service_lookup.go:32
I0127 23:51:07.737420       1 reflector.go:240] Listing and watching *core.Service from github.com/openshift/origin/pkg/router/template/service_lookup.go:32
I0127 23:51:07.739146       1 router.go:154] Creating a new template router, writing to /var/lib/haproxy/router
I0127 23:51:07.739275       1 router.go:228] Template router will coalesce reloads within 5s of each other
I0127 23:51:07.739318       1 router.go:278] Router default cert from router container
I0127 23:51:07.739327       1 router.go:215] Reading persisted state
I0127 23:51:07.739355       1 router.go:219] Committing state
I0127 23:51:07.739363       1 router.go:333] Writing the router state
I0127 23:51:07.739855       1 router.go:340] Writing the router config
I0127 23:51:07.741933       1 router.go:354] Reloading the router
E0127 23:51:07.760463       1 reflector.go:205] github.com/openshift/origin/pkg/router/template/service_lookup.go:32: Failed to list *core.Service: services is forbidden: User "system:serviceaccount:extended-test-scoped-router-s9pmd-2sj74:default" cannot list services at the cluster scope: User "system:serviceaccount:extended-test-scoped-router-s9pmd-2sj74:default" cannot list all services in the cluster
I0127 23:51:07.816597       1 router.go:441] Router reloaded:
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).
```

The s4sd node ovs health check failed immediately after the router started:

```
Jan 27 23:50:31 ci-prtest-5a37c28-15151-ig-n-s4sd origin-node[2074]: F0127 23:50:31.053104    2074 healthcheck.go:96] SDN healthcheck detected unhealthy OVS server, restarting: Link not found
Jan 27 23:50:31 ci-prtest-5a37c28-15151-ig-n-s4sd systemd[1]: origin-node.service: main process exited, code=exited, status=255/n/a
Jan 27 23:50:31 ci-prtest-5a37c28-15151-ig-n-s4sd systemd[1]: Unit origin-node.service entered failed state.
Jan 27 23:50:31 ci-prtest-5a37c28-15151-ig-n-s4sd systemd[1]: origin-node.service failed.
Jan 27 23:50:36 ci-prtest-5a37c28-15151-ig-n-s4sd systemd[1]: origin-node.service holdoff time over, scheduling restart.
Jan 27 23:50:36 ci-prtest-5a37c28-15151-ig-n-s4sd systemd[1]: Starting OpenShift Node...
```

Also seen in https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/18331/test_pull_request_origin_extended_conformance_gce/15216/ on node ci-prtest-5a37c28-15216-ig-n-t1xb. I suspect this is happening in many CI runs on GCP.

Is this GCP specific, openshift-sdn specific, a too aggressive check?

Comment 1 Clayton Coleman 2018-01-30 01:30:45 UTC

Looks like a non-trivial amount of our flakes (1/3?  1/4?) are from this or something like it.

Comment 2 Clayton Coleman 2018-01-30 01:38:54 UTC

Also https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/18323/test_pull_request_origin_extended_conformance_gce/15217/

Jan 29 16:18:21 ci-prtest-5a37c28-15217-ig-n-sqrh origin-node[2102]: F0129 16:18:21.842955    2102 healthcheck.go:96] SDN healthcheck detected unhealthy OVS server, restarting:
Jan 29 16:18:21 ci-prtest-5a37c28-15217-ig-n-sqrh systemd[1]: origin-node.service: main process exited, code=exited, status=255/n/a
Jan 29 16:18:22 ci-prtest-5a37c28-15217-ig-n-sqrh systemd[1]: Unit origin-node.service entered failed state.
Jan 29 16:18:22 ci-prtest-5a37c28-15217-ig-n-sqrh systemd[1]: origin-node.service failed.
Jan 29 16:18:27 ci-prtest-5a37c28-15217-ig-n-sqrh systemd[1]: origin-node.service holdoff time over, scheduling restart.
Jan 29 16:18:27 ci-prtest-5a37c28-15217-ig-n-sqrh systemd[1]: Starting OpenShift Node...

Comment 3 Dan Winship 2018-02-01 16:17:37 UTC

This looks like another symptom of https://github.com/openshift/origin/issues/15991 (which has suddenly started popping up everywhere, though we don't know why now). The "link not found" suggests that the healthcheck is getting run from a thread that has leaked into the wrong network namespace. (And in the 2nd and 3rd examples linked above, the SDN health check failure is preceeded by errors like "CNI request failed with status 400: 'error on port veth70fa69b6: "could not open network device veth70fa69b6 (No such device)"'" when the logs suggest that that device definitely should exist, again suggesting that code was running in the wrong namespace.)

Will be fixed by https://github.com/openshift/origin/pull/18355.

Comment 5 Hongan Li 2018-02-23 08:36:05 UTC

verified in openshift v3.9.0-0.48.0 and cannot reproduce the issue

Comment 8 errata-xmlrpc 2018-03-28 14:23:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489