Bug 1754154

Summary:	Loopback is accidentally torn down.
Product:	OpenShift Container Platform	Reporter:	Kirsten Garrison <kgarriso>
Component:	Networking	Assignee:	Casey Callendrello <cdc>
Networking sub component:	openshift-sdn	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	arghosh, bbennett, dcbw, eparis, mrhodes, mzali, trankin, travi
Version:	4.2.0
Target Milestone:	---
Target Release:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1754638 (view as bug list)		Environment:
Last Closed:	2020-01-23 11:06:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1754638

Description Kirsten Garrison 2019-09-21 09:05:56 UTC

Description of problem:

Flake that has started to hit a PR regularly in CI and is blocking:

Seeing in workers journal after a reboot:
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1127/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/3520/artifacts/e2e-aws-op/nodes/workers-journal:

```
Sep 21 03:00:28 ip-10-0-140-173 hyperkube[1346]: I0921 03:00:28.784508    1346 setters.go:521] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2019-09-21 03:00:28.784452389 +0000 UTC m=+71.300956518 LastTransitionTime:2019-09-21 03:00:28.784452389 +0000 UTC m=+71.300956518 Reason:KubeletNotReady Message:runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network}
Sep 21 03:00:28 ip-10-0-140-173 hyperkube[1346]: I0921 03:00:28.785392    1346 event.go:209] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-140-173.ec2.internal", UID:"ip-10-0-140-173.ec2.internal", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node ip-10-0-140-173.ec2.internal status is now: NodeNotReady
....
Sep 21 03:00:58 ip-10-0-140-173 hyperkube[1346]: I0921 03:00:58.812503    1346 setters.go:521] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2019-09-21 03:00:58.812452071 +0000 UTC m=+101.328956195 LastTransitionTime:2019-09-21 03:00:58.812452071 +0000 UTC m=+101.328956195 Reason:KubeletNotReady Message:runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network}
```

I believe this is causing the Kubelet health check to fail:
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1127/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/3520/artifacts/e2e-aws-op/pods/openshift-machine-config-operator_machine-config-daemon-gklq9_machine-config-daemon.log

```
W0921 03:00:22.780592    2246 daemon.go:573] Failed kubelet health check: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: i/o timeout
W0921 03:01:22.781017    2246 daemon.go:573] Failed kubelet health check: Get http://localhost:10248/healthz: context deadline exceeded
W0921 03:02:22.781451    2246 daemon.go:573] Failed kubelet health check: Get http://localhost:10248/healthz: context deadline exceeded
W0921 03:02:22.781489    2246 daemon.go:508] Got an error from auxiliary tools: kubelet health failure threshold reached


See also: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1127/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/3522

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1127/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/3520

Comment 1 Kirsten Garrison 2019-09-21 09:14:25 UTC

Is this related to?: https://bugzilla.redhat.com/show_bug.cgi?id=1753801

Comment 2 Casey Callendrello 2019-09-23 13:41:49 UTC

Something odd is going on. Node node ip-10-0-152-60.ec2.internal is not ready. This is because openshift-sdn hasn't written the SDN configuration file, which is expected.

However, looking at the logs of the SDN pod on the node, it is happily starting up and getting killed just before it writes its configuration file:

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1127/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/3522/artifacts/e2e-aws-op/pods/openshift-sdn_sdn-6qckt_sdn.log

I0921 07:07:05.159854   12152 cmd.go:173] openshift-sdn network plugin registering startup
I0921 07:07:05.159932   12152 cmd.go:177] openshift-sdn network plugin ready
I0921 07:07:16.224353   12152 roundrobin.go:310] LoadBalancerRR: Setting endpoints for openshift-sdn/sdn:metrics to [10.0.134.176:9101 10.0.136.46:9101 10.0.137.153:9101 10.0.138.235:9101 10.0.152.60:9101 10.0.153.234:9101]
I0921 07:07:16.224390   12152 roundrobin.go:240] Delete endpoint 10.0.152.60:9101 for service "openshift-sdn/sdn:metrics"
I0921 07:07:16.224439   12152 proxy.go:331] hybrid proxy: syncProxyRules start
I0921 07:07:16.378919   12152 proxy.go:334] hybrid proxy: mainProxy.syncProxyRules complete
I0921 07:07:16.438034   12152 proxier.go:367] userspace proxy: processing 0 service events
I0921 07:07:16.438055   12152 proxier.go:346] userspace syncProxyRules took 59.113113ms
I0921 07:07:16.438066   12152 proxy.go:337] hybrid proxy: unidlingProxy.syncProxyRules complete
interrupt: Gracefully shutting down ...

And if you watch the multus logs:
2019-09-21T07:07:05+0000 Using /host/var/run/multus/cni/net.d/80-openshift-network.conf as a source to generate the Multus configuration
2019-09-21T07:07:05+0000 Config file created @ /host/etc/cni/net.d/00-multus.conf
{ "cniVersion": "0.3.1", "name": "multus-cni-network", "type": "multus", "namespaceIsolation": true, "logLevel": "verbose", "binDir": "/opt/multus/bin", "kubeconfig": "/etc/kubernetes/cni/net.d/multus.d/multus.kubeconfig", "delegates": [ { "cniVersion": "0.3.1", "name": "openshift-sdn", "type": "openshift-sdn" } ] }
2019-09-21T07:07:05+0000 Continuing watch loop after configuration regeneration...
2019-09-21T07:07:34+0000 Master plugin @ /host/var/run/multus/cni/net.d/80-openshift-network.conf has been deleted. Performing cleanup...
2019-09-21T07:07:34+0000 Multus configuration intentionally invalidated to prevent pods from being scheduled.

Okay, which is the shutdown handler for the openshift-sdn pod removing the log. So: what is killing openshift-sdn? Looking...

Comment 3 Casey Callendrello 2019-09-23 13:45:39 UTC

Aha. The kubelet is killing it because it failed liveness:

Sep 21 07:07:13 ip-10-0-152-60 hyperkube[1365]: I0921 07:07:13.494388    1365 prober.go:112] Liveness probe for "sdn-6qckt_openshift-sdn(8be9db67-dc35-11e9-a4c5-12d5b2c68656):sdn" failed (failure): Get http://10.0.152.60:10256/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Sep 21 07:07:13 ip-10-0-152-60 hyperkube[1365]: I0921 07:07:13.494618    1365 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-sdn", Name:"sdn-6qckt", UID:"8be9db67-dc35-11e9-a4c5-12d5b2c68656", APIVersion:"v1", ResourceVersion:"41661", FieldPath:"spec.containers{sdn}"}): type: 'Warning' reason: 'Unhealthy' Liveness probe failed: Get http://10.0.152.60:10256/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Overloaded node and bad luck? Something else? All of the other nodes are ready.

Comment 4 Casey Callendrello 2019-09-23 14:17:05 UTC

Hey Kirsten, is this happening for any other sorts of tests, or just the MCD? Because openshift-sdn started failing liveness checks exactly after the MCD wrote its new config...

Sep 21 07:03:03 ip-10-0-152-60 root[3392]: machine-config-daemon[2054]: Disk currentConfig rendered-worker-e5739f218cf71a7af
cdca926fb119202 overrides node annotation rendered-worker-683505cbc36356e2029b40f24b44100c
Sep 21 07:03:03 ip-10-0-152-60 logger[3393]: rendered-worker-e5739f218cf71a7afcdca926fb119202
Sep 21 07:03:03 ip-10-0-152-60 root[3394]: machine-config-daemon[2054]: completed update for config rendered-worker-e5739f21
8cf71a7afcdca926fb119202
Sep 21 07:03:03 ip-10-0-152-60 hyperkube[1365]: I0921 07:03:03.493375    1365 prober.go:112] Liveness probe for "sdn-6qckt_openshift-sdn(8be9db67-dc35-11e9-a4c5-12d5b2c68656):sdn" failed (failure): Get http://10.0.152.60:10256/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Sep 21 07:03:03 ip-10-0-152-60 hyperkube[1365]: I0921 07:03:03.493687    1365 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-sdn", Name:"sdn-6qckt", UID:"8be9db67-dc35-11e9-a4c5-12d5b2c68656", APIVersion:"v1", ResourceVersion:"41661", FieldPath:"spec.containers{sdn}"}): type: 'Warning' reason: 'Unhealthy' Liveness probe failed: Get http://10.0.152.60:10256/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)


Thoughts?

Comment 5 Casey Callendrello 2019-09-23 14:45:33 UTC

Summary:

sdn-6qckt is running on node and seems ready.

07:01 - node reboots in to rendered-worker-e5739f...
07:02:42 - kubelet starts setting up sdn-6qckt
07:02:52 - liveness probe succeeds
07:03:03 - liveness probe for sdn-6qckt fails due to timeout. It never succeeds again

The liveness probe is a direct connection over loopback. It does not go over the pod network, as both are host-network. If this is failing at the network level, something is very wrong.

Comment 6 Casey Callendrello 2019-09-23 16:38:07 UTC

additional debugging has shown that this *almost* always happens for aws-ci-op for the mco, but not always

The exception: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/23839/pull-ci-openshift-origin-master-e2e-aws-upgrade/4844

Comment 7 Casey Callendrello 2019-09-23 17:11:50 UTC

Found the problem. `lo` is down. Now, the question: why?

Comment 8 Casey Callendrello 2019-09-23 17:50:39 UTC

AHA: from crio log:

"Error deleting network: failed to Statfs "/proc/0/ns/net": no such file or directory"


I suspect we're tearing down containers and hitting pid reuse.

Comment 9 Kirsten Garrison 2019-09-23 17:57:35 UTC

Hi Casey,

To answer your Q above, only seeing it in MCD (but MCD is what initiates the update/reboot functions).  In those e2e tests, MCD applies an update, reboots node, verifies that state of node is updated as expected.

Comment 10 Casey Callendrello 2019-09-23 18:40:17 UTC

https://github.com/openshift/containernetworking-plugins/pull/15

Comment 11 Casey Callendrello 2019-09-23 18:51:28 UTC

So, some follow-up analysis:

The recent oci-cni changes finally enabled tearing down loopback (because the cache files were sitting around). However, there is an issue with oci-cni that caches pids across reboots.

After a reboot, the kubelet cleans up old sandboxes. This causes cri-o to do a CNI DEL of the container... with the namespace /proc/<pid>/ns/net. However, if <pid> has been recycled (because this is a reboot), we end up setting lo down... on the node.

Comment 12 Kirsten Garrison 2019-09-23 19:36:58 UTC

updated to 4.3.0 will clone into 4.2.0

Comment 15 zhaozhanqi 2019-10-09 08:33:56 UTC

Verified this bug on 4.3.0-0.ci-2019-10-08-225758

this node works well

Comment 17 errata-xmlrpc 2020-01-23 11:06:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062