1754638 – Missing CNI default network

Bug 1754638 - Missing CNI default network

Summary: Missing CNI default network

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Casey Callendrello
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Duplicates (5):	1753801 1754125 1754434 1754646 1791011 (view as bug list)
Depends On:	1754154
Blocks:
TreeView+	depends on / blocked

Reported:	2019-09-23 19:17 UTC by Ben Bennett
Modified:	2023-09-07 20:39 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1754154
Environment:
Last Closed:	2019-10-16 06:41:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift containernetworking-plugins pull 15	0	'None'	closed	Bug 1754638: loopback: don't tear down	2021-01-07 20:15:19 UTC
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:42:07 UTC

Description Ben Bennett 2019-09-23 19:17:56 UTC

+++ This bug was initially created as a clone of Bug #1754154 +++

Description of problem:

Flake that has started to hit a PR regularly in CI and is blocking:

Seeing in workers journal after a reboot:
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1127/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/3520/artifacts/e2e-aws-op/nodes/workers-journal:

```
Sep 21 03:00:28 ip-10-0-140-173 hyperkube[1346]: I0921 03:00:28.784508    1346 setters.go:521] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2019-09-21 03:00:28.784452389 +0000 UTC m=+71.300956518 LastTransitionTime:2019-09-21 03:00:28.784452389 +0000 UTC m=+71.300956518 Reason:KubeletNotReady Message:runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network}
Sep 21 03:00:28 ip-10-0-140-173 hyperkube[1346]: I0921 03:00:28.785392    1346 event.go:209] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-140-173.ec2.internal", UID:"ip-10-0-140-173.ec2.internal", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node ip-10-0-140-173.ec2.internal status is now: NodeNotReady
....
Sep 21 03:00:58 ip-10-0-140-173 hyperkube[1346]: I0921 03:00:58.812503    1346 setters.go:521] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2019-09-21 03:00:58.812452071 +0000 UTC m=+101.328956195 LastTransitionTime:2019-09-21 03:00:58.812452071 +0000 UTC m=+101.328956195 Reason:KubeletNotReady Message:runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network}
```

I believe this is causing the Kubelet health check to fail:
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1127/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/3520/artifacts/e2e-aws-op/pods/openshift-machine-config-operator_machine-config-daemon-gklq9_machine-config-daemon.log

```
W0921 03:00:22.780592    2246 daemon.go:573] Failed kubelet health check: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: i/o timeout
W0921 03:01:22.781017    2246 daemon.go:573] Failed kubelet health check: Get http://localhost:10248/healthz: context deadline exceeded
W0921 03:02:22.781451    2246 daemon.go:573] Failed kubelet health check: Get http://localhost:10248/healthz: context deadline exceeded
W0921 03:02:22.781489    2246 daemon.go:508] Got an error from auxiliary tools: kubelet health failure threshold reached


See also: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1127/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/3522

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1127/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/3520

--- Additional comment from Kirsten Garrison on 2019-09-21 05:14:25 EDT ---

Is this related to?: https://bugzilla.redhat.com/show_bug.cgi?id=1753801

--- Additional comment from Casey Callendrello on 2019-09-23 09:41:49 EDT ---

Something odd is going on. Node node ip-10-0-152-60.ec2.internal is not ready. This is because openshift-sdn hasn't written the SDN configuration file, which is expected.

However, looking at the logs of the SDN pod on the node, it is happily starting up and getting killed just before it writes its configuration file:

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1127/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/3522/artifacts/e2e-aws-op/pods/openshift-sdn_sdn-6qckt_sdn.log

I0921 07:07:05.159854   12152 cmd.go:173] openshift-sdn network plugin registering startup
I0921 07:07:05.159932   12152 cmd.go:177] openshift-sdn network plugin ready
I0921 07:07:16.224353   12152 roundrobin.go:310] LoadBalancerRR: Setting endpoints for openshift-sdn/sdn:metrics to [10.0.134.176:9101 10.0.136.46:9101 10.0.137.153:9101 10.0.138.235:9101 10.0.152.60:9101 10.0.153.234:9101]
I0921 07:07:16.224390   12152 roundrobin.go:240] Delete endpoint 10.0.152.60:9101 for service "openshift-sdn/sdn:metrics"
I0921 07:07:16.224439   12152 proxy.go:331] hybrid proxy: syncProxyRules start
I0921 07:07:16.378919   12152 proxy.go:334] hybrid proxy: mainProxy.syncProxyRules complete
I0921 07:07:16.438034   12152 proxier.go:367] userspace proxy: processing 0 service events
I0921 07:07:16.438055   12152 proxier.go:346] userspace syncProxyRules took 59.113113ms
I0921 07:07:16.438066   12152 proxy.go:337] hybrid proxy: unidlingProxy.syncProxyRules complete
interrupt: Gracefully shutting down ...

And if you watch the multus logs:
2019-09-21T07:07:05+0000 Using /host/var/run/multus/cni/net.d/80-openshift-network.conf as a source to generate the Multus configuration
2019-09-21T07:07:05+0000 Config file created @ /host/etc/cni/net.d/00-multus.conf
{ "cniVersion": "0.3.1", "name": "multus-cni-network", "type": "multus", "namespaceIsolation": true, "logLevel": "verbose", "binDir": "/opt/multus/bin", "kubeconfig": "/etc/kubernetes/cni/net.d/multus.d/multus.kubeconfig", "delegates": [ { "cniVersion": "0.3.1", "name": "openshift-sdn", "type": "openshift-sdn" } ] }
2019-09-21T07:07:05+0000 Continuing watch loop after configuration regeneration...
2019-09-21T07:07:34+0000 Master plugin @ /host/var/run/multus/cni/net.d/80-openshift-network.conf has been deleted. Performing cleanup...
2019-09-21T07:07:34+0000 Multus configuration intentionally invalidated to prevent pods from being scheduled.

Okay, which is the shutdown handler for the openshift-sdn pod removing the log. So: what is killing openshift-sdn? Looking...

--- Additional comment from Casey Callendrello on 2019-09-23 09:45:39 EDT ---

Aha. The kubelet is killing it because it failed liveness:

Sep 21 07:07:13 ip-10-0-152-60 hyperkube[1365]: I0921 07:07:13.494388    1365 prober.go:112] Liveness probe for "sdn-6qckt_openshift-sdn(8be9db67-dc35-11e9-a4c5-12d5b2c68656):sdn" failed (failure): Get http://10.0.152.60:10256/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Sep 21 07:07:13 ip-10-0-152-60 hyperkube[1365]: I0921 07:07:13.494618    1365 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-sdn", Name:"sdn-6qckt", UID:"8be9db67-dc35-11e9-a4c5-12d5b2c68656", APIVersion:"v1", ResourceVersion:"41661", FieldPath:"spec.containers{sdn}"}): type: 'Warning' reason: 'Unhealthy' Liveness probe failed: Get http://10.0.152.60:10256/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Overloaded node and bad luck? Something else? All of the other nodes are ready.

--- Additional comment from Casey Callendrello on 2019-09-23 10:17:05 EDT ---

Hey Kirsten, is this happening for any other sorts of tests, or just the MCD? Because openshift-sdn started failing liveness checks exactly after the MCD wrote its new config...

Sep 21 07:03:03 ip-10-0-152-60 root[3392]: machine-config-daemon[2054]: Disk currentConfig rendered-worker-e5739f218cf71a7af
cdca926fb119202 overrides node annotation rendered-worker-683505cbc36356e2029b40f24b44100c
Sep 21 07:03:03 ip-10-0-152-60 logger[3393]: rendered-worker-e5739f218cf71a7afcdca926fb119202
Sep 21 07:03:03 ip-10-0-152-60 root[3394]: machine-config-daemon[2054]: completed update for config rendered-worker-e5739f21
8cf71a7afcdca926fb119202
Sep 21 07:03:03 ip-10-0-152-60 hyperkube[1365]: I0921 07:03:03.493375    1365 prober.go:112] Liveness probe for "sdn-6qckt_openshift-sdn(8be9db67-dc35-11e9-a4c5-12d5b2c68656):sdn" failed (failure): Get http://10.0.152.60:10256/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Sep 21 07:03:03 ip-10-0-152-60 hyperkube[1365]: I0921 07:03:03.493687    1365 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-sdn", Name:"sdn-6qckt", UID:"8be9db67-dc35-11e9-a4c5-12d5b2c68656", APIVersion:"v1", ResourceVersion:"41661", FieldPath:"spec.containers{sdn}"}): type: 'Warning' reason: 'Unhealthy' Liveness probe failed: Get http://10.0.152.60:10256/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)


Thoughts?

--- Additional comment from Casey Callendrello on 2019-09-23 10:45:33 EDT ---

Summary:

sdn-6qckt is running on node and seems ready.

07:01 - node reboots in to rendered-worker-e5739f...
07:02:42 - kubelet starts setting up sdn-6qckt
07:02:52 - liveness probe succeeds
07:03:03 - liveness probe for sdn-6qckt fails due to timeout. It never succeeds again

The liveness probe is a direct connection over loopback. It does not go over the pod network, as both are host-network. If this is failing at the network level, something is very wrong.

--- Additional comment from Casey Callendrello on 2019-09-23 12:38:07 EDT ---

additional debugging has shown that this *almost* always happens for aws-ci-op for the mco, but not always

The exception: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/23839/pull-ci-openshift-origin-master-e2e-aws-upgrade/4844

--- Additional comment from Casey Callendrello on 2019-09-23 13:11:50 EDT ---

Found the problem. `lo` is down. Now, the question: why?

--- Additional comment from Casey Callendrello on 2019-09-23 13:50:39 EDT ---

AHA: from crio log:

"Error deleting network: failed to Statfs "/proc/0/ns/net": no such file or directory"


I suspect we're tearing down containers and hitting pid reuse.

--- Additional comment from Kirsten Garrison on 2019-09-23 13:57:35 EDT ---

Hi Casey,

To answer your Q above, only seeing it in MCD (but MCD is what initiates the update/reboot functions).  In those e2e tests, MCD applies an update, reboots node, verifies that state of node is updated as expected.

--- Additional comment from Casey Callendrello on 2019-09-23 14:40:17 EDT ---

https://github.com/openshift/containernetworking-plugins/pull/15

--- Additional comment from Casey Callendrello on 2019-09-23 14:51:28 EDT ---

So, some follow-up analysis:

The recent oci-cni changes finally enabled tearing down loopback (because the cache files were sitting around). However, there is an issue with oci-cni that caches pids across reboots.

After a reboot, the kubelet cleans up old sandboxes. This causes cri-o to do a CNI DEL of the container... with the namespace /proc/<pid>/ns/net. However, if <pid> has been recycled (because this is a reboot), we end up setting lo down... on the node.

Comment 1 Ben Bennett 2019-09-23 19:37:04 UTC

We need to fix this properly in master.  Perhaps by caching and checking the PID's start time.

Comment 2 Kirsten Garrison 2019-09-23 19:53:51 UTC

*** Bug 1754646 has been marked as a duplicate of this bug. ***

Comment 3 Ben Bennett 2019-09-23 20:13:34 UTC

*** Bug 1754434 has been marked as a duplicate of this bug. ***

Comment 4 Ben Bennett 2019-09-23 20:14:43 UTC

*** Bug 1753801 has been marked as a duplicate of this bug. ***

Comment 5 Kirsten Garrison 2019-09-24 00:11:14 UTC

This is still in post but the PR (https://github.com/openshift/containernetworking-plugins/pull/15/) did merge, we had label/bot issues earlier so moving to modified manually.

Comment 7 zhaozhanqi 2019-09-25 06:14:56 UTC

Verified this bug on 4.2.0-0.nightly-2019-09-24-025718
node works well in the this build

Comment 8 Michal Fojtik 2019-09-30 11:49:38 UTC

*** Bug 1754125 has been marked as a duplicate of this bug. ***

Comment 9 Casey Callendrello 2019-10-03 15:42:52 UTC

*** Bug 1754125 has been marked as a duplicate of this bug. ***

Comment 10 Colin Walters 2019-10-03 19:41:28 UTC

> We need to fix this properly in master.  Perhaps by caching and checking the PID's start time.

I don't think anything involving PIDs should be stored persistently on disk.  Use /run for that stuff.

Comment 11 errata-xmlrpc 2019-10-16 06:41:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Comment 12 Orit Wasserman 2020-02-17 16:44:07 UTC

*** Bug 1791011 has been marked as a duplicate of this bug. ***

Comment 13 Arnab Ghosh 2020-05-12 12:29:53 UTC

@team, CU is using 4.2.10 and still facing this issue.

~~~
Apr 30 17:28:13 ip-10-134-209-71 hyperkube[4410]: E0430 17:28:13.307796    4410 pod_workers.go:190] Error syncing pod 3de782af-8a5b-11ea-ad85-1225addc5c9f ("dns-default-l5gqs_openshift-dns(3de782af-8a5b-11ea-ad85-1225addc5c9f)"), skipping: network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network
Apr 30 17:28:13 ip-10-134-209-71 hyperkube[4410]: I0430 17:28:13.307867    4410 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-dns", Name:"dns-default-l5gqs", UID:"3de782af-8a5b-11ea-ad85-1225addc5c9f", APIVersion:"v1", ResourceVersion:"4786039", FieldPath:""}): type: 'Warning' reason: 'NetworkNotReady' network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network
Apr 30 17:28:13 ip-10-134-209-71 hyperkube[4410]: E0430 17:28:13.625586    4410 kubelet.go:2180] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network
~~~

Note You need to log in before you can comment on or make changes to this bug.