Description of problem: Flake that has started to hit a PR regularly in CI and is blocking: Seeing in workers journal after a reboot: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1127/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/3520/artifacts/e2e-aws-op/nodes/workers-journal: ``` Sep 21 03:00:28 ip-10-0-140-173 hyperkube[1346]: I0921 03:00:28.784508 1346 setters.go:521] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2019-09-21 03:00:28.784452389 +0000 UTC m=+71.300956518 LastTransitionTime:2019-09-21 03:00:28.784452389 +0000 UTC m=+71.300956518 Reason:KubeletNotReady Message:runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network} Sep 21 03:00:28 ip-10-0-140-173 hyperkube[1346]: I0921 03:00:28.785392 1346 event.go:209] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-140-173.ec2.internal", UID:"ip-10-0-140-173.ec2.internal", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node ip-10-0-140-173.ec2.internal status is now: NodeNotReady .... Sep 21 03:00:58 ip-10-0-140-173 hyperkube[1346]: I0921 03:00:58.812503 1346 setters.go:521] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2019-09-21 03:00:58.812452071 +0000 UTC m=+101.328956195 LastTransitionTime:2019-09-21 03:00:58.812452071 +0000 UTC m=+101.328956195 Reason:KubeletNotReady Message:runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network} ``` I believe this is causing the Kubelet health check to fail: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1127/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/3520/artifacts/e2e-aws-op/pods/openshift-machine-config-operator_machine-config-daemon-gklq9_machine-config-daemon.log ``` W0921 03:00:22.780592 2246 daemon.go:573] Failed kubelet health check: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: i/o timeout W0921 03:01:22.781017 2246 daemon.go:573] Failed kubelet health check: Get http://localhost:10248/healthz: context deadline exceeded W0921 03:02:22.781451 2246 daemon.go:573] Failed kubelet health check: Get http://localhost:10248/healthz: context deadline exceeded W0921 03:02:22.781489 2246 daemon.go:508] Got an error from auxiliary tools: kubelet health failure threshold reached See also: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1127/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/3522 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1127/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/3520
Is this related to?: https://bugzilla.redhat.com/show_bug.cgi?id=1753801
Something odd is going on. Node node ip-10-0-152-60.ec2.internal is not ready. This is because openshift-sdn hasn't written the SDN configuration file, which is expected. However, looking at the logs of the SDN pod on the node, it is happily starting up and getting killed just before it writes its configuration file: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1127/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/3522/artifacts/e2e-aws-op/pods/openshift-sdn_sdn-6qckt_sdn.log I0921 07:07:05.159854 12152 cmd.go:173] openshift-sdn network plugin registering startup I0921 07:07:05.159932 12152 cmd.go:177] openshift-sdn network plugin ready I0921 07:07:16.224353 12152 roundrobin.go:310] LoadBalancerRR: Setting endpoints for openshift-sdn/sdn:metrics to [10.0.134.176:9101 10.0.136.46:9101 10.0.137.153:9101 10.0.138.235:9101 10.0.152.60:9101 10.0.153.234:9101] I0921 07:07:16.224390 12152 roundrobin.go:240] Delete endpoint 10.0.152.60:9101 for service "openshift-sdn/sdn:metrics" I0921 07:07:16.224439 12152 proxy.go:331] hybrid proxy: syncProxyRules start I0921 07:07:16.378919 12152 proxy.go:334] hybrid proxy: mainProxy.syncProxyRules complete I0921 07:07:16.438034 12152 proxier.go:367] userspace proxy: processing 0 service events I0921 07:07:16.438055 12152 proxier.go:346] userspace syncProxyRules took 59.113113ms I0921 07:07:16.438066 12152 proxy.go:337] hybrid proxy: unidlingProxy.syncProxyRules complete interrupt: Gracefully shutting down ... And if you watch the multus logs: 2019-09-21T07:07:05+0000 Using /host/var/run/multus/cni/net.d/80-openshift-network.conf as a source to generate the Multus configuration 2019-09-21T07:07:05+0000 Config file created @ /host/etc/cni/net.d/00-multus.conf { "cniVersion": "0.3.1", "name": "multus-cni-network", "type": "multus", "namespaceIsolation": true, "logLevel": "verbose", "binDir": "/opt/multus/bin", "kubeconfig": "/etc/kubernetes/cni/net.d/multus.d/multus.kubeconfig", "delegates": [ { "cniVersion": "0.3.1", "name": "openshift-sdn", "type": "openshift-sdn" } ] } 2019-09-21T07:07:05+0000 Continuing watch loop after configuration regeneration... 2019-09-21T07:07:34+0000 Master plugin @ /host/var/run/multus/cni/net.d/80-openshift-network.conf has been deleted. Performing cleanup... 2019-09-21T07:07:34+0000 Multus configuration intentionally invalidated to prevent pods from being scheduled. Okay, which is the shutdown handler for the openshift-sdn pod removing the log. So: what is killing openshift-sdn? Looking...
Aha. The kubelet is killing it because it failed liveness: Sep 21 07:07:13 ip-10-0-152-60 hyperkube[1365]: I0921 07:07:13.494388 1365 prober.go:112] Liveness probe for "sdn-6qckt_openshift-sdn(8be9db67-dc35-11e9-a4c5-12d5b2c68656):sdn" failed (failure): Get http://10.0.152.60:10256/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) Sep 21 07:07:13 ip-10-0-152-60 hyperkube[1365]: I0921 07:07:13.494618 1365 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-sdn", Name:"sdn-6qckt", UID:"8be9db67-dc35-11e9-a4c5-12d5b2c68656", APIVersion:"v1", ResourceVersion:"41661", FieldPath:"spec.containers{sdn}"}): type: 'Warning' reason: 'Unhealthy' Liveness probe failed: Get http://10.0.152.60:10256/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) Overloaded node and bad luck? Something else? All of the other nodes are ready.
Hey Kirsten, is this happening for any other sorts of tests, or just the MCD? Because openshift-sdn started failing liveness checks exactly after the MCD wrote its new config... Sep 21 07:03:03 ip-10-0-152-60 root[3392]: machine-config-daemon[2054]: Disk currentConfig rendered-worker-e5739f218cf71a7af cdca926fb119202 overrides node annotation rendered-worker-683505cbc36356e2029b40f24b44100c Sep 21 07:03:03 ip-10-0-152-60 logger[3393]: rendered-worker-e5739f218cf71a7afcdca926fb119202 Sep 21 07:03:03 ip-10-0-152-60 root[3394]: machine-config-daemon[2054]: completed update for config rendered-worker-e5739f21 8cf71a7afcdca926fb119202 Sep 21 07:03:03 ip-10-0-152-60 hyperkube[1365]: I0921 07:03:03.493375 1365 prober.go:112] Liveness probe for "sdn-6qckt_openshift-sdn(8be9db67-dc35-11e9-a4c5-12d5b2c68656):sdn" failed (failure): Get http://10.0.152.60:10256/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) Sep 21 07:03:03 ip-10-0-152-60 hyperkube[1365]: I0921 07:03:03.493687 1365 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-sdn", Name:"sdn-6qckt", UID:"8be9db67-dc35-11e9-a4c5-12d5b2c68656", APIVersion:"v1", ResourceVersion:"41661", FieldPath:"spec.containers{sdn}"}): type: 'Warning' reason: 'Unhealthy' Liveness probe failed: Get http://10.0.152.60:10256/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) Thoughts?
Summary: sdn-6qckt is running on node and seems ready. 07:01 - node reboots in to rendered-worker-e5739f... 07:02:42 - kubelet starts setting up sdn-6qckt 07:02:52 - liveness probe succeeds 07:03:03 - liveness probe for sdn-6qckt fails due to timeout. It never succeeds again The liveness probe is a direct connection over loopback. It does not go over the pod network, as both are host-network. If this is failing at the network level, something is very wrong.
additional debugging has shown that this *almost* always happens for aws-ci-op for the mco, but not always The exception: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/23839/pull-ci-openshift-origin-master-e2e-aws-upgrade/4844
Found the problem. `lo` is down. Now, the question: why?
AHA: from crio log: "Error deleting network: failed to Statfs "/proc/0/ns/net": no such file or directory" I suspect we're tearing down containers and hitting pid reuse.
Hi Casey, To answer your Q above, only seeing it in MCD (but MCD is what initiates the update/reboot functions). In those e2e tests, MCD applies an update, reboots node, verifies that state of node is updated as expected.
https://github.com/openshift/containernetworking-plugins/pull/15
So, some follow-up analysis: The recent oci-cni changes finally enabled tearing down loopback (because the cache files were sitting around). However, there is an issue with oci-cni that caches pids across reboots. After a reboot, the kubelet cleans up old sandboxes. This causes cri-o to do a CNI DEL of the container... with the namespace /proc/<pid>/ns/net. However, if <pid> has been recycled (because this is a reboot), we end up setting lo down... on the node.
updated to 4.3.0 will clone into 4.2.0
Verified this bug on 4.3.0-0.ci-2019-10-08-225758 this node works well
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062