Bug 1754154
Summary: | Loopback is accidentally torn down. | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Kirsten Garrison <kgarriso> | |
Component: | Networking | Assignee: | Casey Callendrello <cdc> | |
Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | urgent | |||
Priority: | urgent | CC: | arghosh, bbennett, dcbw, eparis, mrhodes, mzali, trankin, travi | |
Version: | 4.2.0 | |||
Target Milestone: | --- | |||
Target Release: | 4.3.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1754638 (view as bug list) | Environment: | ||
Last Closed: | 2020-01-23 11:06:39 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1754638 |
Description
Kirsten Garrison
2019-09-21 09:05:56 UTC
Is this related to?: https://bugzilla.redhat.com/show_bug.cgi?id=1753801 Something odd is going on. Node node ip-10-0-152-60.ec2.internal is not ready. This is because openshift-sdn hasn't written the SDN configuration file, which is expected. However, looking at the logs of the SDN pod on the node, it is happily starting up and getting killed just before it writes its configuration file: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1127/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/3522/artifacts/e2e-aws-op/pods/openshift-sdn_sdn-6qckt_sdn.log I0921 07:07:05.159854 12152 cmd.go:173] openshift-sdn network plugin registering startup I0921 07:07:05.159932 12152 cmd.go:177] openshift-sdn network plugin ready I0921 07:07:16.224353 12152 roundrobin.go:310] LoadBalancerRR: Setting endpoints for openshift-sdn/sdn:metrics to [10.0.134.176:9101 10.0.136.46:9101 10.0.137.153:9101 10.0.138.235:9101 10.0.152.60:9101 10.0.153.234:9101] I0921 07:07:16.224390 12152 roundrobin.go:240] Delete endpoint 10.0.152.60:9101 for service "openshift-sdn/sdn:metrics" I0921 07:07:16.224439 12152 proxy.go:331] hybrid proxy: syncProxyRules start I0921 07:07:16.378919 12152 proxy.go:334] hybrid proxy: mainProxy.syncProxyRules complete I0921 07:07:16.438034 12152 proxier.go:367] userspace proxy: processing 0 service events I0921 07:07:16.438055 12152 proxier.go:346] userspace syncProxyRules took 59.113113ms I0921 07:07:16.438066 12152 proxy.go:337] hybrid proxy: unidlingProxy.syncProxyRules complete interrupt: Gracefully shutting down ... And if you watch the multus logs: 2019-09-21T07:07:05+0000 Using /host/var/run/multus/cni/net.d/80-openshift-network.conf as a source to generate the Multus configuration 2019-09-21T07:07:05+0000 Config file created @ /host/etc/cni/net.d/00-multus.conf { "cniVersion": "0.3.1", "name": "multus-cni-network", "type": "multus", "namespaceIsolation": true, "logLevel": "verbose", "binDir": "/opt/multus/bin", "kubeconfig": "/etc/kubernetes/cni/net.d/multus.d/multus.kubeconfig", "delegates": [ { "cniVersion": "0.3.1", "name": "openshift-sdn", "type": "openshift-sdn" } ] } 2019-09-21T07:07:05+0000 Continuing watch loop after configuration regeneration... 2019-09-21T07:07:34+0000 Master plugin @ /host/var/run/multus/cni/net.d/80-openshift-network.conf has been deleted. Performing cleanup... 2019-09-21T07:07:34+0000 Multus configuration intentionally invalidated to prevent pods from being scheduled. Okay, which is the shutdown handler for the openshift-sdn pod removing the log. So: what is killing openshift-sdn? Looking... Aha. The kubelet is killing it because it failed liveness: Sep 21 07:07:13 ip-10-0-152-60 hyperkube[1365]: I0921 07:07:13.494388 1365 prober.go:112] Liveness probe for "sdn-6qckt_openshift-sdn(8be9db67-dc35-11e9-a4c5-12d5b2c68656):sdn" failed (failure): Get http://10.0.152.60:10256/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) Sep 21 07:07:13 ip-10-0-152-60 hyperkube[1365]: I0921 07:07:13.494618 1365 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-sdn", Name:"sdn-6qckt", UID:"8be9db67-dc35-11e9-a4c5-12d5b2c68656", APIVersion:"v1", ResourceVersion:"41661", FieldPath:"spec.containers{sdn}"}): type: 'Warning' reason: 'Unhealthy' Liveness probe failed: Get http://10.0.152.60:10256/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) Overloaded node and bad luck? Something else? All of the other nodes are ready. Hey Kirsten, is this happening for any other sorts of tests, or just the MCD? Because openshift-sdn started failing liveness checks exactly after the MCD wrote its new config... Sep 21 07:03:03 ip-10-0-152-60 root[3392]: machine-config-daemon[2054]: Disk currentConfig rendered-worker-e5739f218cf71a7af cdca926fb119202 overrides node annotation rendered-worker-683505cbc36356e2029b40f24b44100c Sep 21 07:03:03 ip-10-0-152-60 logger[3393]: rendered-worker-e5739f218cf71a7afcdca926fb119202 Sep 21 07:03:03 ip-10-0-152-60 root[3394]: machine-config-daemon[2054]: completed update for config rendered-worker-e5739f21 8cf71a7afcdca926fb119202 Sep 21 07:03:03 ip-10-0-152-60 hyperkube[1365]: I0921 07:03:03.493375 1365 prober.go:112] Liveness probe for "sdn-6qckt_openshift-sdn(8be9db67-dc35-11e9-a4c5-12d5b2c68656):sdn" failed (failure): Get http://10.0.152.60:10256/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) Sep 21 07:03:03 ip-10-0-152-60 hyperkube[1365]: I0921 07:03:03.493687 1365 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-sdn", Name:"sdn-6qckt", UID:"8be9db67-dc35-11e9-a4c5-12d5b2c68656", APIVersion:"v1", ResourceVersion:"41661", FieldPath:"spec.containers{sdn}"}): type: 'Warning' reason: 'Unhealthy' Liveness probe failed: Get http://10.0.152.60:10256/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) Thoughts? Summary: sdn-6qckt is running on node and seems ready. 07:01 - node reboots in to rendered-worker-e5739f... 07:02:42 - kubelet starts setting up sdn-6qckt 07:02:52 - liveness probe succeeds 07:03:03 - liveness probe for sdn-6qckt fails due to timeout. It never succeeds again The liveness probe is a direct connection over loopback. It does not go over the pod network, as both are host-network. If this is failing at the network level, something is very wrong. additional debugging has shown that this *almost* always happens for aws-ci-op for the mco, but not always The exception: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/23839/pull-ci-openshift-origin-master-e2e-aws-upgrade/4844 Found the problem. `lo` is down. Now, the question: why? AHA: from crio log: "Error deleting network: failed to Statfs "/proc/0/ns/net": no such file or directory" I suspect we're tearing down containers and hitting pid reuse. Hi Casey, To answer your Q above, only seeing it in MCD (but MCD is what initiates the update/reboot functions). In those e2e tests, MCD applies an update, reboots node, verifies that state of node is updated as expected. So, some follow-up analysis: The recent oci-cni changes finally enabled tearing down loopback (because the cache files were sitting around). However, there is an issue with oci-cni that caches pids across reboots. After a reboot, the kubelet cleans up old sandboxes. This causes cri-o to do a CNI DEL of the container... with the namespace /proc/<pid>/ns/net. However, if <pid> has been recycled (because this is a reboot), we end up setting lo down... on the node. updated to 4.3.0 will clone into 4.2.0 Verified this bug on 4.3.0-0.ci-2019-10-08-225758 this node works well Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062 |