Created attachment 1265449 [details] system log from node Description of problem: When creating pods in batches of 75 simultaneously, in each batch some of the pods get the following error in the node syslog: Mar 22 12:18:17 ip-172-31-43-89 atomic-openshift-node: E0322 12:18:17.651013 25624 docker_manager.go:378] NetworkPlugin cni failed on the status hook for pod 'pausepods22' - Unexpected command output Device "eth0" does not exist. In each batch of 75, usually 2-15 get this error. Version-Release number of selected component (if applicable): 3.5.0.55 How reproducible: Always Steps to Reproduce: 1. Set up an OpenShift cluster 2. Download the pod-pause.json file attached to this bz to the master 3. for i in {1..75}; do cat pod-pause.json | sed -e s/NAME/pod$i/g | oc create -f -; done 4. Look at the node logs. This message should pop a few times. Actual results: Mar 22 13:01:12 ip-172-31-43-89 atomic-openshift-node: E0322 13:01:12.443538 34645 docker_manager.go:378] NetworkPlugin cni failed on the status hook for pod 'pod1' - Unexpected command output Device "eth0" does not exist. Expected results: No errors for successful pod creation Additional info:
Created attachment 1265451 [details] pod definition for reproducer
This is a known race in kubelet/openshift-node between some pod status checking (which prints the message) and network setup goroutines in kubelet. Kubelet's "GenericPLEG" (Pod Lifecycle Event Generator) runs on a 1 second loop and asks the runtime (docker) for pod status including the IP address, which is where that message is generated from. It calls code that asks docker directly (GetPodStatus()) for pod status, and if the pod status is "running" according to docker, it tries to read the IP address and fails. Docker reports the container as "running" because the container really is started, it just hasn't had network set up yet. But docker doesn't know this because it's not involved in network setup. And there's the gap. For the time being we should just suppress the message by ignoring the error in kubelet. Eventually the runtimes need to ensure that GetPodNetworkStatus() doesn't even try to get the IP address until networking is actually started.
Upstream pull request to suppress the message here: https://github.com/kubernetes/kubernetes/pull/44067
Lowering the severity since this is just logspam until the above PR merges into Kubernetes and makes it back to OpenShift.
Closed upstream #44067 in favor of https://github.com/kubernetes/kubernetes/pull/43879 which has the same end result.
*** Bug 1451466 has been marked as a duplicate of this bug. ***
43879 was reverted upstream, and instead reworked and merged as https://github.com/kubernetes/kubernetes/pull/46823. Waiting on PM for backport priority (if any).
Closing since no-one has screamed.
Another customer is facing the same error. However, this customer always gets the message when he deploys one pod and all the pods failed to create. Env: OpenShift 3.5.5.31 (brand new install) 1 master + 1 node with containerized on AWS I'll add logs in private.
There are similar BZ and upstream issues. However, I couldn't find any useful information from them. Cannot deploy PODs on secondary nodes : v1.2.0 containerised https://bugzilla.redhat.com/show_bug.cgi?id=1344388 Containerized hyperkube kubelet addNDotsOption: ResolvConfPath Warning https://github.com/kubernetes/kubernetes/issues/23144 Pods getting lost during deployment creation https://github.com/kubernetes/kubernetes/issues/34007 Unable to start runc when setting uidmappings/gidmappings https://github.com/opencontainers/runc/issues/252
Test on latest OCP 3.7 oc v3.7.0-0.125.0 kubernetes v1.7.0+695f48a16f Still met a lot of status hook error in node log: Sep 06 22:40:44 ip-172-18-25-39.ec2.internal atomic-openshift-node[8052]: W0906 22:40:44.912816 8052 docker_sandbox.go:342] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "docker-registry-1-deploy_default": Unexpected command output nsenter: cannot open /proc/9216/ns/net: No such file or directory Sep 06 22:41:26 ip-172-18-25-39.ec2.internal atomic-openshift-node[8052]: W0906 22:41:26.062326 8052 docker_sandbox.go:342] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "registry-console-1-deploy_default": Unexpected command output nsenter: cannot open /proc/9424/ns/net: No such file or directory Sep 06 22:44:03 ip-172-18-25-39.ec2.internal atomic-openshift-node[8052]: W0906 22:44:03.611015 8052 docker_sandbox.go:342] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "nodejs-mongodb-example-1-deploy_install-test": Unexpected command output nsenter: cannot open /proc/13759/ns/net: No such file or directory Sep 06 23:27:13 ip-172-18-25-39.ec2.internal atomic-openshift-node[8052]: W0906 23:27:13.925903 8052 docker_sandbox.go:342] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "registry-console-1-pqtk2_default": CNI failed to retrieve network namespace path: Cannot find network namespace for the terminated container "9f0fb0146774592fb08c4d75ccfd6f5f05a0f3826e75795d351418b48c00e70d"
Any chance you can attach more of the node logs showing this problem? Would it be possible to reproduce it at --loglevel=5 for the node?
Another upstream PR to fine-tune the network teardown and hopefully suppress these messages further. https://github.com/kubernetes/kubernetes/pull/52864
Backport of that PR to origin: https://github.com/openshift/origin/pull/16602
this has been merged into origin
Test on OCP openshift v3.7.0-0.126.6 kubernetes v1.7.0+80709908fd Still could get the status hook errors: Oct 09 05:04:38 qe-ui-registry-37-node-registry-router-1 atomic-openshift-node[17445]: W1009 05:04:38.899016 17445 docker_sandbox.go:347] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "router-1-deploy_default": Unexpected command output nsenter: cannot open /proc/18119/ns/net: No such file or directory Oct 09 05:05:17 qe-ui-registry-37-node-registry-router-1 atomic-openshift-node[17445]: W1009 05:05:17.192946 17445 docker_sandbox.go:347] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "docker-registry-1-deploy_default": Unexpected command output nsenter: cannot open /proc/18754/ns/net: No such file or directory Full node log have attached.
Created attachment 1336594 [details] full node log
fix typo: #comment 23 tested on OCP openshift v3.7.0-0.143.1 kubernetes v1.7.0+80709908fd And will retest the bug after get the errata build.
Test on latest OCP openshift v3.7.7 kubernetes v1.7.6+a08f5eeb62 Nov 14 20:37:23 qe-hongli-37-node-registry-router-1 atomic-openshift-node[24065]: W1114 20:37:23.705155 24065 docker_sandbox.go:348] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "mongodb-1-deploy_install-test": Unexpected command output nsenter: cannot open /proc/43760/ns/net: No such file or directory Nov 14 20:38:57 qe-hongli-37-node-registry-router-1 atomic-openshift-node[24065]: W1114 20:38:57.108625 24065 docker_sandbox.go:348] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "nodejs-mongodb-example-1-deploy_install-test": Unexpected command output nsenter: cannot open /proc/44438/ns/net: No such file or directory Nov 14 22:15:58 qe-hongli-37-node-registry-router-1 atomic-openshift-node[24065]: W1114 22:15:58.334006 24065 docker_sandbox.go:348] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "pod10_d1": CNI failed to retrieve network namespace path: Cannot find network namespace for the terminated container "eb829d2dd4850a628c74891f7229662b02dbbb326abd09304f2d6a7415de2f6d" Nov 14 22:18:31 qe-hongli-37-node-registry-router-1 atomic-openshift-node[24065]: W1114 22:18:31.865810 24065 docker_sandbox.go:348] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "pod9_d1": Unexpected command output nsenter: cannot open /proc/59581/ns/net: No such file or directory
One more instance of this was found upstream: https://github.com/kubernetes/kubernetes/pull/59301
See also: https://github.com/openshift/origin/issues/18414 https://github.com/openshift/origin/pull/18425
Still waiting on upstream review...
Just waiting on origin review now: https://github.com/openshift/origin/pull/18425
https://github.com/openshift/origin/pull/18425 has been merged to origin.
Any idea why this is stalled in MODIFIED?
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days