Bug 1434950

Summary: NetworkPlugin cni failed on status hook - eth0 does not exist
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: NetworkingAssignee: Dan Williams <dcbw>
Status: CLOSED CURRENTRELEASE QA Contact: Meng Bo <bmeng>
Severity: low Docs Contact:
Priority: medium    
Version: 3.5.0CC: aloughla, aos-bugs, atragler, bbennett, dcbw, gferrazs, jupierce, mifiedle, nnosenzo, pparsons, smunilla, tatanaka, tcarlin, yadu
Target Milestone: ---Keywords: Reopened
Target Release: 3.7.z   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-17 14:54:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
system log from node
none
pod definition for reproducer
none
full node log none

Description Mike Fiedler 2017-03-22 17:05:12 UTC
Created attachment 1265449 [details]
system log from node

Description of problem:

When creating pods in batches of 75 simultaneously, in each batch some of the pods get the following error in the node syslog:

Mar 22 12:18:17 ip-172-31-43-89 atomic-openshift-node: E0322 12:18:17.651013   25624 docker_manager.go:378] NetworkPlugin cni failed on the status hook for pod 'pausepods22' - Unexpected command output Device "eth0" does not 
exist.

In each batch of 75, usually 2-15 get this error.

Version-Release number of selected component (if applicable): 3.5.0.55


How reproducible: Always


Steps to Reproduce:
1.  Set up an OpenShift cluster
2.  Download the pod-pause.json file attached to this bz to the master
3.  for i in {1..75}; do cat pod-pause.json | sed -e s/NAME/pod$i/g | oc create -f -; done
4.  Look at the node logs.  This message should pop a few times.

Actual results:

Mar 22 13:01:12 ip-172-31-43-89 atomic-openshift-node: E0322 13:01:12.443538   34645 docker_manager.go:378] NetworkPlugin cni failed on the status hook for pod 'pod1' - Unexpected command output Device "eth0" does not exist.


Expected results:

No errors for successful pod creation

Additional info:

Comment 1 Mike Fiedler 2017-03-22 17:05:45 UTC
Created attachment 1265451 [details]
pod definition for reproducer

Comment 2 Dan Williams 2017-04-05 00:15:54 UTC
This is a known race in kubelet/openshift-node between some pod status checking (which prints the message) and network setup goroutines in kubelet.

Kubelet's "GenericPLEG" (Pod Lifecycle Event Generator) runs on a 1 second loop and asks the runtime (docker) for pod status including the IP address, which is where that message is generated from.  It calls code that asks docker directly (GetPodStatus()) for pod status, and if the pod status is "running" according to docker, it tries to read the IP address and fails.

Docker reports the container as "running" because the container really is started, it just hasn't had network set up yet.  But docker doesn't know this because it's not involved in network setup.

And there's the gap.

For the time being we should just suppress the message by ignoring the error in kubelet.  Eventually the runtimes need to ensure that GetPodNetworkStatus() doesn't even try to get the IP address until networking is actually started.

Comment 3 Dan Williams 2017-04-05 00:50:56 UTC
Upstream pull request to suppress the message here:

https://github.com/kubernetes/kubernetes/pull/44067

Comment 4 Ben Bennett 2017-04-19 13:57:06 UTC
Lowering the severity since this is just logspam until the above PR merges into Kubernetes and makes it back to OpenShift.

Comment 5 Dan Williams 2017-05-17 16:27:32 UTC
Closed upstream #44067 in favor of https://github.com/kubernetes/kubernetes/pull/43879 which has the same end result.

Comment 6 Dan Williams 2017-06-09 20:34:40 UTC
*** Bug 1451466 has been marked as a duplicate of this bug. ***

Comment 7 Dan Williams 2017-06-19 16:18:50 UTC
43879 was reverted upstream, and instead reworked and merged as https://github.com/kubernetes/kubernetes/pull/46823.

Waiting on PM for backport priority (if any).

Comment 8 Ben Bennett 2017-06-22 21:06:28 UTC
Closing since no-one has screamed.

Comment 11 Takayoshi Tanaka 2017-07-19 03:59:41 UTC
Another customer is facing the same error. However, this customer always gets the message when he deploys one pod and all the pods failed to create. 

Env: OpenShift 3.5.5.31 (brand new install)
1 master + 1 node with containerized on AWS

I'll add logs in private.

Comment 13 Takayoshi Tanaka 2017-07-19 04:02:19 UTC
There are similar BZ and upstream issues. However, I couldn't find any useful information from them.

Cannot deploy PODs on secondary nodes : v1.2.0 containerised 
https://bugzilla.redhat.com/show_bug.cgi?id=1344388


Containerized hyperkube kubelet addNDotsOption: ResolvConfPath Warning 
https://github.com/kubernetes/kubernetes/issues/23144

Pods getting lost during deployment creation
https://github.com/kubernetes/kubernetes/issues/34007

Unable to start runc when setting uidmappings/gidmappings
https://github.com/opencontainers/runc/issues/252

Comment 15 Yan Du 2017-09-07 08:34:10 UTC
Test on latest OCP 3.7
oc v3.7.0-0.125.0
kubernetes v1.7.0+695f48a16f

Still met a lot of status hook error in node log:
Sep 06 22:40:44 ip-172-18-25-39.ec2.internal atomic-openshift-node[8052]: W0906 22:40:44.912816    8052 docker_sandbox.go:342] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "docker-registry-1-deploy_default": Unexpected command output nsenter: cannot open /proc/9216/ns/net: No such file or directory
Sep 06 22:41:26 ip-172-18-25-39.ec2.internal atomic-openshift-node[8052]: W0906 22:41:26.062326    8052 docker_sandbox.go:342] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "registry-console-1-deploy_default": Unexpected command output nsenter: cannot open /proc/9424/ns/net: No such file or directory
Sep 06 22:44:03 ip-172-18-25-39.ec2.internal atomic-openshift-node[8052]: W0906 22:44:03.611015    8052 docker_sandbox.go:342] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "nodejs-mongodb-example-1-deploy_install-test": Unexpected command output nsenter: cannot open /proc/13759/ns/net: No such file or directory
Sep 06 23:27:13 ip-172-18-25-39.ec2.internal atomic-openshift-node[8052]: W0906 23:27:13.925903    8052 docker_sandbox.go:342] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "registry-console-1-pqtk2_default": CNI failed to retrieve network namespace path: Cannot find network namespace for the terminated container "9f0fb0146774592fb08c4d75ccfd6f5f05a0f3826e75795d351418b48c00e70d"

Comment 16 Dan Williams 2017-09-19 17:01:37 UTC
Any chance you can attach more of the node logs showing this problem?  Would it be possible to reproduce it at --loglevel=5 for the node?

Comment 19 Dan Williams 2017-09-21 20:03:28 UTC
Another upstream PR to fine-tune the network teardown and hopefully suppress these messages further.

https://github.com/kubernetes/kubernetes/pull/52864

Comment 20 Dan Williams 2017-09-28 17:24:17 UTC
Backport of that PR to origin: https://github.com/openshift/origin/pull/16602

Comment 21 Dan Williams 2017-09-29 22:07:00 UTC
this has been merged into origin

Comment 23 Yan Du 2017-10-10 02:39:46 UTC
Test on OCP
openshift v3.7.0-0.126.6
kubernetes v1.7.0+80709908fd

Still could get the status hook errors:

Oct 09 05:04:38 qe-ui-registry-37-node-registry-router-1 atomic-openshift-node[17445]: W1009 05:04:38.899016   17445 docker_sandbox.go:347] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "router-1-deploy_default": Unexpected command output nsenter: cannot open /proc/18119/ns/net: No such file or directory
Oct 09 05:05:17 qe-ui-registry-37-node-registry-router-1 atomic-openshift-node[17445]: W1009 05:05:17.192946   17445 docker_sandbox.go:347] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "docker-registry-1-deploy_default": Unexpected command output nsenter: cannot open /proc/18754/ns/net: No such file or directory

Full node log have attached.

Comment 24 Yan Du 2017-10-10 02:42:47 UTC
Created attachment 1336594 [details]
full node log

Comment 25 Yan Du 2017-10-10 02:46:03 UTC
fix typo: #comment 23 tested on OCP 
openshift v3.7.0-0.143.1 
kubernetes v1.7.0+80709908fd

And will retest the bug after get the errata build.

Comment 26 Yan Du 2017-11-15 03:26:11 UTC
Test on latest OCP 
openshift v3.7.7
kubernetes v1.7.6+a08f5eeb62

Nov 14 20:37:23 qe-hongli-37-node-registry-router-1 atomic-openshift-node[24065]: W1114 20:37:23.705155   24065 docker_sandbox.go:348] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "mongodb-1-deploy_install-test": Unexpected command output nsenter: cannot open /proc/43760/ns/net: No such file or directory
Nov 14 20:38:57 qe-hongli-37-node-registry-router-1 atomic-openshift-node[24065]: W1114 20:38:57.108625   24065 docker_sandbox.go:348] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "nodejs-mongodb-example-1-deploy_install-test": Unexpected command output nsenter: cannot open /proc/44438/ns/net: No such file or directory
Nov 14 22:15:58 qe-hongli-37-node-registry-router-1 atomic-openshift-node[24065]: W1114 22:15:58.334006   24065 docker_sandbox.go:348] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "pod10_d1": CNI failed to retrieve network namespace path: Cannot find network namespace for the terminated container "eb829d2dd4850a628c74891f7229662b02dbbb326abd09304f2d6a7415de2f6d"
Nov 14 22:18:31 qe-hongli-37-node-registry-router-1 atomic-openshift-node[24065]: W1114 22:18:31.865810   24065 docker_sandbox.go:348] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "pod9_d1": Unexpected command output nsenter: cannot open /proc/59581/ns/net: No such file or directory

Comment 29 Dan Williams 2018-02-03 05:07:01 UTC
One more instance of this was found upstream: https://github.com/kubernetes/kubernetes/pull/59301

Comment 31 Dan Williams 2018-02-23 22:58:42 UTC
Still waiting on upstream review...

Comment 32 Dan Williams 2018-03-20 16:58:00 UTC
Just waiting on origin review now: https://github.com/openshift/origin/pull/18425

Comment 33 Dan Williams 2018-03-24 13:42:27 UTC
https://github.com/openshift/origin/pull/18425 has been merged to origin.

Comment 34 Ben Bennett 2018-05-01 14:41:52 UTC
Any idea why this is stalled in MODIFIED?

Comment 37 Red Hat Bugzilla 2023-09-14 03:55:27 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days