Bug 1451110

Summary: Pods stuck in ContainerCreating with CNI errors in node logs
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: NetworkingAssignee: Dan Williams <dcbw>
Status: CLOSED DUPLICATE QA Contact: Meng Bo <bmeng>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.6.0CC: aloughla, aos-bugs, atragler, bbennett, jeder, mifiedle, wabouham, zzhao
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-09 20:29:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Node log none

Description Mike Fiedler 2017-05-15 20:00:46 UTC
Created attachment 1279103 [details]
Node log

Description of problem:

Running a scale cluster of 100 nodes, 1000 namespaces and 4000 running pods.    During the scaleup, 3 pods got stuck in ContainerCreating state.   All three were on the same node and had a sequence of errors in the node logs like:

May 15 15:14:05 svt-n-1-13 atomic-openshift-node: W0515 15:14:05.916590   33367 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "deploymentconfig0-1-deploy_svt922": Unexpected command output nsenter: cannot open : No such file or directory
May 15 15:15:07 svt-n-1-13 atomic-openshift-node: E0515 15:15:07.374371   33367 cni.go:257] Error adding network: CNI request failed with status 400: 'failed to find netid for namespace: svt922 in vnid map
May 15 15:15:07 svt-n-1-13 atomic-openshift-node: E0515 15:15:07.374435   33367 cni.go:211] Error while adding to cni network: CNI request failed with status 400: 'failed to find netid for namespace: svt922 in vnid map
May 15 15:15:07 svt-n-1-13 atomic-openshift-node: E0515 15:15:07.708740   33367 remote_runtime.go:86] RunPodSandbox from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod "deploymentconfig0-1-deploy_svt922" network: CNI request failed with status 400: 'failed to find netid for namespace: svt922 in vnid map
May 15 15:15:07 svt-n-1-13 atomic-openshift-node: E0515 15:15:07.708799   33367 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "deploymentconfig0-1-deploy_svt922(239ce3d0-39a2-11e7-82e0-fa163e9d2633)" failed: rpc erro
r: code = 2 desc = NetworkPlugin cni failed to set up pod "deploymentconfig0-1-deploy_svt922" network: CNI request failed with status 400: 'failed to find netid for namespace: svt922 in vnid map
May 15 15:15:07 svt-n-1-13 atomic-openshift-node: E0515 15:15:07.708817   33367 kuberuntime_manager.go:619] createPodSandbox for pod "deploymentconfig0-1-deploy_svt922(239ce3d0-39a2-11e7-82e0-fa163e9d2633)" failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod "deploymentconfig0-1-deploy_svt922" network: CNI request failed with status 400: 'failed to find netid for namespace: svt922 in vnid map
May 15 15:15:07 svt-n-1-13 atomic-openshift-node: E0515 15:15:07.708858   33367 pod_workers.go:182] Error syncing pod 239ce3d0-39a2-11e7-82e0-fa163e9d2633 ("deploymentconfig0-1-deploy_svt922(239ce3d0-39a2-11e7-82e0-fa163e9d2633)"), skipping: failed to "CreatePodSandbox" for "deploymentconfig0-1-deploy_svt922(239ce3d0-39a2-11e7-82e0-fa163e9d2633)" with CreatePodSandboxError: "CreatePodSandbox for pod \"deploymentconfig0-1-deploy_svt922(239ce
3d0-39a2-11e7-82e0-fa163e9d2633)\" failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod \"deploymentconfig0-1-deploy_svt922\" network: CNI request failed with status 400: 'failed to find netid for namespace: svt922 in vnid map\n'"
May 15 15:16:31 svt-n-1-13 atomic-openshift-node: W0515 15:16:31.682161   33367 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "deploymentconfig0-1-deploy_svt922": Unexpected command output nsenter: cannot open : No such file or directory
May 15 15:16:31 svt-n-1-13 atomic-openshift-node: W0515 15:16:31.693794   33367 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "deploymentconfig0-1-deploy_svt922": Unexpected command output nsenter: cannot open : No such file or directory
May 15 15:16:31 svt-n-1-13 atomic-openshift-node: W0515 15:16:31.708684   33367 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "deploymentconfig0-1-deploy_svt922": Unexpected command output nsenter: cannot open : No such file or directory
May 15 15:16:31 svt-n-1-13 atomic-openshift-node: W0515 15:16:31.721026   33367 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "deploymentconfig0-1-deploy_svt922": Unexpected command output nsenter: cannot open : No such file or directory



Version-Release number of selected component (if applicable): 3.6.74


How reproducible:  Unknown.   Will report in this bug if it happens during next run.


Steps to Reproduce:
1. 100 node cluster
2. Run cluster-loader (https://github.com/openshift/svt/tree/master/openshift_scalability) with this configuration:  https://github.com/openshift/svt/blob/master/openshift_scalability/config/pyconfigMasterVirtScalePause.yaml
3.  AT the end of the run look for pods stuck in ContainerCreating

Actual results:

3 pods stuck in ContainerCreating

Expected results:

All pods created successfully

Additional info:

Full node logs attached.   Search for svt922, svt914 and svt899.  Those are the namespaces of the failed/hung pods

Comment 3 Dan Williams 2017-06-03 03:06:50 UTC
If you see this again, please:

1) oc get netnamespace -o wide
2) after that, modify the atomic-openshift-node systemd service file in /etc/systemd/system/atomic-openshift-node.service and set --loglevel=5 and restart.  Then wait for the problem to appear again.
3) Or better yet, provision the cluster with --loglevel=5 on all the nodes.

In any case, it's very likely some errors earlier are causing the "failed to find netid for namespace", and we should figure out what those are.

Comment 4 Dan Williams 2017-06-09 20:29:22 UTC
(In reply to Dan Williams from comment #3)
> If you see this again, please:
> 
> 1) oc get netnamespace -o wide
> 2) after that, modify the atomic-openshift-node systemd service file in
> /etc/systemd/system/atomic-openshift-node.service and set --loglevel=5 and
> restart.  Then wait for the problem to appear again.
> 3) Or better yet, provision the cluster with --loglevel=5 on all the nodes.
> 
> In any case, it's very likely some errors earlier are causing the "failed to
> find netid for namespace", and we should figure out what those are.

So having debugged the "failed to find netid for namespace" issue as part of bug 1451902 this is very likely due to kubelet being blocked by docker, and the SDN code not being given time to run, thus some events come out of order.

I'm going to dupe this bug to that one for now, if we solve the docker blockage issue and find the same "failed to find netid" still happening, then we can un-dupe and proceed.

*** This bug has been marked as a duplicate of bug 1451902 ***

Comment 5 Ben Bennett 2017-06-22 19:17:14 UTC
*** Bug 1461370 has been marked as a duplicate of this bug. ***