Created attachment 1279103 [details] Node log Description of problem: Running a scale cluster of 100 nodes, 1000 namespaces and 4000 running pods. During the scaleup, 3 pods got stuck in ContainerCreating state. All three were on the same node and had a sequence of errors in the node logs like: May 15 15:14:05 svt-n-1-13 atomic-openshift-node: W0515 15:14:05.916590 33367 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "deploymentconfig0-1-deploy_svt922": Unexpected command output nsenter: cannot open : No such file or directory May 15 15:15:07 svt-n-1-13 atomic-openshift-node: E0515 15:15:07.374371 33367 cni.go:257] Error adding network: CNI request failed with status 400: 'failed to find netid for namespace: svt922 in vnid map May 15 15:15:07 svt-n-1-13 atomic-openshift-node: E0515 15:15:07.374435 33367 cni.go:211] Error while adding to cni network: CNI request failed with status 400: 'failed to find netid for namespace: svt922 in vnid map May 15 15:15:07 svt-n-1-13 atomic-openshift-node: E0515 15:15:07.708740 33367 remote_runtime.go:86] RunPodSandbox from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod "deploymentconfig0-1-deploy_svt922" network: CNI request failed with status 400: 'failed to find netid for namespace: svt922 in vnid map May 15 15:15:07 svt-n-1-13 atomic-openshift-node: E0515 15:15:07.708799 33367 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "deploymentconfig0-1-deploy_svt922(239ce3d0-39a2-11e7-82e0-fa163e9d2633)" failed: rpc erro r: code = 2 desc = NetworkPlugin cni failed to set up pod "deploymentconfig0-1-deploy_svt922" network: CNI request failed with status 400: 'failed to find netid for namespace: svt922 in vnid map May 15 15:15:07 svt-n-1-13 atomic-openshift-node: E0515 15:15:07.708817 33367 kuberuntime_manager.go:619] createPodSandbox for pod "deploymentconfig0-1-deploy_svt922(239ce3d0-39a2-11e7-82e0-fa163e9d2633)" failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod "deploymentconfig0-1-deploy_svt922" network: CNI request failed with status 400: 'failed to find netid for namespace: svt922 in vnid map May 15 15:15:07 svt-n-1-13 atomic-openshift-node: E0515 15:15:07.708858 33367 pod_workers.go:182] Error syncing pod 239ce3d0-39a2-11e7-82e0-fa163e9d2633 ("deploymentconfig0-1-deploy_svt922(239ce3d0-39a2-11e7-82e0-fa163e9d2633)"), skipping: failed to "CreatePodSandbox" for "deploymentconfig0-1-deploy_svt922(239ce3d0-39a2-11e7-82e0-fa163e9d2633)" with CreatePodSandboxError: "CreatePodSandbox for pod \"deploymentconfig0-1-deploy_svt922(239ce 3d0-39a2-11e7-82e0-fa163e9d2633)\" failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod \"deploymentconfig0-1-deploy_svt922\" network: CNI request failed with status 400: 'failed to find netid for namespace: svt922 in vnid map\n'" May 15 15:16:31 svt-n-1-13 atomic-openshift-node: W0515 15:16:31.682161 33367 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "deploymentconfig0-1-deploy_svt922": Unexpected command output nsenter: cannot open : No such file or directory May 15 15:16:31 svt-n-1-13 atomic-openshift-node: W0515 15:16:31.693794 33367 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "deploymentconfig0-1-deploy_svt922": Unexpected command output nsenter: cannot open : No such file or directory May 15 15:16:31 svt-n-1-13 atomic-openshift-node: W0515 15:16:31.708684 33367 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "deploymentconfig0-1-deploy_svt922": Unexpected command output nsenter: cannot open : No such file or directory May 15 15:16:31 svt-n-1-13 atomic-openshift-node: W0515 15:16:31.721026 33367 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "deploymentconfig0-1-deploy_svt922": Unexpected command output nsenter: cannot open : No such file or directory Version-Release number of selected component (if applicable): 3.6.74 How reproducible: Unknown. Will report in this bug if it happens during next run. Steps to Reproduce: 1. 100 node cluster 2. Run cluster-loader (https://github.com/openshift/svt/tree/master/openshift_scalability) with this configuration: https://github.com/openshift/svt/blob/master/openshift_scalability/config/pyconfigMasterVirtScalePause.yaml 3. AT the end of the run look for pods stuck in ContainerCreating Actual results: 3 pods stuck in ContainerCreating Expected results: All pods created successfully Additional info: Full node logs attached. Search for svt922, svt914 and svt899. Those are the namespaces of the failed/hung pods
If you see this again, please: 1) oc get netnamespace -o wide 2) after that, modify the atomic-openshift-node systemd service file in /etc/systemd/system/atomic-openshift-node.service and set --loglevel=5 and restart. Then wait for the problem to appear again. 3) Or better yet, provision the cluster with --loglevel=5 on all the nodes. In any case, it's very likely some errors earlier are causing the "failed to find netid for namespace", and we should figure out what those are.
(In reply to Dan Williams from comment #3) > If you see this again, please: > > 1) oc get netnamespace -o wide > 2) after that, modify the atomic-openshift-node systemd service file in > /etc/systemd/system/atomic-openshift-node.service and set --loglevel=5 and > restart. Then wait for the problem to appear again. > 3) Or better yet, provision the cluster with --loglevel=5 on all the nodes. > > In any case, it's very likely some errors earlier are causing the "failed to > find netid for namespace", and we should figure out what those are. So having debugged the "failed to find netid for namespace" issue as part of bug 1451902 this is very likely due to kubelet being blocked by docker, and the SDN code not being given time to run, thus some events come out of order. I'm going to dupe this bug to that one for now, if we solve the docker blockage issue and find the same "failed to find netid" still happening, then we can un-dupe and proceed. *** This bug has been marked as a duplicate of bug 1451902 ***
*** Bug 1461370 has been marked as a duplicate of this bug. ***