Description of problem: Always failed deploy with error:Failed create pod sandbox. Version-Release number of selected component (if applicable): openshift v3.7.0-0.178.2 kubernetes v1.7.6+a08f5eeb62 How reproducible: 30% Steps to Reproduce: 1. Login online, and create project; 2. Create app: `oc new-app --docker-image=openshift/deployment-example --name=ab-example-a -l ab-example\=true --env=SUBTITLE\=shardA` 3. Check the app status. Actual results: 3. App deploy failed with error:Failed create pod sandbox, after 13 mins, the pod still can not running: ab-example-b-1-deploy 0/1 ContainerCreating 0 13m Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 1m 1m 1 default-scheduler Normal Scheduled Successfully assigned ab-example-a-1-deploy to ip-172-31-71-195.us-east-2.compute.internal 1m 1m 1 kubelet, ip-172-31-71-195.us-east-2.compute.internal Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "deployer-token-fz5lj" 39s 39s 1 kubelet, ip-172-31-71-195.us-east-2.compute.internal Warning FailedCreatePodSandBox Failed create pod sandbox. 38s 38s 1 kubelet, ip-172-31-71-195.us-east-2.compute.internal Normal SandboxChanged Pod sandbox changed, it will be killed and re-created. Expected results: 3. App deploy succeed. Additional info:
Created attachment 1348418 [details] The json about the resources.
Created attachment 1348420 [details] The log about api and controller services
Created attachment 1348429 [details] The log about node services
Created attachment 1348430 [details] The log about docker services
``` 151040 Nov 06 03:32:34 ip-172-31-71-195.us-east-2.compute.internal atomic-openshift-node[107021]: E1106 03:32:34.586744 10702 1 pod_workers.go:186] Error syncing pod 63a56225-c2a2-11e7-a0b8-0203ad7dfcd7 ("pi-2x6cx_1d2hj(63a56225-c2a2-11e7-a0b8-0 203ad7dfcd7)"), skipping: failed to "CreatePodSandbox" for "pi-2x6cx_1d2hj(63a56225-c2a2-11e7-a0b8-0203ad7dfcd7)" with CreatePodSandboxError: "CreatePodSandbox for pod \"pi-2x6cx_1d2hj(63a56225-c2a2-11e7-a0b8-0203ad7dfcd7)\" failed: rpc e rror: code = 2 desc = NetworkPlugin cni failed to set up pod \"pi-2x6cx_1d2hj\" network: CNI request failed with status 400: 'pods \"pi-2x6cx\" not found\n'" 151026 Nov 06 03:32:30 ip-172-31-71-195.us-east-2.compute.internal atomic-openshift-node[107021]: I1106 03:32:30.008082 10702 1 kubelet.go:2047] Failed to delete pod "pull-11060330z-ly-1-deploy_ops-health-monitoring(ce057b2b-c2a2-11e7-a0b8-0203a d7dfcd7)", err: pod not found ``` I can see the above logs in node_service.log
Still reproduced on free-int v3.7.0-0.191.0 with oc client v3.7.0-0.196.0 $ oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/pods/pod_with_two_containers.json pod "doublecontainers" created $ oc get po NAME READY STATUS RESTARTS AGE doublecontainers 0/2 ContainerCreating 0 17m $ oc get event LASTSEEN FIRSTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE 20m 20m 1 doublecontainers Pod Normal Scheduled default-scheduler Successfully assigned doublecontainers to ip-172-31-49-44.ec2.internal 20m 20m 1 doublecontainers Pod Normal SuccessfulMountVolume kubelet, ip-172-31-49-44.ec2.internal MountVolume.SetUp succeeded for volume "default-token-fs49s" 20m 20m 1 doublecontainers Pod Normal SuccessfulMountVolume kubelet, ip-172-31-49-44.ec2.internal MountVolume.SetUp succeeded for volume "tmp" 43s 20m 39 doublecontainers Pod Warning FailedCreatePodSandBox kubelet, ip-172-31-49-44.ec2.internal Failed create pod sandbox. 12m 20m 12 doublecontainers Pod Normal SandboxChanged kubelet, ip-172-31-49-44.ec2.internal Pod sandbox changed, it will be killed and re-created.
@ Michal Fojtik Checked in both free-int and free-stg by command "systemctl status cri-o" and the output like below is got: “Unit cri-o.service could not be found” and the command "runc list" gets no containers too.
This problem blocked QE testing, when the problem happened , it was very difficult to create a pod successfully, so that the environment was not available almost.
The node logs are packed (1870 occurrences) with the following: cni.go:294] Error adding network: CNI request failed with status 400: 'failed to find netid for namespace: <namespace> in vnid map afaict, every RunPodSandbox/CreatePodSandbox call is hitting this. Sending to Networking.
*** Bug 1510307 has been marked as a duplicate of this bug. ***
I can't reproduce this on free-int. @dcbw is going to add monitoring points to the SDN code so that we have some visibility into this problem for 3.7.0, and based on what that turns up we will fix it in 3.7.1.
@dcbw's PR for improved metrics: https://github.com/openshift/origin/pull/17239
We hope PR https://github.com/openshift/origin/pull/17243 will solve this issue.
https://github.com/openshift/origin/pull/17239 merged on Wednesday, moving to MODIFIED
Reproduce this issue on free-stg version: Server https://api.free-stg.openshift.com:443 openshift v3.7.4 kubernetes v1.7.6+a08f5eeb62
1.Create an app #oc new-app ruby https://github.com/openshift/ruby-hello-world.git -n xiu 2.Check pod logs # oc get pods NAME READY STATUS RESTARTS AGE ruby-hello-world-1-build 0/1 Init:0/2 0 25m #oc describe pods ruby-hello-world-1-build --- 25m 25m 1 default-scheduler Normal Scheduled Successfully assigned ruby-hello-world-1-build to ip-172-31-71-195.us-east-2.compute.internal 25m 25m 1 kubelet, ip-172-31-71-195.us-east-2.compute.internal Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "docker-socket" 25m 25m 1 kubelet, ip-172-31-71-195.us-east-2.compute.internal Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "crio-socket" 25m 25m 1 kubelet, ip-172-31-71-195.us-east-2.compute.internal Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "builder-dockercfg-vxqc4-push" 25m 25m 1 kubelet, ip-172-31-71-195.us-east-2.compute.internal Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "builder-token-pxzc9" 25m 25m 1 kubelet, ip-172-31-71-195.us-east-2.compute.internal Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "buildworkdir" 23m 14m 11 kubelet, ip-172-31-71-195.us-east-2.compute.internal Normal SandboxChanged Pod sandbox changed, it will be killed and re-created. 24m 3m 19 kubelet, ip-172-31-71-195.us-east-2.compute.internal Warning FailedCreatePodSandBox Failed create pod sandbox.
the issue cannot be reproduced always, but seems easy to be reproduced when pod is scheduling to the node: ip-172-31-71-195.us-east-2.compute.internal. for example, pods will be running within 10s on other nodes # oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/routing/list_for_pods.json replicationcontroller "test-rc" created service "service-secure" created service "service-unsecure" created # oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE test-rc-6qkpb 1/1 Running 0 9s 10.128.1.101 ip-172-31-76-218.us-east-2.compute.internal test-rc-v9q8n 1/1 Running 0 9s 10.131.1.65 ip-172-31-74-247.us-east-2.compute.internal but pods is in ContainerCreating status more than 30 mins on node: ip-172-31-71-195 # oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/routing/list_for_pods.json replicationcontroller "test-rc" created service "service-secure" created service "service-unsecure" created # oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE test-rc-d6wm5 0/1 ContainerCreating 0 33m <none> ip-172-31-71-195.us-east-2.compute.internal test-rc-mrbxb 0/1 ContainerCreating 0 33m <none> ip-172-31-71-195.us-east-2.compute.internal
https://github.com/openshift/origin/pull/17239 adds metric for netid failures but won't fix the issue. Changing status back to Assigned and will move to modified once https://github.com/openshift/origin/pull/17243 is merged.
*** Bug 1512370 has been marked as a duplicate of this bug. ***
Thanks for the info. That node helped narrow things down. It had 44 bad sandbox files so every minute was trying to delete the pods, getting the bad checksum, and stopping. We think you can reproduce it with: - Start up 100 pods - Stop the node (but not the master) - "oc delete pods --all" - On the node: - "docker kill" all currently-running containers - perl -pi -e 's/"checksum":[0-9]*/"checksum":999/;' /var/lib/dockershim/sandbox/* - Then start the node https://bugzilla.redhat.com/show_bug.cgi?id=1512370#c4 is related. If the problem occurs after you make corrupt files. BUT even after all of the corrupt sandbox files are cleaned node ip-172-31-71-195.us-east-2.compute.internal is slow.
(In reply to Ben Bennett from comment #34) > Thanks for the info. That node helped narrow things down. It had 44 bad > sandbox files so every minute was trying to delete the pods, getting the bad > checksum, and stopping. > > We think you can reproduce it with: > - Start up 100 pods > - Stop the node (but not the master) > - "oc delete pods --all" > - On the node: > - "docker kill" all currently-running containers > - perl -pi -e 's/"checksum":[0-9]*/"checksum":999/;' > /var/lib/dockershim/sandbox/* > - Then start the node > > https://bugzilla.redhat.com/show_bug.cgi?id=1512370#c4 is related. If the > problem occurs after you make corrupt files. > > BUT even after all of the corrupt sandbox files are cleaned node > ip-172-31-71-195.us-east-2.compute.internal is slow. Hi Ben, I can find similar errors like "checkpoint is corrupted" in logs when reproducing with the steps, but after that the new pod can be created successfully instead of stuck in "ContainerCreating" status. I tested in OCP v3.7.7 and full node logs attached.
My current theory is that there's a resource leak of some sort and so you have to leave things running with the bad checkpoint files for "a long time" before you start to see problems. I don't know how long "a long time" is. (More than an hour, less than a month...)
Commits pushed to master at https://github.com/openshift/origin https://github.com/openshift/origin/commit/e833dc98cbb82ac21d30c2f4724d8b29ab30096b Bug 1509799 - Fix WaitAndGetVNID() in sdn node - Existing 1 sec timeout for WaitAndGetVNID() may not be sufficient for loaded nodes (I might have chosen this value due to lack of enough info). Now the timeout is bumped to 5 secs. - WaitAndGetVNID() also queries the api server if needed (do not expect this to happen with the new timeout) to differentiate between these 2 cases: (1) Genuine error: Namespace doesn't exist so netid won't be present (2) Propagation issue: Namespace exists but watch NetNamespace event not propagated to the node. May be insufficient timeout or event missed. Newly added VnidNotFoundErrors metric will give more insights. https://github.com/openshift/origin/commit/24f9407cb255064d36c31566d5d1a4b00e4a6b22 Merge pull request #17243 from pravisankar/fix-get-vnid Automatic merge from submit-queue (batch tested with PRs 17012, 17243). Bug 1509799 - Fix WaitAndGetVNID() in sdn node - Existing 1 sec timeout for WaitAndGetVNID() may not be sufficient for loaded nodes (I might have chosen this value due to lack of enough info). Now the timeout is bumped to 5 secs. - WaitAndGetVNID() also queries the api server if needed (do not expect this to happen often with the new timeout) to differentiate between these 2 cases: (1) Genuine error: Namespace doesn't exist so netid won't be present (2) Propagation issue: Namespace exists but watch NetNamespace event not propagated to the node. May be insufficient timeout or event missed. Newly added VnidNotFoundErrors metric will give more insights.
free-stg cluster,pods are failed to start up, failed to create pod sandbox. OpenShift Master:v3.7.9 (online version 3.6.0.78) Kubernetes Master:v1.7.6+a08f5eeb62
Server https://api.free-int.openshift.com:443 openshift v3.8.18 Still could reproduce this bug sometimes in free-int on below node. ip-172-31-59-87.ec2.internal
v3.8.18 is only on the control plane (and cri-o nodes) in free-int. I suspect this fix can only be tested in an environment where all nodes are running v3.8 or beyond.
XiuJuan Wang: Is this still blocking? Have we been able to test when everything is running v3.8 or later?
Ravi: Eric observes that the node processes on the master were updated. Can you take a look at this again please since it's a test blocker.
As Justin pointed out in comment#45, masters are running v3.8.31 but nodes are still on v3.7.9 . Proposed fix is for openshift-node and is available on v3.8.0+ (https://github.com/openshift/origin/commit/e833dc98cbb82ac21d30c2f4724d8b29ab30096b). So this can not be tested until all the nodes are upgraded to openshift version that has the fix.
I don't see that node listed in ohi... what type is it?
Sorry... I was on the wrong cluster, that is ip-172-31-59-87.ec2.internal So, it is running the wrong version and needs to be running 3.8 to have the fix.
Verified this bug on Server https://internal.api.free-int.openshift.com:443 openshift v3.9.0-0.21.0 kubernetes v1.9.1+a0ce1bc657 Did Not find the issue.
Attaching case for customer who has encountered this in a standalone OCP cluster: 2018-01-29 10:03:07 -0600 CST 2018-01-29 10:02:58 -0600 CST 9 app-lab-6-build Pod Warning FailedCreatePodSandBox kubelet, node1.example.com Failed create pod sandbox. 2018-01-29 10:03:08 -0600 CST 2018-01-29 10:02:59 -0600 CST 9 app-lab-6-build Pod Normal SandboxChanged kubelet, node1.example.com Pod sandbox changed, it will be killed and re-created. 2018-01-29 10:04:05 -0600 CST 2018-01-29 10:04:05 -0600 CST 1 app-lab-6 Build Normal BuildCancelled build-controller Build app-mon-lab/app-lab-6 has been cancelled 2018-01-29 10:07:46 -0600 CST 2018-01-29 10:07:46 -0600 CST 1 app-lab-7-build Pod Normal Scheduled default-scheduler Successfully assigned app-lab-7-build to node1.example.com Is the fix something that one could implement manually via a config or will this require waiting on a patch? (Commit looks like we're using hardcoded values https://github.com/openshift/origin/commit/e833dc98cbb82ac21d30c2f4724d8b29ab30096b )
@stwalter As mentioned in comment#48, unfortunately this is not a config change and the patch is available on v3.8.0+ release.
@Steven Walter: "FailedCreatePodSandBox" is an extremely vague/broad error. You should not assume that the problem your customer is seeing is necessarily any of the problems that were addressed in this bug without doing further debugging to narrow down what was causing the pod creation failures. (In particular, some of the problems here only apply to "OpenShift-Online-sized" clusters with hundreds of nodes and tens of thousands of users.)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0489