Bug 1509799 - [OnlineStarter][free-stg][free-int]"Failed create pod sandbox"---deploy failed
Summary: [OnlineStarter][free-stg][free-int]"Failed create pod sandbox"---deploy failed
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.7.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.9.0
Assignee: Ravi Sankar
QA Contact: Meng Bo
URL:
Whiteboard:
: 1510307 1512370 (view as bug list)
Depends On:
Blocks: 1512717 1540606 1540608
TreeView+ depends on / blocked
 
Reported: 2017-11-06 06:21 UTC by zhou ying
Modified: 2018-06-16 09:29 UTC (History)
33 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Openshift node was not waiting enough for the VNID as master assigns vnid and it could take a while to propagate to the node. Consequence: Pod creation fails Fix: Increase timeout from 1 to 5 secs for fetching VNID on node. Result: Should allow pod creation to succeed.
Clone Of:
: 1512717 1540606 1540608 (view as bug list)
Environment:
Last Closed: 2018-03-28 14:09:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
The json about the resources. (27.65 KB, text/plain)
2017-11-06 06:23 UTC, zhou ying
no flags Details
The log about api and controller services (5.40 MB, application/x-gzip)
2017-11-06 06:30 UTC, zhaliu
no flags Details
The log about node services (3.96 MB, application/x-gzip)
2017-11-06 07:36 UTC, zhaliu
no flags Details
The log about docker services (5.11 MB, application/x-gzip)
2017-11-06 07:39 UTC, zhaliu
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Origin (Github) 17243 0 None None None 2018-01-31 14:02:22 UTC
Red Hat Knowledge Base (Solution) 3337301 0 None None None 2018-01-29 20:12:05 UTC
Red Hat Product Errata RHBA-2018:0489 0 None None None 2018-03-28 14:10:51 UTC

Description zhou ying 2017-11-06 06:21:16 UTC
Description of problem:
Always failed deploy with error:Failed create pod sandbox. 

Version-Release number of selected component (if applicable):
openshift v3.7.0-0.178.2
kubernetes v1.7.6+a08f5eeb62

How reproducible:
30%

Steps to Reproduce:
1. Login online, and create project;
2. Create app:
  `oc new-app --docker-image=openshift/deployment-example --name=ab-example-a -l ab-example\=true --env=SUBTITLE\=shardA`
3. Check the app status.

Actual results:
3. App deploy failed with error:Failed create pod sandbox, after 13 mins, the pod still can not running:

ab-example-b-1-deploy   0/1       ContainerCreating   0          13m


Events:
  FirstSeen     LastSeen        Count   From                                                    SubObjectPath   Type            Reason                  Message
  ---------     --------        -----   ----                                                    -------------   --------        ------                  -------
  1m            1m              1       default-scheduler                                                       Normal          Scheduled               Successfully assigned ab-example-a-1-deploy to ip-172-31-71-195.us-east-2.compute.internal
  1m            1m              1       kubelet, ip-172-31-71-195.us-east-2.compute.internal                    Normal          SuccessfulMountVolume   MountVolume.SetUp succeeded for volume "deployer-token-fz5lj"
  39s           39s             1       kubelet, ip-172-31-71-195.us-east-2.compute.internal                    Warning         FailedCreatePodSandBox  Failed create pod sandbox.
  38s           38s             1       kubelet, ip-172-31-71-195.us-east-2.compute.internal                    Normal          SandboxChanged          Pod sandbox changed, it will be killed and re-created.

Expected results:
3. App deploy succeed.

Additional info:

Comment 1 zhou ying 2017-11-06 06:23:40 UTC
Created attachment 1348418 [details]
The json about the resources.

Comment 2 zhaliu 2017-11-06 06:30:58 UTC
Created attachment 1348420 [details]
The log about api and controller services

Comment 3 zhaliu 2017-11-06 07:36:32 UTC
Created attachment 1348429 [details]
The log about node services

Comment 4 zhaliu 2017-11-06 07:39:29 UTC
Created attachment 1348430 [details]
The log about docker services

Comment 6 Antonio Murdaca 2017-11-06 11:50:36 UTC
```
151040 Nov 06 03:32:34 ip-172-31-71-195.us-east-2.compute.internal atomic-openshift-node[107021]: E1106 03:32:34.586744  10702       1 pod_workers.go:186] Error syncing pod 63a56225-c2a2-11e7-a0b8-0203ad7dfcd7 ("pi-2x6cx_1d2hj(63a56225-c2a2-11e7-a0b8-0       203ad7dfcd7)"), skipping: failed to "CreatePodSandbox" for "pi-2x6cx_1d2hj(63a56225-c2a2-11e7-a0b8-0203ad7dfcd7)" with        CreatePodSandboxError: "CreatePodSandbox for pod \"pi-2x6cx_1d2hj(63a56225-c2a2-11e7-a0b8-0203ad7dfcd7)\" failed: rpc e       rror: code = 2 desc = NetworkPlugin cni failed to set up pod \"pi-2x6cx_1d2hj\" network: CNI request failed with status        400: 'pods \"pi-2x6cx\" not found\n'"


151026 Nov 06 03:32:30 ip-172-31-71-195.us-east-2.compute.internal atomic-openshift-node[107021]: I1106 03:32:30.008082  10702       1 kubelet.go:2047] Failed to delete pod "pull-11060330z-ly-1-deploy_ops-health-monitoring(ce057b2b-c2a2-11e7-a0b8-0203a       d7dfcd7)", err: pod not found
```

I can see the above logs in node_service.log

Comment 10 XiaochuanWang 2017-11-07 06:52:43 UTC
Still reproduced on free-int v3.7.0-0.191.0 with oc client v3.7.0-0.196.0

$ oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/pods/pod_with_two_containers.json
pod "doublecontainers" created

$ oc get po
NAME               READY     STATUS              RESTARTS   AGE
doublecontainers   0/2       ContainerCreating   0          17m

$ oc get event
LASTSEEN   FIRSTSEEN   COUNT     NAME               KIND      SUBOBJECT   TYPE      REASON                   SOURCE                                  MESSAGE
20m        20m         1         doublecontainers   Pod                   Normal    Scheduled                default-scheduler                       Successfully assigned doublecontainers to ip-172-31-49-44.ec2.internal
20m        20m         1         doublecontainers   Pod                   Normal    SuccessfulMountVolume    kubelet, ip-172-31-49-44.ec2.internal   MountVolume.SetUp succeeded for volume "default-token-fs49s" 
20m        20m         1         doublecontainers   Pod                   Normal    SuccessfulMountVolume    kubelet, ip-172-31-49-44.ec2.internal   MountVolume.SetUp succeeded for volume "tmp" 
43s        20m         39        doublecontainers   Pod                   Warning   FailedCreatePodSandBox   kubelet, ip-172-31-49-44.ec2.internal   Failed create pod sandbox.
12m        20m         12        doublecontainers   Pod                   Normal    SandboxChanged           kubelet, ip-172-31-49-44.ec2.internal   Pod sandbox changed, it will be killed and re-created.

Comment 11 zhaliu 2017-11-07 08:09:37 UTC
@ Michal Fojtik 

Checked in both free-int and free-stg by command "systemctl status cri-o" and the output like below is got:
“Unit cri-o.service could not be found”

and the command "runc list" gets no containers too.

Comment 12 zhaliu 2017-11-07 08:21:43 UTC
This problem blocked QE testing, when the problem happened , it was very difficult to create a pod successfully, so that the environment was not available almost.

Comment 13 Seth Jennings 2017-11-08 03:02:46 UTC
The node logs are packed (1870 occurrences) with the following:

cni.go:294] Error adding network: CNI request failed with status 400: 'failed to find netid for namespace: <namespace> in vnid map

afaict, every RunPodSandbox/CreatePodSandbox call is hitting this.

Sending to Networking.

Comment 14 Ben Bennett 2017-11-08 13:36:51 UTC
*** Bug 1510307 has been marked as a duplicate of this bug. ***

Comment 15 Ben Bennett 2017-11-08 18:16:00 UTC
I can't reproduce this on free-int.  @dcbw is going to add monitoring points to the SDN code so that we have some visibility into this problem for 3.7.0, and based on what that turns up we will fix it in 3.7.1.

Comment 16 Ben Bennett 2017-11-08 18:49:33 UTC
@dcbw's PR for improved metrics: https://github.com/openshift/origin/pull/17239

Comment 17 Ravi Sankar 2017-11-09 05:05:56 UTC
We hope PR https://github.com/openshift/origin/pull/17243 will solve this issue.

Comment 21 Ben Bennett 2017-11-10 16:24:01 UTC
https://github.com/openshift/origin/pull/17239 merged on Wednesday, moving to MODIFIED

Comment 28 XiuJuan Wang 2017-11-13 05:55:09 UTC
Reproduce this issue on free-stg version:

Server https://api.free-stg.openshift.com:443
openshift v3.7.4
kubernetes v1.7.6+a08f5eeb62

Comment 29 XiuJuan Wang 2017-11-13 07:41:11 UTC
1.Create an app
#oc new-app ruby https://github.com/openshift/ruby-hello-world.git -n xiu
2.Check pod logs
# oc  get  pods 
NAME                       READY     STATUS     RESTARTS   AGE
ruby-hello-world-1-build   0/1       Init:0/2   0          25m
#oc  describe  pods ruby-hello-world-1-build
---
  25m		25m		1	default-scheduler							Normal		Scheduled		Successfully assigned ruby-hello-world-1-build to ip-172-31-71-195.us-east-2.compute.internal
  25m		25m		1	kubelet, ip-172-31-71-195.us-east-2.compute.internal			Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "docker-socket" 
  25m		25m		1	kubelet, ip-172-31-71-195.us-east-2.compute.internal			Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "crio-socket" 
  25m		25m		1	kubelet, ip-172-31-71-195.us-east-2.compute.internal			Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "builder-dockercfg-vxqc4-push" 
  25m		25m		1	kubelet, ip-172-31-71-195.us-east-2.compute.internal			Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "builder-token-pxzc9" 
  25m		25m		1	kubelet, ip-172-31-71-195.us-east-2.compute.internal			Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "buildworkdir" 
  23m		14m		11	kubelet, ip-172-31-71-195.us-east-2.compute.internal			Normal		SandboxChanged		Pod sandbox changed, it will be killed and re-created.
  24m		3m		19	kubelet, ip-172-31-71-195.us-east-2.compute.internal			Warning		FailedCreatePodSandBox	Failed create pod sandbox.

Comment 30 Hongan Li 2017-11-13 08:04:51 UTC
the issue cannot be reproduced always, but seems easy to be reproduced when pod is scheduling to the node: ip-172-31-71-195.us-east-2.compute.internal.

for example, pods will be running within 10s on other nodes
# oc create -f  https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/routing/list_for_pods.json
replicationcontroller "test-rc" created
service "service-secure" created
service "service-unsecure" created
# oc get pod -o wide
NAME            READY     STATUS    RESTARTS   AGE       IP             NODE
test-rc-6qkpb   1/1       Running   0          9s        10.128.1.101   ip-172-31-76-218.us-east-2.compute.internal
test-rc-v9q8n   1/1       Running   0          9s        10.131.1.65    ip-172-31-74-247.us-east-2.compute.internal

but pods is in ContainerCreating status more than 30 mins on node: ip-172-31-71-195
# oc create -f  https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/routing/list_for_pods.json
replicationcontroller "test-rc" created
service "service-secure" created
service "service-unsecure" created

# oc get pod -o wide
NAME            READY     STATUS              RESTARTS   AGE       IP        NODE
test-rc-d6wm5   0/1       ContainerCreating   0          33m       <none>    ip-172-31-71-195.us-east-2.compute.internal
test-rc-mrbxb   0/1       ContainerCreating   0          33m       <none>    ip-172-31-71-195.us-east-2.compute.internal

Comment 32 Ravi Sankar 2017-11-13 15:38:33 UTC
https://github.com/openshift/origin/pull/17239 adds metric for netid failures
but won't fix the issue. Changing status back to Assigned and will move to modified once https://github.com/openshift/origin/pull/17243 is merged.

Comment 33 Seth Jennings 2017-11-13 19:49:19 UTC
*** Bug 1512370 has been marked as a duplicate of this bug. ***

Comment 34 Ben Bennett 2017-11-13 23:43:55 UTC
Thanks for the info.  That node helped narrow things down.  It had 44 bad sandbox files so every minute was trying to delete the pods, getting the bad checksum, and stopping.

We think you can reproduce it with:
  - Start up 100 pods
  - Stop the node (but not the master)
  - "oc delete pods --all"
  - On the node: 
    - "docker kill" all currently-running containers
    - perl -pi -e 's/"checksum":[0-9]*/"checksum":999/;' /var/lib/dockershim/sandbox/*
    - Then start the node

https://bugzilla.redhat.com/show_bug.cgi?id=1512370#c4 is related.  If the problem occurs after you make corrupt files.

BUT even after all of the corrupt sandbox files are cleaned node ip-172-31-71-195.us-east-2.compute.internal is slow.

Comment 35 Hongan Li 2017-11-14 08:41:16 UTC
(In reply to Ben Bennett from comment #34)
> Thanks for the info.  That node helped narrow things down.  It had 44 bad
> sandbox files so every minute was trying to delete the pods, getting the bad
> checksum, and stopping.
> 
> We think you can reproduce it with:
>   - Start up 100 pods
>   - Stop the node (but not the master)
>   - "oc delete pods --all"
>   - On the node: 
>     - "docker kill" all currently-running containers
>     - perl -pi -e 's/"checksum":[0-9]*/"checksum":999/;'
> /var/lib/dockershim/sandbox/*
>     - Then start the node
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1512370#c4 is related.  If the
> problem occurs after you make corrupt files.
> 
> BUT even after all of the corrupt sandbox files are cleaned node
> ip-172-31-71-195.us-east-2.compute.internal is slow.

Hi Ben, I can find similar errors like "checkpoint is corrupted" in logs when reproducing with the steps, but after that the new pod can be created successfully instead of stuck in "ContainerCreating" status.

I tested in OCP v3.7.7 and full node logs attached.

Comment 37 Dan Winship 2017-11-14 16:34:47 UTC
My current theory is that there's a resource leak of some sort and so you have to leave things running with the bad checkpoint files for "a long time" before you start to see problems. I don't know how long "a long time" is. (More than an hour, less than a month...)

Comment 38 openshift-github-bot 2017-11-18 16:18:15 UTC
Commits pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/e833dc98cbb82ac21d30c2f4724d8b29ab30096b
Bug 1509799 - Fix WaitAndGetVNID() in sdn node

- Existing 1 sec timeout for WaitAndGetVNID() may not be sufficient
  for loaded nodes (I might have chosen this value due to lack of enough info).
  Now the timeout is bumped to 5 secs.

- WaitAndGetVNID() also queries the api server if needed (do not expect this
  to happen with the new timeout) to differentiate between these 2 cases:
  (1) Genuine error: Namespace doesn't exist so netid won't be present
  (2) Propagation issue: Namespace exists but watch NetNamespace event not
      propagated to the node. May be insufficient timeout or event missed.
      Newly added VnidNotFoundErrors metric will give more insights.

https://github.com/openshift/origin/commit/24f9407cb255064d36c31566d5d1a4b00e4a6b22
Merge pull request #17243 from pravisankar/fix-get-vnid

Automatic merge from submit-queue (batch tested with PRs 17012, 17243).

Bug 1509799 - Fix WaitAndGetVNID() in sdn node

- Existing 1 sec timeout for WaitAndGetVNID() may not be sufficient
  for loaded nodes (I might have chosen this value due to lack of enough info).
  Now the timeout is bumped to 5 secs.

- WaitAndGetVNID() also queries the api server if needed (do not expect this
  to happen often with the new timeout) to differentiate between these 2 cases:
  (1) Genuine error: Namespace doesn't exist so netid won't be present
  (2) Propagation issue: Namespace exists but watch NetNamespace event not
      propagated to the node. May be insufficient timeout or event missed.
      Newly added VnidNotFoundErrors metric will give more insights.

Comment 39 Junqi Zhao 2017-12-01 05:39:14 UTC
free-stg cluster,pods are failed to start up, failed to create pod sandbox.
OpenShift Master:v3.7.9 (online version 3.6.0.78)
Kubernetes Master:v1.7.6+a08f5eeb62

Comment 44 XiuJuan Wang 2017-12-20 10:10:10 UTC
Server https://api.free-int.openshift.com:443
openshift v3.8.18

Still could reproduce this bug sometimes in free-int on below node.
ip-172-31-59-87.ec2.internal

Comment 45 Justin Pierce 2017-12-20 13:31:08 UTC
v3.8.18 is only on the control plane (and cri-o nodes) in free-int. I suspect this fix can only be tested in an environment where all nodes are running v3.8 or beyond.

Comment 46 Ben Bennett 2018-01-16 15:06:47 UTC
XiuJuan Wang: Is this still blocking?  Have we been able to test when everything is running v3.8 or later?

Comment 47 Ben Bennett 2018-01-16 15:38:04 UTC
Ravi: Eric observes that the node processes on the master were updated.  Can you take a look at this again please since it's a test blocker.

Comment 48 Ravi Sankar 2018-01-17 03:07:55 UTC
As Justin pointed out in comment#45, masters are running v3.8.31 but nodes are still on v3.7.9 . Proposed fix is for openshift-node and is available on v3.8.0+ (https://github.com/openshift/origin/commit/e833dc98cbb82ac21d30c2f4724d8b29ab30096b). So this can not be tested until all the nodes are upgraded to openshift version that has the fix.

Comment 49 Ben Bennett 2018-01-17 21:22:14 UTC
I don't see that node listed in ohi... what type is it?

Comment 50 Ben Bennett 2018-01-17 21:28:57 UTC
Sorry... I was on the wrong cluster, that is ip-172-31-59-87.ec2.internal

So, it is running the wrong version and needs to be running 3.8 to have the fix.

Comment 53 zhaozhanqi 2018-01-25 06:54:11 UTC
Verified this bug on 
Server https://internal.api.free-int.openshift.com:443
openshift v3.9.0-0.21.0
kubernetes v1.9.1+a0ce1bc657

Did Not find the issue.

Comment 54 Steven Walter 2018-01-29 20:06:04 UTC
Attaching case for customer who has encountered this in a standalone OCP cluster:

2018-01-29 10:03:07 -0600 CST   2018-01-29 10:02:58 -0600 CST   9         app-lab-6-build   Pod                   Warning   FailedCreatePodSandBox   kubelet, node1.example.com   Failed create pod sandbox.
2018-01-29 10:03:08 -0600 CST   2018-01-29 10:02:59 -0600 CST   9         app-lab-6-build   Pod                   Normal    SandboxChanged           kubelet, node1.example.com   Pod sandbox changed, it will be killed and re-created.
2018-01-29 10:04:05 -0600 CST   2018-01-29 10:04:05 -0600 CST   1         app-lab-6         Build                 Normal    BuildCancelled           build-controller                        Build app-mon-lab/app-lab-6 has been cancelled
2018-01-29 10:07:46 -0600 CST   2018-01-29 10:07:46 -0600 CST   1         app-lab-7-build   Pod                   Normal    Scheduled                default-scheduler                       Successfully assigned app-lab-7-build to node1.example.com

Is the fix something that one could implement manually via a config or will this require waiting on a patch? (Commit looks like we're using hardcoded values https://github.com/openshift/origin/commit/e833dc98cbb82ac21d30c2f4724d8b29ab30096b )

Comment 55 Ravi Sankar 2018-01-29 20:19:45 UTC
@stwalter As mentioned in comment#48, unfortunately this is not a config change and the patch is available on v3.8.0+ release.

Comment 56 Dan Winship 2018-01-31 14:17:37 UTC
@Steven Walter: "FailedCreatePodSandBox" is an extremely vague/broad error. You should not assume that the problem your customer is seeing is necessarily any of the problems that were addressed in this bug without doing further debugging to narrow down what was causing the pod creation failures.

(In particular, some of the problems here only apply to "OpenShift-Online-sized" clusters with hundreds of nodes and tens of thousands of users.)

Comment 68 errata-xmlrpc 2018-03-28 14:09:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489


Note You need to log in before you can comment on or make changes to this bug.