1688720 – Pods go into RunContainerErrors, Failed create pod sandbox

Bug 1688720 - Pods go into RunContainerErrors, Failed create pod sandbox

Summary: Pods go into RunContainerErrors, Failed create pod sandbox

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Mrunal Patel
QA Contact:	weiwei jiang
Docs Contact:
URL:
Whiteboard:	aos-scalability-41
Duplicates (1):	1688719 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-03-14 10:15 UTC by Jiří Mencák
Modified:	2019-06-04 10:45 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:45:49 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Event logs (7.91 KB, application/x-xz) 2019-03-14 10:35 UTC, Jiří Mencák	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:45:56 UTC

Description Jiří Mencák 2019-03-14 10:15:06 UTC

Description of problem:
In a larger-scale deployment (32 nodes). when deploying 2000 pods, at around 1000 pods issues with pods going into RunContainerErros, CrashLoopBackOff started to appear in 4.x cluster.  Moreover, only 1999 pods out of 2000 pods were created.

Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-03-04-234414   True        False         2d20h   Cluster version is 4.0.0-0.nightly-2019-03-04-234414

How reproducible:
Create ~2000 pods in 32 node cluster.

Actual results:
nginx-passthrough-007-qvv2s   0/1       CrashLoopBackOff    3          5m13s
nginx-passthrough-008-kpjp2   1/1       Running             1          5m6s
nginx-passthrough-009-bhj6r   1/1       Running             0          5m5s
nginx-passthrough-010-v6656   1/1       Running             0          5m
nginx-passthrough-011-htlct   1/1       Running             0          4m59s
nginx-passthrough-012-ctfdq   1/1       Running             0          4m55s
nginx-passthrough-013-qlhv6   1/1       Running             0          4m50s
nginx-passthrough-014-dqtjg   1/1       Running             0          4m45s
nginx-passthrough-015-q4mzk   1/1       Running             0          4m41s
nginx-passthrough-016-nmvtl   0/1       CrashLoopBackOff    4          4m41s
nginx-passthrough-017-bsc8r   1/1       Running             1          4m40s
nginx-passthrough-018-5rfzw   1/1       Running             0          4m31s
nginx-passthrough-019-xxz9r   1/1       Running             0          4m29s
nginx-passthrough-020-8g98g   1/1       Running             1          4m23s
nginx-passthrough-021-wsvmg   1/1       Running             1          4m20s
nginx-passthrough-022-ht9q2   1/1       Running             3          4m12s
nginx-passthrough-023-b44v2   1/1       Running             0          3m33s
nginx-passthrough-024-pj6rp   1/1       Running             3          3m33s
nginx-passthrough-025-8zkgw   1/1       Running             2          3m30s
nginx-passthrough-026-zsjrm   0/1       RunContainerError   4          3m31s
nginx-passthrough-027-4crgg   1/1       Running             1          3m33s
nginx-passthrough-028-x7mc2   1/1       Running             1          3m32s
nginx-passthrough-029-slbk8   1/1       Running             0          3m32s
nginx-passthrough-030-7vkxl   1/1       Running             0          3m34s
nginx-passthrough-031-hxkdz   1/1       Running             2          3m35s
nginx-passthrough-032-wq5xg   1/1       Running             1          3m35s
nginx-passthrough-033-vnsbm   1/1       Running             2          3m30s
nginx-passthrough-034-7szlx   1/1       Running             1          3m35s
nginx-passthrough-035-fdn5j   1/1       Running             0          3m30s
nginx-passthrough-036-s95cw   1/1       Running             0          3m33s
nginx-passthrough-037-zrbbs   0/1       CrashLoopBackOff    2          3m34s
nginx-passthrough-038-brk6d   1/1       Running             0          3m33s
nginx-passthrough-039-76mk8   1/1       Running             1          3m35s

86m         Warning   Failed                   pod/nginx-passthrough-026-zsjrm               Error: container nginx-passthrough is not in created state: stopped
87m         Warning   Failed                   pod/nginx-passthrough-026-zsjrm               Error: container create failed: nsenter: could not ensure we are a cloned binary: Argument list to
o long
container_linux.go:336: starting container process caused "process_linux.go:279: running exec setns process for init caused \"exit status 23\""
87m   Normal    SandboxChanged     pod/nginx-passthrough-026-zsjrm               Pod sandbox changed, it will be killed and re-created.

Expected results:
Containers being created without the errors above.

Additional info:
Finally, all 1999 were running after a while going through CrashLoopBackOff and RunContainerError.  The missing pod that wasn't created was supposed to be in server-tls-passthrough-006 namespaces.  The errors above are from server-tls-passthrough-005 namespaces.  Full event logs attached.

Comment 1 Jiří Mencák 2019-03-14 10:35:53 UTC

Created attachment 1543980 [details]
Event logs

Comment 2 Mrunal Patel 2019-03-14 17:33:17 UTC

*** Bug 1688719 has been marked as a duplicate of this bug. ***

Comment 3 Urvashi Mohnani 2019-03-15 17:39:40 UTC

Can we please get the kubelet logs as well.

Comment 5 Urvashi Mohnani 2019-03-25 17:18:50 UTC

We have created a new CRI-O package with changes that we believe should help fixing this problem. Can you grab cri-o 1.12.10 and try this out again please.

Comment 7 weiwei jiang 2019-04-01 09:55:23 UTC

Still need wait new build since latest build we can got is sill cri-o 1.12.9

# oc get nodes -o wide 
NAME                                              STATUS   ROLES    AGE     VERSION             INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                           KERNEL-VERSION         CONTAINER-RUNTIME
ip-10-0-130-158.ap-northeast-1.compute.internal   Ready    worker   3h29m   v1.12.4+30e6a0f55   10.0.130.158   <none>        Red Hat Enterprise Linux CoreOS 410.8.20190322.0   4.18.0-80.el8.x86_64   cri-o://1.12.9-1.rhaos4.0.gitaac6be5.el7
ip-10-0-139-60.ap-northeast-1.compute.internal    Ready    master   3h38m   v1.12.4+30e6a0f55   10.0.139.60    <none>        Red Hat Enterprise Linux CoreOS 410.8.20190322.0   4.18.0-80.el8.x86_64   cri-o://1.12.9-1.rhaos4.0.gitaac6be5.el7
ip-10-0-147-127.ap-northeast-1.compute.internal   Ready    worker   3h29m   v1.12.4+30e6a0f55   10.0.147.127   <none>        Red Hat Enterprise Linux CoreOS 410.8.20190322.0   4.18.0-80.el8.x86_64   cri-o://1.12.9-1.rhaos4.0.gitaac6be5.el7
ip-10-0-150-44.ap-northeast-1.compute.internal    Ready    master   3h39m   v1.12.4+30e6a0f55   10.0.150.44    <none>        Red Hat Enterprise Linux CoreOS 410.8.20190322.0   4.18.0-80.el8.x86_64   cri-o://1.12.9-1.rhaos4.0.gitaac6be5.el7
ip-10-0-172-20.ap-northeast-1.compute.internal    Ready    master   3h39m   v1.12.4+30e6a0f55   10.0.172.20    <none>        Red Hat Enterprise Linux CoreOS 410.8.20190322.0   4.18.0-80.el8.x86_64   cri-o://1.12.9-1.rhaos4.0.gitaac6be5.el7
ip-10-0-174-242.ap-northeast-1.compute.internal   Ready    worker   3h30m   v1.12.4+30e6a0f55   10.0.174.242   <none>        Red Hat Enterprise Linux CoreOS 410.8.20190322.0   4.18.0-80.el8.x86_64   cri-o://1.12.9-1.rhaos4.0.gitaac6be5.el7
# oc get clusterversion 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-03-28-030453   True        False         4h3m    Cluster version is 4.0.0-0.nightly-2019-03-28-030453

Comment 11 errata-xmlrpc 2019-06-04 10:45:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.