Hide Forgot
Description of problem: In a larger-scale deployment (32 nodes). when deploying 2000 pods, at around 1000 pods issues with pods going into RunContainerErros, CrashLoopBackOff started to appear in 4.x cluster. Moreover, only 1999 pods out of 2000 pods were created. Version-Release number of selected component (if applicable): $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-03-04-234414 True False 2d20h Cluster version is 4.0.0-0.nightly-2019-03-04-234414 How reproducible: Create ~2000 pods in 32 node cluster. Actual results: nginx-passthrough-007-qvv2s 0/1 CrashLoopBackOff 3 5m13s nginx-passthrough-008-kpjp2 1/1 Running 1 5m6s nginx-passthrough-009-bhj6r 1/1 Running 0 5m5s nginx-passthrough-010-v6656 1/1 Running 0 5m nginx-passthrough-011-htlct 1/1 Running 0 4m59s nginx-passthrough-012-ctfdq 1/1 Running 0 4m55s nginx-passthrough-013-qlhv6 1/1 Running 0 4m50s nginx-passthrough-014-dqtjg 1/1 Running 0 4m45s nginx-passthrough-015-q4mzk 1/1 Running 0 4m41s nginx-passthrough-016-nmvtl 0/1 CrashLoopBackOff 4 4m41s nginx-passthrough-017-bsc8r 1/1 Running 1 4m40s nginx-passthrough-018-5rfzw 1/1 Running 0 4m31s nginx-passthrough-019-xxz9r 1/1 Running 0 4m29s nginx-passthrough-020-8g98g 1/1 Running 1 4m23s nginx-passthrough-021-wsvmg 1/1 Running 1 4m20s nginx-passthrough-022-ht9q2 1/1 Running 3 4m12s nginx-passthrough-023-b44v2 1/1 Running 0 3m33s nginx-passthrough-024-pj6rp 1/1 Running 3 3m33s nginx-passthrough-025-8zkgw 1/1 Running 2 3m30s nginx-passthrough-026-zsjrm 0/1 RunContainerError 4 3m31s nginx-passthrough-027-4crgg 1/1 Running 1 3m33s nginx-passthrough-028-x7mc2 1/1 Running 1 3m32s nginx-passthrough-029-slbk8 1/1 Running 0 3m32s nginx-passthrough-030-7vkxl 1/1 Running 0 3m34s nginx-passthrough-031-hxkdz 1/1 Running 2 3m35s nginx-passthrough-032-wq5xg 1/1 Running 1 3m35s nginx-passthrough-033-vnsbm 1/1 Running 2 3m30s nginx-passthrough-034-7szlx 1/1 Running 1 3m35s nginx-passthrough-035-fdn5j 1/1 Running 0 3m30s nginx-passthrough-036-s95cw 1/1 Running 0 3m33s nginx-passthrough-037-zrbbs 0/1 CrashLoopBackOff 2 3m34s nginx-passthrough-038-brk6d 1/1 Running 0 3m33s nginx-passthrough-039-76mk8 1/1 Running 1 3m35s 86m Warning Failed pod/nginx-passthrough-026-zsjrm Error: container nginx-passthrough is not in created state: stopped 87m Warning Failed pod/nginx-passthrough-026-zsjrm Error: container create failed: nsenter: could not ensure we are a cloned binary: Argument list to o long container_linux.go:336: starting container process caused "process_linux.go:279: running exec setns process for init caused \"exit status 23\"" 87m Normal SandboxChanged pod/nginx-passthrough-026-zsjrm Pod sandbox changed, it will be killed and re-created. Expected results: Containers being created without the errors above. Additional info: Finally, all 1999 were running after a while going through CrashLoopBackOff and RunContainerError. The missing pod that wasn't created was supposed to be in server-tls-passthrough-006 namespaces. The errors above are from server-tls-passthrough-005 namespaces. Full event logs attached.
Created attachment 1543980 [details] Event logs
*** Bug 1688719 has been marked as a duplicate of this bug. ***
Can we please get the kubelet logs as well.
We have created a new CRI-O package with changes that we believe should help fixing this problem. Can you grab cri-o 1.12.10 and try this out again please.
Still need wait new build since latest build we can got is sill cri-o 1.12.9 # oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-130-158.ap-northeast-1.compute.internal Ready worker 3h29m v1.12.4+30e6a0f55 10.0.130.158 <none> Red Hat Enterprise Linux CoreOS 410.8.20190322.0 4.18.0-80.el8.x86_64 cri-o://1.12.9-1.rhaos4.0.gitaac6be5.el7 ip-10-0-139-60.ap-northeast-1.compute.internal Ready master 3h38m v1.12.4+30e6a0f55 10.0.139.60 <none> Red Hat Enterprise Linux CoreOS 410.8.20190322.0 4.18.0-80.el8.x86_64 cri-o://1.12.9-1.rhaos4.0.gitaac6be5.el7 ip-10-0-147-127.ap-northeast-1.compute.internal Ready worker 3h29m v1.12.4+30e6a0f55 10.0.147.127 <none> Red Hat Enterprise Linux CoreOS 410.8.20190322.0 4.18.0-80.el8.x86_64 cri-o://1.12.9-1.rhaos4.0.gitaac6be5.el7 ip-10-0-150-44.ap-northeast-1.compute.internal Ready master 3h39m v1.12.4+30e6a0f55 10.0.150.44 <none> Red Hat Enterprise Linux CoreOS 410.8.20190322.0 4.18.0-80.el8.x86_64 cri-o://1.12.9-1.rhaos4.0.gitaac6be5.el7 ip-10-0-172-20.ap-northeast-1.compute.internal Ready master 3h39m v1.12.4+30e6a0f55 10.0.172.20 <none> Red Hat Enterprise Linux CoreOS 410.8.20190322.0 4.18.0-80.el8.x86_64 cri-o://1.12.9-1.rhaos4.0.gitaac6be5.el7 ip-10-0-174-242.ap-northeast-1.compute.internal Ready worker 3h30m v1.12.4+30e6a0f55 10.0.174.242 <none> Red Hat Enterprise Linux CoreOS 410.8.20190322.0 4.18.0-80.el8.x86_64 cri-o://1.12.9-1.rhaos4.0.gitaac6be5.el7 # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-03-28-030453 True False 4h3m Cluster version is 4.0.0-0.nightly-2019-03-28-030453
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758