Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1965900

Summary:	Pods are getting stuck in ContainerCreating/ContainerCreateError/Terminating status
Product:	OpenShift Container Platform	Reporter:	Rutvik <rkshirsa>
Component:	Node	Assignee:	Peter Hunt <pehunt>
Node sub component:	CRI-O	QA Contact:	Weinan Liu <weinliu>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	medium	CC:	aos-bugs
Version:	3.11.0
Target Milestone:	---
Target Release:	3.11.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	cri-o-1.11.16-0.15.rhaos3.11.gitd7a399f.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-08-25 15:16:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Rutvik 2021-05-31 05:24:28 UTC

Description of problem:

We have identified symptoms similar to this https://bugzilla.redhat.com/show_bug.cgi?id=1787148

In this case, the customer is using cri-o as a runtime, hence it's not clear if the same fix that was given for docker in the above BZ would help here.


Version-Release number of selected component (if applicable):
# rpm -qa | grep systemd
systemd-libs-219-62.el7_6.5.x86_64
systemd-sysv-219-62.el7_6.5.x86_64
oci-systemd-hook-0.1.18-3.git8787307.el7_6.x86_64
systemd-219-62.el7_6.5.x86_64
---
cri-tools-1.11.1-2.rhaos3.11.gitedabfb5.el7.x86_64
criu-3.12-2.el7.x86_64
cri-o-1.11.16-0.8.dev.rhaos3.11.git6d43aae.el7.x86_64
---
atomic-openshift-docker-excluder-3.11.188-1.git.0.db0eaa8.el7.noarch
docker-1.13.1-94.gitb2f74b2.el7.x86_64 
docker-client-1.13.1-94.gitb2f74b2.el7.x86_64
---


How reproducible:
This is happening randomly

Actual results:

---
May 06 14:26:39 e2n1-1-worker atomic-openshift-node[16190]: E0506 02:26:39.026250   16190 pod_workers.go:186] Error syncing pod a19eaedc-ae2e-11eb-a2e3-009bfd250fac ("f8331805-7754-4893-ac73-540c1adb4fbf-65025-1620280080-brffp_zen(a19eaedc-ae2e-11eb-a2e3-009bfd250fac)"), skipping: failed to ensure that the pod: a19eaedc-ae2e-11eb-a2e3-009bfd250fac cgroups exist and are correctly applied: failed to create container for [kubepods besteffort poda19eaedc-ae2e-11eb-a2e3-009bfd250fac] : Argument list too long
---

------------ >>>
$ cat e1n4-1-worker/e1n4-1-worker.atomic.log | grep -i "Argument list too long" | wc -l
280244

$ cat e2n1-1-worker/e2n1-1-worker.atomic.log |  grep -i "Argument list too long" | wc -l
489776

$ cat e2n2-1-worker/e2n2-1-worker.atomic.log |  grep -i "Argument list too long" | wc -l
362871


Expected results:
The pod should not get stuck in Container Creating / Error phases.

Additional info:
When the problem was occurring, they deleted the affected Pods (Terminating status) using "oc delete pod xxx --force --grace-period=0" command, which recreated the pods with healthy status.

Comment 1 Peter Hunt 2021-06-01 18:58:27 UTC

fix for cri-o is attached

Comment 2 Peter Hunt 2021-06-11 18:57:08 UTC

oops forgot about this, I'll try to push the PR through

Comment 3 Peter Hunt 2021-07-02 20:36:45 UTC


ci is still wonky, hopefully I'll have cycles to fix it next sprint

Comment 4 Peter Hunt 2021-07-23 19:59:18 UTC

alas I did not

Comment 5 Peter Hunt 2021-08-09 18:04:15 UTC

pr merged

Comment 10 errata-xmlrpc 2021-08-25 15:16:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 3.11.z security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3193