Bug 2028153

Summary:	Unable to ensure pod container exists: failed to create container for [kubepods burstable ...] : Argument list too long
Product:	OpenShift Container Platform	Reporter:	Mridul Markandey <mmarkand>
Component:	Node	Assignee:	Kir Kolyshkin <kir>
Node sub component:	CRI-O	QA Contact:	Sunil Choudhary <schoudha>
Status:	CLOSED WONTFIX	Docs Contact:
Severity:	high
Priority:	high	CC:	aos-bugs, bhoppus, bsmitley, cshepher, dgautam, hgomes, jcrumple, kir, kyankovi, mfiedler, nagrawal, pehunt, sbelmasg, vwalek, wrussell
Version:	4.7	Keywords:	Reopened
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	2036853 (view as bug list)		Environment:
Last Closed:	2024-04-30 18:04:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2039326
Bug Blocks:

Description Mridul Markandey 2021-12-01 15:42:45 UTC

Description of problem:
The customer is getting the below error message 
~~~
unable to ensure pod container exists: failed to create container for [kubepods burstable pod82c5a11c-1287-4dbb-b534-7650db60c22b] : Argument list too long
~~~
After applying the workaround as mentioned in the KCS[1], the issue is resolved. 

[1] https://access.redhat.com/solutions/4620671

On the bug, it was mentioned that the bug was fixed in version 4.6.12. However, the customer is still facing this issue in RHOCP v4.8.15 and wants a proper resolution of errata for this issue.

I am sharing the below data for the analysis of the problem:

1. sos_report from the nodes facing this issue.

2. Must gather from the problematic node. Given the node name as $node, you can get the must-gather from the problematic node:

`oc adm must-gather --node-name $node`

The issue is observed only on the worker nodes as of now.

3. Screenshot of the screen, showing the issue.

Let me know if any more information is needed from the customer's environment.


Version-Release number of selected component (if applicable): RHOCP v4.8.15, RHCOS nodes


How reproducible:
NA

Steps to Reproduce:
1.
2.
3.

Actual results:
The customer should not be facing this issue.

Expected results:
The customer is getting the below error message:
~~~
unable to ensure pod container exists: failed to create container for [kubepods burstable pod82c5a11c-1287-4dbb-b534-7650db60c22b] : Argument list too long
~~~

Additional info:
As asked by the engineering team, I have opened a clone of this bug[https://bugzilla.redhat.com/show_bug.cgi?id=1897337] and have shared all the requested data for the analysis.

Comment 16 Kir Kolyshkin 2021-12-17 23:00:28 UTC

In addition to 

> sudo systemctl daemon-reload

suggested by Neelesh earlier, you can also try

> sudo systemctl reset-failed

which may help.

In any case, please collect the output of

> sudo systemctl list-units --all

BEFORE trying any workarounds.

Comment 17 Kir Kolyshkin 2021-12-18 02:11:05 UTC

In case there are many failed .mount units in "systemctl list-units --all" output, backporting of https://github.com/systemd/systemd/pull/10980 may help.

Comment 22 Kir Kolyshkin 2021-12-22 00:54:11 UTC

> The majority of those

To be specific, 121587 out of 129434

> are from runc CVE fix

Again, to be specific, I meant CVE-2019-5736, for which the default mitigation
is to bind mount /proc/self/exe (see
https://github.com/opencontainers/runc/blob/master/libcontainer/nsenter/cloned_binary.c#L399).

This mount is performed every time runc needs to enter the container (i.e. whenever
runc start/run/exec is run). Systemd sees the new mount and creates a mount unit.

Due to a bug in systemd this unit is never removed. The bug is presumably fixed upstream
in https://github.com/systemd/systemd/pull/10980; I am working on a backport.

Comment 23 Kir Kolyshkin 2021-12-22 00:55:00 UTC

Side note: opened a PR to add "systemctl list-units --all" to sosreport: https://github.com/sosreport/sos/pull/2809

Comment 24 Kir Kolyshkin 2021-12-22 02:07:33 UTC

systemd backport: https://github.com/redhat-plumbers/systemd-rhel8/pull/244

Comment 36 Kir Kolyshkin 2022-01-27 17:20:22 UTC

> Can you please guide us about the next steps to follow? Are we planning for a permanent fix for this issue or running cronjob is the only available workaround at the moment?

Yes, a permanent fix is in progress (see comment 31 above).

Comment 41 Amarjit 2022-02-14 06:14:16 UTC

Hello Kir Kolyshin,

Could you please confirm in what particular version of OpenShift 4.7, is this current Bug #2028153 is resolved?
Which I attached with case no. 03126124.


Best Regards,
Amarjit Das

Comment 44 Amarjit 2022-02-16 14:56:36 UTC

Hello Neeleah,

Thank you for your update in comment 38.

As the current bug #2028153 was fixed, that means in OpenShift 4.7.41 we don't have to stick to a workaround. Please confirm.


Best Regards,
Amarjit Das

Comment 51 Kir Kolyshkin 2022-02-28 19:47:33 UTC

The systemd fix landed in systemd-239-45.el8_4.7, which is still not released. Releasing it is the subject of #2039326. Once released, RHCOS should pick it up.

Comment 61 Neelesh Agrawal 2022-03-21 13:20:31 UTC

systemd-239-45.el8_4.8 with the fixes is available in
4.11 nightly
4.10.4
4.9.24
4.8.35
4.7.45

or later builds.

Comment 69 Sunil Choudhary 2022-06-27 12:53:08 UTC

I see this bug was fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1984406 with systemd-239-45.el8. Marking it verified.

Comment 75 Rory Thrasher 2024-04-30 18:04:53 UTC

OCP is no longer using Bugzilla and this bug appears to have been left in an orphaned state. If the bug is still relevant, please open a new issue in the OCPBUGS Jira project: https://issues.redhat.com/projects/OCPBUGS/summary

Comment 76 Red Hat Bugzilla 2024-08-29 04:25:11 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days