Bug 1942375

Summary:	CRI-O failing with error "reserving ctr name"
Product:	OpenShift Container Platform	Reporter:	Rutvik <rkshirsa>
Component:	Node	Assignee:	Kir Kolyshkin <kir>
Node sub component:	CRI-O	QA Contact:	MinLi <minmli>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	abraj, aos-bugs, kir, pehunt, rkshirsa
Version:	4.6.z
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	runc-1.0.0-86.rhaos4.6.git23384e2	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-27 22:55:12 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Rutvik 2021-03-24 09:45:48 UTC

Description of problem:

As per the z-stream fix https://bugzilla.redhat.com/show_bug.cgi?id=1934656#c7, the customer has successfully upgraded the cluster to v4.6.21 but the bare metal workers still facing the issue with CRI-O due to which pods are either getting stuck at ContainerCreating or Terminating phase.

Mar 23 20:42:22 [host_44] crio[3612]: time="2021-03-23 20:42:22.278230942Z" level=warning msg="Error reserving ctr name k8s_application_app_name3 for id 55f93365d7453a9f6e72aecc5baf3de1b4424871a9eeed771ff29dd3741c6411: name is reserved"

Mar 23 20:23:39 [host_44] crio[2973]: time="2021-03-23 20:23:39.206376994Z" level=warning msg="Stopping container cdfd6fb1057a7fe40f0b7a67882fa645dfc978470bdecce403c898f5fc11dce6 with stop signal timed out: timeout reached after 30 seconds waiting for container process to exit"


Version-Release number of selected component (if applicable):
v4.6.21

How reproducible:
Always on BareMetal nodes

Actual results:
level=warning msg="Error reserving ctr name

Expected results:
Pods should not be stuck in the ContainerCreating phase.

Additional info:
This issue is usually affecting the bare metal workers only which are being heavily used as compared to other workers.

Comment 4 Peter Hunt 2021-04-01 19:45:26 UTC

Kir, can you take a look and see if there's anything fishy about runc here?

Comment 5 MinLi 2021-04-02 03:04:23 UTC

A similar bug in 4.6:https://bugzilla.redhat.com/show_bug.cgi?id=1934656

Comment 8 Kir Kolyshkin 2021-04-20 23:40:07 UTC

This might be a dupe of #1903228 -- alas, I don't have anything to say at this time.

Comment 9 Kir Kolyshkin 2021-04-27 22:48:15 UTC

Copying the status update I have provided at https://bugzilla.redhat.com/show_bug.cgi?id=1903228#c35:

It would make sense to test if my fix (upstream: https://github.com/opencontainers/runc/pull/2918, 4.6 backport: https://github.com/projectatomic/runc/pull/47) helps or not. I see that both PRs were merged, but I'm not sure if RPMs are available.

Comment 10 Peter Hunt 2021-04-28 15:41:27 UTC

RPMs are available now

Comment 14 MinLi 2021-05-13 06:04:06 UTC

$ oc get clusterversion 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-05-12-122225   True        False         102m    Cluster version is 4.8.0-0.nightly-2021-05-12-122225


sh-4.4# chroot /host 
sh-4.4# rpm -qa | grep runc 
runc-1.0.0-95.rhaos4.8.gitcd80260.el8.x86_64

Comment 17 Kir Kolyshkin 2021-05-27 02:47:23 UTC

An additional fix (https://github.com/projectatomic/runc/pull/52) went into runc-1.0.0-86.rhaos4.6.git23384e2, please re-test with that one.

Comment 21 errata-xmlrpc 2021-07-27 22:55:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438