Bug 1942375

Summary: CRI-O failing with error "reserving ctr name"
Product: OpenShift Container Platform Reporter: Rutvik <rkshirsa>
Component: NodeAssignee: Kir Kolyshkin <kir>
Node sub component: CRI-O QA Contact: MinLi <minmli>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: abraj, aos-bugs, kir, pehunt, rkshirsa
Version: 4.6.z   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: runc-1.0.0-86.rhaos4.6.git23384e2 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:55:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Rutvik 2021-03-24 09:45:48 UTC
Description of problem:

As per the z-stream fix https://bugzilla.redhat.com/show_bug.cgi?id=1934656#c7, the customer has successfully upgraded the cluster to v4.6.21 but the bare metal workers still facing the issue with CRI-O due to which pods are either getting stuck at ContainerCreating or Terminating phase.

Mar 23 20:42:22 [host_44] crio[3612]: time="2021-03-23 20:42:22.278230942Z" level=warning msg="Error reserving ctr name k8s_application_app_name3 for id 55f93365d7453a9f6e72aecc5baf3de1b4424871a9eeed771ff29dd3741c6411: name is reserved"

Mar 23 20:23:39 [host_44] crio[2973]: time="2021-03-23 20:23:39.206376994Z" level=warning msg="Stopping container cdfd6fb1057a7fe40f0b7a67882fa645dfc978470bdecce403c898f5fc11dce6 with stop signal timed out: timeout reached after 30 seconds waiting for container process to exit"


Version-Release number of selected component (if applicable):
v4.6.21

How reproducible:
Always on BareMetal nodes

Actual results:
level=warning msg="Error reserving ctr name

Expected results:
Pods should not be stuck in the ContainerCreating phase.

Additional info:
This issue is usually affecting the bare metal workers only which are being heavily used as compared to other workers.

Comment 4 Peter Hunt 2021-04-01 19:45:26 UTC
Kir, can you take a look and see if there's anything fishy about runc here?

Comment 5 MinLi 2021-04-02 03:04:23 UTC
A similar bug in 4.6:https://bugzilla.redhat.com/show_bug.cgi?id=1934656

Comment 8 Kir Kolyshkin 2021-04-20 23:40:07 UTC
This might be a dupe of #1903228 -- alas, I don't have anything to say at this time.

Comment 9 Kir Kolyshkin 2021-04-27 22:48:15 UTC
Copying the status update I have provided at https://bugzilla.redhat.com/show_bug.cgi?id=1903228#c35:

It would make sense to test if my fix (upstream: https://github.com/opencontainers/runc/pull/2918, 4.6 backport: https://github.com/projectatomic/runc/pull/47) helps or not. I see that both PRs were merged, but I'm not sure if RPMs are available.

Comment 10 Peter Hunt 2021-04-28 15:41:27 UTC
RPMs are available now

Comment 14 MinLi 2021-05-13 06:04:06 UTC
$ oc get clusterversion 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-05-12-122225   True        False         102m    Cluster version is 4.8.0-0.nightly-2021-05-12-122225


sh-4.4# chroot /host 
sh-4.4# rpm -qa | grep runc 
runc-1.0.0-95.rhaos4.8.gitcd80260.el8.x86_64

Comment 17 Kir Kolyshkin 2021-05-27 02:47:23 UTC
An additional fix (https://github.com/projectatomic/runc/pull/52) went into runc-1.0.0-86.rhaos4.6.git23384e2, please re-test with that one.

Comment 21 errata-xmlrpc 2021-07-27 22:55:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438