Bug 1779421 - CRI-O failing with: error reserving ctr name
Summary: CRI-O failing with: error reserving ctr name
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.5.0
Assignee: Peter Hunt
QA Contact: MinLi
URL:
Whiteboard:
: 1784819 (view as bug list)
Depends On:
Blocks: 1806000 1934656
TreeView+ depends on / blocked
 
Reported: 2019-12-03 23:40 UTC by W. Trevor King
Modified: 2023-09-15 00:20 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1806000 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:12:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 3153 0 None closed Bug 1779421: Bump RHCOS to 44.81.202002211631-0 2021-02-19 13:18:07 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:12:47 UTC

Internal Links: 1785399

Description W. Trevor King 2019-12-03 23:40:27 UTC
4.3 promotion job [1]:

Dec  3 10:40:57.971: INFO: At 2019-12-03 10:40:46 +0000 UTC - event for pod-configmaps-f0097d09-4b22-4b9d-a613-5e53e7b80375: {kubelet glr6vj7q-e8966-jhd4b-worker-w5m2g} Failed: Error: error reserving ctr name k8s_configmap-volume-test_pod-configmaps-f0097d09-4b22-4b9d-a613-5e53e7b80375_e2e-configmap-555_fd69ab6d-5cf9-47b4-b9e5-8581caf3a634_0 for id b7ab2ec6c5f0f8eac4ae1159ddb0e87a3917ef00b03ca21f02905c69f13b721f

In this case it lead to:

Failing tests:

[Feature:Builds][timing] capture build stages and durations  should record build stages and durations for s2i [Suite:openshift/conformance/parallel]
[Feature:DeploymentConfig] deploymentconfigs with multiple image change triggers [Conformance] should run a successful deployment with a trigger used by different containers [Suite:openshift/conformance/parallel/minimal]
[k8s.io] Container Lifecycle Hook when create a pod with lifecycle hook should execute poststart exec hook properly [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]
[k8s.io] Container Runtime blackbox test when starting a container that exits should run with the expected status [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]
[sig-cli] Kubectl client Simple pod should contain last line of the log [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-network] DNS should provide DNS for pods for Subdomain [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]
[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client should enforce policy based on NamespaceSelector with MatchExpressions[Feature:NetworkPolicy] [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:Network/OpenShiftSDN/Multitenant]
[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client should stop enforcing policies after they are deleted [Feature:NetworkPolicy] [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:Network/OpenShiftSDN/Multitenant]
[sig-storage] CSI mock volume CSI workload information using mock driver contain ephemeral=true when using inline volume [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-storage] ConfigMap should be consumable from pods in volume [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]
[sig-storage] ConfigMap should be consumable from pods in volume as non-root with FSGroup [LinuxOnly] [NodeFeature:FSGroup] [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-storage] EmptyDir volumes should support (root,0777,tmpfs) [LinuxOnly] [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]
[sig-storage] In-tree Volumes [Driver: cinder] [Testpattern: Dynamic PV (default fs)] subPath should be able to unmount after the subpath directory is deleted [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-storage] In-tree Volumes [Driver: gluster] [Testpattern: Inline-volume (default fs)] volumes should allow exec of files on the volume [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-storage] In-tree Volumes [Driver: gluster] [Testpattern: Inline-volume (default fs)] volumes should store data [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-storage] In-tree Volumes [Driver: gluster] [Testpattern: Pre-provisioned PV (default fs)] volumes should store data [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: block] [Testpattern: Pre-provisioned PV (default fs)] subPath should support existing single file [LinuxOnly] [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: blockfs] [Testpattern: Pre-provisioned PV (default fs)] subPath should support existing directories when readOnly specified in the volumeSource [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: dir-link] [Testpattern: Pre-provisioned PV (filesystem volmode)] volumeMode should not mount / map unused volumes in a pod [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-storage] Secrets optional updates should be reflected in volume [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]

Seems to happen every few days [2].  Possibly mitigated by rebooting the node [3].  CRI-O code generating the error string is [4].

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.3/510
[2]: https://search.svc.ci.openshift.org/?search=Failed:%20Error:%20error%20reserving%20ctr%20name
[3]: https://github.com/IBM/ibm-spectrum-scale-csi-operator/issues/54#issuecomment-555167595
[4]: https://github.com/cri-o/cri-o/blob/2de09c8b80545b77ba6fa49a1f66b681f9a11755/internal/lib/container_server.go#L562

Comment 1 W. Trevor King 2019-12-03 23:41:46 UTC
Ah, I forgot CRI-O is the Node component.

Comment 2 W. Trevor King 2019-12-04 16:12:41 UTC
I dunno if CRI-O/RHCOS has a different flow, but for most OpenShift components we need a bug with a master-ward Target Release (4.4 at the moment) blocking backports to already-forked-off release branches like 4.3.  That way we ensure we don't fix something in 4.3 and regress after forgetting to fix it in 4.4.

Comment 3 Peter Hunt 2019-12-11 21:29:15 UTC
This bug has certainly existed for a while, and isn't very high priority. I am going to defer it to 4.4.

In the meantime, I merged a PR in 1.16 and master (https://github.com/cri-o/cri-o/pull/3035 and https://github.com/cri-o/cri-o/pull/3036) to print the actual error.

Let's let that soak for a bit to find more instances, and then I will look into the root cause

Comment 4 Tom Sweeney 2019-12-11 22:44:57 UTC
bumping the target release.

Comment 5 John Strunk 2019-12-20 14:55:51 UTC
*** Bug 1784819 has been marked as a duplicate of this bug. ***

Comment 7 David Eads 2020-02-20 14:16:46 UTC
We are still seeing this and it's leading to difficult diagnosis across the stack. Sometimes it is obvious like https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.4/932, but most of the time, some random pod somewhere gets stuck and we see issues like https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.4/937 where the true cause is buried in an event that says 

```
 message: 'Failed to create pod sandbox: rpc error: code = Unknown desc = error reserving
    pod name k8s_apiserver-6qxj7_openshift-apiserver_1f85b2c0-967f-4e20-bede-cede12def172_0
    for id 061e104adb2cf9e49d873704ed7825d41ec36859809098419845a4efb1673dfd: name
    is reserved'
```

Comment 10 agopi 2020-03-31 19:45:53 UTC
Notice the same issue when trying to create a high number of pods with 4.3.9 on AWS (OpenShiftSDN), +1 to what David Eads has said. This is not noticeable especially if you're not using naked pod object defintions, as you'd just wait for a specified number of pods to come up, when in reality pods are being terminated in the background due to the error, while new pods slowly take up their position

Comment 11 Ben Parees 2020-04-01 14:48:01 UTC
Why was the 4.4 BZ(https://bugzilla.redhat.com/show_bug.cgi?id=1806000) closed as "not a bug" when we identified a necessary fix for 4.5?

Also, assuming there is a fix to be made here, please backport this to 4.3 as it is also severely impacted.

Comment 12 Tom Sweeney 2020-04-01 15:56:10 UTC
Ryan Phillips you closed https://bugzilla.redhat.com/show_bug.cgi?id=1806000, based on Ben's comment: https://bugzilla.redhat.com/show_bug.cgi?id=1779421#c11, should we reopen that and assign it to Peter?

Comment 13 MinLi 2020-04-03 09:14:20 UTC
verified on 4.5.0-0.nightly-2020-03-29-224016

sh-4.4# cat /etc/os-release 
NAME="Red Hat Enterprise Linux CoreOS"
VERSION="45.81.202003292027-0"
VERSION_ID="4.5"
OPENSHIFT_VERSION="4.5"
RHEL_VERSION="8.1"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 45.81.202003292027-0 (Ootpa)"
ID="rhcos"
ID_LIKE="rhel fedora"
ANSI_COLOR="0;31"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.5"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.5"
OSTREE_VERSION='45.81.202003292027-0'

sh-4.4# crictl version 
Version:  0.1.0
RuntimeName:  cri-o
RuntimeVersion:  1.17.0-9.dev.rhaos4.4.gitdfc8414.el8
RuntimeApiVersion:  v1alpha1

sh-4.4# rpm -qa | grep cri-o
cri-o-1.17.0-9.dev.rhaos4.4.gitdfc8414.el8.x86_64

Comment 14 Peter Hunt 2020-04-06 17:57:42 UTC
(In reply to Ben Parees from comment #11)
> Why was the 4.4 BZ(https://bugzilla.redhat.com/show_bug.cgi?id=1806000)
> closed as "not a bug" when we identified a necessary fix for 4.5?
> 
> Also, assuming there is a fix to be made here, please backport this to 4.3
> as it is also severely impacted.

The issue in that bug and here are slightly different, though they hit the same symptoms.

This error ultimately appears when a timeout occurs in a pod/container create request the kubelet makes to cri-o. kubelet retries while cri-o is still processing the original request, and cri-o sees there's a duplicate, and errors out.

in 4.4/4.5, there was a bug that caused this to happen more often in cri-o, because it leaked on failure to create in some cases. This change was reverted in [1]

We saw it more after the aforementioned change was reverted because resource limits weren't properly being applied to pods: [2], as well as the kubepods.slice wasn't properly configured [3].

I think one of those three cases were what you described as a necessary fix for 4.5. For 4.3: [1] does not apply (the patch never appeared in 4.3), and [2] and [3] already have back ports.

https://bugzilla.redhat.com/show_bug.cgi?id=1785399 (referred to as a possible duplicate) does not seem to be related to any of these. It is more of a stress test condition: when deploying 2000 pods, cri-o times out more often, causing the kubelet to retry more often, causing more timeouts. In investigating the 2000 pods, there were other discoveries that indicated limits in ovs were the underlying issue, not necessarily cri-o (if networking creating takes too long, then pod creation will timeout). 

I am not sure there are other fixes we know about for this situation in 4.3 that have not already been applied. I am also not really sure why https://bugzilla.redhat.com/show_bug.cgi?id=1806000 was closed as NOTABUG. Maybe it should have been CURRENTVERSION given [1] being merged and fixing the cri-o bug?

[1] https://github.com/cri-o/cri-o/pull/3183
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1812709
[3] https://github.com/openshift/origin/pull/24611

Comment 17 errata-xmlrpc 2020-07-13 17:12:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Comment 18 Red Hat Bugzilla 2023-09-15 00:20:01 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.