4.3 promotion job [1]: Dec 3 10:40:57.971: INFO: At 2019-12-03 10:40:46 +0000 UTC - event for pod-configmaps-f0097d09-4b22-4b9d-a613-5e53e7b80375: {kubelet glr6vj7q-e8966-jhd4b-worker-w5m2g} Failed: Error: error reserving ctr name k8s_configmap-volume-test_pod-configmaps-f0097d09-4b22-4b9d-a613-5e53e7b80375_e2e-configmap-555_fd69ab6d-5cf9-47b4-b9e5-8581caf3a634_0 for id b7ab2ec6c5f0f8eac4ae1159ddb0e87a3917ef00b03ca21f02905c69f13b721f In this case it lead to: Failing tests: [Feature:Builds][timing] capture build stages and durations should record build stages and durations for s2i [Suite:openshift/conformance/parallel] [Feature:DeploymentConfig] deploymentconfigs with multiple image change triggers [Conformance] should run a successful deployment with a trigger used by different containers [Suite:openshift/conformance/parallel/minimal] [k8s.io] Container Lifecycle Hook when create a pod with lifecycle hook should execute poststart exec hook properly [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] [k8s.io] Container Runtime blackbox test when starting a container that exits should run with the expected status [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] [sig-cli] Kubectl client Simple pod should contain last line of the log [Suite:openshift/conformance/parallel] [Suite:k8s] [sig-network] DNS should provide DNS for pods for Subdomain [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] [sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client should enforce policy based on NamespaceSelector with MatchExpressions[Feature:NetworkPolicy] [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:Network/OpenShiftSDN/Multitenant] [sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client should stop enforcing policies after they are deleted [Feature:NetworkPolicy] [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:Network/OpenShiftSDN/Multitenant] [sig-storage] CSI mock volume CSI workload information using mock driver contain ephemeral=true when using inline volume [Suite:openshift/conformance/parallel] [Suite:k8s] [sig-storage] ConfigMap should be consumable from pods in volume [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] [sig-storage] ConfigMap should be consumable from pods in volume as non-root with FSGroup [LinuxOnly] [NodeFeature:FSGroup] [Suite:openshift/conformance/parallel] [Suite:k8s] [sig-storage] EmptyDir volumes should support (root,0777,tmpfs) [LinuxOnly] [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] [sig-storage] In-tree Volumes [Driver: cinder] [Testpattern: Dynamic PV (default fs)] subPath should be able to unmount after the subpath directory is deleted [Suite:openshift/conformance/parallel] [Suite:k8s] [sig-storage] In-tree Volumes [Driver: gluster] [Testpattern: Inline-volume (default fs)] volumes should allow exec of files on the volume [Suite:openshift/conformance/parallel] [Suite:k8s] [sig-storage] In-tree Volumes [Driver: gluster] [Testpattern: Inline-volume (default fs)] volumes should store data [Suite:openshift/conformance/parallel] [Suite:k8s] [sig-storage] In-tree Volumes [Driver: gluster] [Testpattern: Pre-provisioned PV (default fs)] volumes should store data [Suite:openshift/conformance/parallel] [Suite:k8s] [sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: block] [Testpattern: Pre-provisioned PV (default fs)] subPath should support existing single file [LinuxOnly] [Suite:openshift/conformance/parallel] [Suite:k8s] [sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: blockfs] [Testpattern: Pre-provisioned PV (default fs)] subPath should support existing directories when readOnly specified in the volumeSource [Suite:openshift/conformance/parallel] [Suite:k8s] [sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: dir-link] [Testpattern: Pre-provisioned PV (filesystem volmode)] volumeMode should not mount / map unused volumes in a pod [Suite:openshift/conformance/parallel] [Suite:k8s] [sig-storage] Secrets optional updates should be reflected in volume [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] Seems to happen every few days [2]. Possibly mitigated by rebooting the node [3]. CRI-O code generating the error string is [4]. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.3/510 [2]: https://search.svc.ci.openshift.org/?search=Failed:%20Error:%20error%20reserving%20ctr%20name [3]: https://github.com/IBM/ibm-spectrum-scale-csi-operator/issues/54#issuecomment-555167595 [4]: https://github.com/cri-o/cri-o/blob/2de09c8b80545b77ba6fa49a1f66b681f9a11755/internal/lib/container_server.go#L562
Ah, I forgot CRI-O is the Node component.
I dunno if CRI-O/RHCOS has a different flow, but for most OpenShift components we need a bug with a master-ward Target Release (4.4 at the moment) blocking backports to already-forked-off release branches like 4.3. That way we ensure we don't fix something in 4.3 and regress after forgetting to fix it in 4.4.
This bug has certainly existed for a while, and isn't very high priority. I am going to defer it to 4.4. In the meantime, I merged a PR in 1.16 and master (https://github.com/cri-o/cri-o/pull/3035 and https://github.com/cri-o/cri-o/pull/3036) to print the actual error. Let's let that soak for a bit to find more instances, and then I will look into the root cause
bumping the target release.
*** Bug 1784819 has been marked as a duplicate of this bug. ***
We are still seeing this and it's leading to difficult diagnosis across the stack. Sometimes it is obvious like https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.4/932, but most of the time, some random pod somewhere gets stuck and we see issues like https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.4/937 where the true cause is buried in an event that says ``` message: 'Failed to create pod sandbox: rpc error: code = Unknown desc = error reserving pod name k8s_apiserver-6qxj7_openshift-apiserver_1f85b2c0-967f-4e20-bede-cede12def172_0 for id 061e104adb2cf9e49d873704ed7825d41ec36859809098419845a4efb1673dfd: name is reserved' ```
Notice the same issue when trying to create a high number of pods with 4.3.9 on AWS (OpenShiftSDN), +1 to what David Eads has said. This is not noticeable especially if you're not using naked pod object defintions, as you'd just wait for a specified number of pods to come up, when in reality pods are being terminated in the background due to the error, while new pods slowly take up their position
Why was the 4.4 BZ(https://bugzilla.redhat.com/show_bug.cgi?id=1806000) closed as "not a bug" when we identified a necessary fix for 4.5? Also, assuming there is a fix to be made here, please backport this to 4.3 as it is also severely impacted.
Ryan Phillips you closed https://bugzilla.redhat.com/show_bug.cgi?id=1806000, based on Ben's comment: https://bugzilla.redhat.com/show_bug.cgi?id=1779421#c11, should we reopen that and assign it to Peter?
verified on 4.5.0-0.nightly-2020-03-29-224016 sh-4.4# cat /etc/os-release NAME="Red Hat Enterprise Linux CoreOS" VERSION="45.81.202003292027-0" VERSION_ID="4.5" OPENSHIFT_VERSION="4.5" RHEL_VERSION="8.1" PRETTY_NAME="Red Hat Enterprise Linux CoreOS 45.81.202003292027-0 (Ootpa)" ID="rhcos" ID_LIKE="rhel fedora" ANSI_COLOR="0;31" HOME_URL="https://www.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform" REDHAT_BUGZILLA_PRODUCT_VERSION="4.5" REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform" REDHAT_SUPPORT_PRODUCT_VERSION="4.5" OSTREE_VERSION='45.81.202003292027-0' sh-4.4# crictl version Version: 0.1.0 RuntimeName: cri-o RuntimeVersion: 1.17.0-9.dev.rhaos4.4.gitdfc8414.el8 RuntimeApiVersion: v1alpha1 sh-4.4# rpm -qa | grep cri-o cri-o-1.17.0-9.dev.rhaos4.4.gitdfc8414.el8.x86_64
(In reply to Ben Parees from comment #11) > Why was the 4.4 BZ(https://bugzilla.redhat.com/show_bug.cgi?id=1806000) > closed as "not a bug" when we identified a necessary fix for 4.5? > > Also, assuming there is a fix to be made here, please backport this to 4.3 > as it is also severely impacted. The issue in that bug and here are slightly different, though they hit the same symptoms. This error ultimately appears when a timeout occurs in a pod/container create request the kubelet makes to cri-o. kubelet retries while cri-o is still processing the original request, and cri-o sees there's a duplicate, and errors out. in 4.4/4.5, there was a bug that caused this to happen more often in cri-o, because it leaked on failure to create in some cases. This change was reverted in [1] We saw it more after the aforementioned change was reverted because resource limits weren't properly being applied to pods: [2], as well as the kubepods.slice wasn't properly configured [3]. I think one of those three cases were what you described as a necessary fix for 4.5. For 4.3: [1] does not apply (the patch never appeared in 4.3), and [2] and [3] already have back ports. https://bugzilla.redhat.com/show_bug.cgi?id=1785399 (referred to as a possible duplicate) does not seem to be related to any of these. It is more of a stress test condition: when deploying 2000 pods, cri-o times out more often, causing the kubelet to retry more often, causing more timeouts. In investigating the 2000 pods, there were other discoveries that indicated limits in ovs were the underlying issue, not necessarily cri-o (if networking creating takes too long, then pod creation will timeout). I am not sure there are other fixes we know about for this situation in 4.3 that have not already been applied. I am also not really sure why https://bugzilla.redhat.com/show_bug.cgi?id=1806000 was closed as NOTABUG. Maybe it should have been CURRENTVERSION given [1] being merged and fixing the cri-o bug? [1] https://github.com/cri-o/cri-o/pull/3183 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1812709 [3] https://github.com/openshift/origin/pull/24611
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days