Bug 1785399

Summary: Under condition of heavy pod creation, creation fails with 'error reserving pod name ...: name is reserved"
Product: OpenShift Container Platform Reporter: Robert Krawitz <rkrawitz>
Component: NodeAssignee: Peter Hunt <pehunt>
Node sub component: CRI-O QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: aarapov, abodhe, abraj, akamra, andbartl, andcosta, aos-bugs, apjagtap, ayakoubo, bbennett, bjarolim, braander, ckavili, dosmith, dporter, dsanzmor, ekasprzy, eparis, hongkliu, itbrown, izhang, jcallen, jinjli, jmalde, jokerman, jsafrane, jtaleric, jtanenba, kramdoss, lbac, llopezmo, mapandey, mifiedle, mwasher, naoto30, nbhatt, nelluri, nnosenzo, openshift-bugs-escalate, pehunt, pkhaire, pmagotra, rcernin, rperiyas, rphillips, rsevilla, sbhavsar, scuppett, shsaxena, sople, sparpate, syang, tkatarki, vpagar, wking, xmorano, yhuang
Version: 4.3.0Flags: abodhe: needinfo+
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: aos-scalability-43
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Systems under load can run into a situation where pod or container creation requests from the Kubelet to CRI-O can take longer than the Kubelet expects. This causes the request to timeout. The Kubelet will attempt to re-request that resource (even though CRI-O is still working on creating that resource), causing new pod/container creation requests to fail with a message "name is reserved". Once CRI-O finishes the original request, it notices the request timed out, and cleans up the failed pod/container, starting the process over. Consequence: Pod and Container creation can stall for a long time, and many "name is reserved" errors are reported by the Kubelet. This also causes an already overloaded node to be further overloaded. Fix: CRI-O now saves progress on container/pod creations that timeout due to system load. It also stalls new requests from the Kubelet so there are fewer "name is reserved" errors reported. Result: Now, when systems are under load, CRI-O does everything it can to slow down the Kubelet and reduce load on the system. Before, when this situation came up, there would be a "name is reserved" error posted every 2 seconds. They're now stalled to every couple of minutes. Also, because CRI-O saves its work, and doesn't cleanup and retry, the overall load on the node is reduced and Kubelet and CRI-O should reconcile faster. Note: this does not solve the underlying problem of the node being resource starved, but it does mitigate the affect of resource starvation.
Story Points: ---
Clone Of:
: 1913411 (view as bug list) Environment:
Last Closed: 2021-02-24 15:10:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1887744    
Attachments:
Description Flags
oc describe for the pods that did not come up with 4.4.0-0.nightly-2020-04-22-135638 none

Description Robert Krawitz 2019-12-19 20:03:13 UTC
Created attachment 1646673 [details]
Node log from the worker node in question

Description of problem:

While attempting to create (schematically)

- namespace
  count: 100
  deployments:
    count: 2
    routes:
      count: 1
    secrets:
      count: 20
    pods:
      - name: server
        count: 1
        containers:
          count: 1
      - name: client
        count: 4
        containers:
          count: 5

Three of the pods (all part of the same deployment, and all on the same node)
failed to start, remaining in ContainerCreating state as follows:

Dec 19 18:27:51 ip-10-0-150-32 hyperkube[7786]: I1219 18:27:51.616549    7786 kuberuntime_manager.go:409] No sandbox for pod "clusterbuster-1-1-client-5576594889-74nsr_clusterbuster-1(60ce3d1a-0a37-4dd3-bb3b-c213861f1cc0)" can be found. Need to start a new one
Dec 19 18:27:51 ip-10-0-150-32 hyperkube[7786]: I1219 18:27:51.616567    7786 kuberuntime_manager.go:634] computePodActions got {KillPod:true CreateSandbox:true SandboxID: Attempt:0 NextInitContainerToStart:nil ContainersToStart:[0 1 2 3 4] ContainersToKill:map[] EphemeralContainersToStart:[]} for pod "clusterbuster-1-1-client-5576594889-74nsr_clusterbuster-1(60ce3d1a-0a37-4dd3-bb3b-c213861f1cc0)"
Dec 19 18:27:51 ip-10-0-150-32 hyperkube[7786]: E1219 18:27:51.617936    7786 remote_runtime.go:105] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = error reserving pod name k8s_clusterbuster-1-1-client-5576594889-74nsr_clusterbuster-1_60ce3d1a-0a37-4dd3-bb3b-c213861f1cc0_0 for id a7761081c11a8d9ca5c3db6f4b0e562ae7042d9893e3bba0d54e782dc86396e1: name is reserved
Dec 19 18:27:51 ip-10-0-150-32 hyperkube[7786]: E1219 18:27:51.618062    7786 pod_workers.go:191] Error syncing pod 60ce3d1a-0a37-4dd3-bb3b-c213861f1cc0 ("clusterbuster-1-1-client-5576594889-74nsr_clusterbuster-1(60ce3d1a-0a37-4dd3-bb3b-c213861f1cc0)"), skipping: failed to "CreatePodSandbox" for "clusterbuster-1-1-client-5576594889-74nsr_clusterbuster-1(60ce3d1a-0a37-4dd3-bb3b-c213861f1cc0)" with CreatePodSandboxError: "CreatePodSandbox for pod \"clusterbuster-1-1-client-5576594889-74nsr_clusterbuster-1(60ce3d1a-0a37-4dd3-bb3b-c213861f1cc0)\" failed: rpc error: code = Unknown desc = error reserving pod name k8s_clusterbuster-1-1-client-5576594889-74nsr_clusterbuster-1_60ce3d1a-0a37-4dd3-bb3b-c213861f1cc0_0 for id a7761081c11a8d9ca5c3db6f4b0e562ae7042d9893e3bba0d54e782dc86396e1: name is reserved"
Dec 19 18:27:51 ip-10-0-150-32 hyperkube[7786]: I1219 18:27:51.618087    7786 event.go:255] Event(v1.ObjectReference{Kind:"Pod", Namespace:"clusterbuster-1", Name:"clusterbuster-1-1-client-5576594889-74nsr", UID:"60ce3d1a-0a37-4dd3-bb3b-c213861f1cc0", APIVersion:"v1", ResourceVersion:"90293", FieldPath:""}): type: 'Warning' reason: 'FailedCreatePodSandBox' Failed create pod sandbox: rpc error: code = Unknown desc = error reserving pod name k8s_clusterbuster-1-1-client-5576594889-74nsr_clusterbuster-1_60ce3d1a-0a37-4dd3-bb3b-c213861f1cc0_0 for id a7761081c11a8d9ca5c3db6f4b0e562ae7042d9893e3bba0d54e782dc86396e1: name is reserved

Observed the following in the node log for the node in question:

Dec 19 18:27:40 ip-10-0-150-32 hyperkube[7786]: E1219 18:27:40.159974    7786 remote_runtime.go:105] RunPodSandbox from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Dec 19 18:27:40 ip-10-0-150-32 hyperkube[7786]: E1219 18:27:40.160055    7786 kuberuntime_sandbox.go:68] CreatePodSandbox for pod "clusterbuster-1-1-client-5576594889-74nsr_clusterbuster-1(60ce3d1a-0a37-4dd3-bb3b-c213861f1cc0)" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Dec 19 18:27:40 ip-10-0-150-32 hyperkube[7786]: E1219 18:27:40.160075    7786 kuberuntime_manager.go:710] createPodSandbox for pod "clusterbuster-1-1-client-5576594889-74nsr_clusterbuster-1(60ce3d1a-0a37-4dd3-bb3b-c213861f1cc0)" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Dec 19 18:27:40 ip-10-0-150-32 hyperkube[7786]: E1219 18:27:40.160133    7786 pod_workers.go:191] Error syncing pod 60ce3d1a-0a37-4dd3-bb3b-c213861f1cc0 ("clusterbuster-1-1-client-5576594889-74nsr_clusterbuster-1(60ce3d1a-0a37-4dd3-bb3b-c213861f1cc0)"), skipping: failed to "CreatePodSandbox" for "clusterbuster-1-1-client-5576594889-74nsr_clusterbuster-1(60ce3d1a-0a37-4dd3-bb3b-c213861f1cc0)" with CreatePodSandboxError: "CreatePodSandbox for pod \"clusterbuster-1-1-client-5576594889-74nsr_clusterbuster-1(60ce3d1a-0a37-4dd3-bb3b-c213861f1cc0)\" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded"

Version-Release number of selected component (if applicable): 4.4.0-0.ci-2019-12-19-155944

This is on AWS us-east-1, with both workers and masters of type m5.8xlarge.


How reproducible:

See Steps to Reproduce below.


Steps to Reproduce:
1. set-worker-parameters -p 500 # Allow 500 pods/node
2. clusterbuster -P server -b 5 -p 10 -D .01 -M 1 -N 100 -r 4 -d 2 -c 5 -m 16000 -v -s 10 -x

Actual results: Small number of pods remain in ContainerCreating with above error (I suspect it might take multiple tries to reproduce).


Expected results: All pods should be created.

Comment 7 Stephen Cuppett 2020-01-13 12:49:44 UTC
Setting target release to 4.4 (active development branch). Where fixes required/requested, bz clones to backport target releases will need be created.

Comment 13 Peter Hunt 2020-03-05 21:26:38 UTC
I am moving this out of 4.4 because 1: it was produced in 4.3, so is not a release blocker, as well as 2: it seems to be out of scope of what can be handled before code freeze.

Comment 16 W. Trevor King 2020-04-18 02:33:24 UTC
Recent example from CI [1]:

Apr 16 23:29:42.761 I ns/openshift-etcd-operator pod/etcd-operator-7c79777cb-b68s2 node/ci-op-r2nn9-m-2.c.openshift-gce-devel-ci.internal reason/Scheduled
Apr 16 23:30:16.748 W ns/openshift-etcd-operator pod/etcd-operator-7c79777cb-b68s2 node/ci-op-r2nn9-m-2.c.openshift-gce-devel-ci.internal reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_etcd-operator-7c79777cb-b68s2_openshift-etcd-operator_f229fad5-1c24-4ec2-a0c7-21077ecde07d_0(a1b7810bbba06908a991b7c0fdf2fb2d994abe68848d9a21397809e9ba7bfb0c): netplugin failed with no error message
Apr 16 23:30:32.527 I ns/openshift-etcd-operator pod/etcd-operator-7c79777cb-b68s2 reason/AddedInterface Add eth0 [10.129.0.65/23]

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/309/pull-ci-openshift-cluster-etcd-operator-master-e2e-gcp-upgrade/1389

Comment 18 Itzik Brown 2020-04-21 06:27:13 UTC
I see this issue in 4.4

Comment 19 Casey Callendrello 2020-04-22 10:00:59 UTC
Hit this issue in a CI run, to (theoretically) disastrous effects: no sandboxes / pods could be re-created on a freshly-rebooted node.

https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1670/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1940/artifacts/e2e-gcp-op/

See workers-journal, node ci-op-d4rtc-w-b-xrmnd, reboot time 19:33.

This caused a rebooted node to go unmanageable. No bueno.

Comment 20 Casey Callendrello 2020-04-22 11:14:12 UTC
Hit this again immediately after a CI node rebooted. Found some very weird things:

Something very odd is happening - the containers are created before the pod is!?!?

Apr 22 11:00:50 ci-op-mbwl2-w-d-l98qb.c.openshift-gce-devel-ci.internal hyperkube[1602]: I0422 11:00:50.080375    1602 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-dns", Name:"dns-default-7gdsk", UID:"8622fba2-8aa9-413f-8f77-ac21e483de2f", APIVersion:"v1", ResourceVersion:"28850", FieldPath:""}): type: 'Warning' reason: 'FailedCreatePodSandBox' Failed to create pod sandbox: rpc error: code = Unknown desc = reserving pod sandbox name: error reserving pod name k8s_dns-default-7gdsk_openshift-dns_8622fba2-8aa9-413f-8f77-ac21e483de2f_1 for id d951756474982a09253f1f430d60b9dd98e73261fb39506ca2a3ddd8ba1d4014: name is reserved

sh-4.4# crictl ps -a  | grep dns
CONTAINER           IMAGE                                                                                                            CREATED             STATE               NAME                    ATTEMPT             POD ID
583c737881fe2       2d40a50ced18875fbec9ed13c4f31fdf41e93170db5c0d3e16de1ae711e12ceb                                                 16 minutes ago      Exited              dns-node-resolver       1                   5b3e2033407f6
3461898976e88       3c26b54d473580a7d8318725ca8efb2ae43deeb5ee09310d40ee1a1abd17986e                                                 16 minutes ago      Exited              dns                     1                   5b3e2033407f6

sh-4.4# crictl pods | grep dns
POD ID              CREATED             STATE               NAME                                                         NAMESPACE  
5b3e2033407f6       6 minutes ago       NotReady            dns-default-7gdsk                                            openshift-dns 

sh-4.4# uptime
 11:10:46 up 7 min,  0 users,  load average: 0.00, 0.04, 0.01

sh-4.4# crictl inspectp 5b3e2033407f6
{
  "status": {
    "id": "5b3e2033407f661fccc3d60c3626c90e1ca29acbfffd7b1c6fcab63be286e7b1",
    "metadata": {
      "attempt": 1,
      "name": "dns-default-7gdsk",
      "namespace": "openshift-dns",
      "uid": "8622fba2-8aa9-413f-8f77-ac21e483de2f"
    },
    "state": "SANDBOX_NOTREADY",
    "createdAt": "2020-04-22T11:03:34.629652518Z",
    "network": {
      "additionalIps": [],
      "ip": ""
    },
    "linux": {
      "namespaces": {
        "options": {
          "ipc": "POD",
          "network": "POD",
          "pid": "CONTAINER"
        }
      }
    },
    "labels": {
      "controller-revision-hash": "77dbd897bb",
      "dns.operator.openshift.io/daemonset-dns": "default",
      "io.kubernetes.container.name": "POD",
      "io.kubernetes.pod.name": "dns-default-7gdsk",
      "io.kubernetes.pod.namespace": "openshift-dns",
      "io.kubernetes.pod.uid": "8622fba2-8aa9-413f-8f77-ac21e483de2f",
      "pod-template-generation": "1"
    },
    "annotations": {
      "k8s.v1.cni.cncf.io/network-status": "[{\n    \"name\": \"openshift-sdn\",\n    \"interface\": \"eth0\",\n    \"ips\": [\n        \"10.129.2.2\"\n    ],\n    \"default\": true,\n    \"dns\": {}\n}]",
      "k8s.v1.cni.cncf.io/networks-status": "[{\n    \"name\": \"openshift-sdn\",\n    \"interface\": \"eth0\",\n    \"ips\": [\n        \"10.129.2.2\"\n    ],\n    \"default\": true,\n    \"dns\": {}\n}]",
      "kubernetes.io/config.seen": "2020-04-22T10:53:46.78104917Z",
      "kubernetes.io/config.source": "api"
    },
    "runtimeHandler": ""
  }
}

sh-4.4# crictl inspect 3461898976e88
{
  "status": {
    "id": "3461898976e888e587d984b0ad14260e2fd079533e8fa5c35384b5e7c1cd7a56",
    "metadata": {
      "attempt": 1,
      "name": "dns"
    },
    "state": "CONTAINER_EXITED",
    "createdAt": "2020-04-22T10:53:54.257996902Z",
    "startedAt": "2020-04-22T10:53:54.290027815Z",
    "finishedAt": "2020-04-22T10:54:16.652291369Z",
    "exitCode": 0,
    "image": {
      "image": "registry.svc.ci.openshift.org/ci-op-iqxg1qhq/stable@sha256:156211b7caac58589cfd2a498863527996ab23350556a2d2ddff2f0916c77665"
    },
    "imageRef": "registry.svc.ci.openshift.org/ci-op-iqxg1qhq/stable@sha256:156211b7caac58589cfd2a498863527996ab23350556a2d2ddff2f0916c77665",
    "reason": "Completed",
    "message": "",
    "labels": {
      "io.kubernetes.container.name": "dns",
      "io.kubernetes.pod.name": "dns-default-7gdsk",
      "io.kubernetes.pod.namespace": "openshift-dns",
      "io.kubernetes.pod.uid": "8622fba2-8aa9-413f-8f77-ac21e483de2f"
    },
    "annotations": {
      "io.kubernetes.container.hash": "c341caf",
      "io.kubernetes.container.ports": "[{\"name\":\"dns\",\"containerPort\":5353,\"protocol\":\"UDP\"},{\"name\":\"dns-tcp\",\"containerPort\":5353,\"protocol\":\"TCP\"}]",
      "io.kubernetes.container.restartCount": "1",
      "io.kubernetes.container.terminationMessagePath": "/dev/termination-log",
      "io.kubernetes.container.terminationMessagePolicy": "FallbackToLogsOnError",
      "io.kubernetes.pod.terminationGracePeriod": "30"
    },
    "mounts": [],
    "logPath": "/var/log/pods/openshift-dns_dns-default-7gdsk_8622fba2-8aa9-413f-8f77-ac21e483de2f/dns/1.log"
  },
  "info": {
    "sandboxID": "5b3e2033407f661fccc3d60c3626c90e1ca29acbfffd7b1c6fcab63be286e7b1",
    "pid": 3457,
    "runtimeSpec": {
    ...
    "annotations": {
        "io.container.manager": "cri-o",
        "io.kubernetes.container.hash": "c341caf",
        "io.kubernetes.container.name": "dns",
        "io.kubernetes.container.ports": "[{\"name\":\"dns\",\"containerPort\":5353,\"protocol\":\"UDP\"},{\"name\":\"dns-tcp\",\"containerPort\":5353,\"protocol\":\"TCP\"}]",
        "io.kubernetes.container.restartCount": "1",
        "io.kubernetes.container.terminationMessagePath": "/dev/termination-log",
        "io.kubernetes.container.terminationMessagePolicy": "FallbackToLogsOnError",
        "io.kubernetes.cri-o.Annotations": "{\"io.kubernetes.container.hash\":\"c341caf\",\"io.kubernetes.container.ports\":\"[{\\\"name\\\":\\\"dns\\\",\\\"containerPort\\\":5353,\\\"protocol\\\":\\\"UDP\\\"},{\\\"name\\\":\\\"dns-tcp\\\",\\\"containerPort\\\":5353,\\\"protocol\\\":\\\"TCP\\\"}]\",\"io.kubernetes.container.restartCount\":\"1\",\"io.kubernetes.container.terminationMessagePath\":\"/dev/termination-log\",\"io.kubernetes.container.terminationMessagePolicy\":\"FallbackToLogsOnError\",\"io.kubernetes.pod.terminationGracePeriod\":\"30\"}",
        "io.kubernetes.cri-o.ContainerID": "3461898976e888e587d984b0ad14260e2fd079533e8fa5c35384b5e7c1cd7a56",
        "io.kubernetes.cri-o.ContainerType": "container",
        "io.kubernetes.cri-o.Created": "2020-04-22T10:53:54.139786664Z",
        "io.kubernetes.cri-o.IP.0": "10.129.2.2",
        "io.kubernetes.cri-o.Image": "3c26b54d473580a7d8318725ca8efb2ae43deeb5ee09310d40ee1a1abd17986e",
        "io.kubernetes.cri-o.ImageName": "registry.svc.ci.openshift.org/ci-op-iqxg1qhq/stable@sha256:156211b7caac58589cfd2a498863527996ab23350556a2d2ddff2f0916c77665",
        "io.kubernetes.cri-o.ImageRef": "registry.svc.ci.openshift.org/ci-op-iqxg1qhq/stable@sha256:156211b7caac58589cfd2a498863527996ab23350556a2d2ddff2f0916c77665",
        "io.kubernetes.cri-o.Labels": "{\"io.kubernetes.container.name\":\"dns\",\"io.kubernetes.pod.name\":\"dns-default-7gdsk\",\"io.kubernetes.pod.namespace\":\"openshift-dns\",\"io.kubernetes.pod.uid\":\"8622fba2-8aa9-413f-8f77-ac21e483de2f\"}",
        "io.kubernetes.cri-o.LogPath": "/var/log/pods/openshift-dns_dns-default-7gdsk_8622fba2-8aa9-413f-8f77-ac21e483de2f/dns/1.log",
        "io.kubernetes.cri-o.Metadata": "{\"name\":\"dns\",\"attempt\":1}",
        "io.kubernetes.cri-o.MountPoint": "/var/lib/containers/storage/overlay/847e9be68abd5c672020690abffc9a64b6d42498e1642d2fda2980000c8a5d9d/merged",
        "io.kubernetes.cri-o.Name": "k8s_dns_dns-default-7gdsk_openshift-dns_8622fba2-8aa9-413f-8f77-ac21e483de2f_1",
        "io.kubernetes.cri-o.ResolvPath": "/var/run/containers/storage/overlay-containers/5b3e2033407f661fccc3d60c3626c90e1ca29acbfffd7b1c6fcab63be286e7b1/userdata/resolv.conf",
        "io.kubernetes.cri-o.SandboxID": "5b3e2033407f661fccc3d60c3626c90e1ca29acbfffd7b1c6fcab63be286e7b1",
        "io.kubernetes.cri-o.SandboxName": "k8s_dns-default-7gdsk_openshift-dns_8622fba2-8aa9-413f-8f77-ac21e483de2f_1",
        "io.kubernetes.cri-o.SeccompProfilePath": "",
        "io.kubernetes.cri-o.Stdin": "false",
        "io.kubernetes.cri-o.StdinOnce": "false",
        "io.kubernetes.cri-o.TTY": "false",
        "io.kubernetes.cri-o.Volumes": "[{\"container_path\":\"/etc/coredns\",\"host_path\":\"/var/lib/kubelet/pods/8622fba2-8aa9-413f-8f77-ac21e483de2f/volumes/kubernetes.io~configmap/config-volume\",\"readonly\":true},{\"container_path\":\"/etc/hosts\",\"host_path\":\"/var/lib/kubelet/pods/8622fba2-8aa9-413f-8f77-ac21e483de2f/etc-hosts\",\"readonly\":false},{\"container_path\":\"/dev/termination-log\",\"host_path\":\"/var/lib/kubelet/pods/8622fba2-8aa9-413f-8f77-ac21e483de2f/containers/dns/845cbb32\",\"readonly\":false},{\"container_path\":\"/var/run/secrets/kubernetes.io/serviceaccount\",\"host_path\":\"/var/lib/kubelet/pods/8622fba2-8aa9-413f-8f77-ac21e483de2f/volumes/kubernetes.io~secret/dns-token-7h6dr\",\"readonly\":true}]",
        "io.kubernetes.pod.name": "dns-default-7gdsk",
        "io.kubernetes.pod.namespace": "openshift-dns",
        "io.kubernetes.pod.terminationGracePeriod": "30",
        "io.kubernetes.pod.uid": "8622fba2-8aa9-413f-8f77-ac21e483de2f",
        "org.systemd.property.TimeoutStopUSec": "uint64 30000000"
      },

Comment 21 Casey Callendrello 2020-04-22 11:50:43 UTC
For testing, I did a "crictl rm -a" and everything started back up.

Comment 22 Ryan Phillips 2020-04-22 15:59:46 UTC
New crio hit the release images yesterday that should help with this issue. A search on search.ci today shows only one instance of this in the last 24 hours.

Comment 23 Ryan Phillips 2020-04-22 17:40:15 UTC
We are going to split out Casey's bug report into [1] to fix the name is already reserved errors across reboots.

1. https://bugzilla.redhat.com/show_bug.cgi?id=1826895

Comment 24 Robert Krawitz 2020-04-22 23:04:42 UTC
With build 4.4.0-0.nightly-2020-04-22-135638, still observing numerous errors of the forms:

clusterbuster-67-0-client-5546759dbc-n2m4l.desc:Output: Failed to start transient scope unit: Input/output error
clusterbuster-67-0-client-5546759dbc-n2m4l.desc:  Warning  FailedMount  53m  kubelet, ip-10-0-174-124.us-east-2.compute.internal  MountVolume.SetUp failed for volume "secret-clusterbuster-67-0-54" : mount failed: exit status 1

clusterbuster-7-0-client-6997c588bf-d8hdg.desc:  Warning  Failed       52m (x4 over 54m)   kubelet, ip-10-0-153-164.us-east-2.compute.internal  (combined from similar events): Error: container create failed: time="2020-04-22T22:00:43Z" level=error msg="systemd cgroup flag passed, but systemd support for managing cgroups is not available"

clusterbuster-7-0-client-6997c588bf-tzn47.desc:  Warning  Failed       53m (x17 over 56m)     kubelet, ip-10-0-174-124.us-east-2.compute.internal  (combined from similar events): Error: error reserving ctr name k8s_c1_clusterbuster-7-0-client-6997c588bf-tzn47_clusterbuster-7_2ffc7012-a6eb-4cf2-aff8-28196232f614_0 for id ae12985175c08173c7e6f074f488d43e02a3f41ada2fa297437017055e40a846: name is reserved

Attaching 'oc describe pod' output for all pods not running, and 'oc adm node-logs' for the worker nodes.

Comment 25 Robert Krawitz 2020-04-22 23:06:53 UTC
Created attachment 1680997 [details]
oc describe for the pods that did not come up with 4.4.0-0.nightly-2020-04-22-135638

Comment 28 Ryan Phillips 2020-05-11 22:40:51 UTC
Robert: Could you retest this with 4.4? We believe this issue may be cleared up with recent crio and networking patches.

Comment 30 Robert Krawitz 2020-05-15 14:50:06 UTC
Still fails:

Events:
  Type     Reason                  Age                   From                                                Message
  ----     ------                  ----                  ----                                                -------
  Normal   Scheduled               <unknown>             default-scheduler                                   Successfully assigned clusterbuster-94/clusterbuster-94-0-server-d49bb6b67-zsk7s to ip-10-0-132-78.us-west-2.compute.internal
  Warning  FailedCreatePodSandBox  15m                   kubelet, ip-10-0-132-78.us-west-2.compute.internal  Failed to create pod sandbox: rpc error: code = Unknown desc = error reserving pod name k8s_clusterbuster-94-0-server-d49bb6b67-zsk7s_clusterbuster-94_7223e994-79c2-499a-92f5-0500dc777044_0 for id f0f4369462edd57d9e3df12440db76543a37de3f5039b8dd4941342852ccac73: name is reserved
  Warning  FailedCreatePodSandBox  15m                   kubelet, ip-10-0-132-78.us-west-2.compute.internal  Failed to create pod sandbox: rpc error: code = Unknown desc = error reserving pod name k8s_clusterbuster-94-0-server-d49bb6b67-zsk7s_clusterbuster-94_7223e994-79c2-499a-92f5-0500dc777044_0 for id 536452e35a353e296e80fc074e01b550cb995677f78b31ac8cced669d1495e0d: name is reserved
  Warning  FailedCreatePodSandBox  15m                   kubelet, ip-10-0-132-78.us-west-2.compute.internal  Failed to create pod sandbox: rpc error: code = Unknown desc = error reserving pod name k8s_clusterbuster-94-0-server-d49bb6b67-zsk7s_clusterbuster-94_7223e994-79c2-499a-92f5-0500dc777044_0 for id df843a29604ced431ceb8247f6696f78d27e79fd3462e90bdbcf0974384baaf7: name is reserved
  Warning  FailedCreatePodSandBox  14m                   kubelet, ip-10-0-132-78.us-west-2.compute.internal  Failed to create pod sandbox: rpc error: code = Unknown desc = error reserving pod name k8s_clusterbuster-94-0-server-d49bb6b67-zsk7s_clusterbuster-94_7223e994-79c2-499a-92f5-0500dc777044_0 for id 3ecd2e5ca4c465cc0fbf983894e7bc49b3f918a6b67dae209092eeadeb462950: name is reserved
  Warning  FailedCreatePodSandBox  14m                   kubelet, ip-10-0-132-78.us-west-2.compute.internal  Failed to create pod sandbox: rpc error: code = Unknown desc = error reserving pod name k8s_clusterbuster-94-0-server-d49bb6b67-zsk7s_clusterbuster-94_7223e994-79c2-499a-92f5-0500dc777044_0 for id f92e9bba16e05da9157264a436797560666e9aa8a38fb1fa9d822e444dfffa81: name is reserved
  Warning  FailedCreatePodSandBox  14m                   kubelet, ip-10-0-132-78.us-west-2.compute.internal  Failed to create pod sandbox: rpc error: code = Unknown desc = error reserving pod name k8s_clusterbuster-94-0-server-d49bb6b67-zsk7s_clusterbuster-94_7223e994-79c2-499a-92f5-0500dc777044_0 for id 6e765421f8c771127d1b5dd2bae9818d33ccfc37146ffa529f2d12f5c17977fc: name is reserved
  Warning  FailedCreatePodSandBox  14m                   kubelet, ip-10-0-132-78.us-west-2.compute.internal  Failed to create pod sandbox: rpc error: code = Unknown desc = error reserving pod name k8s_clusterbuster-94-0-server-d49bb6b67-zsk7s_clusterbuster-94_7223e994-79c2-499a-92f5-0500dc777044_0 for id 7db67ebe8c88d72a0a38808e907e97c72b0a36d9120c9ee0029344edd298e906: name is reserved
  Warning  FailedCreatePodSandBox  13m                   kubelet, ip-10-0-132-78.us-west-2.compute.internal  Failed to create pod sandbox: rpc error: code = Unknown desc = error reserving pod name k8s_clusterbuster-94-0-server-d49bb6b67-zsk7s_clusterbuster-94_7223e994-79c2-499a-92f5-0500dc777044_0 for id d3f4486da4dab3f76664a9f587a4a18dfd1e542eefd11d8882bb8144787374f4: name is reserved
  Warning  FailedCreatePodSandBox  13m                   kubelet, ip-10-0-132-78.us-west-2.compute.internal  Failed to create pod sandbox: rpc error: code = Unknown desc = error reserving pod name k8s_clusterbuster-94-0-server-d49bb6b67-zsk7s_clusterbuster-94_7223e994-79c2-499a-92f5-0500dc777044_0 for id 0f9ed52678c6db30927084429c07b7799033dfb6f0f7ecd5158f7e79eef25364: name is reserved
  Warning  FailedCreatePodSandBox  2m38s (x52 over 11m)  kubelet, ip-10-0-132-78.us-west-2.compute.internal  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = error reserving pod name k8s_clusterbuster-94-0-server-d49bb6b67-zsk7s_clusterbuster-94_7223e994-79c2-499a-92f5-0500dc777044_0 for id 29d501da962ece8da651a7d2359ec4843e3fd0c85e4fd1592974dc5facb8fc59: name is reserved
[4(rkrawitz)||||{!8}<rkrawitz>/home/rkrawitz]
$ oc version
Client Version: v4.4.0
Server Version: 4.4.4
Kubernetes Version: v1.17.1
[4(rkrawitz)||||{!9}<rkrawitz>/home/rkrawitz]
$ oc get pods -A -l clusterbuster |grep -v Running
NAMESPACE          NAME                                         READY   STATUS                 RESTARTS   AGE
clusterbuster-17   clusterbuster-17-1-client-78bc87fc6c-s66hs   4/5     CreateContainerError   0          22m
clusterbuster-25   clusterbuster-25-1-client-656c9df978-lc5x4   4/5     CreateContainerError   0          22m
clusterbuster-28   clusterbuster-28-0-client-765fb8865f-lgvrk   4/5     CreateContainerError   0          21m
clusterbuster-43   clusterbuster-43-0-client-77947f4bfb-2jp58   4/5     CreateContainerError   0          22m
clusterbuster-50   clusterbuster-50-0-client-64cb9bb9f5-l6ncg   0/5     ContainerCreating      0          21m
clusterbuster-50   clusterbuster-50-0-server-7fd5d9f684-cbzkj   0/1     ContainerCreating      0          21m
clusterbuster-63   clusterbuster-63-0-client-6f7c9897d9-7ps7b   0/5     ContainerCreating      0          21m
clusterbuster-63   clusterbuster-63-0-client-6f7c9897d9-dvlpp   0/5     ContainerCreating      0          21m
clusterbuster-63   clusterbuster-63-0-client-6f7c9897d9-zglsf   0/5     ContainerCreating      0          21m
clusterbuster-63   clusterbuster-63-0-server-6c5749c597-rvxlr   0/1     ContainerCreating      0          21m
clusterbuster-70   clusterbuster-70-0-client-b5f7f7c9b-6rqsh    0/5     ContainerCreating      0          21m
clusterbuster-70   clusterbuster-70-0-client-b5f7f7c9b-lkjn2    0/5     ContainerCreating      0          21m
clusterbuster-70   clusterbuster-70-0-client-b5f7f7c9b-stmd2    0/5     ContainerCreating      0          21m
clusterbuster-70   clusterbuster-70-0-server-84784fcf68-9qqqg   0/1     ContainerCreating      0          21m
clusterbuster-8    clusterbuster-8-1-client-69dc5896fb-7zn4d    0/5     ContainerCreating      0          21m
clusterbuster-8    clusterbuster-8-1-client-69dc5896fb-ljztb    0/5     ContainerCreating      0          21m
clusterbuster-8    clusterbuster-8-1-client-69dc5896fb-m6qwm    0/5     ContainerCreating      0          21m
clusterbuster-8    clusterbuster-8-1-server-7fbf86d848-46d9v    0/1     ContainerCreating      0          21m
clusterbuster-80   clusterbuster-80-1-server-dbc9988bc-lh4h6    0/1     ContainerCreating      0          21m
clusterbuster-94   clusterbuster-94-0-server-d49bb6b67-zsk7s    0/1     ContainerCreating      0          21m

Comment 33 Peter Hunt 2020-06-17 17:21:42 UTC
the work is ongoing in tracing this

Comment 34 ian zhang 2020-07-06 19:52:56 UTC
Hi team, 

I'm seeing similar bug when one of a pod is OOMKill, then the pod stuck at `CreateContainerError`

```
Events:
  Type     Reason   Age                      From                                  Message
  ----     ------   ----                     ----                                  -------
  Warning  Failed   41m (x15945 over 3d9h)   kubelet, ip-10-0-159-69.ec2.internal  (combined from similar events): Error: error reserving ctr name k8s_multicluster-operators-hub-subscription_multicluster-operator
s-hub-subscription-6dccfbc68d-s887d_open-cluster-management_b76d10cb-e880-4b93-965c-46541d0f79fe_3 for id 63150107244ed77008c71037c31a526fe8d06b06b88457cfa9369e87106d654d: name is reserved
  Normal   Pulling  92s (x16101 over 3d17h)  kubelet, ip-10-0-159-69.ec2.internal  Pulling image "quay.io/open-cluster-management/multicluster-operators-subscription@sha256:f78cde6d1dee42da5a93acae7d1e464d64a42d8
584c1464e465e1d17e2b915e0"
```

Is there way I can provide some log to help you move forward? I have to manually delete the pod to get a new pod created...

Comment 35 ian zhang 2020-07-06 19:55:15 UTC
I'm at OCP 4.4.10

Comment 36 Peter Hunt 2020-07-07 19:52:11 UTC
We have found this happens most often when the cloud providers are throttling IOPs. Unfortunately, if that's happening, there's little we (the node team) can do. It's largely up to the application developer to reduce the number of sync() calls (or writes to disk in general).

If you run into this problem again, it would be helpful to check with your cloud provider if you're having your IOPs throttled. If not, gathering information from CRI-O, kubelet, and the networking namespaces will help us find if there's another bottleneck causing this.

Comment 37 W. Trevor King 2020-07-07 20:40:47 UTC
> We have found this happens most often when the cloud providers are throttling IOPs. Unfortunately, if that's happening, there's little we (the node team) can do.

Is this a CRI-O race that IOPs throttling exposes?  I can't think of any other reasons why IOPS throttling would leave to CRI-O getting confused about whether the name was already reserved or not.

Comment 38 Peter Hunt 2020-07-07 20:55:23 UTC
we're still investigating the precise root cause, but the common case seems to be:

kubelet asks CRI-O "you have 4 minutes to make this container/pod"
CRI-O works on that, and gets stuck somewhere
kubelet says "you've taken more than 4 minutes, I'm going to assume my request went into the void, and ask again"
CRI-O says "I already reserved that name, because I'm still working on that request"

So the big question is where is CRI-O being stuck. we have, so far, revealed a correlation between this happening and cloud IOPs throttling, and are looking into mitigations and making sure it's not CRI-O's fault.

Comment 39 Phil Cameron 2020-07-07 21:01:10 UTC
The error message doesn't appear to help understand what is happening. Maybe a more informative error message could be generated.

Comment 40 W. Trevor King 2020-07-07 21:08:32 UTC
So identifying where CRI-O is stuck, as you say, would help.  But if CRI-O is just slow, and not stuck, giving it a way to gracefully pick up the new kubelet request "ah, I was already working on that one.  I'll get back to you once it's ready" seems better than saying "name is reserved, buzz off".  Not sure how hard that would be to implement in CRI-O, but it seems at least theoretically possible :p.  Is the trouble figuring out whether the new request is truly a re-queue attempt, or just something that happens to use the same pod name but diverges for other properties?  That might be a CRI limitation; maybe there should be request IDs that the kubelet would need to return on re-queue for non-idempotent requests?

Comment 41 Robert Krawitz 2020-07-07 22:31:19 UTC
On two bare metal nodes (32 and 40 cores respectively, SATA SSD, 192 GB RAM), I ran

clusterbuster -P server -b 5 -p 10 -D .01 -M 1 -N 90 -r 4 -d 2 -c 5 -m 16000 -v -s 10 -x

This attempts to create 900 pods across the two nodes.  I did not get any of the error reserving name messages.  I did get four pods fail to start as follows, which looks more like https://bugzilla.redhat.com/show_bug.cgi?id=1789973:



  Normal   Scheduled    24m   default-scheduler   Successfully assigned clusterbuster-5/clusterbuster-5-0-client-7759c5bcbf-z24h9 to worker000
  Warning  FailedMount  24m   kubelet, worker000  MountVolume.SetUp failed for volume "secret-clusterbuster-5-0-5" : mount failed: exit status 1
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/86524197-f91e-41ce-b067-bacbba9eb605/volumes/kubernetes.io~secret/secret-clusterbuster-5-0-5 --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/86524197-f91e-41ce-b067-bacbba9eb605/volumes/kubernetes.io~secret/secret-clusterbuster-5-0-5
Output: Failed to start transient scope unit: Input/output error
  Warning  FailedMount  24m  kubelet, worker000  MountVolume.SetUp failed for volume "secret-clusterbuster-5-0-0" : mount failed: exit status 1
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/86524197-f91e-41ce-b067-bacbba9eb605/volumes/kubernetes.io~secret/secret-clusterbuster-5-0-0 --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/86524197-f91e-41ce-b067-bacbba9eb605/volumes/kubernetes.io~secret/secret-clusterbuster-5-0-0
Output: Failed to start transient scope unit: Input/output error
  Warning  FailedMount  24m  kubelet, worker000  MountVolume.SetUp failed for volume "secret-clusterbuster-5-0-4" : mount failed: exit status 1
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/86524197-f91e-41ce-b067-bacbba9eb605/volumes/kubernetes.io~secret/secret-clusterbuster-5-0-4 --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/86524197-f91e-41ce-b067-bacbba9eb605/volumes/kubernetes.io~secret/secret-clusterbuster-5-0-4
Output: Failed to start transient scope unit: Input/output error
  Warning  FailedMount  24m  kubelet, worker000  MountVolume.SetUp failed for volume "secret-clusterbuster-5-0-3" : mount failed: exit status 1
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/86524197-f91e-41ce-b067-bacbba9eb605/volumes/kubernetes.io~secret/secret-clusterbuster-5-0-3 --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/86524197-f91e-41ce-b067-bacbba9eb605/volumes/kubernetes.io~secret/secret-clusterbuster-5-0-3
Output: Failed to start transient scope unit: Input/output error
  Warning  FailedCreatePodContainer  23m                        kubelet, worker000  unable to ensure pod container exists: failed to create container for [kubepods besteffort pod86524197-f91e-41ce-b067-bacbba9eb605] : Message recipient disconnected from message bus without replying
  Warning  FailedMount               17m                        kubelet, worker000  Unable to attach or mount volumes: unmounted volumes=[secret-clusterbuster-5-0-6], unattached volumes=[secret-clusterbuster-5-0-0 secret-clusterbuster-5-0-5 secret-clusterbuster-5-0-3 secret-clusterbuster-5-0-6 secret-clusterbuster-5-0-7 default-token-f9wh8 secret-clusterbuster-5-0-1 secret-clusterbuster-5-0-2 secret-clusterbuster-5-0-4 secret-clusterbuster-5-0-9 secret-clusterbuster-5-0-8]: timed out waiting for the condition
  Warning  FailedMount               15m                        kubelet, worker000  Unable to attach or mount volumes: unmounted volumes=[secret-clusterbuster-5-0-6], unattached volumes=[secret-clusterbuster-5-0-2 secret-clusterbuster-5-0-6 secret-clusterbuster-5-0-0 secret-clusterbuster-5-0-7 secret-clusterbuster-5-0-9 default-token-f9wh8 secret-clusterbuster-5-0-3 secret-clusterbuster-5-0-1 secret-clusterbuster-5-0-4 secret-clusterbuster-5-0-5 secret-clusterbuster-5-0-8]: timed out waiting for the condition
  Warning  FailedMount               12m                        kubelet, worker000  Unable to attach or mount volumes: unmounted volumes=[secret-clusterbuster-5-0-6], unattached volumes=[secret-clusterbuster-5-0-4 secret-clusterbuster-5-0-8 secret-clusterbuster-5-0-9 secret-clusterbuster-5-0-0 secret-clusterbuster-5-0-2 secret-clusterbuster-5-0-3 secret-clusterbuster-5-0-7 secret-clusterbuster-5-0-1 secret-clusterbuster-5-0-6 default-token-f9wh8 secret-clusterbuster-5-0-5]: timed out waiting for the condition
  Warning  FailedMount               10m                        kubelet, worker000  Unable to attach or mount volumes: unmounted volumes=[secret-clusterbuster-5-0-6], unattached volumes=[secret-clusterbuster-5-0-3 secret-clusterbuster-5-0-8 secret-clusterbuster-5-0-5 secret-clusterbuster-5-0-9 secret-clusterbuster-5-0-0 secret-clusterbuster-5-0-1 secret-clusterbuster-5-0-4 secret-clusterbuster-5-0-6 secret-clusterbuster-5-0-7 default-token-f9wh8 secret-clusterbuster-5-0-2]: timed out waiting for the condition
  Warning  FailedMount               8m12s                      kubelet, worker000  Unable to attach or mount volumes: unmounted volumes=[secret-clusterbuster-5-0-6], unattached volumes=[secret-clusterbuster-5-0-1 secret-clusterbuster-5-0-7 secret-clusterbuster-5-0-8 default-token-f9wh8 secret-clusterbuster-5-0-0 secret-clusterbuster-5-0-2 secret-clusterbuster-5-0-6 secret-clusterbuster-5-0-5 secret-clusterbuster-5-0-9 secret-clusterbuster-5-0-3 secret-clusterbuster-5-0-4]: timed out waiting for the condition
  Warning  FailedMount               <invalid> (x4 over 5m57s)  kubelet, worker000  (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[secret-clusterbuster-5-0-6], unattached volumes=[secret-clusterbuster-5-0-3 secret-clusterbuster-5-0-4 secret-clusterbuster-5-0-5 secret-clusterbuster-5-0-7 secret-clusterbuster-5-0-9 secret-clusterbuster-5-0-1 secret-clusterbuster-5-0-6 default-token-f9wh8 secret-clusterbuster-5-0-8 secret-clusterbuster-5-0-0 secret-clusterbuster-5-0-2]: timed out waiting for the condition

Comment 42 Robert Krawitz 2020-07-07 22:33:22 UTC
When I've seen this, after a while there is no evidence of any progress (leave it run for hours, no more pods ever successfully get going).

BTW in comment #41 these were internal lab nodes, running 4.5.0-rc.1.

Comment 43 Peter Hunt 2020-07-09 16:58:56 UTC
the work here is on going

Comment 44 Ramakrishnan Periyasamy 2020-08-20 09:43:42 UTC
Observed this problem during OCS scale testing.

Number of app pods running in the cluster 1200+, performed reboot of worker node where app pods are running, pods failed to be back up and stuck in ContainerCreating or CreateContainerError state

namespace-test-ecb08c55a6a0476fa2ed3bd27841363e    pod-test-rbd-e6ca6d8a582840789bd2a0f02ea9f77e             0/1     ContainerCreating      0          27h     <none>         compute-0         <none>           <none>
namespace-test-ecb08c55a6a0476fa2ed3bd27841363e    pod-test-rbd-fa31c9f796a0433eb57a4c74cc952643             0/1     CreateContainerError   0          27h     10.130.4.6     compute-0         <none>           <none>
namespace-test-d6e181cc799449788ad4857e3c5815c1    pod-test-rbd-9270d3c7e97a4e1f90399f8dd6be800d             0/1     CreateContainerError   0          43h     10.130.4.15    compute-0         <none>           <none>
namespace-test-d6e181cc799449788ad4857e3c5815c1    pod-test-rbd-a1eedbff72a34076b470f6cb50fb5b66             0/1     CreateContainerError   0          41h     10.130.4.16    compute-0         <none>           <none>
namespace-test-ecb08c55a6a0476fa2ed3bd27841363e    pod-test-cephfs-971ac22ac9984ee8b68f610e573f6c68          0/1     ContainerCreating      0          30h     <none>         compute-0         <none>           <none>

Describe cmd output of pod in CreateContainerError state
Events:
  Type     Reason                 Age                    From                Message
  ----     ------                 ----                   ----                -------
  Normal   SuccessfulMountVolume  55m                    kubelet, compute-0  MapVolume.MapPodDevice succeeded for volume "pvc-106aee15-39b3-4c6a-855f-32d7d1917047" globalMapPath "/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-106aee15-39b3-4c6a-855f-32d7d1917047/dev"
  Normal   SuccessfulMountVolume  55m                    kubelet, compute-0  MapVolume.MapPodDevice succeeded for volume "pvc-106aee15-39b3-4c6a-855f-32d7d1917047" volumeMapPath "/var/lib/kubelet/pods/15531a60-281a-42fc-b099-0efea77deb66/volumeDevices/kubernetes.io~csi"
  Normal   AddedInterface         55m                    multus              Add eth0 [10.130.4.6/22]
  Warning  Failed                 52m                    kubelet, compute-0  Error: reserving container name: error reserving ctr name k8s_web-server_pod-test-rbd-fa31c9f796a0433eb57a4c74cc952643_namespace-test-ecb08c55a6a0476fa2ed3bd27841363e_15531a60-281a-42fc-b099-0efea77deb66_0 for id 8e5faa0757ca48607a2858a96ef686ba59b41b6cd4643bb7b3ba3916dcb7b9f7: name is reserved
  Warning  Failed                 52m                    kubelet, compute-0  Error: reserving container name: error reserving ctr name k8s_web-server_pod-test-rbd-fa31c9f796a0433eb57a4c74cc952643_namespace-test-ecb08c55a6a0476fa2ed3bd27841363e_15531a60-281a-42fc-b099-0efea77deb66_0 for id e5a40cc24adb9ad4b56453ed31dc399253f8f4445b95a78096449505611e35a7: name is reserved
  Warning  Failed                 52m                    kubelet, compute-0  Error: reserving container name: error reserving ctr name k8s_web-server_pod-test-rbd-fa31c9f796a0433eb57a4c74cc952643_namespace-test-ecb08c55a6a0476fa2ed3bd27841363e_15531a60-281a-42fc-b099-0efea77deb66_0 for id 75684aa526168039a875a9d355daba78928da45d17b4978d7035b5e13db0a9f0: name is reserved
  Warning  Failed                 52m                    kubelet, compute-0  Error: reserving container name: error reserving ctr name k8s_web-server_pod-test-rbd-fa31c9f796a0433eb57a4c74cc952643_namespace-test-ecb08c55a6a0476fa2ed3bd27841363e_15531a60-281a-42fc-b099-0efea77deb66_0 for id 56c0c56bb9d435ae931f8be21ac31fb853fb99aca1c84c69a44406c424c979e6: name is reserved
  Warning  Failed                 51m                    kubelet, compute-0  Error: reserving container name: error reserving ctr name k8s_web-server_pod-test-rbd-fa31c9f796a0433eb57a4c74cc952643_namespace-test-ecb08c55a6a0476fa2ed3bd27841363e_15531a60-281a-42fc-b099-0efea77deb66_0 for id 1cdd9025563a5bd9dc2db07b42371143ce5a8e3ad5be9334701ea9038ce4fc13: name is reserved

Describe output of pod in ContainerCreating
Events:
  Type     Reason                  Age                   From                Message
  ----     ------                  ----                  ----                -------
  Normal   AddedInterface          65m                   multus              Add eth0 [10.130.4.33/22]
  Warning  FailedCreatePodSandBox  61m                   kubelet, compute-0  Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedCreatePodSandBox  61m                   kubelet, compute-0  Failed to create pod sandbox: rpc error: code = Unknown desc = reserving pod sandbox name: error reserving pod name k8s_pod-test-cephfs-971ac22ac9984ee8b68f610e573f6c68_namespace-test-ecb08c55a6a0476fa2ed3bd27841363e_f4dc3d4b-da69-4739-a1ba-376a5229da72_0 for id 527b6fb01df3b396b2af10e504806e2a0a48669dc2add310067a97b8e62d6ed4: name is reserved
  Warning  FailedCreatePodSandBox  61m                   kubelet, compute-0  Failed to create pod sandbox: rpc error: code = Unknown desc = reserving pod sandbox name: error reserving pod name k8s_pod-test-cephfs-971ac22ac9984ee8b68f610e573f6c68_namespace-test-ecb08c55a6a0476fa2ed3bd27841363e_f4dc3d4b-da69-4739-a1ba-376a5229da72_0 for id 47c98334447b6e97bfad2b41981ac7f7534346d5910a0698c3571106c5075ff3: name is reserved
  Warning  FailedCreatePodSandBox  61m                   kubelet, compute-0  Failed to create pod sandbox: rpc error: code = Unknown desc = reserving pod sandbox name: error reserving pod name k8s_pod-test-cephfs-971ac22ac9984ee8b68f610e573f6c68_namespace-test-ecb08c55a6a0476fa2ed3bd27841363e_f4dc3d4b-da69-4739-a1ba-376a5229da72_0 for id 9ab25490e7dd78cab3ba74731206dbaee72ae1e798d1413cc350a6884da5bf5e: name is reserved
  Warning  FailedCreatePodSandBox  61m                   kubelet, compute-0  Failed to create pod sandbox: rpc error: code = Unknown desc = reserving pod sandbox name: error reserving pod name k8s_pod-test-cephfs-971ac22ac9984ee8b68f610e573f6c68_namespace-test-ecb08c55a6a0476fa2ed3bd27841363e_f4dc3d4b-da69-4739-a1ba-376a5229da72_0 for id 896236a6cbd2e19f3cac259e70b2e1c765262358f557fb40d8b92d700fbd1bb7: name is reserved
  Warning  FailedCreatePodSandBox  60m                   kubelet, compute-0  Failed to create pod sandbox: rpc error: code = Unknown desc = reserving pod sandbox name: error reserving pod name k8s_pod-test-cephfs-971ac22ac9984ee8b68f610e573f6c68_namespace-test-ecb08c55a6a0476fa2ed3bd27841363e_f4dc3d4b-da69-4739-a1ba-376a5229da72_0 for id 2ef187adced2eab3c5a98c2699b3767ecaf29dc8f6ca6462b2fba97c65528123: name is reserved
  Warning  FailedCreatePodSandBox  60m                   kubelet, compute-0  Failed to create pod sandbox: rpc error: code = Unknown desc = reserving pod sandbox name: error reserving pod name k8s_pod-test-cephfs-971ac22ac9984ee8b68f610e573f6c68_namespace-test-ecb08c55a6a0476fa2ed3bd27841363e_f4dc3d4b-da69-4739-a1ba-376a5229da72_0 for id 1762694ca90f7c514ec552674baf332fe317ad0869aeebe66b6a5961377870a5: name is reserved
  Warning  FailedCreatePodSandBox  60m                   kubelet, compute-0  Failed to create pod sandbox: rpc error: code = Unknown desc = reserving pod sandbox name: error reserving pod name k8s_pod-test-cephfs-971ac22ac9984ee8b68f610e573f6c68_namespace-test-ecb08c55a6a0476fa2ed3bd27841363e_f4dc3d4b-da69-4739-a1ba-376a5229da72_0 for id a09655c06651241a2e9eadee480ce10353a1078c285fd96eeac7637fbd99ddff: name is reserved
  Warning  FailedCreatePodSandBox  60m                   kubelet, compute-0  Failed to create pod sandbox: rpc error: code = Unknown desc = reserving pod sandbox name: error reserving pod name k8s_pod-test-cephfs-971ac22ac9984ee8b68f610e573f6c68_namespace-test-ecb08c55a6a0476fa2ed3bd27841363e_f4dc3d4b-da69-4739-a1ba-376a5229da72_0 for id 4bf5d712dc904ce123db8a40891e5f8c23aca40bf097baacbab8027c8af5bde8: name is reserved
  Warning  FailedCreatePodSandBox  110s (x245 over 57m)  kubelet, compute-0  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = reserving pod sandbox name: error reserving pod name k8s_pod-test-cephfs-971ac22ac9984ee8b68f610e573f6c68_namespace-test-ecb08c55a6a0476fa2ed3bd27841363e_f4dc3d4b-da69-4739-a1ba-376a5229da72_0 for id fb15404e021712aa46faf7deeee6c0b6f0ca59c249870ea728627810bce08830: name is reserved

OCP Version:
Client Version: 4.5.0-0.nightly-2020-07-31-162914
Server Version: 4.5.0-0.nightly-2020-08-10-150345
Kubernetes Version: v1.18.3+002a51f

OCS Version:
ocs-operator.v4.5.0-508.ci

Comment 45 krishnaram Karthick 2020-08-20 10:50:56 UTC
I see a lot of great progress made on this bug over time. Do we have any approximate timelines as to when we'd have a fix? Some of our OCS scale tests seems to have failed with issue recently during OCS 4.5 testing.

Comment 47 Peter Hunt 2020-08-20 19:43:39 UTC
Timeline is hard to tell. We have some fixes that went into openshift 4.5.4 that reduce the likelihood of a leak. We are also changing the error string in 4.6 so it's more informative (https://github.com/cri-o/cri-o/pull/4002). This most likely happens now because some other resource (node IOPs/sdn) takes longer than the kubelet wants.

The correct thing for CRI-O to do when the kubelet attempts to retry requests is to ignore them (which is what "error reserving pod name..." is saying), so the node doesn't get further inundated with pod creation requests.

if the node eventually reconciles, then there is no bug. We are keeping this bug around to make sure we are indeed eventually reconciling as expected

Comment 51 Ben Bennett 2020-09-25 20:49:31 UTC
*** Bug 1881868 has been marked as a duplicate of this bug. ***

Comment 54 Peter Hunt 2020-10-02 19:36:54 UTC
I am making some progress here. Currently, CRI-O undoes all of its progress when it detects a timeout. I am working on having it instead save the progress and return the newly provisioned resource (in this bug, pod) and return it then. You can check the progress in the attached github issue

I hope to have this implemented and tested next sprint. It won't totally remove the underlying problems (nodes get overloaded) but it should allow cri-o and kubelet to reconcile faster

Comment 56 Douglas Smith 2020-10-13 19:48:54 UTC
*** Bug 1881678 has been marked as a duplicate of this bug. ***

Comment 62 Michael Washer 2020-11-06 06:48:35 UTC
I'm experiencing a similar issue as this in OCP 4.4

Now a node will not take any pods with similar logs:
`Error: error reserving ctr name k8s_* for id *: name is reserved`
...
`Nov 01 14:05:17 wb09.ocp01.nfvdev.tlabs.ca crio[3024]: time="2020-11-01 14:05:17.881016617Z" level=error msg="Error while removing pod from CNI network \"multus-cni-network\": context deadline exceeded"`
...
`Oct 31 09:02:32 wb09.ocp01.nfvdev.tlabs.ca crio[3024]: time="2020-10-31 09:02:32.176622070Z" level=warning msg="unknown CNI cache file /var/lib/cni/results/openshift-sdn-4fcf55fee891dc032527ff655e145ac184b123b3531a80d7d33339e8790d8fe4-eth0 kind \"\""`

Is there currently a workaround or is this bug resolved?

Comment 63 Peter Hunt 2020-11-06 14:22:00 UTC
I'm actively working on a mitigation, but that problem in particular seems like an issue in multus that caused issues, which is a bit different than the root of this bug which I'm fixing, which is `on systems with load, kubelet's requests in cri-o time out, and this causes further system load`

Comment 67 Naga Ravi Chaitanya Elluri 2020-11-17 13:45:16 UTC
This is a blocker for OCP 4.6 Performance and Scale testing. We are hitting this issue during all the tests which involves heavy pod creation.

Comment 68 Peter Hunt 2020-11-17 19:45:18 UTC
@nelluri, is this a scale regression between 4.5 and 4.6? CRI-O has *always* had this issue, so I am hesitant to call it a blocker. If the issue is worse, I would think it's because the general load on nodes is higher, making this situation happen more often. I do not think that makes this a blocker

That said, I am hoping to have changes that *mitigate* this for 4.7. Backporting to 4.6 is possible, but we'll need some time to make sure the changes don't destabilize anything. But the changes will never completely solve this problem, as the underlying problem is the CRI-O request is taking too long because of something out of its control (network creation, disk access, etc). With a sufficient level of load, this will happen no matter the mitigations.

Comment 69 Joe Talerico 2020-11-17 21:09:27 UTC
(In reply to Peter Hunt from comment #68)
> @nelluri, is this a scale regression between 4.5 and 4.6? CRI-O
> has *always* had this issue, so I am hesitant to call it a blocker. If the
> issue is worse, I would think it's because the general load on nodes is
> higher, making this situation happen more often. I do not think that makes
> this a blocker
> 
> That said, I am hoping to have changes that *mitigate* this for 4.7.
> Backporting to 4.6 is possible, but we'll need some time to make sure the
> changes don't destabilize anything. But the changes will never completely
> solve this problem, as the underlying problem is the CRI-O request is taking
> too long because of something out of its control (network creation, disk
> access, etc). With a sufficient level of load, this will happen no matter
> the mitigations.

Hey Peter - I asked Ravi to escalate this as a blocker. Carrying this bug for multiple releases is the reason I asked for him to do that. In the 4.6 testing, the issue did seem to present itself much more.

Comment 70 Martin André 2020-11-19 10:28:57 UTC
*** Bug 1881868 has been marked as a duplicate of this bug. ***

Comment 71 Peter Hunt 2020-11-19 18:06:04 UTC
for posterity, we opted to not set this as a blocker, but rather update the documentation to better describe how to mitigate the issue. 90% of the time, when someone hits this, they're getting their disk IOPs throttled. The best way to solve it is to increase capacity of the nodes, or reduce the amount of work the nodes are doing

Comment 76 Douglas Smith 2020-12-01 20:19:06 UTC
*** Bug 1896969 has been marked as a duplicate of this bug. ***

Comment 80 Peter Hunt 2020-12-04 22:02:45 UTC
a PR to mitigate this problem is upstream now. Will work on getting it in next sprint

Comment 83 Jan Safranek 2020-12-15 14:48:56 UTC
Today's example from CI (4.4 on OpenStack): https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.4/1338374880525029376

The whole job is full of errors like this:

> error reserving pod name k8s_external-provisioner-v52cr_e2e-volume-9613_66ff2101-b719-485c-b02e-cbf8bac25c3c_0 for id 2957ac00678135cbf3526e5db32cff195334ad36ce78fb83b2dd2694fd6e1c9b: name is reserved

Comment 84 Jan Safranek 2020-12-15 14:49:17 UTC
*** Bug 1907595 has been marked as a duplicate of this bug. ***

Comment 85 Peter Hunt 2020-12-15 21:24:45 UTC
So the PR that will mitigate this issue has merged in to 4.7

Note, this will not fix the underlying issue. if the cloud provider is throttling IOPs, or if some other resource is starved (CPU, network), then RunPodSandbox requests will still time out. Hopefully, however, this change will make the error not as obnoxious and also allow cri-o and kubelet to reconcile faster.

For QE: I have some instructions on how to verify the fix worked here https://github.com/cri-o/cri-o/pull/4394#issuecomment-735934262
specifically minding the note at the bottom "both clusters did eventually fail (the test case specifically had to overload it to trigger this situation) but with the patches we were able to get more pods running (558R vs 432R). Also note in the "with patch" case, we have 0X which means 0 container create errors vs 83X or 83 containers that were spamming kubelet logs with "name is reserved" errors"

This also means that when you do an `oc get pod/$pod` for a pod that's taking a long time to create, it should have at most one or two of these "name is reserved" failures, rather than one every two seconds.

Comment 106 Catherine_H 2021-01-22 05:08:40 UTC
Hi Team,

Is this bug will be fixed in 4.6.x?

Comment 112 Michael Burke 2021-02-11 15:23:09 UTC
If this BZ needs to be included in the 4.7 release notes as a bug fix, please enter Doc Text by EOB 2/12. Thank you, Michael

Comment 114 errata-xmlrpc 2021-02-24 15:10:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 117 Emmanuel Kasper 2022-03-28 09:25:05 UTC
The same symptom can again happen on a patched cluster (4.8.35), see https://bugzilla.redhat.com/show_bug.cgi?id=2069058 for following.

Comment 122 Red Hat Bugzilla 2023-09-15 01:29:01 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days