Bug 1834877 - OCP 4.3.13 --- Error: the container name (...) is already in use by (...) But container does not exist
Summary: OCP 4.3.13 --- Error: the container name (...) is already in use by (...) But...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.3.z
Assignee: Peter Hunt
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-12 15:06 UTC by Andreas Karis
Modified: 2023-10-06 19:59 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-06-03 03:30:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1775647 0 high CLOSED Podman can't reuse a container name, even if the container that was using it is no longer around 2024-01-06 04:27:13 UTC
Red Hat Product Errata RHBA-2020:2256 0 None None None 2020-06-03 03:31:38 UTC

Description Andreas Karis 2020-05-12 15:06:02 UTC
Description of problem:
OCP 4.3.13 --- Error: the container name (...) is already in use by (...) But container does not exist



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 8 Andreas Karis 2020-05-12 15:18:15 UTC
https://github.com/containers/libpod/issues/2240

Comment 9 Andreas Karis 2020-05-12 15:22:34 UTC
Workaround;

# podman rm --storage 44f2b...
44f2bd7e...


# podman rm --storage 84c017aa6...
84c017a...

Comment 11 Andreas Karis 2020-05-12 16:42:21 UTC
Here below an anonymized version of the issue (to have a public trace of this)

------------------------------------------------------------------------

Pod cannot spawn due to:
~~~
the container name "k8s_operator_service-ca-operator-aaaaaaaaaa-bbbbb_openshift-service-ca-operator_dddddddd-eeee-ffff-gggg-hhhhhhhhhhhh_0" is already in use by "vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv". You have to remove that container to be able to reuse that name.: that name is already in use
~~~

Pods show create container error:
~~~
# oc get pods -A -o wide | grep -iv runn | grep -iv compl
NAMESPACE                                               NAME                                                              READY   STATUS                 RESTARTS   AGE     IP               NODE                                     NOMINATED NODE   READINESS GATES
openshift-service-ca-operator                           service-ca-operator-aaaaaaaaaa-bbbbb                              0/1     CreateContainerError   0          114m    192.168.3.34      master01.example.com   <none>           <none>
~~~

Events for the pod show 
~~~
# oc describe pod -n openshift-service-ca-operator service-ca-operator-aaaaaaaaaa-bbbbb  | tail -n 20
  service-ca-operator-token-ccccc:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  service-ca-operator-token-ccccc
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  node-role.kubernetes.io/master=
Tolerations:     node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 120s
                 node.kubernetes.io/unreachable:NoExecute for 120s
Events:
  Type     Reason                  Age                   From                                             Message
  ----     ------                  ----                  ----                                             -------
  Normal   Scheduled               <unknown>             default-scheduler                                Successfully assigned openshift-service-ca-operator/service-ca-operator-aaaaaaaaaa-bbbbb to master01.example.com
  Warning  FailedCreatePodSandBox  111m                  kubelet, master01.example.com  Failed create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedCreatePodSandBox  111m                  kubelet, master01.example.com  Failed create pod sandbox: rpc error: code = Unknown desc = error reserving pod name k8s_service-ca-operator-aaaaaaaaaa-bbbbb_openshift-service-ca-operator_dddddddd-eeee-ffff-gggg-hhhhhhhhhhhh_0 for id xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx: name is reserved
  Warning  FailedCreatePodSandBox  111m                  kubelet, master01.example.com  Failed create pod sandbox: rpc error: code = Unknown desc = error reserving pod name k8s_service-ca-operator-aaaaaaaaaa-bbbbb_openshift-service-ca-operator_dddddddd-eeee-ffff-gggg-hhhhhhhhhhhh_0 for id yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy: name is reserved
  Warning  Failed                  111m                  kubelet, master01.example.com  Error: relabel failed /var/run/containers/storage/overlay-containers/zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz/userdata/resolv.conf: lstat /var/run/containers/storage/overlay-containers/zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz/userdata/resolv.conf: no such file or directory
  Warning  Failed                  109m (x10 over 111m)  kubelet, master01.example.com  Error: the container name "k8s_operator_service-ca-operator-aaaaaaaaaa-bbbbb_openshift-service-ca-operator_dddddddd-eeee-ffff-gggg-hhhhhhhhhhhh_0" is already in use by "vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv". You have to remove that container to be able to reuse that name.: that name is already in use
  Normal   Pulled                  78s (x506 over 111m)  kubelet, master01.example.com  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:<checksum>" already present on machine
~~~

When inspecting the node, one can see that the container with that ID does not exist:
~~~
[root@master01 ~]# crictl ps -a | grep "vvvvvvv"
[root@master01 ~]# podman ps -a | grep "vvvvvv"
[root@master01 ~]# crictl ps -a | grep ca-operator
[root@master01 ~]# podman ps -a | grep ca-operator
~~~

However, the container does exist in `/var/lib/containers/storage/overlay-containers/containers.json`:
~~~
[root@master01 ~]# jq . /var/lib/containers/storage/overlay-containers/containers.json | grep "vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv"  -B1 -A12
  {
    "id": "vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv",
    "names": [
      "k8s_operator_service-ca-operator-aaaaaaaaaa-bbbbb_openshift-service-ca-operator_dddddddd-eeee-ffff-gggg-hhhhhhhhhhhh_0"
    ],
    "image": "<image UUID>",
    "layer": "<layer UUID>",
    "metadata": "{\"pod-name\":\"k8s_service-ca-operator-aaaaaaaaaa-bbbbb_openshift-service-ca-operator_dddddddd-eeee-ffff-gggg-hhhhhhhhhhhh_0\",\"pod-id\":\"zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz\",\"image-name\":\"<image UUID>\",\"image-id\":\"<image UUID>\",\"name\":\"k8s_operator_service-ca-operator-aaaaaaaaaa-bbbbb_openshift-service-ca-operator_dddddddd-eeee-ffff-gggg-hhhhhhhhhhhh_0\",\"metadata-name\":\"operator\",\"created-at\":1589288627}",
    "created": "2020-05-12T13:03:47.639544652Z",
    "flags": {
      "MountLabel": "system_u:object_r:container_file_t:s0:c699,c769",
      "ProcessLabel": "system_u:system_r:container_t:s0:c699,c769"
    }
  },
~~~

Furthermore, it is not possible to delete the container:
~~~
[root@master01 ~]# crictl rm vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Removing the container "vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv" failed: rpc error: code = Unknown desc = container with ID starting with vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv not found: ID does not exist
[root@master01 ~]# podman rm vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Error: no container with name or ID vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv found: no such container
[root@master01 ~]#
~~~

Comment 12 Tom Sweeney 2020-05-12 18:14:33 UTC
Maybe I missed it in the above, but was Podman and/or Buildah in use in this environment along with CRI-O when this happened?  I.e. is it possible that one of those other tools created the container and then CRI-O tried to later delete it?  Or did you install Podman after the error happened due to your research?  Also, can you share a few more details of the commands you used before the error popped up?  Did you first notice it when you did the `# oc describe pod...` command as noted in your report or doing something else?  

I've added Matt Heon to this BZ in case he has any insights.

Comment 13 Matthew Heon 2020-05-12 18:48:33 UTC
Based on the description here, it doesn't sound like Podman is being used to create containers, only CRI-O (as instructed by Openshift). Given that, it sounds like this is an issue very similar to the Podman issue, but being triggered by CRI-O - incomplete deletion results in a container that still exists in c/storage even after CRI-O and Openshift believe it has been removed.

It would support my hypothesis if a pod/container with the given name had been created and removed prior to this failure occurring.

Comment 14 Andreas Karis 2020-05-13 08:44:30 UTC
Hi,

>> Maybe I missed it in the above, but was Podman and/or Buildah in use in this environment along with CRI-O when this happened?
>> Or did you install Podman after the error happened due to your research

Podman is installed on our OpenShift nodes but it's only used by the toolbox debugging tool as far as I know. So when I ran `podman ps -a`, I could see a container with name `toolbox`, but nothing else.

>> I.e. is it possible that one of those other tools created the container and then CRI-O tried to later delete it?  

The container name strongly suggests that container "k8s_operator_service-ca-operator-aaaaaaaaaa-bbbbb_openshift-service-ca-operator_dddddddd-eeee-ffff-gggg-hhhhhhhhhhhh_0" was created by kubernetes. The alternative being that someone created a container with the same name manually through podman (unlikely) and triggered the bug through podman (unlikely). 
It's way more likely that this was caused by kubernetes. 

>> Also, can you share a few more details of the commands you used before the error popped up?
>> Did you first notice it when you did the `# oc describe pod...` command as noted in your report or doing something else?

We ran a few failover tests. Ironically, we hard rebooted the other 2 master nodes sequentially. We pulled the power on the etcd leader node, verified the cluster and failover times. After verification brought back that master node. Then, we repeated the test with the new etcd leader. Pulled power. Saw that OpenShift had some issues stabilizing the cluster for ca. 30 minutes. Waited. Once the cluster was stable, we brought back up the shut down node. And then after bringing the other nodes back up, we ended up with alerts on the OCP web console indicating that 2 containers could not start up due to this issue. We did not touch the master node in question (of course, OpenShift / kubernetes did). I ran the CLI commands `oc get pods` / `oc describe pod` after we noticed the issue on the web interface.

>> it sounds like this is an issue very similar to the Podman issue, but being triggered by CRI-O 

That's what I speculate might be the case too.

>> It would support my hypothesis if a pod/container with the given name had been created and removed prior to this failure occurring.

I can't back this up at the moment but given the failover tests I'd say this is likely.

Comment 15 Peter Hunt 2020-05-21 18:28:16 UTC
We have had many fixes in this area that got into 4.3.19. Can this cluster be upgraded to that to see if that mitigates these problems?

Comment 20 errata-xmlrpc 2020-06-03 03:30:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2256


Note You need to log in before you can comment on or make changes to this bug.