Description of problem: OCP 4.3.13 --- Error: the container name (...) is already in use by (...) But container does not exist Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Similar issues: https://github.com/containers/libpod/issues/2553 https://bugzilla.redhat.com/show_bug.cgi?id=1757845
https://github.com/containers/libpod/issues/2240
Workaround; # podman rm --storage 44f2b... 44f2bd7e... # podman rm --storage 84c017aa6... 84c017a...
Here below an anonymized version of the issue (to have a public trace of this) ------------------------------------------------------------------------ Pod cannot spawn due to: ~~~ the container name "k8s_operator_service-ca-operator-aaaaaaaaaa-bbbbb_openshift-service-ca-operator_dddddddd-eeee-ffff-gggg-hhhhhhhhhhhh_0" is already in use by "vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv". You have to remove that container to be able to reuse that name.: that name is already in use ~~~ Pods show create container error: ~~~ # oc get pods -A -o wide | grep -iv runn | grep -iv compl NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES openshift-service-ca-operator service-ca-operator-aaaaaaaaaa-bbbbb 0/1 CreateContainerError 0 114m 192.168.3.34 master01.example.com <none> <none> ~~~ Events for the pod show ~~~ # oc describe pod -n openshift-service-ca-operator service-ca-operator-aaaaaaaaaa-bbbbb | tail -n 20 service-ca-operator-token-ccccc: Type: Secret (a volume populated by a Secret) SecretName: service-ca-operator-token-ccccc Optional: false QoS Class: Burstable Node-Selectors: node-role.kubernetes.io/master= Tolerations: node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/not-ready:NoExecute for 120s node.kubernetes.io/unreachable:NoExecute for 120s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled <unknown> default-scheduler Successfully assigned openshift-service-ca-operator/service-ca-operator-aaaaaaaaaa-bbbbb to master01.example.com Warning FailedCreatePodSandBox 111m kubelet, master01.example.com Failed create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded Warning FailedCreatePodSandBox 111m kubelet, master01.example.com Failed create pod sandbox: rpc error: code = Unknown desc = error reserving pod name k8s_service-ca-operator-aaaaaaaaaa-bbbbb_openshift-service-ca-operator_dddddddd-eeee-ffff-gggg-hhhhhhhhhhhh_0 for id xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx: name is reserved Warning FailedCreatePodSandBox 111m kubelet, master01.example.com Failed create pod sandbox: rpc error: code = Unknown desc = error reserving pod name k8s_service-ca-operator-aaaaaaaaaa-bbbbb_openshift-service-ca-operator_dddddddd-eeee-ffff-gggg-hhhhhhhhhhhh_0 for id yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy: name is reserved Warning Failed 111m kubelet, master01.example.com Error: relabel failed /var/run/containers/storage/overlay-containers/zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz/userdata/resolv.conf: lstat /var/run/containers/storage/overlay-containers/zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz/userdata/resolv.conf: no such file or directory Warning Failed 109m (x10 over 111m) kubelet, master01.example.com Error: the container name "k8s_operator_service-ca-operator-aaaaaaaaaa-bbbbb_openshift-service-ca-operator_dddddddd-eeee-ffff-gggg-hhhhhhhhhhhh_0" is already in use by "vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv". You have to remove that container to be able to reuse that name.: that name is already in use Normal Pulled 78s (x506 over 111m) kubelet, master01.example.com Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:<checksum>" already present on machine ~~~ When inspecting the node, one can see that the container with that ID does not exist: ~~~ [root@master01 ~]# crictl ps -a | grep "vvvvvvv" [root@master01 ~]# podman ps -a | grep "vvvvvv" [root@master01 ~]# crictl ps -a | grep ca-operator [root@master01 ~]# podman ps -a | grep ca-operator ~~~ However, the container does exist in `/var/lib/containers/storage/overlay-containers/containers.json`: ~~~ [root@master01 ~]# jq . /var/lib/containers/storage/overlay-containers/containers.json | grep "vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv" -B1 -A12 { "id": "vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv", "names": [ "k8s_operator_service-ca-operator-aaaaaaaaaa-bbbbb_openshift-service-ca-operator_dddddddd-eeee-ffff-gggg-hhhhhhhhhhhh_0" ], "image": "<image UUID>", "layer": "<layer UUID>", "metadata": "{\"pod-name\":\"k8s_service-ca-operator-aaaaaaaaaa-bbbbb_openshift-service-ca-operator_dddddddd-eeee-ffff-gggg-hhhhhhhhhhhh_0\",\"pod-id\":\"zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz\",\"image-name\":\"<image UUID>\",\"image-id\":\"<image UUID>\",\"name\":\"k8s_operator_service-ca-operator-aaaaaaaaaa-bbbbb_openshift-service-ca-operator_dddddddd-eeee-ffff-gggg-hhhhhhhhhhhh_0\",\"metadata-name\":\"operator\",\"created-at\":1589288627}", "created": "2020-05-12T13:03:47.639544652Z", "flags": { "MountLabel": "system_u:object_r:container_file_t:s0:c699,c769", "ProcessLabel": "system_u:system_r:container_t:s0:c699,c769" } }, ~~~ Furthermore, it is not possible to delete the container: ~~~ [root@master01 ~]# crictl rm vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv Removing the container "vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv" failed: rpc error: code = Unknown desc = container with ID starting with vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv not found: ID does not exist [root@master01 ~]# podman rm vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv Error: no container with name or ID vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv found: no such container [root@master01 ~]# ~~~
Maybe I missed it in the above, but was Podman and/or Buildah in use in this environment along with CRI-O when this happened? I.e. is it possible that one of those other tools created the container and then CRI-O tried to later delete it? Or did you install Podman after the error happened due to your research? Also, can you share a few more details of the commands you used before the error popped up? Did you first notice it when you did the `# oc describe pod...` command as noted in your report or doing something else? I've added Matt Heon to this BZ in case he has any insights.
Based on the description here, it doesn't sound like Podman is being used to create containers, only CRI-O (as instructed by Openshift). Given that, it sounds like this is an issue very similar to the Podman issue, but being triggered by CRI-O - incomplete deletion results in a container that still exists in c/storage even after CRI-O and Openshift believe it has been removed. It would support my hypothesis if a pod/container with the given name had been created and removed prior to this failure occurring.
Hi, >> Maybe I missed it in the above, but was Podman and/or Buildah in use in this environment along with CRI-O when this happened? >> Or did you install Podman after the error happened due to your research Podman is installed on our OpenShift nodes but it's only used by the toolbox debugging tool as far as I know. So when I ran `podman ps -a`, I could see a container with name `toolbox`, but nothing else. >> I.e. is it possible that one of those other tools created the container and then CRI-O tried to later delete it? The container name strongly suggests that container "k8s_operator_service-ca-operator-aaaaaaaaaa-bbbbb_openshift-service-ca-operator_dddddddd-eeee-ffff-gggg-hhhhhhhhhhhh_0" was created by kubernetes. The alternative being that someone created a container with the same name manually through podman (unlikely) and triggered the bug through podman (unlikely). It's way more likely that this was caused by kubernetes. >> Also, can you share a few more details of the commands you used before the error popped up? >> Did you first notice it when you did the `# oc describe pod...` command as noted in your report or doing something else? We ran a few failover tests. Ironically, we hard rebooted the other 2 master nodes sequentially. We pulled the power on the etcd leader node, verified the cluster and failover times. After verification brought back that master node. Then, we repeated the test with the new etcd leader. Pulled power. Saw that OpenShift had some issues stabilizing the cluster for ca. 30 minutes. Waited. Once the cluster was stable, we brought back up the shut down node. And then after bringing the other nodes back up, we ended up with alerts on the OCP web console indicating that 2 containers could not start up due to this issue. We did not touch the master node in question (of course, OpenShift / kubernetes did). I ran the CLI commands `oc get pods` / `oc describe pod` after we noticed the issue on the web interface. >> it sounds like this is an issue very similar to the Podman issue, but being triggered by CRI-O That's what I speculate might be the case too. >> It would support my hypothesis if a pod/container with the given name had been created and removed prior to this failure occurring. I can't back this up at the moment but given the failover tests I'd say this is likely.
We have had many fixes in this area that got into 4.3.19. Can this cluster be upgraded to that to see if that mitigates these problems?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2256