Description of problem: If a pod redeploy on a node, the pod stuck "Terminating" before creating new pod. And the following errors are shown so many in journal logs. ~~~ Jan 23 15:33:31 worker.ocp.example.com dockerd-current[6688]: time="2020-01-23T15:33:31.794262342+09:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container xxx...xxx: rpc error: code = 2 desc = containerd: container not found" ~~~ There were only 14 "docker-runc-current" processes in ps cmd outout, but container counted 2214 on running containers using docker info. ~~~ $ grep -c docker-runc-current ps 14 $ cat docker_info Containers: 2228 Running: 2214 Paused: 0 Stopped: 14 Images: 88 Server Version: 1.13.1 Storage Driver: overlay2 Backing Filesystem: xfs : Swarm: inactive Runtimes: docker-runc runc Default Runtime: docker-runc Init Binary: /usr/libexec/docker/docker-init-current containerd version: (expected: aa8187dbd3b7ad67d8e5e3a15115d3eef43a7ed1) runc version: 9c3c5f853ebf0ffac0d087e94daef462133b69c7 (expected: 9df8b306d01f59d3a8029be411de015b7304dd8f) init version: fec3683b971d9c3ef73f284f176672c44b448662 (expected: 949e6facb77383876aeff8a6944dde66b3089574) : Docker Root Dir: /docker : ~~~ Version-Release number of selected component (if applicable): openshift-ansible-3.11.146-1.git.0.fcedb45.el7.noarch docker-1.13.1-103.git7f2769b.el7.x86_64 systemd-219-67.el7_7.1.x86_64 How reproducible: N/A Steps to Reproduce: 1. 2. 3. Actual results: Pod cannot redeploy, because the pod stuck with "Terminating" status. Expected results: Pod can redeploy without any issue. Additional info:
Looks like another instance of this problem in a new BZ, https://bugzilla.redhat.com/show_bug.cgi?id=1796451
Alex Jia can you please update this PR per this comment? https://bugzilla.redhat.com/show_bug.cgi?id=1795881#c16
Is this BZ also resolved by https://access.redhat.com/errata/RHSA-2020:1234 ?
@Alex, I guess my slack message did not reach you. #1 May I ask if you can provide the yaml file I can reproduce the issue? #2 I see the BZ is still ASSIGNED, is it already fixed, or we are just trying to get it reproduced?
@Daein, could you provide the yaml file we can reproduce the issue?
@Weinan, There is no reproduce yaml, because I could not reproduce this issue on my test lab. AFAIK only the customers' OCP had this issue. And they said this issue had occurred while some pods restarting using replicas from xx -> 0 to 0 -> xx.
OCP 3.11 install blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1876873#c1
Thank you for continuing to use Red Hat OpenShift. As part of a wider bug review, this bug has been evaluated and we have determined that at this time we do not plan to progress it. As such, we will be closing this bug. If you have need for continued assistance on this issue, please reopen the bug with additional context on why it needs to be reconsidered.