Bug 1795881

Summary: Pod stuck with "Terminating" with container kill failed because of "container not found" or "no such process"
Product: OpenShift Container Platform Reporter: Daein Park <dapark>
Component: ContainersAssignee: Tom Sweeney <tsweeney>
Status: CLOSED WONTFIX QA Contact: Weinan Liu <weinliu>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: ajia, aos-bugs, ddarrah, dornelas, dwalsh, jnovy, jokerman, mslee, nagrawal, pasik, tsweeney, wjiang
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-07 20:53:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1186913    

Description Daein Park 2020-01-29 05:42:21 UTC
Description of problem:

If a pod redeploy on a node, the pod stuck "Terminating" before creating new pod.
And the following errors are shown so many in journal logs.

~~~
Jan 23 15:33:31 worker.ocp.example.com dockerd-current[6688]: time="2020-01-23T15:33:31.794262342+09:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container xxx...xxx: rpc error: code = 2 desc = containerd: container not found"
~~~

There were only 14 "docker-runc-current" processes in ps cmd outout, but container counted 2214 on running containers using docker info.

~~~
$ grep -c docker-runc-current ps
14

$ cat docker_info 
Containers: 2228
 Running: 2214
 Paused: 0
 Stopped: 14
Images: 88
Server Version: 1.13.1
Storage Driver: overlay2
 Backing Filesystem: xfs
:
Swarm: inactive
Runtimes: docker-runc runc
Default Runtime: docker-runc
Init Binary: /usr/libexec/docker/docker-init-current
containerd version:  (expected: aa8187dbd3b7ad67d8e5e3a15115d3eef43a7ed1)
runc version: 9c3c5f853ebf0ffac0d087e94daef462133b69c7 (expected: 9df8b306d01f59d3a8029be411de015b7304dd8f)
init version: fec3683b971d9c3ef73f284f176672c44b448662 (expected: 949e6facb77383876aeff8a6944dde66b3089574)
:
Docker Root Dir: /docker
:
~~~

Version-Release number of selected component (if applicable):

openshift-ansible-3.11.146-1.git.0.fcedb45.el7.noarch
docker-1.13.1-103.git7f2769b.el7.x86_64
systemd-219-67.el7_7.1.x86_64

How reproducible:

N/A

Steps to Reproduce:
1.
2.
3.

Actual results:

Pod cannot redeploy, because the pod stuck with "Terminating" status.

Expected results:

Pod can redeploy without any issue.

Additional info:

Comment 8 Tom Sweeney 2020-01-30 23:41:56 UTC
Looks like another instance of this problem in a new BZ, https://bugzilla.redhat.com/show_bug.cgi?id=1796451

Comment 18 Tom Sweeney 2020-06-08 20:11:54 UTC
Alex Jia can you please update this PR per this comment?  https://bugzilla.redhat.com/show_bug.cgi?id=1795881#c16

Comment 19 Dale Bewley 2020-06-12 23:00:23 UTC
Is this BZ also resolved by https://access.redhat.com/errata/RHSA-2020:1234 ?

Comment 26 Weinan Liu 2020-09-07 09:59:24 UTC
@Alex,
I guess my slack message did not reach you.
#1 May I ask if you can provide the yaml file I can reproduce the issue?
#2 I see the BZ is still ASSIGNED, is it already fixed, or we are just trying to get it reproduced?

Comment 28 Weinan Liu 2020-09-09 15:02:31 UTC
@Daein, could you provide the yaml file we can reproduce the issue?

Comment 29 Daein Park 2020-09-10 02:51:52 UTC
@Weinan, There is no reproduce yaml, because I could not reproduce this issue on my test lab. AFAIK only the customers' OCP had this issue.
And they said this issue had occurred while some pods restarting using replicas from xx -> 0 to 0 -> xx.

Comment 30 Weinan Liu 2020-09-10 03:22:31 UTC
OCP 3.11 install blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1876873#c1

Comment 33 Stephen Cuppett 2020-10-07 20:53:31 UTC
Thank you for continuing to use Red Hat OpenShift.  As part of a wider bug review, this bug has been evaluated and we have determined that at this time we do not plan to progress it.  As such, we will be closing this bug.  If you have need for continued assistance on this issue, please reopen the bug with additional context on why it needs to be reconsidered.