1795881 – Pod stuck with "Terminating" with container kill failed because of "container not found" or "no such process"

Bug 1795881 - Pod stuck with "Terminating" with container kill failed because of "container not found" or "no such process"

Summary: Pod stuck with "Terminating" with container kill failed because of "container...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Tom Sweeney
QA Contact:	Weinan Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1186913
TreeView+	depends on / blocked

Reported:	2020-01-29 05:42 UTC by Daein Park
Modified:	2023-03-24 16:52 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-07 20:53:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Daein Park 2020-01-29 05:42:21 UTC

Description of problem:

If a pod redeploy on a node, the pod stuck "Terminating" before creating new pod.
And the following errors are shown so many in journal logs.

~~~
Jan 23 15:33:31 worker.ocp.example.com dockerd-current[6688]: time="2020-01-23T15:33:31.794262342+09:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container xxx...xxx: rpc error: code = 2 desc = containerd: container not found"
~~~

There were only 14 "docker-runc-current" processes in ps cmd outout, but container counted 2214 on running containers using docker info.

~~~
$ grep -c docker-runc-current ps
14

$ cat docker_info 
Containers: 2228
 Running: 2214
 Paused: 0
 Stopped: 14
Images: 88
Server Version: 1.13.1
Storage Driver: overlay2
 Backing Filesystem: xfs
:
Swarm: inactive
Runtimes: docker-runc runc
Default Runtime: docker-runc
Init Binary: /usr/libexec/docker/docker-init-current
containerd version:  (expected: aa8187dbd3b7ad67d8e5e3a15115d3eef43a7ed1)
runc version: 9c3c5f853ebf0ffac0d087e94daef462133b69c7 (expected: 9df8b306d01f59d3a8029be411de015b7304dd8f)
init version: fec3683b971d9c3ef73f284f176672c44b448662 (expected: 949e6facb77383876aeff8a6944dde66b3089574)
:
Docker Root Dir: /docker
:
~~~

Version-Release number of selected component (if applicable):

openshift-ansible-3.11.146-1.git.0.fcedb45.el7.noarch
docker-1.13.1-103.git7f2769b.el7.x86_64
systemd-219-67.el7_7.1.x86_64

How reproducible:

N/A

Steps to Reproduce:
1.
2.
3.

Actual results:

Pod cannot redeploy, because the pod stuck with "Terminating" status.

Expected results:

Pod can redeploy without any issue.

Additional info:

Comment 8 Tom Sweeney 2020-01-30 23:41:56 UTC

Looks like another instance of this problem in a new BZ, https://bugzilla.redhat.com/show_bug.cgi?id=1796451

Comment 18 Tom Sweeney 2020-06-08 20:11:54 UTC

Alex Jia can you please update this PR per this comment?  https://bugzilla.redhat.com/show_bug.cgi?id=1795881#c16

Comment 19 Dale Bewley 2020-06-12 23:00:23 UTC

Is this BZ also resolved by https://access.redhat.com/errata/RHSA-2020:1234 ?

Comment 26 Weinan Liu 2020-09-07 09:59:24 UTC

@Alex,
I guess my slack message did not reach you.
#1 May I ask if you can provide the yaml file I can reproduce the issue?
#2 I see the BZ is still ASSIGNED, is it already fixed, or we are just trying to get it reproduced?

Comment 28 Weinan Liu 2020-09-09 15:02:31 UTC

@Daein, could you provide the yaml file we can reproduce the issue?

Comment 29 Daein Park 2020-09-10 02:51:52 UTC

@Weinan, There is no reproduce yaml, because I could not reproduce this issue on my test lab. AFAIK only the customers' OCP had this issue.
And they said this issue had occurred while some pods restarting using replicas from xx -> 0 to 0 -> xx.

Comment 30 Weinan Liu 2020-09-10 03:22:31 UTC

OCP 3.11 install blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1876873#c1

Comment 33 Stephen Cuppett 2020-10-07 20:53:31 UTC

Thank you for continuing to use Red Hat OpenShift.  As part of a wider bug review, this bug has been evaluated and we have determined that at this time we do not plan to progress it.  As such, we will be closing this bug.  If you have need for continued assistance on this issue, please reopen the bug with additional context on why it needs to be reconsidered.

Note You need to log in before you can comment on or make changes to this bug.