1357121 – [extras-rhel-7.3.0] Docker ps -a shows dead pods that can't be removed

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1357121 - [extras-rhel-7.3.0] Docker ps -a shows dead pods that can't be removed

Summary: [extras-rhel-7.3.0] Docker ps -a shows dead pods that can't be removed

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	docker
Sub Component:
Version:	7.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	rc
Target Release:	7.3
Assignee:	Lokesh Mandvekar
QA Contact:	atomic-bugs@redhat.com
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1356993 (view as bug list)
Depends On:
Blocks:	OSOPS_V3
TreeView+	depends on / blocked

Reported:	2016-07-15 19:47 UTC by Eric Jones
Modified:	2019-12-16 06:07 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	Docker 1.9.1
Last Closed:	2016-11-04 09:08:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2016:2634	0	normal	SHIPPED_LIVE	Moderate: docker security and bug fix update	2016-11-03 20:51:48 UTC

Description Eric Jones 2016-07-15 19:47:05 UTC

Description of problem:
Docker container is showing as dead but cannot be removed.
Atomic-openshift-node logs indicate the following: 

Jul 12 10:02:17 cwb02ddkrapp03 atomic-openshift-node: W0712 10:02:17.769789   66939 container_gc.go:116] Failed to remove dead container "/k8s_POD.f388fd4e_kmb-8-bj1r6_ibxchannels_62fd0c50-446d-11e6-8e88-005056950cd0_cf2bc8fe": API error (500): Unable to remove filesystem for 776f22a1c39cb5359f6e6146c6eef95c1c76f2eb695ee245d6aafdb0776d3e0b: remove /var/lib/docker/containers/776f22a1c39cb5359f6e6146c6eef95c1c76f2eb695ee245d6aafdb0776d3e0b/shm: device or resource busy


# docker ps -a | grep 7b16c41 
7b16c41cf582  openshift3/ose-pod:v3.2.0.46  "/pod"     19 hours ago        Dead


# docker rm 7b16c41
Error response from daemon: Unable to remove filesystem for 7b16c41cf5826d2e783ecaf76e75fc61776a6a7886c7a70d13e24f91bd214d3b: remove /var/lib/docker/containers/7b16c41cf5826d2e783ecaf76e75fc61776a6a7886c7a70d13e24f91bd214d3b/shm: device or resource busy




Version-Release number of selected component (if applicable):
Docker 1.9.1
OSe 3.2.0

Comment 1 Vivek Goyal 2016-07-19 13:43:29 UTC

Need some more information about setup.

- Are you running docker daemon in host mount namespace or in a separate mount namespace with slave relationship. If you are using systemd, you can check
docker.service unit file and see if "MountFlags=slave" is specified or not.

Comment 2 Vivek Goyal 2016-07-19 14:03:46 UTC

Mrunal, Have a question about shm. So in this case directory /var/lib/.../shm/ could not be removed. And I think reason being that it is mounted on in some other mount namespace.

There are two possibilities.

- This mount point has leaked into some other namespace.
- Or we have intentially leaked this mount point in some other namespace so that
  two containers can share this mount point.

I remember that you had done some work in this area. Can you shed some light on this.

Comment 3 Vivek Goyal 2016-07-19 14:12:35 UTC

I think this issue depends on kernel issue 1347821. A directory removal has failed because this directory is mounted on in some other mount namespace.

Comment 4 Andy Goldstein 2016-08-09 16:54:03 UTC

*** Bug 1356993 has been marked as a duplicate of this bug. ***

Comment 5 Mrunal Patel 2016-08-15 16:38:42 UTC

I agree that this is tied to the kernel mount namespace bug. I think getting the mounts in the container, host and docker mount namespaces will help.

Comment 6 Ian McLeod 2016-08-19 21:08:10 UTC

(In reply to Mrunal Patel from comment #5)
> I agree that this is tied to the kernel mount namespace bug. I think getting
> the mounts in the container, host and docker mount namespaces will help.

Murnal,

Do we expect that any of the recent changes to the mount namespace(s) in our docker 1.10 packages will fix this?  If so, can we point people to a particular build and/or commit?

Comment 7 Mrunal Patel 2016-08-22 17:38:11 UTC

Ian, Just talked with Vivek and he agrees that adding MountFlags=slave should help with this. He is also going to be working on a workaround garbage collection to remove these as kernel backfixes aren't going to happen.

Comment 10 Vivek Goyal 2016-08-24 18:28:16 UTC

Docker containers can't be removed right now because most likely their mounts points have leaked into other app's mountnamespace, like systemd-machined. There is a high chance that these containers can be cleaned up little later. If we provide a cron job which cleans up dead containers, periodically, will that help.

Also in fedora systemd-machined runs into a host mountnamespace and mountpoint leak situation does not arise. While in rhel, systemd-machined seems to be running into its own mount namespace. 

If we modify rhel7 systemd-machined to run into host mount namespace, it will reduce the possibility of facing Dead containers.

Comment 11 Vivek Goyal 2016-08-24 19:09:23 UTC

Following entry in "crontab -e" worked for me. Thanks to Dan Walsh for the docker command.

0,10,20,30,40,50 * * * * docker rm $(docker ps -aq -f status=dead)

Comment 12 Vivek Goyal 2016-08-24 20:07:23 UTC

Rather, following tries to clean dead containers every 10 mins.

*/10 * * * * docker rm $(docker ps -aq -f status=dead)

Comment 15 Vivek Goyal 2016-08-26 14:43:07 UTC

I think we can drop an hourly cron job script in /etc/cron.hourly/ to take care of cleaning dead containers.

We probably require two scripts. One for docker and other for docker-latest. For now I am testing docker-hourly.cron and docker-latest-hourly.cron on my system.

docker-hourly.cron
===================
#!/bin/bash

# Do nothing if docker daemon is not running
if ! systemctl is-active --quiet docker
  exit 0
fi

# Try to cleanup dead containers
docker rm $(docker ps -aq -f status=dead)


docker-latest-hourly.cron
=========================
#!/bin/bash

# Do nothing if docker-latest service is not active
if ! systemctl is-active --quiet docker-latest
  exit 0
fi

# Try to cleanup dead containers
docker rm $(docker ps -aq -f status=dead)

Comment 16 Daniel Walsh 2016-08-26 14:46:58 UTC

We can package it up in docker-common, and then only have one.  /usr/bin/docker will work for either docker or docker-latest.

Comment 17 Vivek Goyal 2016-08-26 15:18:46 UTC

That's fine too. I will modify initial check to test that either docker or docker-latest should be active otherwise don't do anything.

Comment 18 Vivek Goyal 2016-08-26 17:20:10 UTC

Following works for me. Lokesh, will you be able to change docker-common package to include script named docker-hourly.cron and installed in /etc/cron.hourly/ dir.

#!/bin/bash

# Do nothing if neither docker nor docker-latest service is running
#if ! systemctl is-active --quiet docker-latest docker; then
if ! systemctl --quiet is-active docker-latest && ! systemctl --quiet is-active docker; then
  exit 0
fi

# Try to cleanup dead containers
docker rm $(docker ps -aq -f status=dead)

Comment 20 Daniel Walsh 2016-08-26 17:39:18 UTC

#if ! systemctl is-active --quiet docker-latest docker; then
if ! systemctl --quiet is-active docker-latest && ! systemctl --quiet is-active 

First line should work.  Need to pick one.

Comment 21 Vivek Goyal 2016-08-26 17:44:03 UTC

I tried "if ! systemctl is-active --quiet docker-latest docker;" with docker-latest running and docker not being installed. It was not working as written
in man pages. I was getting non-zero exit status. As per man page I should
get zero exit status as long as one of the services listed is active. Looks like
there is some bug.

So I switched to the current syntax.

Comment 22 Daniel Walsh 2016-08-26 17:56:46 UTC

That is fine, just remove the initial comment.

Comment 23 Vivek Goyal 2016-08-26 18:22:03 UTC

Here is the updated script.

#!/bin/bash

# Do nothing if neither docker nor docker-latest service is running
if ! systemctl --quiet is-active docker-latest && ! systemctl --quiet is-active docker; then
  exit 0
fi

# Try to cleanup dead containers
docker rm $(docker ps -aq -f status=dead)

Comment 24 Daniel Walsh 2016-08-26 19:04:12 UTC

LGTM

Comment 25 Eric Jones 2016-08-29 15:55:39 UTC

Hi, quick follow up question.

Is this a script that needs to be implemented somewhere in code? Or is this something that users experiencing this issue can use to correct the problem?

Comment 26 Daniel Walsh 2016-08-29 15:58:19 UTC

The script can be shipped as part of the docker package.  But it is just a simple bash script.  Nothing has to change in code, until we fix the kernel issue which is causing this problem.

Comment 27 Eric Jones 2016-08-29 18:15:15 UTC

So if a customer is currently experiencing this issue, they should be able to run `docker rm $(docker ps -aq -f status=dead)` to remove the stuck, dead, containers? Or do they need to run anything first?

Comment 28 Vivek Goyal 2016-08-29 18:18:32 UTC

Customers can try running this. There is no guarantee that dead containers will go away immediately. Though one needs to keep trying and there is a hope that over a period of time, dead containers will go away.

Comment 29 Eric Jones 2016-08-29 18:31:59 UTC

Okay, Thank you @Vivek

Comment 30 Daniel Walsh 2016-09-06 19:23:30 UTC

Lokesh can we get this into the 7.3 release?

Comment 31 Vivek Goyal 2016-09-06 21:29:16 UTC

Use following version of script. I updated it so that it exits with status 0 if there are no dead containers.

#!/bin/bash

# Do nothing if neither docker nor docker-latest service is running
if ! systemctl --quiet is-active docker-latest && ! systemctl --quiet is-active docker; then
  exit 0
fi

# If there are no dead containers, exit.
DEAD_CONTAINERS=`docker ps -aq -f status=dead`

[ -z "$DEAD_CONTAINERS" ] && exit 0

# Try to cleanup dead containers
docker rm $DEAD_CONTAINERS

Comment 33 Guohua Ouyang 2016-09-23 04:55:45 UTC

checked the script is included in docker-common-1.10.3-55.el7.x86_64

Comment 36 errata-xmlrpc 2016-11-04 09:08:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2634.html

Note You need to log in before you can comment on or make changes to this bug.