Bug 1367141 - docker not removing cgroups for exited containers
Summary: docker not removing cgroups for exited containers
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: docker
Version: 7.2
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Matthew Heon
QA Contact: atomic-bugs@redhat.com
URL:
Whiteboard:
Depends On:
Blocks: 1328913 1357081 1496245
TreeView+ depends on / blocked
 
Reported: 2016-08-15 17:03 UTC by Seth Jennings
Modified: 2020-03-11 15:12 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-06-30 15:01:37 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Bugzilla 1328913 urgent CLOSED Long running reliability tests show network errors on nodes 2021-01-20 06:05:38 UTC
Red Hat Bugzilla 1357081 medium CLOSED openshift node logs thousands of "du: cannot access" 2021-01-20 06:05:38 UTC

Internal Links: 1328913 1357081

Description Seth Jennings 2016-08-15 17:03:27 UTC
Description of problem:

Repeating errors in cadvisor/kubelet/openshift-node process

Either:

Unable to get network stats from pid 128265: couldn't read network stats: failure opening /proc/128265/net/dev: open /proc/128265/net/dev: no such file or directory

or

failed to collect filesystem stats - du command failed on /rootfs/var/lib/docker/containers/a5febc665c196470ec125d7354103b707604      476ad8c31255d54a53cdb9352b41 with output stdout: , stderr: du: cannot access '/rootfs/var/lib/docker/containers/a5febc665c196470ec125d7354103b707604476ad8c31255d54a53cdb9352b41': No such file or directory

Follows this error:

Failed to remove orphaned pod "c134c5fe-4d6c-11e6-b1cc-0e227273c3bd" dir; err: remove /var/lib/origin/openshift.local.volumes/pods/c134c5fe-4d6c-11e6-b1cc-0e227273c3bd/volumes/kubernetes.io~secret/deployer-token-0hr8m: device or resource busy

Or this error:

kernel: XFS (dm-95): Unmounting Filesystem
kernel: device-mapper: thin: Deletion of thin device 68560 failed.
forward-journal[18687]: time="2016-07-15T04:55:27.490877319Z" level=error msg="Handler for DELETE /v1.21/containers/5cec00e7355de03fdc8966895223776c1163875b92a7ed05734ea80ce6527436 returned error: Driver devicemapper failed to remove root filesystem 5cec00e7355de03fdc8966895223776c1163875b92a7ed05734ea80ce6527436: Device is Busy"
forward-journal[18687]: time="2016-07-15T04:55:27.491095598Z" level=error msg="HTTP Error" err="Driver devicemapper failed to remove root filesystem 5cec00e7355de03fdc8966895223776c1163875b92a7ed05734ea80ce6527436: Device is Busy" statusCode=500

Version-Release number of selected component (if applicable):
Openshift 3.1

How reproducible:
No reproduction procedure.  Seems to be a race.

Steps to Reproduce:
1.
2.
3.

Actual results:

The device mapper thin device fails to unmount or remove pod directories due to a "device busy" error.  This causes the container's cgroup to not be cleaned up properly.  This causes cadvisor (and therefore kubelet and openshift-node) to continue monitoring for a container whose pid 1 has exited, leading to failure messages in the logs.

Expected results:

Ideally, we would not encounter the "device busy" errors which leave orphaned thin devices everywhere.  But in the case the error does occur (due to a kernel bug, race in docker, or whatever), docker should _always_ clean up the container cgroup so that cadvisor will stop monitoring the container.

Additional info:
https://bugzilla.redhat.com/show_bug.cgi?id=1328913
https://bugzilla.redhat.com/show_bug.cgi?id=1357081

Comment 1 Seth Jennings 2016-08-15 17:15:35 UTC
Should be noted that the device mapper error causing the orphaned cgroup issue is a theory.  Might not be right.  The real problem is the orphaned cgoup which was confirmed here:

https://bugzilla.redhat.com/show_bug.cgi?id=1328913#c9

The reason the cgroup is not being removed is really unknown at this point.

Comment 3 Daniel Walsh 2016-10-18 13:51:21 UTC
Mrunal have any ideas here?

Comment 4 Mrunal Patel 2016-10-18 17:53:06 UTC
I think that this is related to the mounts leaking into machined issue.

Comment 5 Daniel Walsh 2017-06-30 15:01:37 UTC
Should be fixed by oci-umount I believe.


Note You need to log in before you can comment on or make changes to this bug.