Bug 1367141

Summary: docker not removing cgroups for exited containers
Product: Red Hat Enterprise Linux 7 Reporter: Seth Jennings <sjenning>
Component: dockerAssignee: Matthew Heon <mheon>
Status: CLOSED CURRENTRELEASE QA Contact: atomic-bugs <atomic-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.2CC: dwalsh, jshivers, lsm5, mpatel
Target Milestone: rcKeywords: Extras
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-30 15:01:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1328913, 1357081, 1496245    

Description Seth Jennings 2016-08-15 17:03:27 UTC
Description of problem:

Repeating errors in cadvisor/kubelet/openshift-node process

Either:

Unable to get network stats from pid 128265: couldn't read network stats: failure opening /proc/128265/net/dev: open /proc/128265/net/dev: no such file or directory

or

failed to collect filesystem stats - du command failed on /rootfs/var/lib/docker/containers/a5febc665c196470ec125d7354103b707604      476ad8c31255d54a53cdb9352b41 with output stdout: , stderr: du: cannot access '/rootfs/var/lib/docker/containers/a5febc665c196470ec125d7354103b707604476ad8c31255d54a53cdb9352b41': No such file or directory

Follows this error:

Failed to remove orphaned pod "c134c5fe-4d6c-11e6-b1cc-0e227273c3bd" dir; err: remove /var/lib/origin/openshift.local.volumes/pods/c134c5fe-4d6c-11e6-b1cc-0e227273c3bd/volumes/kubernetes.io~secret/deployer-token-0hr8m: device or resource busy

Or this error:

kernel: XFS (dm-95): Unmounting Filesystem
kernel: device-mapper: thin: Deletion of thin device 68560 failed.
forward-journal[18687]: time="2016-07-15T04:55:27.490877319Z" level=error msg="Handler for DELETE /v1.21/containers/5cec00e7355de03fdc8966895223776c1163875b92a7ed05734ea80ce6527436 returned error: Driver devicemapper failed to remove root filesystem 5cec00e7355de03fdc8966895223776c1163875b92a7ed05734ea80ce6527436: Device is Busy"
forward-journal[18687]: time="2016-07-15T04:55:27.491095598Z" level=error msg="HTTP Error" err="Driver devicemapper failed to remove root filesystem 5cec00e7355de03fdc8966895223776c1163875b92a7ed05734ea80ce6527436: Device is Busy" statusCode=500

Version-Release number of selected component (if applicable):
Openshift 3.1

How reproducible:
No reproduction procedure.  Seems to be a race.

Steps to Reproduce:
1.
2.
3.

Actual results:

The device mapper thin device fails to unmount or remove pod directories due to a "device busy" error.  This causes the container's cgroup to not be cleaned up properly.  This causes cadvisor (and therefore kubelet and openshift-node) to continue monitoring for a container whose pid 1 has exited, leading to failure messages in the logs.

Expected results:

Ideally, we would not encounter the "device busy" errors which leave orphaned thin devices everywhere.  But in the case the error does occur (due to a kernel bug, race in docker, or whatever), docker should _always_ clean up the container cgroup so that cadvisor will stop monitoring the container.

Additional info:
https://bugzilla.redhat.com/show_bug.cgi?id=1328913
https://bugzilla.redhat.com/show_bug.cgi?id=1357081

Comment 1 Seth Jennings 2016-08-15 17:15:35 UTC
Should be noted that the device mapper error causing the orphaned cgroup issue is a theory.  Might not be right.  The real problem is the orphaned cgoup which was confirmed here:

https://bugzilla.redhat.com/show_bug.cgi?id=1328913#c9

The reason the cgroup is not being removed is really unknown at this point.

Comment 3 Daniel Walsh 2016-10-18 13:51:21 UTC
Mrunal have any ideas here?

Comment 4 Mrunal Patel 2016-10-18 17:53:06 UTC
I think that this is related to the mounts leaking into machined issue.

Comment 5 Daniel Walsh 2017-06-30 15:01:37 UTC
Should be fixed by oci-umount I believe.