1367141 – docker not removing cgroups for exited containers

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1367141 - docker not removing cgroups for exited containers

Summary: docker not removing cgroups for exited containers

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	docker
Sub Component:
Version:	7.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Matthew Heon
QA Contact:	atomic-bugs@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1328913 1357081 1496245
TreeView+	depends on / blocked

Reported:	2016-08-15 17:03 UTC by Seth Jennings
Modified:	2020-03-11 15:12 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-06-30 15:01:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1328913	0	urgent	CLOSED	Long running reliability tests show network errors on nodes	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1357081	0	medium	CLOSED	openshift node logs thousands of "du: cannot access"	2021-02-22 00:41:40 UTC

Internal Links: 1328913 1357081

Description Seth Jennings 2016-08-15 17:03:27 UTC

Description of problem:

Repeating errors in cadvisor/kubelet/openshift-node process

Either:

Unable to get network stats from pid 128265: couldn't read network stats: failure opening /proc/128265/net/dev: open /proc/128265/net/dev: no such file or directory

or

failed to collect filesystem stats - du command failed on /rootfs/var/lib/docker/containers/a5febc665c196470ec125d7354103b707604      476ad8c31255d54a53cdb9352b41 with output stdout: , stderr: du: cannot access '/rootfs/var/lib/docker/containers/a5febc665c196470ec125d7354103b707604476ad8c31255d54a53cdb9352b41': No such file or directory

Follows this error:

Failed to remove orphaned pod "c134c5fe-4d6c-11e6-b1cc-0e227273c3bd" dir; err: remove /var/lib/origin/openshift.local.volumes/pods/c134c5fe-4d6c-11e6-b1cc-0e227273c3bd/volumes/kubernetes.io~secret/deployer-token-0hr8m: device or resource busy

Or this error:

kernel: XFS (dm-95): Unmounting Filesystem
kernel: device-mapper: thin: Deletion of thin device 68560 failed.
forward-journal[18687]: time="2016-07-15T04:55:27.490877319Z" level=error msg="Handler for DELETE /v1.21/containers/5cec00e7355de03fdc8966895223776c1163875b92a7ed05734ea80ce6527436 returned error: Driver devicemapper failed to remove root filesystem 5cec00e7355de03fdc8966895223776c1163875b92a7ed05734ea80ce6527436: Device is Busy"
forward-journal[18687]: time="2016-07-15T04:55:27.491095598Z" level=error msg="HTTP Error" err="Driver devicemapper failed to remove root filesystem 5cec00e7355de03fdc8966895223776c1163875b92a7ed05734ea80ce6527436: Device is Busy" statusCode=500

Version-Release number of selected component (if applicable):
Openshift 3.1

How reproducible:
No reproduction procedure.  Seems to be a race.

Steps to Reproduce:
1.
2.
3.

Actual results:

The device mapper thin device fails to unmount or remove pod directories due to a "device busy" error.  This causes the container's cgroup to not be cleaned up properly.  This causes cadvisor (and therefore kubelet and openshift-node) to continue monitoring for a container whose pid 1 has exited, leading to failure messages in the logs.

Expected results:

Ideally, we would not encounter the "device busy" errors which leave orphaned thin devices everywhere.  But in the case the error does occur (due to a kernel bug, race in docker, or whatever), docker should _always_ clean up the container cgroup so that cadvisor will stop monitoring the container.

Additional info:
https://bugzilla.redhat.com/show_bug.cgi?id=1328913
https://bugzilla.redhat.com/show_bug.cgi?id=1357081

Comment 1 Seth Jennings 2016-08-15 17:15:35 UTC

Should be noted that the device mapper error causing the orphaned cgroup issue is a theory.  Might not be right.  The real problem is the orphaned cgoup which was confirmed here:

https://bugzilla.redhat.com/show_bug.cgi?id=1328913#c9

The reason the cgroup is not being removed is really unknown at this point.

Comment 3 Daniel Walsh 2016-10-18 13:51:21 UTC

Mrunal have any ideas here?

Comment 4 Mrunal Patel 2016-10-18 17:53:06 UTC

I think that this is related to the mounts leaking into machined issue.

Comment 5 Daniel Walsh 2017-06-30 15:01:37 UTC

Should be fixed by oci-umount I believe.

Note You need to log in before you can comment on or make changes to this bug.