Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Description of problem:
Repeating errors in cadvisor/kubelet/openshift-node process
Either:
Unable to get network stats from pid 128265: couldn't read network stats: failure opening /proc/128265/net/dev: open /proc/128265/net/dev: no such file or directory
or
failed to collect filesystem stats - du command failed on /rootfs/var/lib/docker/containers/a5febc665c196470ec125d7354103b707604 476ad8c31255d54a53cdb9352b41 with output stdout: , stderr: du: cannot access '/rootfs/var/lib/docker/containers/a5febc665c196470ec125d7354103b707604476ad8c31255d54a53cdb9352b41': No such file or directory
Follows this error:
Failed to remove orphaned pod "c134c5fe-4d6c-11e6-b1cc-0e227273c3bd" dir; err: remove /var/lib/origin/openshift.local.volumes/pods/c134c5fe-4d6c-11e6-b1cc-0e227273c3bd/volumes/kubernetes.io~secret/deployer-token-0hr8m: device or resource busy
Or this error:
kernel: XFS (dm-95): Unmounting Filesystem
kernel: device-mapper: thin: Deletion of thin device 68560 failed.
forward-journal[18687]: time="2016-07-15T04:55:27.490877319Z" level=error msg="Handler for DELETE /v1.21/containers/5cec00e7355de03fdc8966895223776c1163875b92a7ed05734ea80ce6527436 returned error: Driver devicemapper failed to remove root filesystem 5cec00e7355de03fdc8966895223776c1163875b92a7ed05734ea80ce6527436: Device is Busy"
forward-journal[18687]: time="2016-07-15T04:55:27.491095598Z" level=error msg="HTTP Error" err="Driver devicemapper failed to remove root filesystem 5cec00e7355de03fdc8966895223776c1163875b92a7ed05734ea80ce6527436: Device is Busy" statusCode=500
Version-Release number of selected component (if applicable):
Openshift 3.1
How reproducible:
No reproduction procedure. Seems to be a race.
Steps to Reproduce:
1.
2.
3.
Actual results:
The device mapper thin device fails to unmount or remove pod directories due to a "device busy" error. This causes the container's cgroup to not be cleaned up properly. This causes cadvisor (and therefore kubelet and openshift-node) to continue monitoring for a container whose pid 1 has exited, leading to failure messages in the logs.
Expected results:
Ideally, we would not encounter the "device busy" errors which leave orphaned thin devices everywhere. But in the case the error does occur (due to a kernel bug, race in docker, or whatever), docker should _always_ clean up the container cgroup so that cadvisor will stop monitoring the container.
Additional info:
https://bugzilla.redhat.com/show_bug.cgi?id=1328913https://bugzilla.redhat.com/show_bug.cgi?id=1357081
Should be noted that the device mapper error causing the orphaned cgroup issue is a theory. Might not be right. The real problem is the orphaned cgoup which was confirmed here:
https://bugzilla.redhat.com/show_bug.cgi?id=1328913#c9
The reason the cgroup is not being removed is really unknown at this point.