Hide Forgot
Description of problem: Repeating errors in cadvisor/kubelet/openshift-node process Either: Unable to get network stats from pid 128265: couldn't read network stats: failure opening /proc/128265/net/dev: open /proc/128265/net/dev: no such file or directory or failed to collect filesystem stats - du command failed on /rootfs/var/lib/docker/containers/a5febc665c196470ec125d7354103b707604 476ad8c31255d54a53cdb9352b41 with output stdout: , stderr: du: cannot access '/rootfs/var/lib/docker/containers/a5febc665c196470ec125d7354103b707604476ad8c31255d54a53cdb9352b41': No such file or directory Follows this error: Failed to remove orphaned pod "c134c5fe-4d6c-11e6-b1cc-0e227273c3bd" dir; err: remove /var/lib/origin/openshift.local.volumes/pods/c134c5fe-4d6c-11e6-b1cc-0e227273c3bd/volumes/kubernetes.io~secret/deployer-token-0hr8m: device or resource busy Or this error: kernel: XFS (dm-95): Unmounting Filesystem kernel: device-mapper: thin: Deletion of thin device 68560 failed. forward-journal[18687]: time="2016-07-15T04:55:27.490877319Z" level=error msg="Handler for DELETE /v1.21/containers/5cec00e7355de03fdc8966895223776c1163875b92a7ed05734ea80ce6527436 returned error: Driver devicemapper failed to remove root filesystem 5cec00e7355de03fdc8966895223776c1163875b92a7ed05734ea80ce6527436: Device is Busy" forward-journal[18687]: time="2016-07-15T04:55:27.491095598Z" level=error msg="HTTP Error" err="Driver devicemapper failed to remove root filesystem 5cec00e7355de03fdc8966895223776c1163875b92a7ed05734ea80ce6527436: Device is Busy" statusCode=500 Version-Release number of selected component (if applicable): Openshift 3.1 How reproducible: No reproduction procedure. Seems to be a race. Steps to Reproduce: 1. 2. 3. Actual results: The device mapper thin device fails to unmount or remove pod directories due to a "device busy" error. This causes the container's cgroup to not be cleaned up properly. This causes cadvisor (and therefore kubelet and openshift-node) to continue monitoring for a container whose pid 1 has exited, leading to failure messages in the logs. Expected results: Ideally, we would not encounter the "device busy" errors which leave orphaned thin devices everywhere. But in the case the error does occur (due to a kernel bug, race in docker, or whatever), docker should _always_ clean up the container cgroup so that cadvisor will stop monitoring the container. Additional info: https://bugzilla.redhat.com/show_bug.cgi?id=1328913 https://bugzilla.redhat.com/show_bug.cgi?id=1357081
Should be noted that the device mapper error causing the orphaned cgroup issue is a theory. Might not be right. The real problem is the orphaned cgoup which was confirmed here: https://bugzilla.redhat.com/show_bug.cgi?id=1328913#c9 The reason the cgroup is not being removed is really unknown at this point.
Mrunal have any ideas here?
I think that this is related to the mounts leaking into machined issue.
Should be fixed by oci-umount I believe.