Bug 1357081

Summary:	openshift node logs thousands of "du: cannot access"
Product:	OpenShift Container Platform	Reporter:	Sten Turpin <sten>
Component:	Node	Assignee:	Seth Jennings <sjenning>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	DeShuai Ma <dma>
Severity:	low	Docs Contact:
Priority:	medium
Version:	3.2.0	CC:	agoldste, aos-bugs, dma, eparis, jokerman, mmccomas, sjenning
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-05-24 14:10:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1367141
Bug Blocks:	1303130

Description Sten Turpin 2016-07-15 16:17:28 UTC

Description of problem: openshift node logs thousands of "du: cannot access"


Version-Release number of selected component (if applicable): atomic-openshift-node-3.2.0.44-1.git.0.a4463d9.el7.x86_64


How reproducible: always


Steps to Reproduce:
1. run pods on an openshift node with the above version (also noted on earlier versions)

Actual results:
the below error is logged roughly 6000 times/day on a low-utilization node

Expected results:
this error should be handled

Additional info:
example of a complete log entry:
Jul 15 14:11:52 ip-172-31-3-96.ec2.internal atomic-openshift-node[7724]: E0715 14:11:52.590669    7724 fsHandler.go:106] failed to collect filesystem stats - du command failed on /var/lib/docker/containers/cc8af30222fa33248476635981347a56bb10e7014cd8546f75fe2d6fcb301740 with output stdout: , stderr: du: cannot access ‘/var/lib/docker/containers/cc8af30222fa33248476635981347a56bb10e7014cd8546f75fe2d6fcb301740’: No such file or directory

Comment 1 Andy Goldstein 2016-07-15 16:19:11 UTC

I believe this is fixed in 3.3.

Comment 2 DeShuai Ma 2016-07-18 08:53:53 UTC

Test on openshift 3.3, There is still has this error, but not too much. In one of my nodes there are two related errors.
openshift v3.3.0.6
kubernetes v1.3.0+57fb9ac
etcd 2.3.0+git

 4346 Jul 18 03:51:49 ip-172-18-10-95 atomic-openshift-node: E0718 03:51:49.519355   17412 fsHandler.go:106] failed to collect filesystem stats - du command failed on /rootfs/var/lib/docker/containers/a5febc665c196470ec125d7354103b707604      476ad8c31255d54a53cdb9352b41 with output stdout: , stderr: du: cannot access '/rootfs/var/lib/docker/containers/a5febc665c196470ec125d7354103b707604476ad8c31255d54a53cdb9352b41': No such file or directory

Jul 18 04:48:48 ip-172-18-10-95 atomic-openshift-node: E0718 04:48:48.245490   17412 fsHandler.go:106] failed to collect filesystem stats - du command failed on /rootfs/var/lib/docker/containers/27d314be763347414adef76078fc057cc0490aa6a2e3b818df6deef883d46c41 with output stdout: , stderr: du: cannot access '/rootfs/var/lib/docker/containers/27d314be763347414adef76078fc057cc0490aa6a2e3b818df6deef883d46c41': No such file or directory

Comment 3 Andy Goldstein 2016-07-18 13:44:06 UTC

Could you please paste the output of `sudo docker info`?

Also, do you know what those 2 containers are - what were they running, did OpenShift start them, etc?

Comment 4 DeShuai Ma 2016-07-19 07:40:09 UTC

(In reply to Andy Goldstein from comment #3)
> Could you please paste the output of `sudo docker info`?
> 
> Also, do you know what those 2 containers are - what were they running, did
> OpenShift start them, etc?

Those container is created by openshift. like this one I create 10 pods with same image then delete it. only has meet this error.
 18961 Jul 19 00:53:53 ip-172-18-4-118 atomic-openshift-node: E0719 00:53:53.669042    8978 fsHandler.go:106] failed to collect filesystem stats - du command failed on /var/lib/docker/containers/319d3d8950094b7c1d7147ca74e6663b839a348475       2e0e154b1dbff71f6c6e25 with output stdout: , stderr: du: cannot access ‘/var/lib/docker/containers/319d3d8950094b7c1d7147ca74e6663b839a3484752e0e154b1dbff71f6c6e25’: No such file or directory


[root@ip-172-18-4-118 containers]# docker info
Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 5
Server Version: 1.10.3
Storage Driver: devicemapper
 Pool Name: docker-202:2-67110041-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 1.692 GB
 Data Space Total: 107.4 GB
 Data Space Available: 22.64 GB
 Metadata Space Used: 2.13 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.145 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Either use `--storage-opt dm.thinpooldev` or use `--storage-opt dm.no_warn_on_loop_devices=true` to suppress this warning.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.107-RHEL7 (2016-06-09)
Execution Driver: native-0.2
Logging Driver: json-file
Plugins: 
 Volume: local
 Network: null host bridge
 Authorization: rhel-push-plugin
Kernel Version: 3.10.0-327.22.2.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.2 (Maipo)
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 2
CPUs: 1
Total Memory: 3.518 GiB
Name: ip-172-18-4-118.ec2.internal
ID: FHHR:XIJ3:XQV7:U5I2:VMF5:PNF6:OJYZ:MV33:PR4T:T3U6:SQ74:4WJV
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Registries: registry.qe.openshift.com (insecure), registry.access.redhat.com (secure), docker.io (secure)

Comment 5 DeShuai Ma 2016-07-19 07:45:12 UTC

sometime when delete all the pod in project, also meet device busy error.
1. delete pod in project
[root@ip-172-18-1-64 ~]# oc delete `oc get pod -n dma -o name` -n dma
pod "besteffort-3iap2" deleted
pod "besteffort-95ct6" deleted
pod "besteffort-9n99g" deleted
pod "besteffort-bpzy0" deleted
pod "besteffort-hdxs8" deleted
pod "besteffort-jkyrd" deleted
pod "besteffort-lqmga" deleted
pod "besteffort-n72uy" deleted
pod "besteffort-qc3xu" deleted
pod "besteffort-ttt3j" deleted

2. check node logs.
Jul 19 03:40:57 ip-172-18-4-118 atomic-openshift-node: I0719 03:40:57.451528   35151 kubelet.go:2043] Failed to remove orphaned pod "c134c5fe-4d6c-11e6-b1cc-0e227273c3bd" dir; err: remove /var/lib/origin/openshift.local.volumes/pods/c134c5fe-4d6c-11e6-b1cc-0e227273c3bd/volumes/kubernetes.io~secret/deployer-token-0hr8m: device or resource busy

http://pastebin.test.redhat.com/393492

Comment 6 Seth Jennings 2016-08-12 19:27:11 UTC

DeShuai, looking for some clarity here.  Does the du error message continue indefinitely for a particular pod or does it go away within a short period of time?

Comment 7 Seth Jennings 2016-08-12 19:49:13 UTC

I am pretty sure this is related to the orphaning of docker cgroups, see 1328913.

The suspected cause of the orphaning is the DM error mentioned in comment #5

This can be confirmed using the following method:

Get this error described:

 18961 Jul 19 00:53:53 ip-172-18-4-118 atomic-openshift-node: E0719 00:53:53.669042    8978 fsHandler.go:106] failed to collect filesystem stats - du command failed on /var/lib/docker/containers/319d3d8950094b7c1d7147ca74e6663b839a348475       2e0e154b1dbff71f6c6e25 with output stdout: , stderr: du: cannot access ‘/var/lib/docker/containers/319d3d8950094b7c1d7147ca74e6663b839a3484752e0e154b1dbff71f6c6e25’: No such file or directory

If the path

/sys/fs/cgroup/cpu,cpuacct/system.slice/319d3d8950094b7c1d7147ca74e6663b839a3484752e0e154b1dbff71f6c6e25

exists, then it is the same problem.

Additional confirmation is if /sys/fs/cgroup/cpu,cpuacct/system.slice/319d3d8950094b7c1d7147ca74e6663b839a3484752e0e154b1dbff71f6c6e25/tasks is empty (no tasks in the cgroup)

Comment 8 Seth Jennings 2017-05-05 14:37:58 UTC

Still blocked on
https://bugzilla.redhat.com/show_bug.cgi?id=1367141

Comment 9 Seth Jennings 2017-05-24 14:10:14 UTC

Trail has gone cold on this.  Docker 1.10 is old.  No associated customer issue.  Closing.