Created attachment 1425612 [details] atomic-openshift-node.service log Description of problem: The runtime imageFs value is wrong in node stats/summary, this will case image GC never be triggered even runtime image space usage ratio is high. Version-Release number of selected component (if applicable): openshift v3.10.0-0.27.0 kubernetes v1.10.0+b81c8f8 docker 1.13.1 with devicemapper Red Hat Enterprise Linux Server release 7.5 (Maipo) How reproducible: Always Steps to Reproduce: 1. Get node stats/summary $ oc get --raw /api/v1/nodes/ip-172-18-3-124.ec2.internal/proxy/stats/summary "fs": { "time": "2018-04-23T09:05:58Z", "availableBytes": 15427821568, "capacityBytes": 19316867072, "usedBytes": 3889045504, "inodesFree": 9391639, "inodes": 9437184, "inodesUsed": 45545 }, "runtime": { "imageFs": { "time": "2018-04-23T09:05:58Z", "availableBytes": 15427821568, "capacityBytes": 19316867072, "usedBytes": 34309890497, "inodesFree": 9391639, "inodes": 9437184, "inodesUsed": 45545 } }, 2. On node "ip-172-18-3-124.ec2.internal" check the docker info [root@ip-172-18-3-124 ~]# docker info Containers: 16 Running: 12 Paused: 0 Stopped: 4 Images: 68 Server Version: 1.13.1 Storage Driver: devicemapper Pool Name: rhel-docker--pool Pool Blocksize: 524.3 kB Base Device Size: 10.74 GB Backing Filesystem: xfs Data file: Metadata file: Data Space Used: 20.75 GB Data Space Total: 22.64 GB Data Space Available: 1.895 GB Metadata Space Used: 4.149 MB Metadata Space Total: 46.14 MB Metadata Space Available: 41.99 MB Thin Pool Minimum Free Space: 2.264 GB Udev Sync Supported: true Deferred Removal Enabled: true Deferred Deletion Enabled: true Deferred Deleted Device Count: 0 Library Version: 1.02.146-RHEL7 (2018-01-22) Logging Driver: json-file Cgroup Driver: systemd Plugins: Volume: local Network: bridge host macvlan null overlay Authorization: rhel-push-plugin Swarm: inactive Runtimes: docker-runc runc Default Runtime: docker-runc Init Binary: /usr/libexec/docker/docker-init-current containerd version: (expected: aa8187dbd3b7ad67d8e5e3a15115d3eef43a7ed1) runc version: e9c345b3f906d5dc5e8100b05ce37073a811c74a (expected: 9df8b306d01f59d3a8029be411de015b7304dd8f) init version: 5b117de7f824f3d3825737cf09581645abbe35d4 (expected: 949e6facb77383876aeff8a6944dde66b3089574) Security Options: seccomp WARNING: You're not using the default seccomp profile Profile: /etc/docker/seccomp.json selinux Kernel Version: 3.10.0-862.el7.x86_64 Operating System: Red Hat Enterprise Linux Server 7.5 (Maipo) OSType: linux Architecture: x86_64 Number of Docker Hooks: 3 CPUs: 8 Total Memory: 31.26 GiB Name: ip-172-18-3-124.ec2.internal ID: OVTK:2U5Z:5WFZ:A4EG:ESGK:IQY3:KKBN:42YR:YBHS:BDZO:F3AC:G5FX Docker Root Dir: /var/lib/docker Debug Mode (client): false Debug Mode (server): false Registry: https://registry.reg-aws.openshift.com:443/v1/ WARNING: bridge-nf-call-iptables is disabled WARNING: bridge-nf-call-ip6tables is disabled Experimental: false Insecure Registries: registry.reg-aws.openshift.com:443 virt-openshift-05.lab.eng.nay.redhat.com:5000 virt-openshift-05.lab.eng.nay.redhat.com:5001 asb-registry.usersys.redhat.com:5000 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888 127.0.0.0/8 Live Restore Enabled: false [root@ip-172-18-3-124 ~]# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert docker-pool rhel twi-aot--- <21.09g 91.63 8.99 root rhel -wi-ao---- 18.00g 3. Check the "DiskPressure" of node [root@ip-172-18-15-111 ~]# oc describe no ip-172-18-3-124.ec2.internal |grep DiskPressure DiskPressure False Mon, 23 Apr 2018 05:19:03 -0400 Sun, 22 Apr 2018 21:24:47 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure Actual results: 1. runtime.imageFs is different with docker space, also this env is devicemapper but inodes is not zero. 3. DiskPressure is False Expected results: 1. runtime.imageFs is same with docker space 3. DiskPressure should be True Additional info:
Robert, could you take a look at this?
Noting that capacityBytes in the stats (19316867072) closely matches the size of the root volume -- that number of bytes is 17.99 GiB.
Could you include output of "df" and "df -i" on ip-172-18-3-124.ec2.internal?
Progress update: attempting to reproduce with 3.10, using oc cluster up with lvm storage. Unable to reproduce with master, as cluster will not come up. Cluster comes up with 3.10.0 alpha-0, but storage information is correct: "fs": { "time": "2018-04-30T19:34:42Z", "availableBytes": 12419698688, "capacityBytes": 21003583488, "usedBytes": 7493365760, "inodesFree": 1107054, "inodes": 1310720, "inodesUsed": 203666 }, "runtime": { "imageFs": { "time": "2018-04-30T19:34:42Z", "availableBytes": 19358810112, "capacityBytes": 24469569536, "usedBytes": 10023672506 } } Currently bisecting to try to determine either where this fails or am no longer able to bring up the cluster.
(In reply to Seth Jennings from comment #1) > Robert, could you take a look at this? http://pastebin.test.redhat.com/584960
Retest on v3.10.0-0.31.0. oc v3.10.0-0.31.0 kubernetes v1.10.0+b81c8f8 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://ip-172-18-10-141.ec2.internal:8443 openshift v3.10.0-0.31.0 kubernetes v1.10.0+b81c8f8
Unable to bring up a cluster with the current oc cluster up. Could you provide the sequence and configuration that you used to bring your cluster up?
I can also see this: $ oc get --raw=/api/v1/nodes/node1.lab.variantweb.net/proxy/stats/summary | grep -A 20 fs "fs": { "time": "2018-05-02T19:32:24Z", "availableBytes": 15577264128, "capacityBytes": 21463281664, "usedBytes": 5886017536, "inodesFree": 10443544, "inodes": 10485232, "inodesUsed": 41688 }, "runtime": { "imageFs": { "time": "2018-05-02T19:32:24Z", "availableBytes": 15577264128, "capacityBytes": 21463281664, "usedBytes": 3244701384, "inodesFree": 10443544, "inodes": 10485232, "inodesUsed": 41688 } }, on node1: [root@node1 ~]# df -h / Filesystem Size Used Avail Use% Mounted on /dev/vda1 20G 5.5G 15G 28% / 5.5G matches usedBytes in fs # docker info | grep Data WARNING: Usage of loopback devices is strongly discouraged for production use. Use `--storage-opt dm.thinpooldev` to specify a custom block storage device. Data file: /dev/loop0 Data Space Used: 3.262 GB Data Space Total: 107.4 GB Data Space Available: 15.58 GB Data loop file: /var/lib/docker/devicemapper/devicemapper/data 3.262 GB is somewhat close to imageFs usedBytes. All the other fields between fs and runtime.imageFs are the same.
Please be sure to include the commands you used to set up the storage, /etc/docker/daemon.json, and /etc/sysconfig/docker. Thanks!
I recreated locally with the following steps: - deploy 3.10 cluster with openshift-ansible - chose a compute node for testing - oc adm drain --ignore-daemonsets <node> - ssh into <node> - systemctl stop atomic-openshift-node - systemctl stop docker - rm -rf /var/lib/docker/* - edit /etc/sysconfig/docker-storage with s/overlay2/devicemapper - systemctl start docker - "docker info | grep Storage" makes sure devicemapper is in use - systemctl start atomic-openshift-node - exit node - oc adm uncordon <node> - oc get --raw=/api/v1/nodes/<node>/proxy/stats/summary | grep -A 20 fs
In 3.9: Before any image pull to DM storage: "fs": { "time": "2018-05-02T20:46:08Z", "availableBytes": 19392348160, "capacityBytes": 21463281664, "usedBytes": 2070933504, "inodesFree": 10451574, "inodes": 10485232, "inodesUsed": 33658 }, "runtime": { "imageFs": { "time": "2018-05-02T20:46:08Z", "availableBytes": 17063477248, "capacityBytes": 17083400192, "usedBytes": 0 } } After image pull: "fs": { "time": "2018-05-02T20:48:10Z", "availableBytes": 19382808576, "capacityBytes": 21463281664, "usedBytes": 2080473088, "inodesFree": 10451530, "inodes": 10485232, "inodesUsed": 33702 }, "runtime": { "imageFs": { "time": "2018-05-02T20:48:10Z", "availableBytes": 15486418944, "capacityBytes": 17083400192, "usedBytes": 1420732513 } } # docker images REPOSITORY TAG IMAGE ID CREATED SIZE docker.io/openshift/origin latest ad6bb5bd6b73 2 hours ago 1.42 GB # docker info 2>/dev/null | grep "Data Space" Data Space Used: 1.597 GB Data Space Total: 17.08 GB Data Space Available: 15.49 GB usedBytes is matching up with the image size. inodes fields are not included. availableBytes matches "Data Space Available" and capacityBytes matches "Data Space Total" So there is a change in behavior (i.e. bug) here for 3.10
Already fixed in upstream cadvisor but not updated in kube 1.10 so we inherited the bug :-/ Origin PR: https://github.com/openshift/origin/pull/19601
Verify on v3.10.0-0.46.0 [root@ip-172-18-0-165 ~]# oc get --raw /api/v1/nodes/ip-172-18-14-141.ec2.internal/proxy/stats/summary |grep -w -A 16 fs "fs": { "time": "2018-05-16T07:43:18Z", "availableBytes": 16132792320, "capacityBytes": 19316867072, "usedBytes": 3184074752, "inodesFree": 9389563, "inodes": 9437184, "inodesUsed": 47621 }, "runtime": { "imageFs": { "time": "2018-05-16T07:43:18Z", "availableBytes": 8842117120, "capacityBytes": 13098811392, "usedBytes": 6847236375 } },
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816