Description of problem: Image GC lowered to image-gc-high-threshold 85%, but the gc fails to successfully run when Metadata is over 85%. Container GC is also failing with over 250 dead Kubernetes managed containers. docker-pool docker-vg twi-aot--- 299.39g 46.58 88.15 docker-pool_tdata(0) --- Logical volume --- global/lvdisplay_shows_full_device_path not found in config: defaulting to 0 LV Name docker-pool VG Name docker-vg LV UUID NQuJC2-SEa2-WwP2-cN2c-oNc7-EccS-Xfkrj9 LV Write Access read/write LV Creation host, time d-124.dmz, 2017-03-02 09:42:41 -0800 LV Pool metadata docker-pool_tmeta LV Pool data docker-pool_tdata LV Status available # open 40 LV Size 299.39 GiB Allocated pool data 46.58% Allocated metadata 88.15% Current LE 76645 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 8192 Block device 253:14 --- Physical volumes --- PV Name /dev/sde1 PV UUID TK8YYh-1o9w-7ROC-08iA-fbww-z6Fx-3TRk6h PV Status allocatable Total PE / Free PE 76799 / 0 Containers: 306 Running: 38 Paused: 0 Stopped: 268 Images: 2255 May 09 20:44:03 d-124.dmz atomic-openshift-node[3008]: E0509 20:44:03.325337 3008 generic.go:197] GenericPLEG: Unable to retrieve pods: operation timeout: context deadline exceeded May 09 20:44:03 d-124.dmz atomic-openshift-node[3008]: E0509 20:44:03.443629 3008 kubelet.go:2713] Failed cleaning pods: operation timeout: context deadline exceeded May 09 21:47:28 d-124.dmz atomic-openshift-node[3008]: E0509 21:47:28.810460 3008 generic.go:197] GenericPLEG: Unable to retrieve pods: operation timeout: context deadline exceeded May 09 21:47:29 d-124.dmz atomic-openshift-node[3008]: E0509 21:47:29.435884 3008 kubelet.go:2713] Failed cleaning pods: operation timeout: context deadline exceeded May 10 14:18:09 d-124.dmz atomic-openshift-node[3008]: E0510 14:18:09.343189 3008 generic.go:236] PLEG: Ignoring events for pod scan-e9476/management-infra: operation timeout: context deadline exceeded May 10 14:18:09 d-124.dmz atomic-openshift-node[3008]: E0510 14:18:09.343316 3008 pod_workers.go:183] Error syncing pod dc1f5ed2-35c5-11e7-9347-0050568379a2, skipping: operation timeout: context deadline exceeded May 10 14:18:09 d-124.dmz atomic-openshift-node[3008]: E0510 14:18:09.428921 3008 kubelet.go:2713] Failed cleaning pods: operation timeout: context deadline exceeded May 10 22:10:03 d-124.dmz atomic-openshift-node[3008]: E0510 22:10:03.920124 3008 generic.go:197] GenericPLEG: Unable to retrieve pods: operation timeout: context deadline exceeded May 10 22:10:04 d-124.dmz atomic-openshift-node[3008]: E0510 22:10:04.552203 3008 kubelet.go:2713] Failed cleaning pods: operation timeout: context deadline exceeded May 10 22:10:34 d-124.dmz atomic-openshift-node[3008]: E0510 22:10:34.493241 3008 kubelet.go:952] Container garbage collection failed: operation timeout: context deadline exceeded May 10 14:17:10 ociopf-d-124.dmz lvm[2563]: WARNING: Thin pool docker--vg-docker--pool metadata is now 80.02% full. May 10 19:34:30 ociopf-d-124.dmz lvm[2563]: WARNING: Thin pool docker--vg-docker--pool metadata is now 85.01% full. Manually cleaned: [root@d-124 ~]# docker rmi $(docker images --filter "dangling=true" -q --no-trunc) [root@d-124 ~]# lvs | head -2 LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert docker-pool docker-vg twi-aot--- 299.39g 19.60 37.63 Version-Release number of selected component (if applicable): 3.3.1.7 How reproducible: Very consistent in this environment, never properly triggers or fails when it does trigger. Actual results: GC fails, requiring manually cleanup before the volume fills up. Expected results: Successful GC Additional info:
Don't know what does "context deadline exceeded" mean? Is there some timeout in openshift/kube. If yes, how do we know time is sufficient to finish operation. Can you run "docker info" on the node. How do we know docker is not responding?
I think openshift/kube needs to fill in more details here (especially with docker logs) showing that docker is dead, before it comes to platform team.
May be docker is not running at all at the node. I mean without having more details about state of node, don't assume this is a docker issue.
If the customer has a node in this condition do this: $ curl -H "Authorization: Bearer $(oc whoami -t)" --insecure https://localhost:10250/metrics | grep -e docker Should give us some indication as to docker's responsiveness as seen from the openshift node process.
Going by the docker logs, docker is refusing creation of new containers/images because it thinks metadata is almost full. So you need to either grow metadata lv or remove some images/containers and that should reduce usage of metadata volume. time="2017-08-14T17:28:55.161557210-07:00" level=error msg="Handler for POST /containers/create returned error: devmapper: Thin Pool has 7863 free metadata blocks which is less than minimum required 7884 free metadata blocks. Create more free metadata space in thin pool or use dm.min_free_space option to change behavior"
Verify on ocp3.7, when imagegc trigger we can seed th imageGCManager trying to cleanup the unused image. # openshift version openshift v3.7.0-0.146.0 kubernetes v1.7.6+a08f5eeb62 etcd 3.2.1 //logs like: Oct 10 04:36:30 ip-172-18-4-102 atomic-openshift-node: I1010 04:36:30.520176 32692 image_gc_manager.go:270] [imageGCManager]: Disk usage on "rhel-docker--pool" () is at 28% which is over the high threshold (10%). Trying to free 2904974950 bytes Oct 10 04:41:30 ip-172-18-4-102 atomic-openshift-node: I1010 04:41:30.710926 32692 image_gc_manager.go:335] [imageGCManager]: Removing image "sha256:220072a1dafa8eb7f5739606393285301888277b21fe65f14614ad86377e6332" to free 208663627 bytes
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188