Description of problem: I am running reliability tests which go on for a long time, as part of these tests we create/build/scale/delete apps for extended period of time (2-3 weeks). I have been using compute nodes with 50G disk space for last few releases. For this run runtime is CRI-O, one of the compute nodes is running out of disk space and started Evicting pods after few days. Attaching "oc describe node" output. With other runs where runtime was docker we did not see this problem. Disk is consumed primarily by docker root@: /var/lib/containers/docker # du -sh * 224K containers 453M image 44K network 53G overlay2 0 plugins 0 swarm 0 tmp 0 trust 24K volumes Version-Release number of selected component (if applicable): openshift v3.9.0-0.53.0 kubernetes v1.9.1+a0ce1bc657 etcd 3.2.8 How reproducible: with cri-o runtime Steps to Reproduce: 1. Create OCP cluster with cri-o runtime 2. start creating/building/deleting quickstart apps for a long time Actual results: Node starts evicting pods Expected results: Disk space should be re-claimed. Additional info: See node logs and describe node attached.
Derek, I assume this is because when we run in crio mode, imageGC ignores the docker filesystem, so any images that are being pulled down/built by the openshift build process are not monitored/GCed. Is there any way we can make imageGC apply to both the crio and docker filesystems?
Seth may know also.
Created attachment 1405542 [details] describe node
Yes, there is a daemonset that needs to be deployed on crio nodes that also run docker builds in order to due container and image GC for docker: https://github.com/openshift/origin/blob/master/examples/dockergc/dockergc-ds.yaml If using openshift-ansible, this should deploy if you set openshift_use_crio=true In order to do it explictly openshift_crio_enable_docker_gc=true
I already had openshift_use_crio=true in my inventory when created the env, how do I check to see if its running ?
oc get ds --all-namespaces oc get pods --all-namespaces -l app=dockergc I can't recall the namespace atm
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE kube-service-catalog apiserver 1 1 1 1 1 openshift-infra=apiserver 9d kube-service-catalog controller-manager 1 1 1 1 1 openshift-infra=apiserver 9d openshift-template-service-broker apiserver 1 1 1 1 1 region=infra 9d
when I try to create it in my own project root@ip-172-31-58-173: ~ # oc create -f dockergc-ds.yaml serviceaccount "dockergc" created daemonset "dockergc" created root@ip-172-31-58-173: ~ # oc get events LAST SEEN FIRST SEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE 1s 3s 9 dockergc.1519bf015a639e89 DaemonSet Warning FailedCreate daemonset-controller Error creating: pods "dockergc-" is forbidden: unable to validate against any security context constraint: [spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed]
You need to grant higher privs to the SA: https://github.com/openshift/origin/blob/master/examples/dockergc/dockergc-ds.yaml#L8-L9
Here are the logs from one of those containers run by ds root@ip-172-31-16-7: ~ # oc logs dockergc-l5g5m Error: unknown command "ex" for "openshift" Run 'openshift --help' for usage. I created another cluster today, installer created the ds but all the pods are going in CrashLoopBackOff. Raising the sev on this bz.
Marking this TestBlocker as it causes reliability testing with CRI-O to eventually fail with pods evicted because of out of disk due to gc failures.
Ah, thanks for the pod logs. Turns out that "ex" is no longer a subcommand oc openshift, but rather oc. Must have been part of a 3.9 refactor. Can you modify the example DS to use "oc ex" rather than "openshift ex" in the container command and see if it will start?
More concretely, this https://github.com/sjenning/origin/commit/0dd74d12f9c5ad2a7ce932dedbff9322fe7c8584
dockergc pod logs have following error after they start running I0309 14:31:15.582402 1 dockergc.go:150] gathering disk usage data E0309 14:31:15.587715 1 dockergc.go:267] garbage collection attempt failed: exit status 1 I0309 14:32:15.587934 1 dockergc.go:150] gathering disk usage data E0309 14:32:15.590050 1 dockergc.go:267] garbage collection attempt failed: exit status 1
That means that it can't access /var/lib/docker within the pod. Can you exec into the pod and see 1) what user is running pid 1 and 2) can that user list the contents of /var/lib/docker?
Here it is UID PID PPID C STIME TTY TIME CMD root 1 0 0 13:20 ? 00:00:00 /usr/bin/oc ex dockergc --image-gc-low-threshold=60 --image-gc-high-threshold=80 --minimum-ttl-duration=1h0m0s root 17 0 0 13:24 pts/0 00:00:00 /bin/sh root 33 17 0 13:25 pts/0 00:00:00 ps -ef sh-4.2# ls -l total 148 drwx------. 18 root root 4096 Mar 19 13:20 containers drwx------. 3 root root 22 Mar 14 15:48 image drwxr-x---. 3 root root 19 Mar 14 15:48 network drwxr-xr-x. 1405 root root 118784 Mar 19 13:20 overlay2 drwx------. 4 root root 32 Mar 14 15:48 plugins drwx------. 2 root root 6 Mar 14 15:48 swarm drwx------. 2 root root 6 Mar 14 20:15 tmp drwx------. 2 root root 6 Mar 14 15:48 trust drwx------. 2 root root 25 Mar 14 15:48 volumes
Hi Vikas, The pr has been merged,please help check if this bug could be verified,thanks!
Sorry, wrong target release. This is fixed in openshift-ansible master i.e. 3.10. If you are using the release-3.9 branch of openshift-ansible, I did not backport the fix because cri-o is not officially supported in 3.9. Please test against 3.10.
Opened PR for 3.9.z as well: https://github.com/openshift/openshift-ansible/pull/8236
Hi Vikas,please help check if this bug has been fixed ,thanks!
The following version of openshift cluster where I am running tests, I dont see this problem. openshift v3.10.12
Verified on following version, docker-gc is working and no errors in pod logs. openshift v3.9.40 kubernetes v1.9.1+a0ce1bc657 etcd 3.2.16
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2335