Bug 1552827
| Summary: | cri-o runtime docker fills up disk space | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Vikas Laad <vlaad> | ||||
| Component: | Node | Assignee: | Seth Jennings <sjenning> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Vikas Laad <vlaad> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 3.9.0 | CC: | aos-bugs, bparees, decarr, jforrest, jokerman, mifiedle, mmccomas, sjenning, vlaad, wmeng, wsun | ||||
| Target Milestone: | --- | Keywords: | TestBlocker | ||||
| Target Release: | 3.9.z | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | No Doc Update | |||||
| Doc Text: |
undefined
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2018-08-09 22:13:46 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 1587899, 1599240 | ||||||
| Bug Blocks: | |||||||
| Attachments: |
|
||||||
|
Description
Vikas Laad
2018-03-07 19:54:01 UTC
Derek, I assume this is because when we run in crio mode, imageGC ignores the docker filesystem, so any images that are being pulled down/built by the openshift build process are not monitored/GCed. Is there any way we can make imageGC apply to both the crio and docker filesystems? Seth may know also. Created attachment 1405542 [details]
describe node
Yes, there is a daemonset that needs to be deployed on crio nodes that also run docker builds in order to due container and image GC for docker: https://github.com/openshift/origin/blob/master/examples/dockergc/dockergc-ds.yaml If using openshift-ansible, this should deploy if you set openshift_use_crio=true In order to do it explictly openshift_crio_enable_docker_gc=true I already had openshift_use_crio=true in my inventory when created the env, how do I check to see if its running ? oc get ds --all-namespaces oc get pods --all-namespaces -l app=dockergc I can't recall the namespace atm NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE kube-service-catalog apiserver 1 1 1 1 1 openshift-infra=apiserver 9d kube-service-catalog controller-manager 1 1 1 1 1 openshift-infra=apiserver 9d openshift-template-service-broker apiserver 1 1 1 1 1 region=infra 9d when I try to create it in my own project root@ip-172-31-58-173: ~ # oc create -f dockergc-ds.yaml serviceaccount "dockergc" created daemonset "dockergc" created root@ip-172-31-58-173: ~ # oc get events LAST SEEN FIRST SEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE 1s 3s 9 dockergc.1519bf015a639e89 DaemonSet Warning FailedCreate daemonset-controller Error creating: pods "dockergc-" is forbidden: unable to validate against any security context constraint: [spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed] You need to grant higher privs to the SA: https://github.com/openshift/origin/blob/master/examples/dockergc/dockergc-ds.yaml#L8-L9 Here are the logs from one of those containers run by ds root@ip-172-31-16-7: ~ # oc logs dockergc-l5g5m Error: unknown command "ex" for "openshift" Run 'openshift --help' for usage. I created another cluster today, installer created the ds but all the pods are going in CrashLoopBackOff. Raising the sev on this bz. Marking this TestBlocker as it causes reliability testing with CRI-O to eventually fail with pods evicted because of out of disk due to gc failures. Ah, thanks for the pod logs. Turns out that "ex" is no longer a subcommand oc openshift, but rather oc. Must have been part of a 3.9 refactor. Can you modify the example DS to use "oc ex" rather than "openshift ex" in the container command and see if it will start? More concretely, this https://github.com/sjenning/origin/commit/0dd74d12f9c5ad2a7ce932dedbff9322fe7c8584 dockergc pod logs have following error after they start running I0309 14:31:15.582402 1 dockergc.go:150] gathering disk usage data E0309 14:31:15.587715 1 dockergc.go:267] garbage collection attempt failed: exit status 1 I0309 14:32:15.587934 1 dockergc.go:150] gathering disk usage data E0309 14:32:15.590050 1 dockergc.go:267] garbage collection attempt failed: exit status 1 That means that it can't access /var/lib/docker within the pod. Can you exec into the pod and see 1) what user is running pid 1 and 2) can that user list the contents of /var/lib/docker? Here it is UID PID PPID C STIME TTY TIME CMD root 1 0 0 13:20 ? 00:00:00 /usr/bin/oc ex dockergc --image-gc-low-threshold=60 --image-gc-high-threshold=80 --minimum-ttl-duration=1h0m0s root 17 0 0 13:24 pts/0 00:00:00 /bin/sh root 33 17 0 13:25 pts/0 00:00:00 ps -ef sh-4.2# ls -l total 148 drwx------. 18 root root 4096 Mar 19 13:20 containers drwx------. 3 root root 22 Mar 14 15:48 image drwxr-x---. 3 root root 19 Mar 14 15:48 network drwxr-xr-x. 1405 root root 118784 Mar 19 13:20 overlay2 drwx------. 4 root root 32 Mar 14 15:48 plugins drwx------. 2 root root 6 Mar 14 15:48 swarm drwx------. 2 root root 6 Mar 14 20:15 tmp drwx------. 2 root root 6 Mar 14 15:48 trust drwx------. 2 root root 25 Mar 14 15:48 volumes Hi Vikas, The pr has been merged,please help check if this bug could be verified,thanks! Sorry, wrong target release. This is fixed in openshift-ansible master i.e. 3.10. If you are using the release-3.9 branch of openshift-ansible, I did not backport the fix because cri-o is not officially supported in 3.9. Please test against 3.10. Opened PR for 3.9.z as well: https://github.com/openshift/openshift-ansible/pull/8236 Hi Vikas,please help check if this bug has been fixed ,thanks! The following version of openshift cluster where I am running tests, I dont see this problem. openshift v3.10.12 Verified on following version, docker-gc is working and no errors in pod logs. openshift v3.9.40 kubernetes v1.9.1+a0ce1bc657 etcd 3.2.16 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2335 |