Bug 1552827

Summary:

cri-o runtime docker fills up disk space

Product:

OpenShift Container Platform

Reporter:

Vikas Laad <vlaad>

Component:

Node

Assignee:

Seth Jennings <sjenning>

Status:

CLOSED ERRATA

QA Contact:

Vikas Laad <vlaad>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

3.9.0

CC:

aos-bugs, bparees, decarr, jforrest, jokerman, mifiedle, mmccomas, sjenning, vlaad, wmeng, wsun

Target Milestone:

---

Keywords:

TestBlocker

Target Release:

3.9.z

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

undefined

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-08-09 22:13:46 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1587899, 1599240

Bug Blocks:

Attachments:

Description	Flags
describe node	none

Description Vikas Laad 2018-03-07 19:54:01 UTC

Description of problem:
I am running reliability tests which go on for a long time, as part of these tests we create/build/scale/delete apps for extended period of time (2-3 weeks). I have been using compute nodes with 50G disk space for last few releases. For this run runtime is CRI-O, one of the compute nodes is running out of disk space and started Evicting pods after few days. Attaching "oc describe node" output. With other runs where runtime was docker we did not see this problem. Disk is consumed primarily by docker

root@: /var/lib/containers/docker # du -sh *
224K    containers
453M    image
44K     network
53G     overlay2
0       plugins
0       swarm
0       tmp
0       trust
24K     volumes


Version-Release number of selected component (if applicable):
openshift v3.9.0-0.53.0
kubernetes v1.9.1+a0ce1bc657
etcd 3.2.8

How reproducible:
with cri-o runtime

Steps to Reproduce:
1. Create OCP cluster with cri-o runtime
2. start creating/building/deleting quickstart apps for a long time

Actual results:
Node starts evicting pods

Expected results:
Disk space should be re-claimed.

Additional info:
See node logs and describe node attached.

Comment 1 Ben Parees 2018-03-07 20:16:08 UTC

Derek, I assume this is because when we run in crio mode, imageGC ignores the docker filesystem, so any images that are being pulled down/built by the openshift build process are not monitored/GCed.

Is there any way we can make imageGC apply to both the crio and docker filesystems?

Comment 2 Ben Parees 2018-03-07 20:16:26 UTC

Seth may know also.

Comment 3 Vikas Laad 2018-03-07 20:28:33 UTC

Created attachment 1405542 [details]
describe node

Comment 5 Seth Jennings 2018-03-07 20:40:53 UTC

Yes, there is a daemonset that needs to be deployed on crio nodes that also run docker builds in order to due container and image GC for docker:
https://github.com/openshift/origin/blob/master/examples/dockergc/dockergc-ds.yaml

If using openshift-ansible, this should deploy if you set
openshift_use_crio=true

In order to do it explictly
openshift_crio_enable_docker_gc=true

Comment 6 Vikas Laad 2018-03-07 20:43:04 UTC

I already had openshift_use_crio=true in my inventory when created the env, how do I check to see if its running ?

Comment 7 Seth Jennings 2018-03-07 20:54:35 UTC

oc get ds --all-namespaces
oc get pods --all-namespaces -l app=dockergc

I can't recall the namespace atm

Comment 8 Vikas Laad 2018-03-07 20:57:09 UTC

NAMESPACE                           NAME                   DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR               AGE
kube-service-catalog                apiserver              1         1         1         1            1           openshift-infra=apiserver   9d
kube-service-catalog                controller-manager     1         1         1         1            1           openshift-infra=apiserver   9d
openshift-template-service-broker   apiserver              1         1         1         1            1           region=infra                9d

Comment 9 Vikas Laad 2018-03-07 21:00:10 UTC

when I try to create it in my own project

root@ip-172-31-58-173: ~ # oc create -f dockergc-ds.yaml 
serviceaccount "dockergc" created
daemonset "dockergc" created
root@ip-172-31-58-173: ~ # oc get events
LAST SEEN   FIRST SEEN   COUNT     NAME                        KIND        SUBOBJECT   TYPE      REASON         SOURCE                 MESSAGE
1s          3s           9         dockergc.1519bf015a639e89   DaemonSet               Warning   FailedCreate   daemonset-controller   Error creating: pods "dockergc-" is forbidden: unable to validate against any security context constraint: [spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed]

Comment 10 Seth Jennings 2018-03-07 21:02:35 UTC

You need to grant higher privs to the SA:
https://github.com/openshift/origin/blob/master/examples/dockergc/dockergc-ds.yaml#L8-L9

Comment 11 Vikas Laad 2018-03-08 16:16:24 UTC

Here are the logs from one of those containers run by ds

root@ip-172-31-16-7: ~ # oc logs dockergc-l5g5m  
Error: unknown command "ex" for "openshift"
Run 'openshift --help' for usage.

I created another cluster today, installer created the ds but all the pods are going in CrashLoopBackOff. Raising the sev on this bz.

Comment 12 Mike Fiedler 2018-03-08 16:19:00 UTC

Marking this TestBlocker as it causes reliability testing with CRI-O to eventually fail with pods evicted because of out of disk due to gc failures.

Comment 13 Seth Jennings 2018-03-08 17:50:52 UTC

Ah, thanks for the pod logs.

Turns out that "ex" is no longer a subcommand oc openshift, but rather oc.  Must have been part of a 3.9 refactor.

Can you modify the example DS to use "oc ex" rather than "openshift ex" in the container command and see if it will start?

Comment 14 Seth Jennings 2018-03-08 18:17:25 UTC

More concretely, this
https://github.com/sjenning/origin/commit/0dd74d12f9c5ad2a7ce932dedbff9322fe7c8584

Comment 16 Vikas Laad 2018-03-09 14:34:05 UTC

dockergc pod logs have following error after they start running
I0309 14:31:15.582402       1 dockergc.go:150] gathering disk usage data
E0309 14:31:15.587715       1 dockergc.go:267] garbage collection attempt failed: exit status 1
I0309 14:32:15.587934       1 dockergc.go:150] gathering disk usage data
E0309 14:32:15.590050       1 dockergc.go:267] garbage collection attempt failed: exit status 1

Comment 17 Seth Jennings 2018-03-16 21:27:53 UTC

That means that it can't access /var/lib/docker within the pod.  Can you exec into the pod and see 1) what user is running pid 1 and 2) can that user list the contents of /var/lib/docker?

Comment 18 Vikas Laad 2018-03-19 13:25:54 UTC

Here it is

UID         PID   PPID  C STIME TTY          TIME CMD
root          1      0  0 13:20 ?        00:00:00 /usr/bin/oc ex dockergc --image-gc-low-threshold=60 --image-gc-high-threshold=80 --minimum-ttl-duration=1h0m0s
root         17      0  0 13:24 pts/0    00:00:00 /bin/sh
root         33     17  0 13:25 pts/0    00:00:00 ps -ef


sh-4.2# ls -l
total 148
drwx------.   18 root root   4096 Mar 19 13:20 containers
drwx------.    3 root root     22 Mar 14 15:48 image
drwxr-x---.    3 root root     19 Mar 14 15:48 network
drwxr-xr-x. 1405 root root 118784 Mar 19 13:20 overlay2
drwx------.    4 root root     32 Mar 14 15:48 plugins
drwx------.    2 root root      6 Mar 14 15:48 swarm
drwx------.    2 root root      6 Mar 14 20:15 tmp
drwx------.    2 root root      6 Mar 14 15:48 trust
drwx------.    2 root root     25 Mar 14 15:48 volumes

Comment 22 Wei Sun 2018-04-28 02:18:14 UTC

Hi Vikas,

The pr has been merged,please help check if this bug could be verified,thanks!

Comment 24 Seth Jennings 2018-05-02 17:01:42 UTC

Sorry, wrong target release.

This is fixed in openshift-ansible master i.e. 3.10.  If you are using the release-3.9 branch of openshift-ansible, I did not backport the fix because cri-o is not officially supported in 3.9.

Please test against 3.10.

Comment 26 Seth Jennings 2018-05-02 18:29:21 UTC

Opened PR for 3.9.z as well:
https://github.com/openshift/openshift-ansible/pull/8236

Comment 29 Wei Sun 2018-07-31 08:35:33 UTC

Hi Vikas,please help check if this bug has been fixed ,thanks!

Comment 30 Vikas Laad 2018-07-31 13:37:21 UTC

The following version of openshift cluster where I am running tests, I dont see this problem.

openshift v3.10.12

Comment 32 Vikas Laad 2018-08-02 18:32:39 UTC

Verified on following version, docker-gc is working and no errors in pod logs.

openshift v3.9.40
kubernetes v1.9.1+a0ce1bc657
etcd 3.2.16

Comment 34 errata-xmlrpc 2018-08-09 22:13:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2335