Hide Forgot
Description of problem: When running reliability tests for more then 3 days one of the nodes became NotReady because Openswitch service fails to start in Container Install of Openshift. Noticed following errors in log Aug 16 11:28:11 ip-172-31-32-196 openvswitch: docker: Error response from daemon: devmapper: Thin Pool has 1224 free metadata blocks which is less than minimum required 1228 free metadata blocks. Create more free metadata space in thin pool or use dm.min_free_space option to change behavior. After this error openvswitch.service service does not start Aug 16 11:57:39 ip-172-31-32-196 systemd: Failed to start openvswitch.service. Aug 16 11:57:39 ip-172-31-32-196 systemd: Dependency failed for atomic-openshift-node.service. Aug 16 11:57:39 ip-172-31-32-196 systemd: Job atomic-openshift-node.service/start failed with result 'dependency'. Aug 16 11:57:39 ip-172-31-32-196 systemd: Unit openvswitch.service entered failed state. Aug 16 11:57:39 ip-172-31-32-196 systemd: openvswitch.service failed. Aug 16 11:57:44 ip-172-31-32-196 systemd: openvswitch.service holdoff time over, scheduling restart. Aug 16 11:57:44 ip-172-31-32-196 systemd: Cannot add dependency job for unit atomic-openshift-master.service, ignoring: Unit atomic-openshift-master.service failed to load: No such file or directory. Aug 16 11:57:44 ip-172-31-32-196 systemd: Cannot add dependency job for unit atomic-openshift-master.service, ignoring: Unit atomic-openshift-master.service failed to load: No such file or directory. Aug 16 11:57:44 ip-172-31-32-196 systemd: Started atomic-openshift-node-dep.service. Aug 16 11:57:44 ip-172-31-32-196 systemd: Starting atomic-openshift-node-dep.service... Aug 16 11:57:44 ip-172-31-32-196 systemd: Starting openvswitch.service... Aug 16 11:57:44 ip-172-31-32-196 openvswitch: Failed to remove container (openvswitch): Error response from daemon: No such container: openvswitch Aug 16 11:57:44 ip-172-31-32-196 openvswitch: docker: Error response from daemon: Conflict. The name "/openvswitch" is already in use by container 0263d2a59521c92a6130b80ec1f92f5bbf5d1ace5270787ba43736c90a1f0b07. You have to remove (or rename) that cont ainer to be able to reuse that name.. Aug 16 11:57:44 ip-172-31-32-196 openvswitch: See '/usr/bin/docker-current run --help'. Aug 16 11:57:44 ip-172-31-32-196 systemd: openvswitch.service: main process exited, code=exited, status=125/n/a Aug 16 11:57:47 ip-172-31-32-196 lvm[746]: Rounding pool metadata size to boundary between physical extents: 12.00 MiB Aug 16 11:57:47 ip-172-31-32-196 lvm[746]: Insufficient free space: 3802 extents needed, but only 1452 available Aug 16 11:57:47 ip-172-31-32-196 lvm[746]: Failed to extend thin docker_vg-docker--pool. Aug 16 11:57:49 ip-172-31-32-196 openvswitch: Failed to stop container (openvswitch): Error response from daemon: No such container: openvswitch docker ps with that container id returns nothing on that node. docker ps -q | grep openvswitch Version-Release number of selected component (if applicable): openshift v3.3.0.18 kubernetes v1.3.0+507d3a7 etcd 2.3.0+git root@ip-172-31-36-93: ~ # docker info Containers: 3 Running: 3 Paused: 0 Stopped: 0 Images: 29 Server Version: 1.10.3 Storage Driver: devicemapper Pool Name: docker_vg-docker--pool Pool Blocksize: 524.3 kB Base Device Size: 10.74 GB Backing Filesystem: xfs Data file: Metadata file: Data Space Used: 5.716 GB Data Space Total: 12.87 GB Data Space Available: 7.152 GB Metadata Space Used: 1.225 MB Metadata Space Total: 33.55 MB Metadata Space Available: 32.33 MB Udev Sync Supported: true Deferred Removal Enabled: true Deferred Deletion Enabled: true Deferred Deleted Device Count: 0 Library Version: 1.02.107-RHEL7 (2015-10-14) Execution Driver: native-0.2 Logging Driver: json-file Plugins: Volume: local Network: null host bridge Authorization: rhel-push-plugin Kernel Version: 3.10.0-394.el7.x86_64 Operating System: Red Hat Enterprise Linux Server 7.2 (Maipo) OSType: linux Architecture: x86_64 Number of Docker Hooks: 2 CPUs: 4 Total Memory: 15.26 GiB Name: ip-172-31-36-93.us-west-2.compute.internal ID: 22OV:QWPO:T3UM:D5VM:2WS4:WAEZ:GPAT:M2PA:PT5M:UAL7:TEHE:INWD WARNING: bridge-nf-call-iptables is disabled Registries: registry.qe.openshift.com (insecure), registry.access.redhat.com (secure), docker.io (secure) Steps to Reproduce: 1. Create few projects 2. Keep rebuilding/scaling/redeploying them 3. Node becomes not ready Additional info: 1. Image/Builds/Deployments pruning was done everyday to cleanup unused data
Please provide details on which commands you used for pruning.
oadm prune deployments --orphans --keep-complete=5 --keep-failed=1 --keep-younger-than=60m oadm prune builds --orphans --keep-complete=5 --keep-failed=1 --keep-younger-than=60m oadm prune images --keep-tag-revisions=3 --keep-younger-than=60m --confirm
Pruning removes data from etcd related to builds, deployments, and images. It also removes image layers from the registry's storage. It does **not** remove anything from the docker daemon's thin pool, which is what is apparently having an issue. Although, if you look at the output from 'docker info', it appears that everything should be ok. Sorry I can't be more helpful here. Perhaps vgoyal could?
1 minor clarification - if 'oadm prune' deletes a build or a deployment, if containers exist for the associated pods, those containers will be deleted, and anything in the containers' COW space would be deleted as well, which would come out of the thin pool.
Is there any other cleanup recommended to make sure this does not happen?
I'm still confused as to why you got this error. Your output from 'docker info' appears to show plenty of space. How soon after you got the initial openvswitch error did you run 'docker info'? Kube/OpenShift will automatically prune non-running containers as needed, and there are settings you can tweak for when that kicks in. It will also automatically prune images if it is running low on space.
When I run docker info openvswitch was still not starting.
after lowering dm.min_free_space things started working.
Vikas, Can you run docker on this system and while docker is running, can you run "dmsetup status" command and "lvs -a" command and paste output here.
Hi Vivek, I do not have this cluster around, I am going to start the app reliability tests today and update this bug when hit this problem again.
Andy, I think you are talking about following settings, we have it on all the nodes. I will stop doing pruning I guess, because these setting should take care of pruning automatically. image-gc-high-threshold: - '80' image-gc-low-threshold: - '70' max-pods: - '250' maximum-dead-containers: - '20' maximum-dead-containers-per-container: - '1' minimum-container-ttl-duration: - 10s Please let me know if there is anything else I should do. (In reply to Andy Goldstein from comment #7) > I'm still confused as to why you got this error. Your output from 'docker > info' appears to show plenty of space. How soon after you got the initial > openvswitch error did you run 'docker info'? > > Kube/OpenShift will automatically prune non-running containers as needed, > and there are settings you can tweak for when that kicks in. It will also > automatically prune images if it is running low on space.
Yes, but you should continue to run 'oadm prune' so it can get rid of completed builds and deployments and their associated pods/containers.
@vikas, hvae you been able to reproduce the problem? I think your thin pool just filled up and that's why docker refused to start new containers. Lowering min_free_space, just allows you to go little further till you fill last remaining free space. So there should be good mechanism in openshift to keep track of free space and keep on cleaning images/containers to make sure sufficient free space is there in thin pool. After that either stop sending jobs to that node or add more storag to that node.
@Vivek, I am still running the tests, will update the bug when/if I encounter the issue. If not I guess we will close this bug.
I am not able to reproduce this issue, had couple of reliability runs on container install. Closing this bug.