Description of problem: On one cluster, 25-50% of builds are failing with this build error: build error: Failed to push image. Response from registry is: open /dev/mapper/docker-253:4-33595986-5b6aba0f60e86a8734b53c09d19763302917956ed4cbc49ded701c722e38ad5b: no such file or directory We have a docker registry running on a single infra node. The docker service has been restarted, but the issue persists. Version-Release number of selected component (if applicable): oc v3.2.1.15-8-gc402626 kubernetes v1.2.0-36-g4a3f9c5 atomic-openshift-3.2.1.15-1.git.8.c402626.el7.x86_64 docker-1.9.1-40.el7.x86_64 How reproducible: 25-50% of the time so far Steps to Reproduce: 1. Create a new project and new app of any type. 2. Watch the build logs for the error. 3. Actual results: Sometimes the build fails with the error. Expected results: Build should succeed every time. Additional info:
can you paste "docker info" output as well.
I suspect following fix might help. https://github.com/projectatomic/docker/pull/188 But this fix will is available only in docker-1.10. Can you please try on top of docker-1.10 and see if problem still happens.
Alternatively, on docker-1.9, try disabling deferred removal of device feature and see if that works. You will have to remove "--storage-opt dm.use_deferred_removal=true" from /etc/sysconfig/docker-storage and restart docker.
Containers: 17 Images: 265 Server Version: 1.9.1 Storage Driver: devicemapper Pool Name: docker_vg-docker--pool Pool Blocksize: 524.3 kB Base Device Size: 3.221 GB Backing Filesystem: xfs Data file: Metadata file: Data Space Used: 13.01 GB Data Space Total: 212.6 GB Data Space Available: 199.6 GB Metadata Space Used: 5.825 MB Metadata Space Total: 218.1 MB Metadata Space Available: 212.3 MB Udev Sync Supported: true Deferred Removal Enabled: true Deferred Deletion Enabled: true Deferred Deleted Device Count: 0 Library Version: 1.02.107-RHEL7 (2016-06-09) Execution Driver: native-0.2 Logging Driver: json-file Kernel Version: 3.10.0-327.22.2.el7.x86_64 Operating System: Red Hat Enterprise Linux Server 7.2 (Maipo) CPUs: 4 Total Memory: 15.26 GiB Name: ip-172-31-10-24.ec2.internal ID: MO2H:GBUK:PRXK:7JAZ:S6K5:6CX3:E56B:O3NL:V5DN:H2LJ:FFS5:6JWM WARNING: bridge-nf-call-iptables is disabled WARNING: bridge-nf-call-ip6tables is disabled
We have our registry pods locked to two specific nodes; on each of those nodes, I re-initialized Docker storage with deferred deletion and deferred removal disabled. The exact same error (down to the UID in the DM device) persisted. The error also persisted after I re-enabled deferred removal/deletion, rebooted the hosts, and re-initialized storage again.
Further info: The STI build we're deploying is done from a script every half-hour as an end-to-end test. The failure rate is roughly 90% We're doing the STI build from this repo: https://github.com/openshift/nodejs-ex
Is it possible to enable debug in docker daemon (-D flag), and restart docker daemon. And once problem happens again, please collect journal logs and attach to the bug. I would like to have a look at the logs and see if I can spot something.
Does anybody know what's this id "5b6aba0f60e86a8734b53c09d19763302917956ed4cbc49ded701c722e38ad5b". Is it an image id? If yes, could it be that it is some race with image deletion. Some other component in the system tried deleting this image while we are trying to push this image.
Error message also says that "Response from registry is". So is this an error messsage from registry? Should registry developers have a look. Or somebody who knows openshift side better, can they break it down little bit in terms of docker commands so that I can begin to understand the workflow.
Despite the error pretty specifically pointing at the Docker registry, there was a bad image on one of our compute nodes, which are separated from the registry. Once Vivek pointed me to the right node, I was able to wipe docker storage on that node and get STI builds back to normal.