Hide Forgot
Description of problem: ----------------------- This is on OCP Cluster of 5 nodes: 1 etcd, 1 master, 1 infra node, and 2 application nodes, running on AWS EC2 m4.xlarge instances. NOTE: This cluster installed with docker 1.10.3 During multi-day reliability testing (where multiple projects are created over time, users added/removed, builds executed etc.), on day 2, we started seeing intermittent build errors for several of the sample applications (dancer-mysql-example, cake-php-exampel, django-psqlexample, etc) at 35% failure rate: error: Execution of post execute step failed warning: Failed to remove container "80a380004979dd6536e6b76fe3460d0391090d96418d022b763b3744d54a7c23": Error response from daemon: Driver devicemapper failed to remove root filesystem 80a380004979dd6536e6b76fe3460d0391090d96418d022b763b3744d54a7c23: mount still active error: build error: building walid/dancer-mysql-example-1:02773ab0 failed when committing the image due to error: Cannot connect to the Docker daemon. Is the docker daemon running on this host? Note what is causing the "mount still active" error. Version-Release number of selected component (if applicable): ------------------------------------------------------------- oc v3.4.0.32+d349492 kubernetes v1.4.0+776c994 features: Basic-Auth GSSAPI Kerberos SPNEGO openshift v3.4.0.32+d349492 kubernetes v1.4.0+776c994 docker 1.10.3 How reproducible: ----------------- Reproducible almost at will. Steps to Reproduce: =================== 1. Install OCP 3.4.0.32 cluster with docker 1.10.3 2. Run reliability testing, then manually create a new project 3. Inside new project, run: oc new-app dancer-mysql-example. Right now on this cluster I can reproduce it the first time I create a new project and a new app. Actual results: =============== # oc get pods NAME READY STATUS RESTARTS AGE dancer-mysql-example-1-build 0/1 Error 0 38m database-1-h939j 1/1 Running 0 38m Expected results: ================= # oc get pods NAME READY STATUS RESTARTS AGE dancer-mysql-example-1-build 0/1 Completed 0 38m dancer-mysql-example-1-pk317 1/1 Running 0 38m database-1-h939j 1/1 Running 0 38m Additional info: ================ Examining /usr/lib/systemd/system/docker.service shows we are running with MountFlags=slave On the node where the build failed: I do not see a docker container with the same ID as the container it failed to remove "80a380004979dd6536e6b76fe3460d0391090d96418d022b763b3744d54a7c23" or "80a38000497" in the errored build log (attached). Attempt at finding the offending process using the technique described in https://bugzilla.redhat.com/show_bug.cgi?id=1391665#c7 # find /proc/*/mounts | xargs grep 80a38000 grep: /proc/130827/mounts: No such file or directory also did a docker ps -a | grep <project_name> and tried the find command above on the two exited containers: # docker ps -a | grep walid 783c97d7aabf registry.access.redhat.com/rhscl/mysql-56-rhel7@sha256:0d32a738023a7e76e5df41a69e4c77cae80ed60676f7b5455dc70364896cc32b "container-entrypoint" About an hour ago Up About an hour k8s_mysql.486ea395_database-1-h939j_walid_15a55b28-bd8d-11e6-a6b0-02b95abd7a23_168846b8 0cdc1c55c3f5 registry.ops.openshift.com/openshift3/ose-pod:v3.4.0.32 "/pod" About an hour ago Up About an hour k8s_POD.f7ee6ba_database-1-h939j_walid_15a55b28-bd8d-11e6-a6b0-02b95abd7a23_356b4f22 5e8ac33d0b6b registry.ops.openshift.com/openshift3/ose-sti-builder:v3.4.0.32 "/usr/bin/openshift-s" About an hour ago Exited (1) 54 minutes ago k8s_sti-build.6cb594e0_dancer-mysql-example-1-build_walid_fd35c41b-bd8c-11e6-a6b0-02b95abd7a23_6a76125d 081230e27790 registry.ops.openshift.com/openshift3/ose-pod:v3.4.0.32 "/pod" About an hour ago Exited (0) 54 minutes ago k8s_POD.f7ee6ba_dancer-mysql-example-1-build_walid_fd35c41b-bd8c-11e6-a6b0-02b95abd7a23_4cbc6d38 # find /proc/*/mounts | xargs grep "08123" grep: /proc/128523/mounts: No such file or directory # find /proc/*/mounts | xargs grep 081230e27790 grep: /proc/129424/mounts: No such file or directory # find /proc/*/mounts | xargs grep 5e8ac33d0b6b grep: /proc/130174/mounts: No such file or directory docker-current PID: 9434 # ls -l /proc/9434/ns/mnt lrwxrwxrwx. 1 root root 0 Dec 8 16:43 /proc/9434/ns/mnt -> mnt:[4026532301] # ls -l /proc/self/ns/mnt lrwxrwxrwx. 1 root root 0 Dec 8 17:18 /proc/self/ns/mnt -> mnt:[4026531840] # ls -l /proc/$$/ns/mnt lrwxrwxrwx. 1 root root 0 Dec 8 17:17 /proc/91851/ns/mnt -> mnt:[4026531840] Please email me if you need access to this testbed. If you send me your id_rsa.pub public key, I can add it to authorized_keys file on my nodes, so you can ssh to my nodes. Access info in next comment.
the root cause of this bug appears to be this error: error: build error: building walid/dancer-mysql-example-1:02773ab0 failed when committing the image due to error: Cannot connect to the Docker daemon. Is the docker daemon running on this host? that is the first error that occurs, after that error occurs we attempt to remove the container, which results in this error: error: Execution of post execute step failed warning: Failed to remove container "80a380004979dd6536e6b76fe3460d0391090d96418d022b763b3744d54a7c23": Error response from daemon: Driver devicemapper failed to remove root filesystem 80a380004979dd6536e6b76fe3460d0391090d96418d022b763b3744d54a7c23: mount still active The errors are printed in reverse order due to how our error handling logic is processed. Note that even though we couldn't commit the container, I would still expect us to be able to remove the container, so both errors are a problem, but the main problem is the failure to commit the container.
Raised https://bugzilla.redhat.com/show_bug.cgi?id=1405272 for docker-storage-setup setting an unsupported option in /etc/sysconfig/docker-storage on RHEL 7.3.1
Mike, upstream docker-storage-setup has already been fixed to determine dynamically if underlying kernel supports deferred_deletion or not and set the option accordingly. We just need to make sure that both docker-1.12 and docker-1.10 builds have latest docker-storage-setup. Lokesh?
Vivek do you want it with the Overlay Patch that got merged yesterday for RHEL7.3.2?
Dan, I am merging shishir's changes for docker root volume now. We should not pull that one in yet. It is very new code. I think we should pull in till following commit. commit 516cb9c0bc14883f46ef2362a3f5abd4d6c20b1e Author: Vivek Goyal <vgoyal> Date: Tue Nov 15 09:13:45 2016 -0500 Let lvm create metadata volume automatically
Should be fixed in all versions of OCP 3.5. For testing purposes, OCP v3.5.0.17 or newer, and docker 1.10.3 or newer.
It works fine on OCP 3.5: openshift v3.5.0.17+c55cf2b , move to verified.