Description of problem: Installing an HA containerized environment on basic RHEL (*not* Atomic) in AWS seems to almost always fail to start docker on some system the first time running the openshift-ansible config playbooks. Re-running seems to fix it every time. I have been able to get onto one of the systems affected immediately after it happens and docker seems to be doing something after the install fails, looked like mkfs commands. Version-Release number of selected component (if applicable): Using latest openshift master, around openshift-ansible-3.0.90-1. On the affected host: docker-1.8.2-10.el7.x86_64 How reproducible: I believe it's frequent, but not 100% reproducible. However I virtually always see it with an HA install in AWS. (3 masters, 2 nodes, 1 load balancer) It seems to show up earlier sometimes, later others, so I think it does not affect every host or even the same host every time. Steps to Reproduce: 1. Install an HA env in AWS using RHEL server, but containerized. Will attach sample inventory (ansible-playbook -b -i ./hosts ~/src/openshift-ansible/playbooks/byo/openshift-cluster/config.yml) Actual results: TASK: [docker | Start the docker service] ************************************* failed: [ec2-54-175-159-33.compute-1.amazonaws.com] => {"failed": true} msg: Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details. failed: [ec2-54-173-164-148.compute-1.amazonaws.com] => {"failed": true} msg: Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details. FATAL: all hosts have already failed -- aborting PLAY RECAP ******************************************************************** to retry, use: --limit @/home/dgoodwin/config.retry ec2-52-207-229-152.compute-1.amazonaws.com : ok=39 changed=12 unreachable=0 failed=0 ec2-54-152-241-168.compute-1.amazonaws.com : ok=295 changed=85 unreachable=0 failed=0 ec2-54-172-46-115.compute-1.amazonaws.com : ok=468 changed=128 unreachable=0 failed=0 ec2-54-173-164-148.compute-1.amazonaws.com : ok=47 changed=6 unreachable=0 failed=1 ec2-54-175-159-33.compute-1.amazonaws.com : ok=47 changed=6 unreachable=0 failed=1 ec2-54-85-74-39.compute-1.amazonaws.com : ok=295 changed=85 unreachable=0 failed=0 localhost : ok=28 changed=0 unreachable=0 failed=0 ansible-playbook -b -i ./hosts 61.35s user 29.50s system 2% cpu 53:53.15 total docker logs on the failing host: [root@ip-172-18-13-184 ec2-user]# journalctl -u docker -- Logs begin at Tue 2016-05-17 07:54:48 EDT, end at Tue 2016-05-17 09:16:02 EDT. -- May 17 08:59:30 ip-172-18-13-184.ec2.internal systemd[1]: Starting Docker Application Container Engine... May 17 08:59:30 ip-172-18-13-184.ec2.internal docker[13322]: time="2016-05-17T08:59:30.036001327-04:00" level=info msg="Listening for HTTP on unix (/var/run/docker.sock)" May 17 08:59:31 ip-172-18-13-184.ec2.internal docker[13322]: time="2016-05-17T08:59:31.952741731-04:00" level=error msg="WARNING: No --storage-opt dm.thinpooldev specified, using loopback; this configuration is strongly discouraged for production use" May 17 09:00:30 ip-172-18-13-184.ec2.internal systemd[1]: docker.service start operation timed out. Terminating. May 17 09:00:30 ip-172-18-13-184.ec2.internal systemd[1]: docker.service: main process exited, code=exited, status=2/INVALIDARGUMENT May 17 09:00:30 ip-172-18-13-184.ec2.internal systemd[1]: Failed to start Docker Application Container Engine. May 17 09:00:30 ip-172-18-13-184.ec2.internal systemd[1]: Unit docker.service entered failed state. May 17 09:00:30 ip-172-18-13-184.ec2.internal systemd[1]: docker.service failed. May 17 09:00:30 ip-172-18-13-184.ec2.internal systemd[1]: docker.service holdoff time over, scheduling restart. May 17 09:00:56 ip-172-18-13-184.ec2.internal systemd[1]: Starting Docker Application Container Engine... May 17 09:00:56 ip-172-18-13-184.ec2.internal docker[13372]: time="2016-05-17T09:00:56.823813309-04:00" level=info msg="Listening for HTTP on unix (/var/run/docker.sock)" May 17 09:00:56 ip-172-18-13-184.ec2.internal docker[13372]: time="2016-05-17T09:00:56.833578526-04:00" level=error msg="WARNING: No --storage-opt dm.thinpooldev specified, using loopback; this configuration is strongly discouraged for production use" May 17 09:01:56 ip-172-18-13-184.ec2.internal systemd[1]: docker.service start operation timed out. Terminating. May 17 09:01:56 ip-172-18-13-184.ec2.internal systemd[1]: docker.service: main process exited, code=exited, status=2/INVALIDARGUMENT May 17 09:01:56 ip-172-18-13-184.ec2.internal systemd[1]: Failed to start Docker Application Container Engine. May 17 09:01:56 ip-172-18-13-184.ec2.internal systemd[1]: Unit docker.service entered failed state. May 17 09:01:56 ip-172-18-13-184.ec2.internal systemd[1]: docker.service failed. May 17 09:01:57 ip-172-18-13-184.ec2.internal systemd[1]: docker.service holdoff time over, scheduling restart. May 17 09:02:26 ip-172-18-13-184.ec2.internal systemd[1]: Starting Docker Application Container Engine... May 17 09:02:26 ip-172-18-13-184.ec2.internal docker[13434]: time="2016-05-17T09:02:26.872750954-04:00" level=info msg="Listening for HTTP on unix (/var/run/docker.sock)" May 17 09:02:26 ip-172-18-13-184.ec2.internal docker[13434]: time="2016-05-17T09:02:26.954054985-04:00" level=error msg="WARNING: No --storage-opt dm.thinpooldev specified, using loopback; this configuration is strongly discouraged for production use" May 17 09:02:59 ip-172-18-13-184.ec2.internal docker[13434]: time="2016-05-17T09:02:59.853750926-04:00" level=info msg="[graphdriver] using prior storage driver \"devicemapper\"" May 17 09:02:59 ip-172-18-13-184.ec2.internal docker[13434]: time="2016-05-17T09:02:59.865893991-04:00" level=info msg="Option DefaultDriver: bridge" May 17 09:02:59 ip-172-18-13-184.ec2.internal docker[13434]: time="2016-05-17T09:02:59.865926970-04:00" level=info msg="Option DefaultNetwork: bridge" May 17 09:03:00 ip-172-18-13-184.ec2.internal docker[13434]: time="2016-05-17T09:03:00.133639768-04:00" level=warning msg="Running modprobe bridge nf_nat br_netfilter failed with message: modprobe: WARNING: Module br_netfilter not found.\ninsmod /lib/mod May 17 09:03:00 ip-172-18-13-184.ec2.internal docker[13434]: time="2016-05-17T09:03:00.138066032-04:00" level=info msg="Firewalld running: false" May 17 09:03:00 ip-172-18-13-184.ec2.internal docker[13434]: time="2016-05-17T09:03:00.788056494-04:00" level=info msg="Loading containers: start." May 17 09:03:00 ip-172-18-13-184.ec2.internal docker[13434]: time="2016-05-17T09:03:00.788250595-04:00" level=info msg="Loading containers: done." May 17 09:03:00 ip-172-18-13-184.ec2.internal docker[13434]: time="2016-05-17T09:03:00.788271061-04:00" level=info msg="Daemon has completed initialization" May 17 09:03:00 ip-172-18-13-184.ec2.internal docker[13434]: time="2016-05-17T09:03:00.788296510-04:00" level=info msg="Docker daemon" commit="a01dc02/1.8.2" execdriver=native-0.2 graphdriver=devicemapper version=1.8.2-el7 May 17 09:03:00 ip-172-18-13-184.ec2.internal systemd[1]: Started Docker Application Container Engine. May 17 09:11:54 ip-172-18-13-184.ec2.internal docker[13434]: time="2016-05-17T09:11:54.843808024-04:00" level=info msg="GET /v1.20/containers/json"
This is reportedly a known issue with using Docker 1.8, and the unmodified setup for docker storage. Closing.