1336807 – Docker sometimes fails to start on first run of openshift-ansible

Bug 1336807 - Docker sometimes fails to start on first run of openshift-ansible

Summary: Docker sometimes fails to start on first run of openshift-ansible

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Jason DeTiberus
QA Contact:	Ma xiaoqiang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-05-17 13:29 UTC by Devan Goodwin
Modified:	2016-07-04 00:46 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-05-17 19:00:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Devan Goodwin 2016-05-17 13:29:16 UTC

Description of problem:

Installing an HA containerized environment on basic RHEL (*not* Atomic) in AWS seems to almost always fail to start docker on some system the first time running the openshift-ansible config playbooks.

Re-running seems to fix it every time.

I have been able to get onto one of the systems affected immediately after it happens and docker seems to be doing something after the install fails, looked like mkfs commands.


Version-Release number of selected component (if applicable):

Using latest openshift master, around openshift-ansible-3.0.90-1.

On the affected host: docker-1.8.2-10.el7.x86_64


How reproducible:

I believe it's frequent, but not 100% reproducible. However I virtually always see it with an HA install in AWS. (3 masters, 2 nodes, 1 load balancer) It seems to show up earlier sometimes, later others, so I think it does not affect every host or even the same host every time.

Steps to Reproduce:

1. Install an HA env in AWS using RHEL server, but containerized. Will attach sample inventory (ansible-playbook -b -i ./hosts ~/src/openshift-ansible/playbooks/byo/openshift-cluster/config.yml)

Actual results:

TASK: [docker | Start the docker service] ************************************* 
failed: [ec2-54-175-159-33.compute-1.amazonaws.com] => {"failed": true}
msg: Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.

failed: [ec2-54-173-164-148.compute-1.amazonaws.com] => {"failed": true}
msg: Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.


FATAL: all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/home/dgoodwin/config.retry

ec2-52-207-229-152.compute-1.amazonaws.com : ok=39   changed=12   unreachable=0    failed=0   
ec2-54-152-241-168.compute-1.amazonaws.com : ok=295  changed=85   unreachable=0    failed=0   
ec2-54-172-46-115.compute-1.amazonaws.com : ok=468  changed=128  unreachable=0    failed=0   
ec2-54-173-164-148.compute-1.amazonaws.com : ok=47   changed=6    unreachable=0    failed=1   
ec2-54-175-159-33.compute-1.amazonaws.com : ok=47   changed=6    unreachable=0    failed=1   
ec2-54-85-74-39.compute-1.amazonaws.com : ok=295  changed=85   unreachable=0    failed=0   
localhost                  : ok=28   changed=0    unreachable=0    failed=0   

ansible-playbook -b -i ./hosts   61.35s user 29.50s system 2% cpu 53:53.15 total


docker logs on the failing host:

[root@ip-172-18-13-184 ec2-user]# journalctl -u docker
-- Logs begin at Tue 2016-05-17 07:54:48 EDT, end at Tue 2016-05-17 09:16:02 EDT. --
May 17 08:59:30 ip-172-18-13-184.ec2.internal systemd[1]: Starting Docker Application Container Engine...
May 17 08:59:30 ip-172-18-13-184.ec2.internal docker[13322]: time="2016-05-17T08:59:30.036001327-04:00" level=info msg="Listening for HTTP on unix (/var/run/docker.sock)"
May 17 08:59:31 ip-172-18-13-184.ec2.internal docker[13322]: time="2016-05-17T08:59:31.952741731-04:00" level=error msg="WARNING: No --storage-opt dm.thinpooldev specified, using loopback; this configuration is strongly discouraged for production use"
May 17 09:00:30 ip-172-18-13-184.ec2.internal systemd[1]: docker.service start operation timed out. Terminating.
May 17 09:00:30 ip-172-18-13-184.ec2.internal systemd[1]: docker.service: main process exited, code=exited, status=2/INVALIDARGUMENT
May 17 09:00:30 ip-172-18-13-184.ec2.internal systemd[1]: Failed to start Docker Application Container Engine.
May 17 09:00:30 ip-172-18-13-184.ec2.internal systemd[1]: Unit docker.service entered failed state.
May 17 09:00:30 ip-172-18-13-184.ec2.internal systemd[1]: docker.service failed.
May 17 09:00:30 ip-172-18-13-184.ec2.internal systemd[1]: docker.service holdoff time over, scheduling restart.
May 17 09:00:56 ip-172-18-13-184.ec2.internal systemd[1]: Starting Docker Application Container Engine...
May 17 09:00:56 ip-172-18-13-184.ec2.internal docker[13372]: time="2016-05-17T09:00:56.823813309-04:00" level=info msg="Listening for HTTP on unix (/var/run/docker.sock)"
May 17 09:00:56 ip-172-18-13-184.ec2.internal docker[13372]: time="2016-05-17T09:00:56.833578526-04:00" level=error msg="WARNING: No --storage-opt dm.thinpooldev specified, using loopback; this configuration is strongly discouraged for production use"
May 17 09:01:56 ip-172-18-13-184.ec2.internal systemd[1]: docker.service start operation timed out. Terminating.
May 17 09:01:56 ip-172-18-13-184.ec2.internal systemd[1]: docker.service: main process exited, code=exited, status=2/INVALIDARGUMENT
May 17 09:01:56 ip-172-18-13-184.ec2.internal systemd[1]: Failed to start Docker Application Container Engine.
May 17 09:01:56 ip-172-18-13-184.ec2.internal systemd[1]: Unit docker.service entered failed state.
May 17 09:01:56 ip-172-18-13-184.ec2.internal systemd[1]: docker.service failed.
May 17 09:01:57 ip-172-18-13-184.ec2.internal systemd[1]: docker.service holdoff time over, scheduling restart.
May 17 09:02:26 ip-172-18-13-184.ec2.internal systemd[1]: Starting Docker Application Container Engine...
May 17 09:02:26 ip-172-18-13-184.ec2.internal docker[13434]: time="2016-05-17T09:02:26.872750954-04:00" level=info msg="Listening for HTTP on unix (/var/run/docker.sock)"
May 17 09:02:26 ip-172-18-13-184.ec2.internal docker[13434]: time="2016-05-17T09:02:26.954054985-04:00" level=error msg="WARNING: No --storage-opt dm.thinpooldev specified, using loopback; this configuration is strongly discouraged for production use"
May 17 09:02:59 ip-172-18-13-184.ec2.internal docker[13434]: time="2016-05-17T09:02:59.853750926-04:00" level=info msg="[graphdriver] using prior storage driver \"devicemapper\""
May 17 09:02:59 ip-172-18-13-184.ec2.internal docker[13434]: time="2016-05-17T09:02:59.865893991-04:00" level=info msg="Option DefaultDriver: bridge"
May 17 09:02:59 ip-172-18-13-184.ec2.internal docker[13434]: time="2016-05-17T09:02:59.865926970-04:00" level=info msg="Option DefaultNetwork: bridge"
May 17 09:03:00 ip-172-18-13-184.ec2.internal docker[13434]: time="2016-05-17T09:03:00.133639768-04:00" level=warning msg="Running modprobe bridge nf_nat br_netfilter failed with message: modprobe: WARNING: Module br_netfilter not found.\ninsmod /lib/mod
May 17 09:03:00 ip-172-18-13-184.ec2.internal docker[13434]: time="2016-05-17T09:03:00.138066032-04:00" level=info msg="Firewalld running: false"
May 17 09:03:00 ip-172-18-13-184.ec2.internal docker[13434]: time="2016-05-17T09:03:00.788056494-04:00" level=info msg="Loading containers: start."
May 17 09:03:00 ip-172-18-13-184.ec2.internal docker[13434]: time="2016-05-17T09:03:00.788250595-04:00" level=info msg="Loading containers: done."
May 17 09:03:00 ip-172-18-13-184.ec2.internal docker[13434]: time="2016-05-17T09:03:00.788271061-04:00" level=info msg="Daemon has completed initialization"
May 17 09:03:00 ip-172-18-13-184.ec2.internal docker[13434]: time="2016-05-17T09:03:00.788296510-04:00" level=info msg="Docker daemon" commit="a01dc02/1.8.2" execdriver=native-0.2 graphdriver=devicemapper version=1.8.2-el7
May 17 09:03:00 ip-172-18-13-184.ec2.internal systemd[1]: Started Docker Application Container Engine.
May 17 09:11:54 ip-172-18-13-184.ec2.internal docker[13434]: time="2016-05-17T09:11:54.843808024-04:00" level=info msg="GET /v1.20/containers/json"

Comment 2 Devan Goodwin 2016-05-17 19:00:29 UTC

This is reportedly a known issue with using Docker 1.8, and the unmodified setup for docker storage. Closing.

Note You need to log in before you can comment on or make changes to this bug.