Bug 1263394

Summary:	Docker does not start up on the node
Product:	Red Hat Enterprise Linux 7	Reporter:	Veer Muchandi <veer>
Component:	docker	Assignee:	Lokesh Mandvekar <lsm5>
Status:	CLOSED CURRENTRELEASE	QA Contact:	atomic-bugs <atomic-bugs>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	7.1	CC:	aos-bugs, dwalsh, jlothian, jokerman, lsm5, mmccomas, sdodson, vgoyal
Target Milestone:	rc	Keywords:	Extras, Reopened
Target Release:	7.1
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:	docker-1.9.1-2.el7_2	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1264193 (view as bug list)		Environment:
Last Closed:	2016-01-07 21:59:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1264193

Description Veer Muchandi 2015-09-15 18:09:06 UTC

Description of problem:
Nodes have been running for a few weeks now. Suddenly one node went down and Docker does not start. Again same thing happened to another node.

Version-Release number of selected component (if applicable):
3.0.1

How reproducible:
This happened in the existing OpenShift environment on its own. I cannot recreate the situation. I can provide the logs from the node that went down. 

Steps to Reproduce:
1.
2.
3.

Actual results:
Node is supposed to run without issues. Docker should start up. 

Expected results:
Docker does not start.

Additional info:

# systemctl start docker
Job for docker.service failed. See 'systemctl status docker.service' and 'journalctl -xn' for details.


[root@node1 cloud-user]# journalctl -xn -u docker -l
-- Logs begin at Mon 2015-09-14 20:42:58 EDT, end at Mon 2015-09-14 21:28:05 EDT. --
Sep 14 20:44:56 node1.demov3.osecloud.com docker[1137]: time="2015-09-14T20:44:56-04:00" level=info msg="Received signal 'terminated', starting shutdown of docker..."
Sep 14 20:44:56 node1.demov3.osecloud.com systemd[1]: Failed to start Docker Application Container Engine.
-- Subject: Unit docker.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit docker.service has failed.
-- 
-- The result is failed.
Sep 14 20:44:56 node1.demov3.osecloud.com systemd[1]: Unit docker.service entered failed state.
Sep 14 21:26:35 node1.demov3.osecloud.com systemd[1]: Starting Docker Application Container Engine...
-- Subject: Unit docker.service has begun with start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit docker.service has begun starting up.
Sep 14 21:26:35 node1.demov3.osecloud.com docker[1317]: time="2015-09-14T21:26:35-04:00" level=info msg="+job serveapi(unix:///var/run/docker.sock)"
Sep 14 21:26:35 node1.demov3.osecloud.com docker[1317]: time="2015-09-14T21:26:35-04:00" level=info msg="Listening for HTTP on unix (/var/run/docker.sock)"
Sep 14 21:28:05 node1.demov3.osecloud.com systemd[1]: docker.service operation timed out. Terminating.
Sep 14 21:28:05 node1.demov3.osecloud.com docker[1317]: time="2015-09-14T21:28:05-04:00" level=info msg="Received signal 'terminated', starting shutdown of docker..."
Sep 14 21:28:05 node1.demov3.osecloud.com systemd[1]: Failed to start Docker Application Container Engine.
-- Subject: Unit docker.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit docker.service has failed.
-- 
-- The result is failed.
Sep 14 21:28:05 node1.demov3.osecloud.com systemd[1]: Unit docker.service entered failed state.

Comment 3 Daniel Walsh 2015-09-15 19:37:06 UTC

Could you be out of space?

Comment 4 Veer Muchandi 2015-09-15 20:24:16 UTC

There is space
# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G  5.5G   15G  28% /
devtmpfs        3.9G     0  3.9G   0% /dev
tmpfs           3.9G     0  3.9G   0% /dev/shm
tmpfs           3.9G   17M  3.9G   1% /run
tmpfs           3.9G     0  3.9G   0% /sys/fs/cgroup

Comment 5 Daniel Walsh 2015-09-15 20:46:42 UTC

What happens if you run 

docker -d

Comment 6 Veer Muchandi 2015-09-15 20:57:49 UTC

# docker -d
INFO[0000] +job serveapi(unix:///var/run/docker.sock)   
INFO[0000] Listening for HTTP on unix (/var/run/docker.sock) 
ERRO[0000] WARNING: No --storage-opt dm.thinpooldev specified, using loopback; this configuration is strongly discouraged for production use

Comment 8 Daniel Walsh 2015-09-15 21:11:22 UTC

Ok So docker is working without all of the other flags.

vivek what are the standard docker flags on RHEL?

docker -d --selinux-enabled --storage-opt ...

Comment 9 Daniel Walsh 2015-09-15 21:12:53 UTC

Could this be running out of the 10 gig of space in the devicemapper?

Comment 10 Vivek Goyal 2015-09-15 21:21:40 UTC

Try --storage-opt dm.basesize=10G and see if same problem happens.

People have complained that mkfs.xfs is taking time with 100G thin devices when loopback devices are being used on top of cloud storage. And system times out.

In my testing it was taking 30 seoconds. Primarily it is slower storage problem.

Comment 11 Vivek Goyal 2015-09-15 21:23:12 UTC

For dm.basesize=10G to be effective, you will have to setup a fresh instance of docker. If you restart already setup instance with new parameter, it will not help.

Comment 12 Veer Muchandi 2015-09-15 21:39:14 UTC

# docker -d --storage-opt dm.basesize=10G
ERRO[0000] WARNING: No --storage-opt dm.thinpooldev specified, using loopback; this configuration is strongly discouraged for production use 
ERRO[0000] Unable to delete device: Error running DeleteDevice dm_task_run failed 
INFO[0000] +job serveapi(unix:///var/run/docker.sock)   
INFO[0000] Listening for HTTP on unix (/var/run/docker.sock) 
INFO[0008] +job init_networkdriver()                    
INFO[0008] -job init_networkdriver() = OK (0)           
INFO[0008] Loading containers: start.                   
..........................................................................................................................................................................................................................................
INFO[0008] Loading containers: done.                    
INFO[0008] docker daemon: 1.6.2 ba1f6c3/1.6.2; execdriver: native-0.2; graphdriver: devicemapper 
INFO[0008] +job acceptconnections()                     
INFO[0008] -job acceptconnections() = OK (0)            
INFO[0008] Daemon has completed initialization

Comment 13 Veer Muchandi 2015-09-15 22:46:23 UTC

after running docker -d, I tried starting docker again with systemctl start docker and it starts
Not sure what the difference is but it at least starts up.

Comment 14 Vivek Goyal 2015-09-16 12:08:52 UTC

Delays happen only first time when mkfs.xfs is being done. If you have run docker with "-d" successfully, that means mkfs is done and when you run next time using systemd it succeeds as it does not go through mkfs step.

Comment 15 Daniel Walsh 2015-09-16 12:32:12 UTC

If you are running on a loopback device, you should really consider moving to a physical device, We are heavily recommending that people do not run loopback devmapper in production.

Comment 16 Veer Muchandi 2015-09-17 22:25:36 UTC

Inspite of docker service coming up after I run docker -d, it seems to be causing issues. I am not able to start any pods on openshift after this. They are all in pending status.

Comment 17 Daniel Walsh 2015-09-18 13:26:41 UTC

Does something like 

docker run --rm fedora echo hello

Work?

Comment 18 Veer Muchandi 2015-09-18 13:49:16 UTC

Dan, Last night I ended up scrapping these boxes as I need OpenShift. I am reinstalling the whole thing again,

Comment 19 Daniel Walsh 2015-09-28 17:52:22 UTC

Ok reopen if it happens again.

Comment 20 Scott Dodson 2015-10-22 16:05:53 UTC

docker-1.8.2-7.el7.x86_64 seems to have reduced the timeout to 60 seconds which is less than the systemd default of 90 seconds. When this happened I'm unable to start docker after cleaning up /var/lib/docker.

I believe the intention was to increase the timeout rather than decrease it so I think this should be increased to 2min or more.

On my laptop I can reliably create the 100G loopback device in 90 seconds but not 60 seconds. Cloud storage may be considerably slower than the SSD in my laptop however.

I'll try to run a few rounds of tests in AWS and an OpenStack environment but I may not have time to do so.

Comment 21 Daniel Walsh 2015-10-28 17:42:59 UTC

Lokesh lets set the timeout to 5 minutes.

Comment 22 Scott Dodson 2015-11-19 18:42:26 UTC

*** Bug 1230389 has been marked as a duplicate of this bug. ***