Bug 1256877

Summary:	PODs are failing to be deployed because docker couldn’t pull the images
Product:	Red Hat Enterprise Linux 7	Reporter:	Akram Ben Aissi <abenaiss>
Component:	docker	Assignee:	Daniel Walsh <dwalsh>
Status:	CLOSED ERRATA	QA Contact:	atomic-bugs <atomic-bugs>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	7.1	CC:	abenaiss, aos-bugs, dmcphers, dwalsh, erich, jkaur, jokerman, lsm5, lsu, mmccomas, pep
Target Milestone:	rc	Keywords:	Extras
Target Release:	7.1
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: The docker daemon prior to docker-1.9 had a problem pulling multiple containers images in parallel. It would hang. Consequence: This caused OpenSHift to fail to deploy PODS in production. Fix: docker-1.9 has a fix for this that allows the images to be pulled in parallel. Result: OpenShift should be able to deploy all of its POD's images.	Story Points:	---
Clone Of:
Clones:	1278143 (view as bug list)		Environment:
Last Closed:	2016-03-31 23:22:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1278143

Description Akram Ben Aissi 2015-08-25 16:00:36 UTC

Description of problem:

PODs are failing to be deployed because docker couldn’t pull the 
images.

Version-Release number of selected component (if applicable):
3.0.0.0

How reproducible:
Some PODs failed to be deployed because docker couldn’t pull the 
images. We reproduced the issue outside of OpenShift by manually pulling the 
docker images:

Steps to Reproduce:
1. create pods
2. some will fail because docker will not manage to get deployed because docker couldn’t pull the images
3. Perform a docker pull manually

Actual results:

The subsequent deployment tentatives always fail
The manual docker pull also fails

Expected results:
Subsequent pull should be retried and succeeded of from OpenShift or with docker

Additional info:
One workaround (at least to get unstuck) is to go into /var/lib/docker and try to find any references to the layer in question and delete them. Sometimes pulling then works, while other times clearing /var/lib/docker is the only way to go.

Comment 2 Paul Weil 2015-08-25 17:16:04 UTC

Akram, are there bugs in the registry or docker logs?

Comment 3 Paul Weil 2015-08-25 17:16:42 UTC

Sorry, meant to say "are there errors in the logs"

Comment 4 Akram Ben Aissi 2015-08-25 17:22:45 UTC

Oh, sorry Paul, I also raised a ticket with GSS with the details, and I forgot to include them here.

Here is the logs we've got:
To go deeper into details, here is what we have:

some PODs failed to be deployed because docker couldn’t pull the 
images. We reproduced the issue outside of OpenShift by manually pulling the 
docker images:

=============================================================================
# docker pull acs_fwk/consul_agent:1.0.0

d67fc5cee6fb: Already exists 
 not foundpull repository dockerhub.rnd.amadeus.net:5002/acs_fwk/consul_agent 
...
Trying to pull repository registry.access.redhat.com/acs_fwk/consul_agent ... 
not found
Trying to pull repository docker.io/acs_fwk/consul_agent ... not found
FATA[0009] Error: image acs_fwk/consul_agent:1.0.0 not found 
=============================================================================

The docker logs are showing the following errors:

=============================================================================
Aug 24 08:00:17 ose3-int-node2.figaro.amadeus.net docker[11633]: 
time="2015-08-24T08:00:17Z" level=error msg="Error from V2 registry: Error 
mounting 
'/dev/mapper/docker-253:1-92289210-01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e
Aug 24 08:00:17 ose3-int-node2.figaro.amadeus.net docker[11633]: Error: image 
acs_fwk/consul_agent:1.0.0 not found
…
Aug 24 08:00:19 ose3-int-node2.figaro.amadeus.net docker[11633]: 
time="2015-08-24T08:00:19Z" level=error msg="Error from V2 registry: Error 
mounting 
'/dev/mapper/docker-253:1-92289210-01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e
Aug 24 08:00:19 ose3-int-node2.figaro.amadeus.net docker[11633]: Error: image 
acs_fwk/consul_agent:1.0.0 not found

=============================================================================

strace confirms that the error is at mount time:

=============================================================================
[pid 11690] 
mount("/dev/mapper/docker-253:1-92289210-01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e62c2c0653cd5", 
"/var/lib/docker/devicemapper/mnt/01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e62c2c0653cd5", 
"ext4", MS_MGC_VAL, "discard") = -1 EINVAL (Invalid argument)
[pid 11690] 
mount("/dev/mapper/docker-253:1-92289210-01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e62c2c0653cd5", 
"/var/lib/docker/devicemapper/mnt/01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e62c2c0653cd5", 
"ext4", MS_MGC_VAL, NULL) = -1 EINVAL (Invalid argument)
[pid 13796] 
mount("/dev/mapper/docker-253:1-92289210-01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e62c2c0653cd5", 
"/var/lib/docker/devicemapper/mnt/01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e62c2c0653cd5", 
"ext4", MS_MGC_VAL, "discard") = -1 EINVAL (Invalid argument)
[pid 13796] 
mount("/dev/mapper/docker-253:1-92289210-01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e62c2c0653cd5", 
"/var/lib/docker/devicemapper/mnt/01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e62c2c0653cd5", 
"ext4", MS_MGC_VAL, NULL) = -1 EINVAL (Invalid argument)
=============================================================================


The same “docker pull” command works on some boxes and fails on some other 
boxes that have exactly the same configuration (both are nodes configured by 
openshift-ansible)

Once a “docker pull” has failed, any subsequent “docker pull” command on the 
same image repeatedly keeps on failing.

Rebooting docker doesn’t recorver.

Rebooting the whole VM doesn’t recover.

The only way to recover from the situation is to empty /var/lib/docker.

Comment 5 Jhon Honce 2015-10-13 18:57:22 UTC

It is believed https://github.com/docker/docker/issues/9718 addresses this issue and that code will ship with Docker 1.9

Comment 8 Daniel Walsh 2015-10-15 17:13:32 UTC

Fixed in docker-1.9

Comment 12 Luwen Su 2016-02-03 08:43:10 UTC

In docker-1.9.1-15.el7.x86_64, works fine,

``
for image in mesosphere/chronos:chronos-2.3.4-1.0.81.ubuntu1404-mesos-0.22.1-1.0.ubuntu1404 mesosphere/mesos-master:0.22.1-1.0.ubuntu1404 mesosphere/mesos-slave:0.22.1-1.0.ubuntu1404 mesosphere/marathon:v0.8.2-RC4 ; do
    docker pull $image &
done
wait

``

The trigger is pulling multiple images in parallel from same base images,
so i used the mesosphere images for this,
more details and steps see https://github.com/docker/docker/issues/9718

Comment 14 errata-xmlrpc 2016-03-31 23:22:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0536.html