Bug 1256877 - PODs are failing to be deployed because docker couldn’t pull the images
PODs are failing to be deployed because docker couldn’t pull the images
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: docker (Show other bugs)
7.1
All Linux
unspecified Severity high
: rc
: 7.1
Assigned To: Daniel Walsh
atomic-bugs@redhat.com
: Extras
Depends On:
Blocks: 1278143
  Show dependency treegraph
 
Reported: 2015-08-25 12:00 EDT by Akram Ben Aissi
Modified: 2016-05-03 07:29 EDT (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The docker daemon prior to docker-1.9 had a problem pulling multiple containers images in parallel. It would hang. Consequence: This caused OpenSHift to fail to deploy PODS in production. Fix: docker-1.9 has a fix for this that allows the images to be pulled in parallel. Result: OpenShift should be able to deploy all of its POD's images.
Story Points: ---
Clone Of:
: 1278143 (view as bug list)
Environment:
Last Closed: 2016-03-31 19:22:18 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Akram Ben Aissi 2015-08-25 12:00:36 EDT
Description of problem:

PODs are failing to be deployed because docker couldn’t pull the 
images.

Version-Release number of selected component (if applicable):
3.0.0.0

How reproducible:
Some PODs failed to be deployed because docker couldn’t pull the 
images. We reproduced the issue outside of OpenShift by manually pulling the 
docker images:

Steps to Reproduce:
1. create pods
2. some will fail because docker will not manage to get deployed because docker couldn’t pull the images
3. Perform a docker pull manually

Actual results:

The subsequent deployment tentatives always fail
The manual docker pull also fails

Expected results:
Subsequent pull should be retried and succeeded of from OpenShift or with docker

Additional info:
One workaround (at least to get unstuck) is to go into /var/lib/docker and try to find any references to the layer in question and delete them. Sometimes pulling then works, while other times clearing /var/lib/docker is the only way to go.
Comment 2 Paul Weil 2015-08-25 13:16:04 EDT
Akram, are there bugs in the registry or docker logs?
Comment 3 Paul Weil 2015-08-25 13:16:42 EDT
Sorry, meant to say "are there errors in the logs"
Comment 4 Akram Ben Aissi 2015-08-25 13:22:45 EDT
Oh, sorry Paul, I also raised a ticket with GSS with the details, and I forgot to include them here.

Here is the logs we've got:
To go deeper into details, here is what we have:

some PODs failed to be deployed because docker couldn’t pull the 
images. We reproduced the issue outside of OpenShift by manually pulling the 
docker images:

=============================================================================
# docker pull acs_fwk/consul_agent:1.0.0

d67fc5cee6fb: Already exists 
 not foundpull repository dockerhub.rnd.amadeus.net:5002/acs_fwk/consul_agent 
...
Trying to pull repository registry.access.redhat.com/acs_fwk/consul_agent ... 
not found
Trying to pull repository docker.io/acs_fwk/consul_agent ... not found
FATA[0009] Error: image acs_fwk/consul_agent:1.0.0 not found 
=============================================================================

The docker logs are showing the following errors:

=============================================================================
Aug 24 08:00:17 ose3-int-node2.figaro.amadeus.net docker[11633]: 
time="2015-08-24T08:00:17Z" level=error msg="Error from V2 registry: Error 
mounting 
'/dev/mapper/docker-253:1-92289210-01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e
Aug 24 08:00:17 ose3-int-node2.figaro.amadeus.net docker[11633]: Error: image 
acs_fwk/consul_agent:1.0.0 not found
…
Aug 24 08:00:19 ose3-int-node2.figaro.amadeus.net docker[11633]: 
time="2015-08-24T08:00:19Z" level=error msg="Error from V2 registry: Error 
mounting 
'/dev/mapper/docker-253:1-92289210-01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e
Aug 24 08:00:19 ose3-int-node2.figaro.amadeus.net docker[11633]: Error: image 
acs_fwk/consul_agent:1.0.0 not found

=============================================================================

strace confirms that the error is at mount time:

=============================================================================
[pid 11690] 
mount("/dev/mapper/docker-253:1-92289210-01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e62c2c0653cd5", 
"/var/lib/docker/devicemapper/mnt/01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e62c2c0653cd5", 
"ext4", MS_MGC_VAL, "discard") = -1 EINVAL (Invalid argument)
[pid 11690] 
mount("/dev/mapper/docker-253:1-92289210-01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e62c2c0653cd5", 
"/var/lib/docker/devicemapper/mnt/01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e62c2c0653cd5", 
"ext4", MS_MGC_VAL, NULL) = -1 EINVAL (Invalid argument)
[pid 13796] 
mount("/dev/mapper/docker-253:1-92289210-01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e62c2c0653cd5", 
"/var/lib/docker/devicemapper/mnt/01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e62c2c0653cd5", 
"ext4", MS_MGC_VAL, "discard") = -1 EINVAL (Invalid argument)
[pid 13796] 
mount("/dev/mapper/docker-253:1-92289210-01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e62c2c0653cd5", 
"/var/lib/docker/devicemapper/mnt/01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e62c2c0653cd5", 
"ext4", MS_MGC_VAL, NULL) = -1 EINVAL (Invalid argument)
=============================================================================


The same “docker pull” command works on some boxes and fails on some other 
boxes that have exactly the same configuration (both are nodes configured by 
openshift-ansible)

Once a “docker pull” has failed, any subsequent “docker pull” command on the 
same image repeatedly keeps on failing.

Rebooting docker doesn’t recorver.

Rebooting the whole VM doesn’t recover.

The only way to recover from the situation is to empty /var/lib/docker.
Comment 5 Jhon Honce 2015-10-13 14:57:22 EDT
It is believed https://github.com/docker/docker/issues/9718 addresses this issue and that code will ship with Docker 1.9
Comment 8 Daniel Walsh 2015-10-15 13:13:32 EDT
Fixed in docker-1.9
Comment 12 Luwen Su 2016-02-03 03:43:10 EST
In docker-1.9.1-15.el7.x86_64, works fine,

``
for image in mesosphere/chronos:chronos-2.3.4-1.0.81.ubuntu1404-mesos-0.22.1-1.0.ubuntu1404 mesosphere/mesos-master:0.22.1-1.0.ubuntu1404 mesosphere/mesos-slave:0.22.1-1.0.ubuntu1404 mesosphere/marathon:v0.8.2-RC4 ; do
    docker pull $image &
done
wait

``

The trigger is pulling multiple images in parallel from same base images,
so i used the mesosphere images for this,
more details and steps see https://github.com/docker/docker/issues/9718
Comment 14 errata-xmlrpc 2016-03-31 19:22:18 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0536.html

Note You need to log in before you can comment on or make changes to this bug.