Bug 1256877
| Summary: | PODs are failing to be deployed because docker couldn’t pull the images | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Akram Ben Aissi <abenaiss> | |
| Component: | docker | Assignee: | Daniel Walsh <dwalsh> | |
| Status: | CLOSED ERRATA | QA Contact: | atomic-bugs <atomic-bugs> | |
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 7.1 | CC: | abenaiss, aos-bugs, dmcphers, dwalsh, erich, jkaur, jokerman, lsm5, lsu, mmccomas, pep | |
| Target Milestone: | rc | Keywords: | Extras | |
| Target Release: | 7.1 | |||
| Hardware: | All | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: |
Cause:
The docker daemon prior to docker-1.9 had a problem pulling multiple containers images in parallel. It would hang.
Consequence:
This caused OpenSHift to fail to deploy PODS in production.
Fix:
docker-1.9 has a fix for this that allows the images to be pulled in parallel.
Result:
OpenShift should be able to deploy all of its POD's images.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1278143 (view as bug list) | Environment: | ||
| Last Closed: | 2016-03-31 23:22:18 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1278143 | |||
|
Description
Akram Ben Aissi
2015-08-25 16:00:36 UTC
Akram, are there bugs in the registry or docker logs? Sorry, meant to say "are there errors in the logs" Oh, sorry Paul, I also raised a ticket with GSS with the details, and I forgot to include them here.
Here is the logs we've got:
To go deeper into details, here is what we have:
some PODs failed to be deployed because docker couldn’t pull the
images. We reproduced the issue outside of OpenShift by manually pulling the
docker images:
=============================================================================
# docker pull acs_fwk/consul_agent:1.0.0
d67fc5cee6fb: Already exists
not foundpull repository dockerhub.rnd.amadeus.net:5002/acs_fwk/consul_agent
...
Trying to pull repository registry.access.redhat.com/acs_fwk/consul_agent ...
not found
Trying to pull repository docker.io/acs_fwk/consul_agent ... not found
FATA[0009] Error: image acs_fwk/consul_agent:1.0.0 not found
=============================================================================
The docker logs are showing the following errors:
=============================================================================
Aug 24 08:00:17 ose3-int-node2.figaro.amadeus.net docker[11633]:
time="2015-08-24T08:00:17Z" level=error msg="Error from V2 registry: Error
mounting
'/dev/mapper/docker-253:1-92289210-01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e
Aug 24 08:00:17 ose3-int-node2.figaro.amadeus.net docker[11633]: Error: image
acs_fwk/consul_agent:1.0.0 not found
…
Aug 24 08:00:19 ose3-int-node2.figaro.amadeus.net docker[11633]:
time="2015-08-24T08:00:19Z" level=error msg="Error from V2 registry: Error
mounting
'/dev/mapper/docker-253:1-92289210-01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e
Aug 24 08:00:19 ose3-int-node2.figaro.amadeus.net docker[11633]: Error: image
acs_fwk/consul_agent:1.0.0 not found
=============================================================================
strace confirms that the error is at mount time:
=============================================================================
[pid 11690]
mount("/dev/mapper/docker-253:1-92289210-01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e62c2c0653cd5",
"/var/lib/docker/devicemapper/mnt/01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e62c2c0653cd5",
"ext4", MS_MGC_VAL, "discard") = -1 EINVAL (Invalid argument)
[pid 11690]
mount("/dev/mapper/docker-253:1-92289210-01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e62c2c0653cd5",
"/var/lib/docker/devicemapper/mnt/01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e62c2c0653cd5",
"ext4", MS_MGC_VAL, NULL) = -1 EINVAL (Invalid argument)
[pid 13796]
mount("/dev/mapper/docker-253:1-92289210-01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e62c2c0653cd5",
"/var/lib/docker/devicemapper/mnt/01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e62c2c0653cd5",
"ext4", MS_MGC_VAL, "discard") = -1 EINVAL (Invalid argument)
[pid 13796]
mount("/dev/mapper/docker-253:1-92289210-01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e62c2c0653cd5",
"/var/lib/docker/devicemapper/mnt/01a55a23614c5d12131fc781e21314a54a52f8f4575165c26f4e62c2c0653cd5",
"ext4", MS_MGC_VAL, NULL) = -1 EINVAL (Invalid argument)
=============================================================================
The same “docker pull” command works on some boxes and fails on some other
boxes that have exactly the same configuration (both are nodes configured by
openshift-ansible)
Once a “docker pull” has failed, any subsequent “docker pull” command on the
same image repeatedly keeps on failing.
Rebooting docker doesn’t recorver.
Rebooting the whole VM doesn’t recover.
The only way to recover from the situation is to empty /var/lib/docker.
It is believed https://github.com/docker/docker/issues/9718 addresses this issue and that code will ship with Docker 1.9 Fixed in docker-1.9 In docker-1.9.1-15.el7.x86_64, works fine,
``
for image in mesosphere/chronos:chronos-2.3.4-1.0.81.ubuntu1404-mesos-0.22.1-1.0.ubuntu1404 mesosphere/mesos-master:0.22.1-1.0.ubuntu1404 mesosphere/mesos-slave:0.22.1-1.0.ubuntu1404 mesosphere/marathon:v0.8.2-RC4 ; do
docker pull $image &
done
wait
``
The trigger is pulling multiple images in parallel from same base images,
so i used the mesosphere images for this,
more details and steps see https://github.com/docker/docker/issues/9718
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0536.html |