Bug 1331038 - Pods are stuck in pending state due to failed image pulling
Summary: Pods are stuck in pending state due to failed image pulling
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Build
Version: 3.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Cesar Wong
QA Contact: Wang Haoran
URL:
Whiteboard:
Depends On:
Blocks: 1267746
TreeView+ depends on / blocked
 
Reported: 2016-04-27 13:56 UTC by Miheer Salunke
Modified: 2019-11-14 07:52 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-05-11 13:33:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2016:1038 0 normal SHIPPED_LIVE Moderate: openshift security update 2016-05-11 17:32:46 UTC
Red Hat Product Errata RHSA-2016:1064 0 normal SHIPPED_LIVE Important: Red Hat OpenShift Enterprise 3.2 security, bug fix, and enhancement update 2016-05-12 20:19:17 UTC

Comment 5 Andy Goldstein 2016-04-27 18:58:59 UTC
This is happening in 3.1.1.6 with the docker build strategy. This is what's happening (via OSE's docker builder; I'm listing the rough equivalent CLI steps):

1. docker build -t $registry/$project/$image:latest
2. docker push $registry/$project/$image:latest
3. in parallel:
  3a. image change trigger kicks off a deployment and it happens to land on the same node, this does 'docker pull $registry/$project/$image@sha256:...'
  3. docker rmi $registry/$project/$image:latest

The removal of the image tagged :latest happens at about the same time that the image is being pulled by its sha256 digest. We see in the journalctl output for docker that the image removal is issued a bit before the pull by digest occurs. The image removal removes layers not in use by any other image/container, and the pull by digest is trying to pull them down at the same time.

Comment 8 Nicolas Dordet 2016-04-28 16:43:19 UTC
Version is 3.1.1

Comment 20 Wang Haoran 2016-05-03 05:51:49 UTC
verified with :
openshift v3.1.1.6-43-gf583589
kubernetes v1.1.0-origin-1107-g4c8e6f4
etcd 2.1.2

Comment 27 errata-xmlrpc 2016-05-11 13:33:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:1038

Comment 32 Cesar Wong 2016-05-23 12:36:33 UTC
Brenton, sorry I don't have a theory or explanation for why in the TSI case restarting the node could have made the patch start working.

While debugging with Matt, we did verify two things:

1) The image that the builds were using was the image that contained the fix. We did this by looking at the output of /usr/bin/origin version using the image of one of the completed build containers.

2) The symptoms we were seeing were consistent with the bug that was fixed in the new builder image. After a build completed, the image was no longer present in the local Docker, stalling the pre-deployment pod.


Note You need to log in before you can comment on or make changes to this bug.