Bug 1331038

Summary: Pods are stuck in pending state due to failed image pulling
Product: OpenShift Container Platform Reporter: Miheer Salunke <misalunk>
Component: BuildAssignee: Cesar Wong <cewong>
Status: CLOSED ERRATA QA Contact: Wang Haoran <haowang>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.1.0CC: agoldste, aos-bugs, bleanhar, bvincell, cewong, erich, jkaur, jokerman, mmccomas, ndordet, pep, simon.gunzenreiner, tdawson, xtian
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-11 13:33:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1267746    

Comment 5 Andy Goldstein 2016-04-27 18:58:59 UTC
This is happening in 3.1.1.6 with the docker build strategy. This is what's happening (via OSE's docker builder; I'm listing the rough equivalent CLI steps):

1. docker build -t $registry/$project/$image:latest
2. docker push $registry/$project/$image:latest
3. in parallel:
  3a. image change trigger kicks off a deployment and it happens to land on the same node, this does 'docker pull $registry/$project/$image@sha256:...'
  3. docker rmi $registry/$project/$image:latest

The removal of the image tagged :latest happens at about the same time that the image is being pulled by its sha256 digest. We see in the journalctl output for docker that the image removal is issued a bit before the pull by digest occurs. The image removal removes layers not in use by any other image/container, and the pull by digest is trying to pull them down at the same time.

Comment 8 Nicolas Dordet 2016-04-28 16:43:19 UTC
Version is 3.1.1

Comment 20 Wang Haoran 2016-05-03 05:51:49 UTC
verified with :
openshift v3.1.1.6-43-gf583589
kubernetes v1.1.0-origin-1107-g4c8e6f4
etcd 2.1.2

Comment 27 errata-xmlrpc 2016-05-11 13:33:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:1038

Comment 32 Cesar Wong 2016-05-23 12:36:33 UTC
Brenton, sorry I don't have a theory or explanation for why in the TSI case restarting the node could have made the patch start working.

While debugging with Matt, we did verify two things:

1) The image that the builds were using was the image that contained the fix. We did this by looking at the output of /usr/bin/origin version using the image of one of the completed build containers.

2) The symptoms we were seeing were consistent with the bug that was fixed in the new builder image. After a build completed, the image was no longer present in the local Docker, stalling the pre-deployment pod.