1331038 – Pods are stuck in pending state due to failed image pulling

Bug 1331038 - Pods are stuck in pending state due to failed image pulling

Summary: Pods are stuck in pending state due to failed image pulling

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Build
Sub Component:
Version:	3.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Cesar Wong
QA Contact:	Wang Haoran
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1267746
TreeView+	depends on / blocked

Reported:	2016-04-27 13:56 UTC by Miheer Salunke
Modified:	2019-11-14 07:52 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-05-11 13:33:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2016:1038	0	normal	SHIPPED_LIVE	Moderate: openshift security update	2016-05-11 17:32:46 UTC
Red Hat Product Errata	RHSA-2016:1064	0	normal	SHIPPED_LIVE	Important: Red Hat OpenShift Enterprise 3.2 security, bug fix, and enhancement update	2016-05-12 20:19:17 UTC

Comment 5 Andy Goldstein 2016-04-27 18:58:59 UTC

This is happening in 3.1.1.6 with the docker build strategy. This is what's happening (via OSE's docker builder; I'm listing the rough equivalent CLI steps):

1. docker build -t $registry/$project/$image:latest
2. docker push $registry/$project/$image:latest
3. in parallel:
  3a. image change trigger kicks off a deployment and it happens to land on the same node, this does 'docker pull $registry/$project/$image@sha256:...'
  3. docker rmi $registry/$project/$image:latest

The removal of the image tagged :latest happens at about the same time that the image is being pulled by its sha256 digest. We see in the journalctl output for docker that the image removal is issued a bit before the pull by digest occurs. The image removal removes layers not in use by any other image/container, and the pull by digest is trying to pull them down at the same time.

Comment 8 Nicolas Dordet 2016-04-28 16:43:19 UTC

Version is 3.1.1

Comment 20 Wang Haoran 2016-05-03 05:51:49 UTC

verified with :
openshift v3.1.1.6-43-gf583589
kubernetes v1.1.0-origin-1107-g4c8e6f4
etcd 2.1.2

Comment 27 errata-xmlrpc 2016-05-11 13:33:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:1038

Comment 32 Cesar Wong 2016-05-23 12:36:33 UTC

Brenton, sorry I don't have a theory or explanation for why in the TSI case restarting the node could have made the patch start working.

While debugging with Matt, we did verify two things:

1) The image that the builds were using was the image that contained the fix. We did this by looking at the output of /usr/bin/origin version using the image of one of the completed build containers.

2) The symptoms we were seeing were consistent with the bug that was fixed in the new builder image. After a build completed, the image was no longer present in the local Docker, stalling the pre-deployment pod.

Note You need to log in before you can comment on or make changes to this bug.