Bug 1446925 - Image pull error from registry.access.redhat.com
Summary: Image pull error from registry.access.redhat.com
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Build
Version: 3.4.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.4.z
Assignee: Cesar Wong
QA Contact: Wenjing Zheng
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-04-30 15:10 UTC by Javier Ramirez
Modified: 2020-08-13 09:08 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When doing an incremental build, the s2i builder pulls its builder image before calling the save-artifacts script and doesn't ensure that the builder image is still there when it calls assemble. Consequence: This leaves a gap of time between the start of the build and the calling of the assemble script in which the image can be removed. If the image is removed, the build fails. Fix: Add a call to ensure that the builder image exists right before calling the assemble script. Result: The chance of the assemble script running and not finding an available builder image is greatly reduced.
Clone Of:
Environment:
Last Closed: 2017-07-11 10:47:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
failed build log (1.59 MB, text/plain)
2017-04-30 15:11 UTC, Javier Ramirez
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github https://github.com/openshift source-to-image pull 743 0 None None None 2020-08-19 16:03:39 UTC
Red Hat Product Errata RHBA-2017:1640 0 normal SHIPPED_LIVE OpenShift Container Platform 3.5 and 3.4 bug fix update 2017-07-11 14:47:16 UTC

Description Javier Ramirez 2017-04-30 15:10:45 UTC
Description of problem:

On various occasions the builds fail because the image is reported as not available in the registry but pulling manually from CMD does work. 


Version-Release number of selected component (if applicable):
3.4.1.7

How reproducible:
Intermittently

Steps to Reproduce:
1.
2.
3.

Actual results:
- First fail to inspect the image:
I0427 12:38:46.688268       1 docker.go:524] error inspecting image registry.access.redhat.com/jboss-webserver-3/webserver30-tomcat8-openshift@sha256:b82967f5349971b43a2c839df255317d2cbef0c2f7cfa86e449b323afe69c28b: Error: No such image: registry.access.redhat.com/jboss-webserver-3/webserver30-tomcat8-openshift@sha256:b82967f5349971b43a2c839df255317d2cbef0c2f7cfa86e449b323afe69c28b

- Then it can't not find it locally:
I0427 12:38:46.688338       1 docker.go:511] Image "registry.access.redhat.com/jboss-webserver-3/webserver30-tomcat8-openshift@sha256:b82967f5349..." not available locally, pulling ...

- Then it use the image locally:
I0427 12:38:49.294607       1 docker.go:515] Using locally available image "registry.access.redhat.com/jboss-webserver-3/webserver30-tomcat8-openshift@sha256:b82967f5349..."
I0427 12:38:49.302982       1 docker.go:515] Using locally available image "registry.access.redhat.com/jboss-webserver-3/webserver30-tomcat8-openshift@sha256:b82967f5349..."

- But finally fails when trying to get the metadata:
I0427 12:41:30.502064 1 docker.go:899] error: Unable to get image metadata for registry.access.redhat.com/jboss-webserver-3/webserver30-tomcat8-openshift@sha256:b82967f5349971b43a2c839df255317d2cbef0c2f7cfa86e449b323afe69c28b: Error: No such image: registry.access.redhat.com/jboss-webserver-3/webserver30-tomcat8-openshift@sha256:b82967f5349971b43a2c839df255317d2cbef0c2f7cfa86e449b323afe69c28b
 

Expected results:
Image pulled successful

Additional info:
The same images can be manually pulled with docker pull everytime.

Comment 1 Javier Ramirez 2017-04-30 15:11:39 UTC
Created attachment 1275291 [details]
failed build log

Comment 2 Oleg Bulatov 2017-05-03 07:55:58 UTC
I do not see any error in the log that relates to pulling. "Using locally available image" AFAIK means the image was pulled successfully.

Comment 3 Javier Ramirez 2017-05-08 08:40:39 UTC
(In reply to Oleg Bulatov from comment #2)
> I do not see any error in the log that relates to pulling. "Using locally
> available image" AFAIK means the image was pulled successfully.

What about the failing error:

I0427 12:41:30.502064 1 docker.go:899] error: Unable to get image metadata for registry.access.redhat.com/jboss-webserver-3/webserver30-tomcat8-openshift@sha256:b82967f5349971b43a2c839df255317d2cbef0c2f7cfa86e449b323afe69c28b: Error: No such image: registry.access.redhat.com/jboss-webserver-3/webserver30-tomcat8-openshift@sha256:b82967f5349971b43a2c839df255317d2cbef0c2f7cfa86e449b323afe69c28b

Comment 4 Cesar Wong 2017-05-09 17:15:45 UTC
Looking at the build log, the image was determined not to be present and was pulled at 12:38:46.688338. Log entries at 12:38:49.294607 and 12:38:49.302982 confirm that the image is now present.

However, when trying to instantiate a container with that same image at 12:41:30.502064, the image seems to no longer be present. 

A couple of theories:
1) The image was garbage collected from the time the build started and the time we try to use it to run the assemble script (roughly 3 minutes).

2) The docker daemon lied to us and didn't really pull down the image. The time difference between the time we say that we're pulling the image and the time that we say that the image is there is only 3 sec. Also, there are no log statements showing that the image was pulled as with the previous image for incremental build (12:39:21.599616). 

One change we can make to mitigate this is to check/pull the image right before the assemble script is run. In this case (because of running the save-artifacts script) there's 3 mins between the time we check the image is present and the time we try to run it. Ben, thoughts?

Javier, is this something that you can reproduce at will?

Comment 5 Ben Parees 2017-05-09 17:21:48 UTC
Moving the check/pull closer to when we use it seems like a good idea regardless.

I'd like to think garbage collection would not remove an image that was just pulled down minutes ago (unless the filesystem was absolutely full), but that's a question for whoever owns that logic.


>  The docker daemon lied to us and didn't really pull down the image. The time difference between the time we say that we're pulling the image and the time that we say that the image is there is only 3 sec. Also, there are no log statements showing that the image was pulled as with the previous image for incremental build (12:39:21.599616). 


if that were the case, the check for the image would have failed, right?

Comment 6 Cesar Wong 2017-05-09 17:32:13 UTC
> if that were the case, the check for the image would have failed, right?

The best I could come up with is that somehow the image could have been in the process of getting deleted and because of the pull call, inspect briefly returned successful.

Comment 7 Cesar Wong 2017-05-09 20:20:44 UTC
S2I pull request:

https://github.com/openshift/source-to-image/pull/743

Comment 8 Javier Ramirez 2017-05-11 07:11:43 UTC
(In reply to Cesar Wong from comment #4)
> Looking at the build log, the image was determined not to be present and was
> pulled at 12:38:46.688338. Log entries at 12:38:49.294607 and
> 12:38:49.302982 confirm that the image is now present.
> 
> However, when trying to instantiate a container with that same image at
> 12:41:30.502064, the image seems to no longer be present. 
> 
> A couple of theories:
> 1) The image was garbage collected from the time the build started and the
> time we try to use it to run the assemble script (roughly 3 minutes).
> 
> 2) The docker daemon lied to us and didn't really pull down the image. The
> time difference between the time we say that we're pulling the image and the
> time that we say that the image is there is only 3 sec. Also, there are no
> log statements showing that the image was pulled as with the previous image
> for incremental build (12:39:21.599616). 
> 
> One change we can make to mitigate this is to check/pull the image right
> before the assemble script is run. In this case (because of running the
> save-artifacts script) there's 3 mins between the time we check the image is
> present and the time we try to run it. Ben, thoughts?
> 
> Javier, is this something that you can reproduce at will?
Customer can reproduce it, it usually requires several tries, but it can be reproduced.

Comment 9 Javier Ramirez 2017-05-11 07:12:38 UTC
(In reply to Cesar Wong from comment #7)
> S2I pull request:
> 
> https://github.com/openshift/source-to-image/pull/743

Is there a way to try this fix easily ?

Comment 10 Cesar Wong 2017-05-11 15:08:39 UTC
Javier, we first need to get it into origin master so we can have a set of images that include the fix. I will do that today. But after that, it would only be in master, and not in OCP. Is this something that needs to be backported to OCP 3.4.1?

Comment 11 Javier Ramirez 2017-05-11 17:16:32 UTC
(In reply to Cesar Wong from comment #10)
> Javier, we first need to get it into origin master so we can have a set of
> images that include the fix. I will do that today. But after that, it would
> only be in master, and not in OCP. Is this something that needs to be
> backported to OCP 3.4.1?

Cesar, if we are sure that this will fix the issue, then we need to provide something to the customer, if there is any workaround or mitigation procedure, then I can discuss with the customer if backported would be required, if there is no other option, then yes, we need to backport it to 3.4.

Comment 12 Cesar Wong 2017-05-12 13:32:10 UTC
Hi Javier, so the theory is that somehow the image is getting garbage collected some time between the time that the build begins to the time the assemble script starts. One thing we can check before trying to apply a patch is whether the nodes where the customer is running builds have sufficient space for docker images or whether the garbage collector is configured too aggressively. https://docs.openshift.org/latest/admin_guide/garbage_collection.html#image-garbage-collection

The other thing to check is when they get a failed build on a particular node to check if the image that the build log reports as non-existent is actually on that node.

Comment 22 Ben Parees 2017-06-13 12:36:58 UTC
Cesar did this ever merge?  I remember merging some s2i commits in preparation for the backport at least.

Comment 23 Ben Parees 2017-06-13 12:38:05 UTC
(also we need to clone separate bugs for each release we're putting this into)

Comment 25 Cesar Wong 2017-06-23 17:32:45 UTC
Created https://bugzilla.redhat.com/show_bug.cgi?id=1464537
to track the v3.5 fixx

Comment 27 wewang 2017-06-30 03:12:17 UTC
verified in 
openshift v3.4.1.44
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

steps:
1. #  oc new-app openshift/jboss-webserver30-tomcat8-openshift~https://github.com
/jboss-openshift/openshift-quickstarts.git#1.1 --context-dir=tomcat-websocket-
chat
2. #  oc get builds
NAME                      TYPE      FROM          STATUS     STARTED                  DURATION
openshift-quickstarts-1   Source    Git@292cc9a   Complete   Less than a second ago   2m29s
3. # oc describe build openshift-quickstarts-1
Name:		openshift-quickstarts-1
Namespace:	wewang
Created:	Less than a second ago
Labels:		app=openshift-quickstarts
		buildconfig=openshift-quickstarts
		openshift.io/build-config.name=openshift-quickstarts
		openshift.io/build.start-policy=Serial
Annotations:	openshift.io/build-config.name=openshift-quickstarts
		openshift.io/build.number=1
		openshift.io/build.pod-name=openshift-quickstarts-1-build

Status:		Complete
Started:	Fri, 30 Jun 2017 11:05:33 CST
Duration:	2m29s

Build Config:	openshift-quickstarts
Build Pod:	openshift-quickstarts-1-build

Strategy:		Source
URL:			https://github.com/jboss-openshift/openshift-quickstarts.git
Ref:			1.1
ContextDir:		tomcat-websocket-chat
Commit:			292cc9a (helloworld-mdb quickstart: CLOUD-199)
Author/Committer:	Kevin Conner
From Image:		DockerImage registry.access.redhat.com/jboss-webserver-3/webserver30-tomcat8-openshift@sha256:c347a9e7286e3c750ce4d43cf931ef8deadfb8c81e330b69cb7b0d88b0d49643
Output to:		ImageStreamTag openshift-quickstarts:latest
Push Secret:		builder-dockercfg-9y3ix

Comment 28 wewang 2017-07-03 02:29:47 UTC
yes, it better to have docs.

Comment 30 errata-xmlrpc 2017-07-11 10:47:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1640


Note You need to log in before you can comment on or make changes to this bug.