Description of problem: On various occasions the builds fail because the image is reported as not available in the registry but pulling manually from CMD does work. Version-Release number of selected component (if applicable): 3.4.1.7 How reproducible: Intermittently Steps to Reproduce: 1. 2. 3. Actual results: - First fail to inspect the image: I0427 12:38:46.688268 1 docker.go:524] error inspecting image registry.access.redhat.com/jboss-webserver-3/webserver30-tomcat8-openshift@sha256:b82967f5349971b43a2c839df255317d2cbef0c2f7cfa86e449b323afe69c28b: Error: No such image: registry.access.redhat.com/jboss-webserver-3/webserver30-tomcat8-openshift@sha256:b82967f5349971b43a2c839df255317d2cbef0c2f7cfa86e449b323afe69c28b - Then it can't not find it locally: I0427 12:38:46.688338 1 docker.go:511] Image "registry.access.redhat.com/jboss-webserver-3/webserver30-tomcat8-openshift@sha256:b82967f5349..." not available locally, pulling ... - Then it use the image locally: I0427 12:38:49.294607 1 docker.go:515] Using locally available image "registry.access.redhat.com/jboss-webserver-3/webserver30-tomcat8-openshift@sha256:b82967f5349..." I0427 12:38:49.302982 1 docker.go:515] Using locally available image "registry.access.redhat.com/jboss-webserver-3/webserver30-tomcat8-openshift@sha256:b82967f5349..." - But finally fails when trying to get the metadata: I0427 12:41:30.502064 1 docker.go:899] error: Unable to get image metadata for registry.access.redhat.com/jboss-webserver-3/webserver30-tomcat8-openshift@sha256:b82967f5349971b43a2c839df255317d2cbef0c2f7cfa86e449b323afe69c28b: Error: No such image: registry.access.redhat.com/jboss-webserver-3/webserver30-tomcat8-openshift@sha256:b82967f5349971b43a2c839df255317d2cbef0c2f7cfa86e449b323afe69c28b Expected results: Image pulled successful Additional info: The same images can be manually pulled with docker pull everytime.
Created attachment 1275291 [details] failed build log
I do not see any error in the log that relates to pulling. "Using locally available image" AFAIK means the image was pulled successfully.
(In reply to Oleg Bulatov from comment #2) > I do not see any error in the log that relates to pulling. "Using locally > available image" AFAIK means the image was pulled successfully. What about the failing error: I0427 12:41:30.502064 1 docker.go:899] error: Unable to get image metadata for registry.access.redhat.com/jboss-webserver-3/webserver30-tomcat8-openshift@sha256:b82967f5349971b43a2c839df255317d2cbef0c2f7cfa86e449b323afe69c28b: Error: No such image: registry.access.redhat.com/jboss-webserver-3/webserver30-tomcat8-openshift@sha256:b82967f5349971b43a2c839df255317d2cbef0c2f7cfa86e449b323afe69c28b
Looking at the build log, the image was determined not to be present and was pulled at 12:38:46.688338. Log entries at 12:38:49.294607 and 12:38:49.302982 confirm that the image is now present. However, when trying to instantiate a container with that same image at 12:41:30.502064, the image seems to no longer be present. A couple of theories: 1) The image was garbage collected from the time the build started and the time we try to use it to run the assemble script (roughly 3 minutes). 2) The docker daemon lied to us and didn't really pull down the image. The time difference between the time we say that we're pulling the image and the time that we say that the image is there is only 3 sec. Also, there are no log statements showing that the image was pulled as with the previous image for incremental build (12:39:21.599616). One change we can make to mitigate this is to check/pull the image right before the assemble script is run. In this case (because of running the save-artifacts script) there's 3 mins between the time we check the image is present and the time we try to run it. Ben, thoughts? Javier, is this something that you can reproduce at will?
Moving the check/pull closer to when we use it seems like a good idea regardless. I'd like to think garbage collection would not remove an image that was just pulled down minutes ago (unless the filesystem was absolutely full), but that's a question for whoever owns that logic. > The docker daemon lied to us and didn't really pull down the image. The time difference between the time we say that we're pulling the image and the time that we say that the image is there is only 3 sec. Also, there are no log statements showing that the image was pulled as with the previous image for incremental build (12:39:21.599616). if that were the case, the check for the image would have failed, right?
> if that were the case, the check for the image would have failed, right? The best I could come up with is that somehow the image could have been in the process of getting deleted and because of the pull call, inspect briefly returned successful.
S2I pull request: https://github.com/openshift/source-to-image/pull/743
(In reply to Cesar Wong from comment #4) > Looking at the build log, the image was determined not to be present and was > pulled at 12:38:46.688338. Log entries at 12:38:49.294607 and > 12:38:49.302982 confirm that the image is now present. > > However, when trying to instantiate a container with that same image at > 12:41:30.502064, the image seems to no longer be present. > > A couple of theories: > 1) The image was garbage collected from the time the build started and the > time we try to use it to run the assemble script (roughly 3 minutes). > > 2) The docker daemon lied to us and didn't really pull down the image. The > time difference between the time we say that we're pulling the image and the > time that we say that the image is there is only 3 sec. Also, there are no > log statements showing that the image was pulled as with the previous image > for incremental build (12:39:21.599616). > > One change we can make to mitigate this is to check/pull the image right > before the assemble script is run. In this case (because of running the > save-artifacts script) there's 3 mins between the time we check the image is > present and the time we try to run it. Ben, thoughts? > > Javier, is this something that you can reproduce at will? Customer can reproduce it, it usually requires several tries, but it can be reproduced.
(In reply to Cesar Wong from comment #7) > S2I pull request: > > https://github.com/openshift/source-to-image/pull/743 Is there a way to try this fix easily ?
Javier, we first need to get it into origin master so we can have a set of images that include the fix. I will do that today. But after that, it would only be in master, and not in OCP. Is this something that needs to be backported to OCP 3.4.1?
(In reply to Cesar Wong from comment #10) > Javier, we first need to get it into origin master so we can have a set of > images that include the fix. I will do that today. But after that, it would > only be in master, and not in OCP. Is this something that needs to be > backported to OCP 3.4.1? Cesar, if we are sure that this will fix the issue, then we need to provide something to the customer, if there is any workaround or mitigation procedure, then I can discuss with the customer if backported would be required, if there is no other option, then yes, we need to backport it to 3.4.
Hi Javier, so the theory is that somehow the image is getting garbage collected some time between the time that the build begins to the time the assemble script starts. One thing we can check before trying to apply a patch is whether the nodes where the customer is running builds have sufficient space for docker images or whether the garbage collector is configured too aggressively. https://docs.openshift.org/latest/admin_guide/garbage_collection.html#image-garbage-collection The other thing to check is when they get a failed build on a particular node to check if the image that the build log reports as non-existent is actually on that node.
Cesar did this ever merge? I remember merging some s2i commits in preparation for the backport at least.
(also we need to clone separate bugs for each release we're putting this into)
Created https://bugzilla.redhat.com/show_bug.cgi?id=1464537 to track the v3.5 fixx
verified in openshift v3.4.1.44 kubernetes v1.4.0+776c994 etcd 3.1.0-rc.0 steps: 1. # oc new-app openshift/jboss-webserver30-tomcat8-openshift~https://github.com /jboss-openshift/openshift-quickstarts.git#1.1 --context-dir=tomcat-websocket- chat 2. # oc get builds NAME TYPE FROM STATUS STARTED DURATION openshift-quickstarts-1 Source Git@292cc9a Complete Less than a second ago 2m29s 3. # oc describe build openshift-quickstarts-1 Name: openshift-quickstarts-1 Namespace: wewang Created: Less than a second ago Labels: app=openshift-quickstarts buildconfig=openshift-quickstarts openshift.io/build-config.name=openshift-quickstarts openshift.io/build.start-policy=Serial Annotations: openshift.io/build-config.name=openshift-quickstarts openshift.io/build.number=1 openshift.io/build.pod-name=openshift-quickstarts-1-build Status: Complete Started: Fri, 30 Jun 2017 11:05:33 CST Duration: 2m29s Build Config: openshift-quickstarts Build Pod: openshift-quickstarts-1-build Strategy: Source URL: https://github.com/jboss-openshift/openshift-quickstarts.git Ref: 1.1 ContextDir: tomcat-websocket-chat Commit: 292cc9a (helloworld-mdb quickstart: CLOUD-199) Author/Committer: Kevin Conner From Image: DockerImage registry.access.redhat.com/jboss-webserver-3/webserver30-tomcat8-openshift@sha256:c347a9e7286e3c750ce4d43cf931ef8deadfb8c81e330b69cb7b0d88b0d49643 Output to: ImageStreamTag openshift-quickstarts:latest Push Secret: builder-dockercfg-9y3ix
yes, it better to have docs.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1640