Bug 1721380 - Builds using openshift imagestreams fail if the import has not completed yet
Summary: Builds using openshift imagestreams fail if the import has not completed yet
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Build
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.2.0
Assignee: Gabe Montero
QA Contact: wewang
URL:
Whiteboard: buildcop
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-06-18 07:32 UTC by Maciej Szulik
Modified: 2020-03-13 05:53 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:32:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 23277 0 None closed Bug 1721380: Increase logs on flaking build test 2020-04-22 03:36:33 UTC
Github openshift origin pull 23568 0 None closed Bug 1721380: more exhaustive verification of imagestream import completeness 2020-04-22 03:36:33 UTC
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:32:16 UTC

Description Maciej Szulik 2019-06-18 07:32:06 UTC
This looks like a new instance of an old bug 1392764 or bug 1702743.

I've noticed this happened 13 times in the past 7 days, last occurrence is https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/795/pull-ci-openshift-machine-config-operator-master-e2e-aws/3961

Looks like the build pod was not ready on time:


Jun 17 22:59:06.996: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready
Jun 17 22:59:07.029: INFO: Running AfterSuite actions on all nodes
Jun 17 22:59:07.029: INFO: Running AfterSuite actions on node 1
fail [github.com/openshift/origin/test/extended/builds/new_app.go:56]: Unexpected error:
    <*errors.errorString | 0xc002c91160>: {
        s: "The build \"a234567890123456789012345678901234567890123456789012345678-1\" status is \"Failed\"",
    }
    The build "a234567890123456789012345678901234567890123456789012345678-1" status is "Failed"
occurred

due to:

error: build error: After retrying 2 times, Pull image still failed due to error: unable to retrieve auth token: invalid username/password

Comment 1 Adam Kaplan 2019-06-25 20:25:13 UTC
Needs additional debugging - CI continues to flake on this test for auth reason.
Current hypothesis is that the imagestream hasn't finished importing when test is run. Image therefore tries to pull from registry.redhat.io, and fails because we don't have a pull secret.

Debug PR: https://github.com/openshift/origin/pull/23277

Comment 2 Adam Kaplan 2019-07-16 22:38:44 UTC
Moving back to POST - PR only added extra logging so we can investigate this flake further.

Comment 3 Hongkai Liu 2019-07-26 18:14:07 UTC
Saw it again.
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.2/2529#0:build-log.txt%3A4135

fail [github.com/openshift/origin/test/extended/builds/new_app.go:56]: Unexpected error:
    <*errors.errorString | 0xc000038a80>: {
        s: "The build \"a234567890123456789012345678901234567890123456789012345678-1\" status is \"Failed\"",
    }
    The build "a234567890123456789012345678901234567890123456789012345678-1" status is "Failed"
occurred
 failed: (1m1s) 2019-07-26T17:18:13 "[Feature:Builds][Conformance] oc new-app  should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal]"
 started: (1/542/2160) "[sig-storage] In-tree Volumes [Driver


Failing tests:
 [Feature:Builds][Conformance] oc new-app  should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal]
 Writing JUnit report to /tmp/artifacts/junit/junit_e2e_20190726-174606.xml
 error: 1 fail, 880 pass, 1279 skip (34m39s)
2019/07/26 17:46:09 Container test in pod e2e-aws failed, exit code 1, reason Error
2019/07/26 17:53:31 Copied 222.75Mi of artifacts from e2e-aws to /logs/artifacts/e2e-aws
2019/07/26 17:53:33 Ran for 1h13m6s
error: could not run steps: step e2e-aws failed: template pod "e2e-aws" failed: the pod ci-op-5vchvc8w/e2e-aws failed after 1h11m24s (failed containers: test): ContainerFailed one or more containers exited
 Container test exited with code 1, reason Error

Comment 4 Adam Kaplan 2019-07-29 20:36:16 UTC
We have data now - looks like the build is pulling the nodejs imagestreamtag from registry.redhat.io, rather than the internal registry:

```
I0726 17:17:51.011462       1 sti.go:229] Getting docker auth in paths : [/var/run/secrets/openshift.io/pull]
I0726 17:17:51.011476       1 config.go:131] looking for config.json at /var/run/secrets/openshift.io/pull/config.json
I0726 17:17:51.011518       1 config.go:101] looking for .dockercfg at /var/run/secrets/openshift.io/pull/.dockercfg
I0726 17:17:51.012072       1 config.go:112] found .dockercfg at /var/run/secrets/openshift.io/pull/.dockercfg
Local copy of "registry.redhat.io/rhoar-nodejs/nodejs-10@sha256:9dce2f60b87b2eea351213c3d142716c0a70c894df8b3d2d4425b4933b8f6221" is not present.
Explicitly pulling image registry.redhat.io/rhoar-nodejs/nodejs-10@sha256:9dce2f60b87b2eea351213c3d142716c0a70c894df8b3d2d4425b4933b8f6221
Asked to pull fresh copy of "registry.redhat.io/rhoar-nodejs/nodejs-10@sha256:9dce2f60b87b2eea351213c3d142716c0a70c894df8b3d2d4425b4933b8f6221".
Warning: Pull failed, retrying in 5s ...
```

I don't think the build pod is configured to have a pull secret for registry.redhat.io, even though this information is available on the cluster. I believe this is something on the radar to fix in 4.3.

Comment 5 Ben Parees 2019-07-30 00:27:07 UTC
this is one of our recurring failures in the release-4.2 jobs, it needs to be fixed there so we can have confidence the the stability of our 4.2 stream.

Comment 6 Gabe Montero 2019-07-31 13:34:43 UTC
*** Bug 1734627 has been marked as a duplicate of this bug. ***

Comment 7 Gabe Montero 2019-08-07 21:12:30 UTC
@Ben @Adam fyi in studying this one a bit after Ben's explanation to us last week, I discovered that the "image stream readiness" checks in the extended tests could be augmented to more match what the samples operator does 
(Adam has asked general questions in this space wrt this bug in the past ... I've now sufficiently connected some dots)

Presumably this could address the potential timing windows Ben highlighted to us, and better ensure that the build controller's IST to image registry reference resolution will not suffer.

I'm starting with updated the ext test utilities but this could be a candidate for openshift/library-go inclusion over time, with perhaps some convergence with samples operator.

But of course chime in either here of my upcoming PR as you all deem fit.

Comment 8 Gabe Montero 2019-08-07 21:33:14 UTC
It would at least eliminate imagestream readiness once and for all.

Comment 9 Gabe Montero 2019-08-08 19:58:12 UTC
Ben graciously re-stated his hypothesis for us in
- https://github.com/openshift/origin/pull/23568#issuecomment-519281026
- https://github.com/openshift/origin/pull/23568#issuecomment-519281650
- https://github.com/openshift/origin/pull/23568#issuecomment-519283586

I'm looking into the above.

the extended test case update is in fact not directly pertinent, though it further could narrow different timing windows unrelated to this bug.  Will pursue that separately.

Comment 10 Gabe Montero 2019-08-08 22:17:39 UTC
Found some related references to Ben's hypothesis:  https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/controller/build/build_controller.go#L876-L880
from most likely a prior incarnation of it.

But it is only output image related.

Ultimately gets surfaced as an error or event at https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/controller/build/build_controller.go#L1007-L1019

So minimally copying that approach for all the image references in addition to the output image would help debug / confirm the theory.

Actually relisting directly from the API server and bypassing the watch/informer cache might be a thing we want to do as well.

Thoughts @Ben @Adam ?

Comment 11 Ben Parees 2019-08-08 22:37:34 UTC
I'm not sure where you're proposing adding the explicit list from the apiserver..  if you're proposing doing that in the build/image controller, I think that's going to be unacceptable from a performance perspective.  It'll significantly increase the api calls we're making.

I think there's two things to do here:

1) sanity check that we still think the controller manager is getting restarted when the internal registry gets set

2) assuming it is, and therefore assuming we're just in some very narrow timing window here (though I have to say, it's hard to envision that window), add some sort of canary build.  Create the build pointing to the imagestream, confirm that the resolved imagestream pullspec points to the internal registry, not the external one.  Wait/repeat until it does.  (I can't think of a better way to check if the controller manager cache is "correct", other than to force a rollout of the controller manager and wait for it to complete).

Regardless, definitely the first step is to try to get more confirmation of the theory that the cause is indeed "stale" cached content in the controller manager that pre-dates the presence of the internal registry.

Comment 12 Gabe Montero 2019-08-08 23:03:30 UTC
OK, so agreed on adding similar eventing logic for all the images references like is done for the output image reference with a missing stream.Status.DockerImageReference when the From points to an ImageStream.

On the explicit list from the build/image controller, certainly *ONLY* do that when we detect the anomaly that stream.Status.DockerImageRepository is *NOT* set.
Presumably that should typically not be the case, right?  If so, and with that caveat, do the concerns about increased api calls remain?

For your item 2) ... you proposing doing that from the extended tests, right?


For 1), I recently wrote logic to track OCM rollout (https://github.com/openshift/origin/blob/master/test/extended/builds/cluster_config.go#L55-L100)
but can you clarify where the image registry is being set?  .... some sort of OCM / OCM controller config object I would presume ...
Not seeing it in https://github.com/openshift/api/blob/master/openshiftcontrolplane/v1/types.go for example

Comment 13 Ben Parees 2019-08-09 00:15:07 UTC
> OK, so agreed on adding similar eventing logic for all the images references like is done for the output image reference with a missing stream.Status.DockerImageReference when the From points to an ImageStream.

i'm not sure on this one.  The reason it's done for the push is because when your push target for a build is an imagestream, it always goes to the internal registry (you can't setup an imagestream that points to an external registry and have the build use that imagestream and push to the external registry).  Therefore if you are pushing your build to an imagestream and you don't have an internal registry, it is an error condition because it can't  possibly work.

If you're using an imagestream for pull, that build can still succeed if:
1) the imagestream is not using local imagereference
2) your build is not pushing an image anywhere, or is not pushing to the internal registry.

so it's not an error condition in that case.

So if you do add an error/event, it probably needs to be conditioned not only on the internal registry not being there, but also on the imagestream using the local reference policy.

Even then, it's a bit odd because local reference policy is, at least in some cases, just an optimization (you get to pull from a local registry).  There can be cases where someone defines an imagestream that points to a public registry, and uses the local reference policy.  If the local registry isn't present, you can still resolve+pull that image from the external registry successfully, so it's not an error/blocking condition if the internal registry isn't present, for that scenario.  (But it is perhaps worth a warning to the user that they are likely not getting the experience they intended).

It gets extra complicated when you consider that the consumer of the imagestream may not be the one who defined it.  So they may not be in a position to fix/change the local reference policy in order to clear the error/warning, in the case where they just want to go ahead and pull from the external registry instead.

Comment 14 Gabe Montero 2019-08-13 14:54:27 UTC
Also note, aside from the additional verification / short circuits in the extended tests prior to build attempts with https://github.com/openshift/origin/pull/23568
I have https://github.com/openshift/openshift-controller-manager/pull/13 on the merge queue so that the build controller will submit an event if it detects that a build has and imagestreamtag local ref input, but the internal registry hostname is not available on the imagestream in question, re: Ben's note in #comment 13

Comment 16 Gabe Montero 2019-08-14 20:45:19 UTC
Given this was an e2e flake and only automated test code was modified, I'm moving this to verified ... don't need to spend QA's time here

we can reopen if we see a repro along the same lines of local ref pull through having to fall back to the TBR ref

if this test fails along some other lines ... new bug

Comment 18 errata-xmlrpc 2019-10-16 06:32:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.