This looks like a new instance of an old bug 1392764 or bug 1702743. I've noticed this happened 13 times in the past 7 days, last occurrence is https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/795/pull-ci-openshift-machine-config-operator-master-e2e-aws/3961 Looks like the build pod was not ready on time: Jun 17 22:59:06.996: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready Jun 17 22:59:07.029: INFO: Running AfterSuite actions on all nodes Jun 17 22:59:07.029: INFO: Running AfterSuite actions on node 1 fail [github.com/openshift/origin/test/extended/builds/new_app.go:56]: Unexpected error: <*errors.errorString | 0xc002c91160>: { s: "The build \"a234567890123456789012345678901234567890123456789012345678-1\" status is \"Failed\"", } The build "a234567890123456789012345678901234567890123456789012345678-1" status is "Failed" occurred due to: error: build error: After retrying 2 times, Pull image still failed due to error: unable to retrieve auth token: invalid username/password
Needs additional debugging - CI continues to flake on this test for auth reason. Current hypothesis is that the imagestream hasn't finished importing when test is run. Image therefore tries to pull from registry.redhat.io, and fails because we don't have a pull secret. Debug PR: https://github.com/openshift/origin/pull/23277
Moving back to POST - PR only added extra logging so we can investigate this flake further.
Saw it again. https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.2/2529#0:build-log.txt%3A4135 fail [github.com/openshift/origin/test/extended/builds/new_app.go:56]: Unexpected error: <*errors.errorString | 0xc000038a80>: { s: "The build \"a234567890123456789012345678901234567890123456789012345678-1\" status is \"Failed\"", } The build "a234567890123456789012345678901234567890123456789012345678-1" status is "Failed" occurred failed: (1m1s) 2019-07-26T17:18:13 "[Feature:Builds][Conformance] oc new-app should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal]" started: (1/542/2160) "[sig-storage] In-tree Volumes [Driver Failing tests: [Feature:Builds][Conformance] oc new-app should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal] Writing JUnit report to /tmp/artifacts/junit/junit_e2e_20190726-174606.xml error: 1 fail, 880 pass, 1279 skip (34m39s) 2019/07/26 17:46:09 Container test in pod e2e-aws failed, exit code 1, reason Error 2019/07/26 17:53:31 Copied 222.75Mi of artifacts from e2e-aws to /logs/artifacts/e2e-aws 2019/07/26 17:53:33 Ran for 1h13m6s error: could not run steps: step e2e-aws failed: template pod "e2e-aws" failed: the pod ci-op-5vchvc8w/e2e-aws failed after 1h11m24s (failed containers: test): ContainerFailed one or more containers exited Container test exited with code 1, reason Error
We have data now - looks like the build is pulling the nodejs imagestreamtag from registry.redhat.io, rather than the internal registry: ``` I0726 17:17:51.011462 1 sti.go:229] Getting docker auth in paths : [/var/run/secrets/openshift.io/pull] I0726 17:17:51.011476 1 config.go:131] looking for config.json at /var/run/secrets/openshift.io/pull/config.json I0726 17:17:51.011518 1 config.go:101] looking for .dockercfg at /var/run/secrets/openshift.io/pull/.dockercfg I0726 17:17:51.012072 1 config.go:112] found .dockercfg at /var/run/secrets/openshift.io/pull/.dockercfg Local copy of "registry.redhat.io/rhoar-nodejs/nodejs-10@sha256:9dce2f60b87b2eea351213c3d142716c0a70c894df8b3d2d4425b4933b8f6221" is not present. Explicitly pulling image registry.redhat.io/rhoar-nodejs/nodejs-10@sha256:9dce2f60b87b2eea351213c3d142716c0a70c894df8b3d2d4425b4933b8f6221 Asked to pull fresh copy of "registry.redhat.io/rhoar-nodejs/nodejs-10@sha256:9dce2f60b87b2eea351213c3d142716c0a70c894df8b3d2d4425b4933b8f6221". Warning: Pull failed, retrying in 5s ... ``` I don't think the build pod is configured to have a pull secret for registry.redhat.io, even though this information is available on the cluster. I believe this is something on the radar to fix in 4.3.
this is one of our recurring failures in the release-4.2 jobs, it needs to be fixed there so we can have confidence the the stability of our 4.2 stream.
*** Bug 1734627 has been marked as a duplicate of this bug. ***
@Ben @Adam fyi in studying this one a bit after Ben's explanation to us last week, I discovered that the "image stream readiness" checks in the extended tests could be augmented to more match what the samples operator does (Adam has asked general questions in this space wrt this bug in the past ... I've now sufficiently connected some dots) Presumably this could address the potential timing windows Ben highlighted to us, and better ensure that the build controller's IST to image registry reference resolution will not suffer. I'm starting with updated the ext test utilities but this could be a candidate for openshift/library-go inclusion over time, with perhaps some convergence with samples operator. But of course chime in either here of my upcoming PR as you all deem fit.
It would at least eliminate imagestream readiness once and for all.
Ben graciously re-stated his hypothesis for us in - https://github.com/openshift/origin/pull/23568#issuecomment-519281026 - https://github.com/openshift/origin/pull/23568#issuecomment-519281650 - https://github.com/openshift/origin/pull/23568#issuecomment-519283586 I'm looking into the above. the extended test case update is in fact not directly pertinent, though it further could narrow different timing windows unrelated to this bug. Will pursue that separately.
Found some related references to Ben's hypothesis: https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/controller/build/build_controller.go#L876-L880 from most likely a prior incarnation of it. But it is only output image related. Ultimately gets surfaced as an error or event at https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/controller/build/build_controller.go#L1007-L1019 So minimally copying that approach for all the image references in addition to the output image would help debug / confirm the theory. Actually relisting directly from the API server and bypassing the watch/informer cache might be a thing we want to do as well. Thoughts @Ben @Adam ?
I'm not sure where you're proposing adding the explicit list from the apiserver.. if you're proposing doing that in the build/image controller, I think that's going to be unacceptable from a performance perspective. It'll significantly increase the api calls we're making. I think there's two things to do here: 1) sanity check that we still think the controller manager is getting restarted when the internal registry gets set 2) assuming it is, and therefore assuming we're just in some very narrow timing window here (though I have to say, it's hard to envision that window), add some sort of canary build. Create the build pointing to the imagestream, confirm that the resolved imagestream pullspec points to the internal registry, not the external one. Wait/repeat until it does. (I can't think of a better way to check if the controller manager cache is "correct", other than to force a rollout of the controller manager and wait for it to complete). Regardless, definitely the first step is to try to get more confirmation of the theory that the cause is indeed "stale" cached content in the controller manager that pre-dates the presence of the internal registry.
OK, so agreed on adding similar eventing logic for all the images references like is done for the output image reference with a missing stream.Status.DockerImageReference when the From points to an ImageStream. On the explicit list from the build/image controller, certainly *ONLY* do that when we detect the anomaly that stream.Status.DockerImageRepository is *NOT* set. Presumably that should typically not be the case, right? If so, and with that caveat, do the concerns about increased api calls remain? For your item 2) ... you proposing doing that from the extended tests, right? For 1), I recently wrote logic to track OCM rollout (https://github.com/openshift/origin/blob/master/test/extended/builds/cluster_config.go#L55-L100) but can you clarify where the image registry is being set? .... some sort of OCM / OCM controller config object I would presume ... Not seeing it in https://github.com/openshift/api/blob/master/openshiftcontrolplane/v1/types.go for example
> OK, so agreed on adding similar eventing logic for all the images references like is done for the output image reference with a missing stream.Status.DockerImageReference when the From points to an ImageStream. i'm not sure on this one. The reason it's done for the push is because when your push target for a build is an imagestream, it always goes to the internal registry (you can't setup an imagestream that points to an external registry and have the build use that imagestream and push to the external registry). Therefore if you are pushing your build to an imagestream and you don't have an internal registry, it is an error condition because it can't possibly work. If you're using an imagestream for pull, that build can still succeed if: 1) the imagestream is not using local imagereference 2) your build is not pushing an image anywhere, or is not pushing to the internal registry. so it's not an error condition in that case. So if you do add an error/event, it probably needs to be conditioned not only on the internal registry not being there, but also on the imagestream using the local reference policy. Even then, it's a bit odd because local reference policy is, at least in some cases, just an optimization (you get to pull from a local registry). There can be cases where someone defines an imagestream that points to a public registry, and uses the local reference policy. If the local registry isn't present, you can still resolve+pull that image from the external registry successfully, so it's not an error/blocking condition if the internal registry isn't present, for that scenario. (But it is perhaps worth a warning to the user that they are likely not getting the experience they intended). It gets extra complicated when you consider that the consumer of the imagestream may not be the one who defined it. So they may not be in a position to fix/change the local reference policy in order to clear the error/warning, in the case where they just want to go ahead and pull from the external registry instead.
Also note, aside from the additional verification / short circuits in the extended tests prior to build attempts with https://github.com/openshift/origin/pull/23568 I have https://github.com/openshift/openshift-controller-manager/pull/13 on the merge queue so that the build controller will submit an event if it detects that a build has and imagestreamtag local ref input, but the internal registry hostname is not available on the imagestream in question, re: Ben's note in #comment 13
Given this was an e2e flake and only automated test code was modified, I'm moving this to verified ... don't need to spend QA's time here we can reopen if we see a repro along the same lines of local ref pull through having to fall back to the TBR ref if this test fails along some other lines ... new bug
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922