Description of problem: If a build uses a pull-through imagestream tag as its FROM image, sometimes the build will not use the internal registry's reference location and instead use the "source" registry's location. This can result in build failures if the source registry requires a pull secret, such as registry.redhat.io. Version-Release number of selected component (if applicable): 4.1.0 How reproducible: Sometimes Steps to Reproduce: Observed in flakes of the following test suites: "[Feature:Builds][Conformance] oc new-app should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal]" Actual results: Build uses the source registry's reference location (registry.redhat.io), resulting in failures because the build does not have a valid pull secret for that registry. Expected results: Builds succeed because they use the internal registry to pull images.
Suspected root cause is that the openshift controller manager is caching the imagestream at a point where the openshift apiserver is not using the internal registry's hostname. The hypothesis is that the flaking test runs in a window where the apiserver is using the internal registry hostname, but OCM hasn't re-listed its caches. The simplest way to clear the cache is to restart the openshift controller manager. A cluster admin can do this as follows: ``` $ oc delete pods -l app=openshift-controller-manager,controller-manager=true -n openshift-controller-manager ```
Flaking tests have been disabled in https://github.com/openshift/origin/pull/23832. These tests need to be re-enabled for this BZ to be accepted as fixed.
my current thought on how to fix this is to revise the helper logic used in OCM that "resolves" imagestreamtags and rather than having that logic look for the (decorator-added) local registry field on the imagestream, just have it be aware of what the local registry hostname is(if one exists) and use it. The OCM should already know this value. Then we are not dependent on the apiserver having picked up the internal registry hostname.
Some relevant code links https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/controller/build/build_controller.go#L969 https://github.com/openshift/library-go/blob/master/pkg/image/imageutil/helpers.go#L300-L355 in particular https://github.com/openshift/library-go/blob/master/pkg/image/imageutil/helpers.go#L332 In addition the `stream.Status.DockerImageRegistry`, we can use https://github.com/openshift/api/blob/master/openshiftcontrolplane/v1/types.go#L239 https://github.com/openshift/api/blob/master/openshiftcontrolplane/v1/types.go#L181 which the build controller already has https://github.com/openshift/openshift-controller-manager/blob/master/pkg/cmd/controller/build.go#L77 and it is passed into the strategy specific create build pod: https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/controller/build/build_controller.go#L730
For QE: the theory for the code change the associated extended test exposed centered around executing a build whose input image leveraged local reference policy on an imagestream so we got pull through from the internal registry. If that build ran as soon as the server came up, or as soon as a config change resulted in a restart of the openshift api server, their could be a timing window where the api server does not get the internal registry hostname before the build controller does. In such a case the stream.Status.DockerImageRepository field would be empty. If you want to spend some time trying to force such a timing window, OK, but otherwise I'm good with either moving to Verify, because the extended test is passing, or performing some basic regression testing of builds using local reference imagestream inputs.
e2e passed: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/23957/pull-ci-openshift-origin-master-e2e-aws-builds/3182, and in tested with ruby which referencePolicy is local [wewang@Desktop Downloads]$ oc get is ruby -n openshift -o yaml apiVersion: image.openshift.io/v1 kind: ImageStream metadata: annotations: openshift.io/display-name: Ruby openshift.io/image.dockerRepositoryCheck: "2019-10-15T00:46:44Z" samples.operator.openshift.io/version: 4.3.0-0.ci-2019-10-14-215116 creationTimestamp: "2019-10-15T00:44:55Z" generation: 2 labels: samples.operator.openshift.io/managed: "true" name: ruby namespace: openshift resourceVersion: "15013" selfLink: /apis/image.openshift.io/v1/namespaces/openshift/imagestreams/ruby uid: 80bf634e-8121-4768-a599-da026625fbf0 spec: lookupPolicy: local: false tags: - annotations: description: Build and run Ruby 2.3 applications on RHEL 7. For more information about using this builder image, including OpenShift considerations, see https://github.com/sclorg/s2i-ruby-container/blob/master/2.3/README.md. iconClass: icon-ruby openshift.io/display-name: Ruby 2.3 openshift.io/provider-display-name: Red Hat, Inc. sampleRepo: https://github.com/sclorg/ruby-ex.git supports: ruby:2.3,ruby tags: hidden,builder,ruby version: "2.3" from: kind: DockerImage name: registry.redhat.io/rhscl/ruby-23-rhel7:latest generation: 2 importPolicy: {} name: "2.3" referencePolicy: type: Local
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062