Bug 1753731

Summary:	Builds can use incorrect location for pullthrough imagestream tags
Product:	OpenShift Container Platform	Reporter:	Adam Kaplan <adam.kaplan>
Component:	Build	Assignee:	Gabe Montero <gmontero>
Status:	CLOSED ERRATA	QA Contact:	wewang <wewang>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.1.0	CC:	aos-bugs, bparees, gmontero, wzheng
Target Milestone:	---
Target Release:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Builds started very soon an imagestream getting created may not leverage local pullthrough imagestream tags when specified. Consequence: The build would attempt to pull the image from the external image registry, and if the build was not set up with the authorization and certificates needed for that registry (assuming it would pull the image from the internal openshift registry), the build would fail. Fix: The build controller was updated to detect when its imagestream cache was missing the necessary information to allow for local pullthrough imagestream tags and retrieve that information from other means. Result: Builds expecting to leverage local imagestream tag pullthrough would now be able to do so.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-01-23 11:06:22 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Adam Kaplan 2019-09-19 16:28:22 UTC

Description of problem:

If a build uses a pull-through imagestream tag as its FROM image, sometimes the build will not use the internal registry's reference location and instead use the "source" registry's location.

This can result in build failures if the source registry requires a pull secret, such as registry.redhat.io.


Version-Release number of selected component (if applicable): 4.1.0


How reproducible: Sometimes


Steps to Reproduce:
Observed in flakes of the following test suites:
"[Feature:Builds][Conformance] oc new-app should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal]"

Actual results:
Build uses the source registry's reference location (registry.redhat.io), resulting in failures because the build does not have a valid pull secret for that registry.

Expected results:
Builds succeed because they use the internal registry to pull images.

Comment 1 Adam Kaplan 2019-09-19 16:46:59 UTC

Suspected root cause is that the openshift controller manager is caching the imagestream at a point where the openshift apiserver is not using the internal registry's hostname.
The hypothesis is that the flaking test runs in a window where the apiserver is using the internal registry hostname, but OCM hasn't re-listed its caches.

The simplest way to clear the cache is to restart the openshift controller manager. A cluster admin can do this as follows:

```
$ oc delete pods -l app=openshift-controller-manager,controller-manager=true -n openshift-controller-manager
```

Comment 2 Adam Kaplan 2019-09-19 16:50:21 UTC

Flaking tests have been disabled in https://github.com/openshift/origin/pull/23832. These tests need to be re-enabled for this BZ to be accepted as fixed.

Comment 3 Ben Parees 2019-09-19 16:56:39 UTC

my current thought on how to fix this is to revise the helper logic used in OCM that "resolves" imagestreamtags and rather than having that logic look for the (decorator-added) local registry field on the imagestream, just have it be aware of what the local registry hostname is(if one exists) and use it.  The OCM should already know this value.

Then we are not dependent on the apiserver having picked up the internal registry hostname.

Comment 4 Gabe Montero 2019-10-10 21:00:28 UTC

Some relevant code links
https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/controller/build/build_controller.go#L969
https://github.com/openshift/library-go/blob/master/pkg/image/imageutil/helpers.go#L300-L355
in particular https://github.com/openshift/library-go/blob/master/pkg/image/imageutil/helpers.go#L332

In addition the `stream.Status.DockerImageRegistry`, we can use 

https://github.com/openshift/api/blob/master/openshiftcontrolplane/v1/types.go#L239
https://github.com/openshift/api/blob/master/openshiftcontrolplane/v1/types.go#L181

which the build controller already has

https://github.com/openshift/openshift-controller-manager/blob/master/pkg/cmd/controller/build.go#L77

and it is passed into the strategy specific create build pod:

https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/controller/build/build_controller.go#L730

Comment 5 Gabe Montero 2019-10-14 19:56:30 UTC

For QE:  the theory for the code change the associated extended test exposed centered around executing a build whose input image leveraged local reference policy on an imagestream so we got pull through from the internal registry.

If that build ran as soon as the server came up, or as soon as a config change resulted in a restart of the openshift api server, their could be a timing window where the api server does not get the internal registry hostname 
before the build controller does.

In such a case the stream.Status.DockerImageRepository field would be empty.

If you want to spend some time trying to force such a timing window, OK, but otherwise I'm good with either moving to Verify, because the extended test is passing, or performing some basic regression testing of builds
using local reference imagestream inputs.

Comment 7 wewang 2019-10-15 06:20:46 UTC

e2e passed: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/23957/pull-ci-openshift-origin-master-e2e-aws-builds/3182, and in tested with ruby which referencePolicy is local
[wewang@Desktop Downloads]$ oc get is ruby -n openshift -o yaml
apiVersion: image.openshift.io/v1
kind: ImageStream
metadata:
  annotations:
    openshift.io/display-name: Ruby
    openshift.io/image.dockerRepositoryCheck: "2019-10-15T00:46:44Z"
    samples.operator.openshift.io/version: 4.3.0-0.ci-2019-10-14-215116
  creationTimestamp: "2019-10-15T00:44:55Z"
  generation: 2
  labels:
    samples.operator.openshift.io/managed: "true"
  name: ruby
  namespace: openshift
  resourceVersion: "15013"
  selfLink: /apis/image.openshift.io/v1/namespaces/openshift/imagestreams/ruby
  uid: 80bf634e-8121-4768-a599-da026625fbf0
spec:
  lookupPolicy:
    local: false
  tags:
  - annotations:
      description: Build and run Ruby 2.3 applications on RHEL 7. For more information
        about using this builder image, including OpenShift considerations, see https://github.com/sclorg/s2i-ruby-container/blob/master/2.3/README.md.
      iconClass: icon-ruby
      openshift.io/display-name: Ruby 2.3
      openshift.io/provider-display-name: Red Hat, Inc.
      sampleRepo: https://github.com/sclorg/ruby-ex.git
      supports: ruby:2.3,ruby
      tags: hidden,builder,ruby
      version: "2.3"
    from:
      kind: DockerImage
      name: registry.redhat.io/rhscl/ruby-23-rhel7:latest
    generation: 2
    importPolicy: {}
    name: "2.3"
    referencePolicy:
      type: Local

Comment 9 errata-xmlrpc 2020-01-23 11:06:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062