Bug 1911470

Summary:	ServiceAccount Registry Authfiles Do Not Contain Entries for Public Hostnames
Product:	OpenShift Container Platform	Reporter:	Steve Kuznetsov <skuznets>
Component:	Image Registry	Assignee:	Ricardo Maraschini <rmarasch>
Status:	CLOSED ERRATA	QA Contact:	Wenjing Zheng <wzheng>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.1.z	CC:	aaleman, aos-bugs, ccoleman, hongkliu, rmarasch
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Automatically created docker config secret does not include credentials for integrated internal registry routes. Consequence: As no credentials were present for accessing the registry through any of its routes pods attempting to reach the registry were failing due to lack of authentication. Fix: Include all configured registry routes to the default docker credential secret. Result: Now pods can reach the integrated registry by any of its routes as credentials now contain an entry for each route.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-27 22:35:38 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1931856

Description Steve Kuznetsov 2020-12-29 16:06:05 UTC

We have a relatively new cluster:

$ oc --context app.ci get clusterversion version
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.1     True        False         67d     Cluster version is 4.6.1

This cluster has the registry configured to use a custom route as well as a default one:

$ oc --context app.ci get configs.imageregistry.operator.openshift.io cluster -o jsonpath={.spec.defaultRoute}
true
$ oc --context app.ci get configs.imageregistry.operator.openshift.io cluster -o jsonpath={.spec.routes[0].hostname}
registry.ci.openshift.org

Both are live:

$ oc --context app.ci get image.config.openshift.io cluster -o jsonpath={.status.externalRegistryHostnames}
[default-route-openshift-image-registry.apps.ci.l2s4.p1.openshiftapps.com registry.ci.openshift.org]

However, no ServiceAccount for this cluster has any entry for the external hostnames in the authfile generated for the registry:

$ oc --context app.ci --namespace ocp extract secrets/default-dockercfg-x878r --to=- | jq 'keys'
# .dockercfg
[
  "172.30.49.128:5000",
  "image-registry.openshift-image-registry.svc.cluster.local:5000",
  "image-registry.openshift-image-registry.svc:5000"
]


On a 3.11 cluster, entries are automatically generated for the external hostnames.

Comment 1 Clayton Coleman 2020-12-29 16:29:13 UTC

On an OSD cluster with 2 public routes (correctly configured in the cluster image configuration)

- apiVersion: config.openshift.io/v1
  kind: Image
  metadata:
    annotations:
      release.openshift.io/create-only: "true"
    creationTimestamp: "2020-04-16T19:11:37Z"
    generation: 2
    name: cluster
    resourceVersion: "211280865"
    selfLink: /apis/config.openshift.io/v1/images/cluster
    uid: e6d14209-b45e-40ac-bf51-74b870d7c0ad
  spec:
    externalRegistryHostnames:
    - registry.ci.openshift.org
  status:
    externalRegistryHostnames:
    - default-route-openshift-image-registry.apps.ci.l2s4.p1.openshiftapps.com
    - registry.ci.openshift.org
    internalRegistryHostname: image-registry.openshift-image-registry.svc:5

the openshiftcontrollermanagers config is only sending the internal address to be generated into the pull secret:

$ oc get openshiftcontrollermanagers.operator.openshift.io -o yaml

  spec:
    logLevel: ""
    managementState: Managed
    observedConfig:
      build:
        buildDefaults:
          resources: {}
        imageTemplateFormat:
          format: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2986a09ed686a571312bcb20d648baac46b422efa072f8b68eb41c7996e94610
      deployer:
        imageTemplateFormat:
          format: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f42509c18cf5e41201d64cf3a9c1994ffa5318f8d7cee5de45fa2da914e68bbc
      dockerPullSecret:
        internalRegistryHostname: image-registry.openshift-image-registry.svc:5000
      ingress:
        ingressIPNetworkCIDR: ""
    operatorLogLevel: ""
    unsupportedConfigOverrides: null

It should be sending all the public names as well as the internal registry name.  This allows someone to work with both public names and private names equally and pods still work.  The public names allow resiliency if someone sets up a proxy in front of one of those names.

I'm not positive this is a regression, but this absolutely broke a scenario that worked in 3.11 when we moved to 4.6, so marking it as such (it could break others moving from 3 to 4) and we'll need to assess whether this has been broken all of 4.x or just 4.6 in determining whether to backport (if it's broken in all 4.x, i think 4.6 only is acceptable).

Comment 3 Ricardo Maraschini 2021-01-04 12:12:04 UTC

I can confirm that this is the behavior in 4.1 as well: no extra entry for the external route is created.

Comment 4 Oleg Bulatov 2021-01-04 14:19:30 UTC

What you described works as designed and made intentionally. If you want to use the external name, you need to create a secret manually. This will make your manifests transferable between clusters. If your manifests are supposed to be used only on one cluster, then there is no reason to use the external route. Usage of external routes may give you feeling that you can easily transfer you manifest to another cluster, but you can't.

So I'd be very careful with adding external names to these secrets.

Steve, what is your use case? You haven't described why you need this. This looks more like an RFE.

> The public names allow resiliency if someone sets up a proxy in front of one of those names.

Clayton, can you elaborate on what kind of proxy you want in front of image-registry.openshift-image-registry.svc?

Comment 5 Hongkai Liu 2021-01-04 14:36:13 UTC

Just want to report a use case (not the same as the one reported for this bug by Steve) while migrating the CI registry from a 311 cluster to 4.6.
https://github.com/openshift/release/pull/14522/files#r548190491
The suggested workaround is to use the internal hostname of registry's svc.

Comment 6 Oleg Bulatov 2021-01-05 11:24:36 UTC

Migration is expected, you cannot use docker-registry.default.svc:5000 either. If you want to use an external name, you should treat it as an external registry.

Comment 7 Steve Kuznetsov 2021-01-05 15:52:39 UTC

How is this an RFE if it's a regression over previous behavior? I want to be able to use the internal or external hostname to refer to the registry. They're identical. We've built up an enormous amount of nonsense automation to re-write secrets including the external hostname to deal with this.

Comment 8 Oleg Bulatov 2021-01-05 17:10:46 UTC

> They're identical.

Except that the traffic for the external hostname goes through the load balancer and the router.

Not every 3.11 feature is supported by 4.x. Storage quota is gone, Alibaba storage is gone, etc. That can return, if somebody asks for it. But so far we removed external hostnames to protect you from creating configurations that are hard to migrate from one cluster to another cluster. It protects you from using external load balancers when they are not needed. I guess you have a reason why it's better to use external names in your case, but you haven't told it yet.

As we already have 5 big releases without this feature, so I don't consider it's to be a regression. I'm OK to make an option in 4.8 (or in a version that our PM selects) that will enable the behavior that you want.

Comment 9 Steve Kuznetsov 2021-01-08 18:16:35 UTC

Using a known, functional external hostname makes the migration easier, not harder. Since this works and is valid in 3.x and not functional in 4.x, this is a regression in the product that will break user workloads that expect it to continue working. We should reinstate this - if not by default, at least as opt-in.

Comment 10 Clayton Coleman 2021-01-08 18:26:42 UTC

This is a regression in the product.  The design of the image stream public field is that the public hostname can be pulled by pods.  It must be fixed.  It should be on by default.  There is no downside to having this on by default.

Comment 11 Wenjing Zheng 2021-02-07 09:09:53 UTC

It works with the open PR with below results:
$ oc extract secrets/default-dockercfg-dbn69 --to=- | jq 'keys'
# .dockercfg
[
  "172.30.148.19:5000",
  "default-route-openshift-image-registry.apps.ci-ln-2tc5v1t-f76d1.origin-ci-int-gce.dev.openshift.com",
  "image-registry.openshift-image-registry.svc.cluster.local:5000",
  "image-registry.openshift-image-registry.svc:5000"
]
zhengwenjings-MacBook-Pro:4.0 wzheng$ oc get routes
NAME            HOST/PORT                                                                                             PATH   SERVICES         PORT    TERMINATION   WILDCARD
default-route   default-route-openshift-image-registry.apps.ci-ln-2tc5v1t-f76d1.origin-ci-int-gce.dev.openshift.com          image-registry   <all>   reencrypt     None
myregistry      registry.ci.openshift.org                                                                                    image-registry   <all>   reencrypt     None

Comment 13 Wenjing Zheng 2021-03-02 08:06:36 UTC

$ oc extract secrets/default-dockercfg-82d4h --to=- | jq 'keys'
# .dockercfg
[
  "172.30.67.124:5000",
  "default-route-openshift-image-registry.apps.wxj-c2s32.govcloudemu.devcluster.openshift.com",
  "image-registry.openshift-image-registry.svc.cluster.local:5000",
  "image-registry.openshift-image-registry.svc:5000"
]
[wzheng@preserve-docker-slave 4.8]$ oc get routes
Unable to connect to the server: Service Unavailable
[wzheng@preserve-docker-slave 4.8]$ oc get routes
NAME            HOST/PORT                                                                                    PATH   SERVICES         PORT    TERMINATION   WILDCARD
default-route   default-route-openshift-image-registry.apps.wxj-c2s32.govcloudemu.devcluster.openshift.com          image-registry   <all>   reencrypt     None

Verified on  4.8.0-0.nightly-2021-03-01-143026.

Comment 14 Hongkai Liu 2021-04-13 20:14:43 UTC

Thanks for the fix.

oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.6     True        False         25m     Cluster version is 4.7.6

oc get secret default-dockercfg-ftz8j -o yaml | yq -r '.data.".dockercfg"' | base64 -d | jq -r '.|keys[]'
172.30.49.128:5000
default-route-openshift-image-registry.apps.ci.l2s4.p1.openshiftapps.com
image-registry.openshift-image-registry.svc.cluster.local:5000
image-registry.openshift-image-registry.svc:5000
registry.ci.openshift.org

Comment 17 errata-xmlrpc 2021-07-27 22:35:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438