Bug 2083299 - SRO does not fetch mirrored DTK images in disconnected clusters
Summary: SRO does not fetch mirrored DTK images in disconnected clusters
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Special Resource Operator
Version: 4.10
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.11.0
Assignee: Yoni Bettan
QA Contact: Constantin Vultur
URL:
Whiteboard:
Depends On:
Blocks: 2100039
TreeView+ depends on / blocked
 
Reported: 2022-05-09 15:59 UTC by Quentin Barrand
Modified: 2023-09-15 01:54 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2100039 (view as bug list)
Environment:
Last Closed: 2022-08-10 11:10:43 UTC
Target Upstream Version:
Embargoed:
ybettan: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift special-resource-operator pull 212 0 None closed Draft: use host config & CA certificates when pulling images 2022-06-22 12:49:35 UTC
Github openshift special-resource-operator pull 226 0 None Merged Bug 2083299: Adding support for disconnected clusters. 2022-06-22 12:49:38 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:10:59 UTC

Description Quentin Barrand 2022-05-09 15:59:10 UTC
Description of problem:
Disconnected clusters use an ImageContentSourcePolicy resource to map OCP image names to their mirror on a custom registry.
This ImageContentSourcePolicy writes mirroring configuration for the container runtimes on the nodes directly; the redirection is completely transparent for other Kubernetes components, including SRO.
This means that the image referenced in the openshift/driver-toolkit ImageStream
cannot be directly pulled by SRO, which leads to an error.


Version-Release number of selected component (if applicable): 4.10.z


Steps to Reproduce:
1. Create a disconnected cluster with an ImageContentSourcePolicy
2. Deploy simple-kmod

Actual results:
SRO crashes with the following stacktrace:

2022-04-11T09:44:11.766Z        INFO    ESC[1;33mwarning  ESC[0m        ESC[1;33mOnError: cannot extract manifest: GET https://quay.io/v2/openshift-release-dev/ocp-v4.0-art-dev/manifests/sha256:8705564ee52c5b655529860687f73db78daef4eaa54f5fdd3160cfbfc4aac438: UNAUTHORIZED: access to the requested resource is not authorized; map[]  ESC[0m
2022-04-11T09:44:11.766Z        ERROR   controller.specialresource      Reconciler error        {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "name": "infoscale-vtas", "namespace": "", "error": "RECONCILE ERROR: Cannot upgrade special resource: cannot extract last layer for DTK from quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8705564ee52c5b655529860687f73db78daef4eaa54f5fdd3160cfbfc4aac438: %!w(<nil>)", "errorVerbose": "cannot extract last layer for DTK from quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8705564ee52c5b655529860687f73db78daef4eaa54f5fdd3160cfbfc4aac438: %!w(<nil>)\nRECONCILE ERROR: Cannot upgrade special resource\ngithub.com/openshift-psap/special-resource-operator/controllers.(*SpecialResourceReconciler).Reconcile\n\t/workspace/controllers/specialresource_controller.go:83\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1581"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227


Expected results:
SRO deploys simple-kmod.

Comment 3 Quentin Barrand 2022-05-11 13:41:13 UTC
During today's troubleshooting session, we understood why SRO was failing when trying to pull the DTK image [1] and not the OpenShift Release Image [0], which happens beforehand.
The OpenShift release image is public and can be pulled from quay.io without authentication, while DTK requires authentication.
It appears that the cluster is in fact not really disconnected and can reach quay.io, which makes SRO able to public public images. However, it fails to pull DTK as no pull secret is configured for quay.io. 
We asked the Veritas team to verify their setup and make the cluster actually disconnected as we work on code changes to make SRO aware of image mirroring.

[0] quay.io/openshift-release-dev/ocp-release@sha256:0696e249622b4d07d8f4501504b6c568ed6ba92416176a01a12b7f1882707117
[1] quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:52038f11156a3eac9700be73c6aef7121839727d93a8775c81e17de0ebd15732

Comment 4 yliu1 2022-05-18 22:38:10 UTC
There are 3 things to take into account when dealing with disconnected registry: registry cert, pull secret, and registries.conf where image content source policies were reflected.
Quentin, could you please confirm all these will be resolved in this bz? (We encountered failure with registry cert, and would not duplicate bz if this is the place to address all above items)

Comment 5 Quentin Barrand 2022-05-19 09:56:03 UTC
SRO already gets the pull secret. The fix for this BZ will mount the two remaining items you mentioned: registries.conf and custom CAs.

Comment 6 yliu1 2022-05-19 12:56:52 UTC
Thanks for confirming!

Comment 7 mlammon 2022-05-19 15:22:13 UTC
In another environment we also encountered the following error from SRO logs.  My understanding is this BZ will resolve as well. 
{"level":"error","ts":1652898031.5360885,"logger":"controller.specialresourcemodule","msg":"Reconciler error","reconciler group":"sro.openshift.io","reconciler kind":"SpecialResourceModule","name":"acm-ice","namespace":"","error":"failed to get OCP versions: could not get version info from image 'registry.ocp-edge-cluster-rdu2-0.qe.lab.redhat.com:5000/openshift-release-dev/ocp-release@sha256:5f6a8321f8abfa06209f7abe22a5c1fc72e3d1269941bdbfbf12ec1791905c4e': failed to get manifest's last layer for image 'registry.ocp-edge-cluster-rdu2-0.qe.lab.redhat.com:5000/openshift-release-dev/ocp-release@sha256:5f6a8321f8abfa06209f7abe22a5c1fc72e3d1269941bdbfbf12ec1791905c4e': failed to get layers digests of the image registry.ocp-edge-cluster-rdu2-0.qe.lab.redhat.com:5000/openshift-release-dev/ocp-release@sha256:5f6a8321f8abfa06209f7abe22a5c1fc72e3d1269941bdbfbf12ec1791905c4e: failed to get manifest stream from image registry.ocp-edge-cluster-rdu2-0.qe.lab.redhat.com:5000/openshift-release-dev/ocp-release@sha256:5f6a8321f8abfa06209f7abe22a5c1fc72e3d1269941bdbfbf12ec1791905c4e: failed to get crane manifest from image registry.ocp-edge-cluster-rdu2-0.qe.lab.redhat.com:5000/openshift-release-dev/ocp-release@sha256:5f6a8321f8abfa06209f7abe22a5c1fc72e3d1269941bdbfbf12ec1791905c4e: Get \"https://registry.ocp-edge-cluster-rdu2-0.qe.lab.redhat.com:5000/v2/\": x509: certificate signed by unknown authority","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}

Comment 8 Quentin Barrand 2022-05-24 15:05:04 UTC
Work ongoing in https://github.com/openshift/special-resource-operator/pull/212.
Assigning @ybettan as I am going in PTO.

Comment 13 Constantin Vultur 2022-06-22 09:30:50 UTC
Verified it with a bundle build in disconnected environment. 

Using custom examples of simple-kmod and acm-ice the build/daemonset worked fine:

# oc get all -n simple-kmod -o wide
NAME                                                      READY   STATUS      RESTARTS   AGE   IP                NODE                                                  NOMINATED NODE   READINESS GATES
pod/simple-kmod-driver-build-99249f257707c0c3-1-build     0/1     Completed   0          18h   fd01:0:0:1::495   master-0-2.ocp-edge-cluster-hub-0.qe.lab.redhat.com   <none>           <none>
pod/simple-kmod-driver-container-99249f257707c0c3-2ctkz   1/1     Running     0          18h   fd01:0:0:3::9a    master-0-0.ocp-edge-cluster-hub-0.qe.lab.redhat.com   <none>           <none>
pod/simple-kmod-driver-container-99249f257707c0c3-9c9w5   1/1     Running     0          18h   fd01:0:0:1::494   master-0-2.ocp-edge-cluster-hub-0.qe.lab.redhat.com   <none>           <none>
pod/simple-kmod-driver-container-99249f257707c0c3-mvjzh   1/1     Running     0          18h   fd01:0:0:2::87    master-0-1.ocp-edge-cluster-hub-0.qe.lab.redhat.com   <none>           <none>

NAME                                                           DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                                                 AGE   CONTAINERS                     IMAGES                                                                                                                    SELECTOR
daemonset.apps/simple-kmod-driver-container-99249f257707c0c3   3         3         3       3            3           feature.node.kubernetes.io/kernel-version.full=4.18.0-305.49.1.el8_4.x86_64,node-role.kubernetes.io/worker=   18h   simple-kmod-driver-container   image-registry.openshift-image-registry.svc:5000/simple-kmod/simple-kmod-driver-container:v4.18.0-305.49.1.el8_4.x86_64   app=simple-kmod-driver-container-99249f257707c0c3

NAME                                                                       TYPE     FROM         LATEST
buildconfig.build.openshift.io/simple-kmod-driver-build-99249f257707c0c3   Docker   Dockerfile   1

NAME                                                                   TYPE     FROM         STATUS     STARTED        DURATION
build.build.openshift.io/simple-kmod-driver-build-99249f257707c0c3-1   Docker   Dockerfile   Complete   19 hours ago   1m36s

NAME                                                          IMAGE REPOSITORY                                                                            TAGS                            UPDATED
imagestream.image.openshift.io/simple-kmod-driver-container   image-registry.openshift-image-registry.svc:5000/simple-kmod/simple-kmod-driver-container   v4.18.0-305.49.1.el8_4.x86_64   19 hours ago

acm-ice example

# oc get all -n acm-ice -o wide
NAME                        READY   STATUS      RESTARTS   AGE    IP                NODE                                                  NOMINATED NODE   READINESS GATES
pod/acm-ice-4-8-2-1-build   0/1     Completed   0          124m   fd01:0:0:1::4de   master-0-2.ocp-edge-cluster-hub-0.qe.lab.redhat.com   <none>           <none>

NAME                                           TYPE     FROM         LATEST
buildconfig.build.openshift.io/acm-ice-4-8-2   Docker   Dockerfile   1

NAME                                       TYPE     FROM         STATUS     STARTED       DURATION
build.build.openshift.io/acm-ice-4-8-2-1   Docker   Dockerfile   Complete   2 hours ago   4m12s

Comment 15 errata-xmlrpc 2022-08-10 11:10:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Comment 16 Red Hat Bugzilla 2023-09-15 01:54:38 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days


Note You need to log in before you can comment on or make changes to this bug.