Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2083299

Summary: SRO does not fetch mirrored DTK images in disconnected clusters
Product: OpenShift Container Platform Reporter: Quentin Barrand <quba>
Component: Special Resource OperatorAssignee: Yoni Bettan <ybettan>
Status: CLOSED ERRATA QA Contact: Constantin Vultur <cvultur>
Severity: high Docs Contact:
Priority: high    
Version: 4.10CC: bblock, bthurber, bzvonar, grajaiya, keyoung, mamccoma, mlammon, ybettan, yliu1, yshnaidm
Target Milestone: ---Keywords: TestBlocker
Target Release: 4.11.0Flags: ybettan: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2100039 (view as bug list) Environment:
Last Closed: 2022-08-10 11:10:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2100039    

Description Quentin Barrand 2022-05-09 15:59:10 UTC
Description of problem:
Disconnected clusters use an ImageContentSourcePolicy resource to map OCP image names to their mirror on a custom registry.
This ImageContentSourcePolicy writes mirroring configuration for the container runtimes on the nodes directly; the redirection is completely transparent for other Kubernetes components, including SRO.
This means that the image referenced in the openshift/driver-toolkit ImageStream
cannot be directly pulled by SRO, which leads to an error.


Version-Release number of selected component (if applicable): 4.10.z


Steps to Reproduce:
1. Create a disconnected cluster with an ImageContentSourcePolicy
2. Deploy simple-kmod

Actual results:
SRO crashes with the following stacktrace:

2022-04-11T09:44:11.766Z        INFO    ESC[1;33mwarning  ESC[0m        ESC[1;33mOnError: cannot extract manifest: GET https://quay.io/v2/openshift-release-dev/ocp-v4.0-art-dev/manifests/sha256:8705564ee52c5b655529860687f73db78daef4eaa54f5fdd3160cfbfc4aac438: UNAUTHORIZED: access to the requested resource is not authorized; map[]  ESC[0m
2022-04-11T09:44:11.766Z        ERROR   controller.specialresource      Reconciler error        {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "name": "infoscale-vtas", "namespace": "", "error": "RECONCILE ERROR: Cannot upgrade special resource: cannot extract last layer for DTK from quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8705564ee52c5b655529860687f73db78daef4eaa54f5fdd3160cfbfc4aac438: %!w(<nil>)", "errorVerbose": "cannot extract last layer for DTK from quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8705564ee52c5b655529860687f73db78daef4eaa54f5fdd3160cfbfc4aac438: %!w(<nil>)\nRECONCILE ERROR: Cannot upgrade special resource\ngithub.com/openshift-psap/special-resource-operator/controllers.(*SpecialResourceReconciler).Reconcile\n\t/workspace/controllers/specialresource_controller.go:83\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1581"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227


Expected results:
SRO deploys simple-kmod.

Comment 3 Quentin Barrand 2022-05-11 13:41:13 UTC
During today's troubleshooting session, we understood why SRO was failing when trying to pull the DTK image [1] and not the OpenShift Release Image [0], which happens beforehand.
The OpenShift release image is public and can be pulled from quay.io without authentication, while DTK requires authentication.
It appears that the cluster is in fact not really disconnected and can reach quay.io, which makes SRO able to public public images. However, it fails to pull DTK as no pull secret is configured for quay.io. 
We asked the Veritas team to verify their setup and make the cluster actually disconnected as we work on code changes to make SRO aware of image mirroring.

[0] quay.io/openshift-release-dev/ocp-release@sha256:0696e249622b4d07d8f4501504b6c568ed6ba92416176a01a12b7f1882707117
[1] quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:52038f11156a3eac9700be73c6aef7121839727d93a8775c81e17de0ebd15732

Comment 4 yliu1 2022-05-18 22:38:10 UTC
There are 3 things to take into account when dealing with disconnected registry: registry cert, pull secret, and registries.conf where image content source policies were reflected.
Quentin, could you please confirm all these will be resolved in this bz? (We encountered failure with registry cert, and would not duplicate bz if this is the place to address all above items)

Comment 5 Quentin Barrand 2022-05-19 09:56:03 UTC
SRO already gets the pull secret. The fix for this BZ will mount the two remaining items you mentioned: registries.conf and custom CAs.

Comment 6 yliu1 2022-05-19 12:56:52 UTC
Thanks for confirming!

Comment 7 mlammon 2022-05-19 15:22:13 UTC
In another environment we also encountered the following error from SRO logs.  My understanding is this BZ will resolve as well. 
{"level":"error","ts":1652898031.5360885,"logger":"controller.specialresourcemodule","msg":"Reconciler error","reconciler group":"sro.openshift.io","reconciler kind":"SpecialResourceModule","name":"acm-ice","namespace":"","error":"failed to get OCP versions: could not get version info from image 'registry.ocp-edge-cluster-rdu2-0.qe.lab.redhat.com:5000/openshift-release-dev/ocp-release@sha256:5f6a8321f8abfa06209f7abe22a5c1fc72e3d1269941bdbfbf12ec1791905c4e': failed to get manifest's last layer for image 'registry.ocp-edge-cluster-rdu2-0.qe.lab.redhat.com:5000/openshift-release-dev/ocp-release@sha256:5f6a8321f8abfa06209f7abe22a5c1fc72e3d1269941bdbfbf12ec1791905c4e': failed to get layers digests of the image registry.ocp-edge-cluster-rdu2-0.qe.lab.redhat.com:5000/openshift-release-dev/ocp-release@sha256:5f6a8321f8abfa06209f7abe22a5c1fc72e3d1269941bdbfbf12ec1791905c4e: failed to get manifest stream from image registry.ocp-edge-cluster-rdu2-0.qe.lab.redhat.com:5000/openshift-release-dev/ocp-release@sha256:5f6a8321f8abfa06209f7abe22a5c1fc72e3d1269941bdbfbf12ec1791905c4e: failed to get crane manifest from image registry.ocp-edge-cluster-rdu2-0.qe.lab.redhat.com:5000/openshift-release-dev/ocp-release@sha256:5f6a8321f8abfa06209f7abe22a5c1fc72e3d1269941bdbfbf12ec1791905c4e: Get \"https://registry.ocp-edge-cluster-rdu2-0.qe.lab.redhat.com:5000/v2/\": x509: certificate signed by unknown authority","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}

Comment 8 Quentin Barrand 2022-05-24 15:05:04 UTC
Work ongoing in https://github.com/openshift/special-resource-operator/pull/212.
Assigning @ybettan as I am going in PTO.

Comment 13 Constantin Vultur 2022-06-22 09:30:50 UTC
Verified it with a bundle build in disconnected environment. 

Using custom examples of simple-kmod and acm-ice the build/daemonset worked fine:

# oc get all -n simple-kmod -o wide
NAME                                                      READY   STATUS      RESTARTS   AGE   IP                NODE                                                  NOMINATED NODE   READINESS GATES
pod/simple-kmod-driver-build-99249f257707c0c3-1-build     0/1     Completed   0          18h   fd01:0:0:1::495   master-0-2.ocp-edge-cluster-hub-0.qe.lab.redhat.com   <none>           <none>
pod/simple-kmod-driver-container-99249f257707c0c3-2ctkz   1/1     Running     0          18h   fd01:0:0:3::9a    master-0-0.ocp-edge-cluster-hub-0.qe.lab.redhat.com   <none>           <none>
pod/simple-kmod-driver-container-99249f257707c0c3-9c9w5   1/1     Running     0          18h   fd01:0:0:1::494   master-0-2.ocp-edge-cluster-hub-0.qe.lab.redhat.com   <none>           <none>
pod/simple-kmod-driver-container-99249f257707c0c3-mvjzh   1/1     Running     0          18h   fd01:0:0:2::87    master-0-1.ocp-edge-cluster-hub-0.qe.lab.redhat.com   <none>           <none>

NAME                                                           DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                                                 AGE   CONTAINERS                     IMAGES                                                                                                                    SELECTOR
daemonset.apps/simple-kmod-driver-container-99249f257707c0c3   3         3         3       3            3           feature.node.kubernetes.io/kernel-version.full=4.18.0-305.49.1.el8_4.x86_64,node-role.kubernetes.io/worker=   18h   simple-kmod-driver-container   image-registry.openshift-image-registry.svc:5000/simple-kmod/simple-kmod-driver-container:v4.18.0-305.49.1.el8_4.x86_64   app=simple-kmod-driver-container-99249f257707c0c3

NAME                                                                       TYPE     FROM         LATEST
buildconfig.build.openshift.io/simple-kmod-driver-build-99249f257707c0c3   Docker   Dockerfile   1

NAME                                                                   TYPE     FROM         STATUS     STARTED        DURATION
build.build.openshift.io/simple-kmod-driver-build-99249f257707c0c3-1   Docker   Dockerfile   Complete   19 hours ago   1m36s

NAME                                                          IMAGE REPOSITORY                                                                            TAGS                            UPDATED
imagestream.image.openshift.io/simple-kmod-driver-container   image-registry.openshift-image-registry.svc:5000/simple-kmod/simple-kmod-driver-container   v4.18.0-305.49.1.el8_4.x86_64   19 hours ago

acm-ice example

# oc get all -n acm-ice -o wide
NAME                        READY   STATUS      RESTARTS   AGE    IP                NODE                                                  NOMINATED NODE   READINESS GATES
pod/acm-ice-4-8-2-1-build   0/1     Completed   0          124m   fd01:0:0:1::4de   master-0-2.ocp-edge-cluster-hub-0.qe.lab.redhat.com   <none>           <none>

NAME                                           TYPE     FROM         LATEST
buildconfig.build.openshift.io/acm-ice-4-8-2   Docker   Dockerfile   1

NAME                                       TYPE     FROM         STATUS     STARTED       DURATION
build.build.openshift.io/acm-ice-4-8-2-1   Docker   Dockerfile   Complete   2 hours ago   4m12s

Comment 15 errata-xmlrpc 2022-08-10 11:10:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Comment 16 Red Hat Bugzilla 2023-09-15 01:54:38 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days