Description of problem: Disconnected clusters use an ImageContentSourcePolicy resource to map OCP image names to their mirror on a custom registry. This ImageContentSourcePolicy writes mirroring configuration for the container runtimes on the nodes directly; the redirection is completely transparent for other Kubernetes components, including SRO. This means that the image referenced in the openshift/driver-toolkit ImageStream cannot be directly pulled by SRO, which leads to an error. Version-Release number of selected component (if applicable): 4.10.z Steps to Reproduce: 1. Create a disconnected cluster with an ImageContentSourcePolicy 2. Deploy simple-kmod Actual results: SRO crashes with the following stacktrace: 2022-04-11T09:44:11.766Z INFO ESC[1;33mwarning ESC[0m ESC[1;33mOnError: cannot extract manifest: GET https://quay.io/v2/openshift-release-dev/ocp-v4.0-art-dev/manifests/sha256:8705564ee52c5b655529860687f73db78daef4eaa54f5fdd3160cfbfc4aac438: UNAUTHORIZED: access to the requested resource is not authorized; map[] ESC[0m 2022-04-11T09:44:11.766Z ERROR controller.specialresource Reconciler error {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "name": "infoscale-vtas", "namespace": "", "error": "RECONCILE ERROR: Cannot upgrade special resource: cannot extract last layer for DTK from quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8705564ee52c5b655529860687f73db78daef4eaa54f5fdd3160cfbfc4aac438: %!w(<nil>)", "errorVerbose": "cannot extract last layer for DTK from quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8705564ee52c5b655529860687f73db78daef4eaa54f5fdd3160cfbfc4aac438: %!w(<nil>)\nRECONCILE ERROR: Cannot upgrade special resource\ngithub.com/openshift-psap/special-resource-operator/controllers.(*SpecialResourceReconciler).Reconcile\n\t/workspace/controllers/specialresource_controller.go:83\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1581"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227 Expected results: SRO deploys simple-kmod.
During today's troubleshooting session, we understood why SRO was failing when trying to pull the DTK image [1] and not the OpenShift Release Image [0], which happens beforehand. The OpenShift release image is public and can be pulled from quay.io without authentication, while DTK requires authentication. It appears that the cluster is in fact not really disconnected and can reach quay.io, which makes SRO able to public public images. However, it fails to pull DTK as no pull secret is configured for quay.io. We asked the Veritas team to verify their setup and make the cluster actually disconnected as we work on code changes to make SRO aware of image mirroring. [0] quay.io/openshift-release-dev/ocp-release@sha256:0696e249622b4d07d8f4501504b6c568ed6ba92416176a01a12b7f1882707117 [1] quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:52038f11156a3eac9700be73c6aef7121839727d93a8775c81e17de0ebd15732
There are 3 things to take into account when dealing with disconnected registry: registry cert, pull secret, and registries.conf where image content source policies were reflected. Quentin, could you please confirm all these will be resolved in this bz? (We encountered failure with registry cert, and would not duplicate bz if this is the place to address all above items)
SRO already gets the pull secret. The fix for this BZ will mount the two remaining items you mentioned: registries.conf and custom CAs.
Thanks for confirming!
In another environment we also encountered the following error from SRO logs. My understanding is this BZ will resolve as well. {"level":"error","ts":1652898031.5360885,"logger":"controller.specialresourcemodule","msg":"Reconciler error","reconciler group":"sro.openshift.io","reconciler kind":"SpecialResourceModule","name":"acm-ice","namespace":"","error":"failed to get OCP versions: could not get version info from image 'registry.ocp-edge-cluster-rdu2-0.qe.lab.redhat.com:5000/openshift-release-dev/ocp-release@sha256:5f6a8321f8abfa06209f7abe22a5c1fc72e3d1269941bdbfbf12ec1791905c4e': failed to get manifest's last layer for image 'registry.ocp-edge-cluster-rdu2-0.qe.lab.redhat.com:5000/openshift-release-dev/ocp-release@sha256:5f6a8321f8abfa06209f7abe22a5c1fc72e3d1269941bdbfbf12ec1791905c4e': failed to get layers digests of the image registry.ocp-edge-cluster-rdu2-0.qe.lab.redhat.com:5000/openshift-release-dev/ocp-release@sha256:5f6a8321f8abfa06209f7abe22a5c1fc72e3d1269941bdbfbf12ec1791905c4e: failed to get manifest stream from image registry.ocp-edge-cluster-rdu2-0.qe.lab.redhat.com:5000/openshift-release-dev/ocp-release@sha256:5f6a8321f8abfa06209f7abe22a5c1fc72e3d1269941bdbfbf12ec1791905c4e: failed to get crane manifest from image registry.ocp-edge-cluster-rdu2-0.qe.lab.redhat.com:5000/openshift-release-dev/ocp-release@sha256:5f6a8321f8abfa06209f7abe22a5c1fc72e3d1269941bdbfbf12ec1791905c4e: Get \"https://registry.ocp-edge-cluster-rdu2-0.qe.lab.redhat.com:5000/v2/\": x509: certificate signed by unknown authority","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}
Work ongoing in https://github.com/openshift/special-resource-operator/pull/212. Assigning @ybettan as I am going in PTO.
Verified it with a bundle build in disconnected environment. Using custom examples of simple-kmod and acm-ice the build/daemonset worked fine: # oc get all -n simple-kmod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/simple-kmod-driver-build-99249f257707c0c3-1-build 0/1 Completed 0 18h fd01:0:0:1::495 master-0-2.ocp-edge-cluster-hub-0.qe.lab.redhat.com <none> <none> pod/simple-kmod-driver-container-99249f257707c0c3-2ctkz 1/1 Running 0 18h fd01:0:0:3::9a master-0-0.ocp-edge-cluster-hub-0.qe.lab.redhat.com <none> <none> pod/simple-kmod-driver-container-99249f257707c0c3-9c9w5 1/1 Running 0 18h fd01:0:0:1::494 master-0-2.ocp-edge-cluster-hub-0.qe.lab.redhat.com <none> <none> pod/simple-kmod-driver-container-99249f257707c0c3-mvjzh 1/1 Running 0 18h fd01:0:0:2::87 master-0-1.ocp-edge-cluster-hub-0.qe.lab.redhat.com <none> <none> NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE CONTAINERS IMAGES SELECTOR daemonset.apps/simple-kmod-driver-container-99249f257707c0c3 3 3 3 3 3 feature.node.kubernetes.io/kernel-version.full=4.18.0-305.49.1.el8_4.x86_64,node-role.kubernetes.io/worker= 18h simple-kmod-driver-container image-registry.openshift-image-registry.svc:5000/simple-kmod/simple-kmod-driver-container:v4.18.0-305.49.1.el8_4.x86_64 app=simple-kmod-driver-container-99249f257707c0c3 NAME TYPE FROM LATEST buildconfig.build.openshift.io/simple-kmod-driver-build-99249f257707c0c3 Docker Dockerfile 1 NAME TYPE FROM STATUS STARTED DURATION build.build.openshift.io/simple-kmod-driver-build-99249f257707c0c3-1 Docker Dockerfile Complete 19 hours ago 1m36s NAME IMAGE REPOSITORY TAGS UPDATED imagestream.image.openshift.io/simple-kmod-driver-container image-registry.openshift-image-registry.svc:5000/simple-kmod/simple-kmod-driver-container v4.18.0-305.49.1.el8_4.x86_64 19 hours ago acm-ice example # oc get all -n acm-ice -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/acm-ice-4-8-2-1-build 0/1 Completed 0 124m fd01:0:0:1::4de master-0-2.ocp-edge-cluster-hub-0.qe.lab.redhat.com <none> <none> NAME TYPE FROM LATEST buildconfig.build.openshift.io/acm-ice-4-8-2 Docker Dockerfile 1 NAME TYPE FROM STATUS STARTED DURATION build.build.openshift.io/acm-ice-4-8-2-1 Docker Dockerfile Complete 2 hours ago 4m12s
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days