Description of problem: If any catalogsources in openshift-marketplace are in a bad state all operator installation via OLM is blocked. Version-Release number of selected component (if applicable): 4.10 and 4.8 OSD clusters How reproducible: 100% Steps to Reproduce: 1. break one of the catalogsources in openshift-marketplace 2. create custom catalogsource in new namespace 3. create subscription with `source` and `sourceNamespace` set for the custom catalogsource Actual results: Operator installation does not happen. Expected results: Operator is installed. Additional info: Was seen on ROSA/OSD over the weekend of April 14/17 when imaging signing component of registry.redhat.io experieenced a few hours of outage. All ROSA/OSD cluster installs were impacted. Workaround was to delete / disable the broken catalogsources in openshift-marketplace to allow other operators to install. Note there are no dependencies between operators, all custom catalogsources not in openshift-marketplace, and all subscriptions have source and sourceNamespace set.
Setting this as not a blocker, since it's working as designed. However, we should still aim to improve the UX for this use-case.
Summary of the path forward: From the OLM side we feel strongly about the promise we make users about the determinism of the resolver. This is why we fail resolution in case a catalog source cannot be reached. Rolling back on this could lead to confusion for admins and large blast radius for problems. In order to mitigate the issue above, we suggest that we add a mechanism to allow certain namespaces to opt-out of using the global catalogs during resolution. This should ease the case for non-CVO managed namespaces to rely solely on the catalog source they provide. Use-cases: 1. Self-management of operators though local catalog sources In this case, the admin provides all operators in a locally namespaced catalog sources. Resolution will be robust to global catalog source failures by ignoring them entirely. Local catalog source errors will still be surfaced and affect resolution. 2. Self-managed + Global catalog sources In this case, if you depend on global catalog sources and there's an issue with them, resolution will fail. This guards against non-deterministic resolution, and guarantees to admins that the intended operator will be used independently of the underlying network conditions. Back-portability: Since we don't backport API changes, we propose the following compromise: For OCP versions <= 4.10: the admin can add an annotation (olm.operatorframework.io/exclude-global-catalog-resolution) to the namespace operator group. For OCP versions >= 4.11: the OperatorGroup API will include a toggle excludeGlobalCatalogResolution = true | false P.S. I need to double check the versions. It may well be that in 4.11 we only use the annotation as well and push the OG changes to 4.12.
Moving to assigned as this is currently in-progress.
*** Bug 2048197 has been marked as a duplicate of this bug. ***
Upstream PR has merged: https://github.com/operator-framework/operator-lifecycle-manager/pull/2788 This should get pulled in during the next downstream sync. The operatorgroup annotation key is olm.operatorframework.io/exclude-global-namespace-resolution and setting the value to "true" will cause resolution to exclude global catalogs in that namespace.
1, Build a cluster that contains the fixed PR via cluster-bot. mac:~ jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.ci.test-2022-06-22-084806-ci-ln-mmpd1lk-latest True False 12m Cluster version is 4.11.0-0.ci.test-2022-06-22-084806-ci-ln-mmpd1lk-latest 2, Install a bad CatalogSource in the openshift-marketplace project. mac:~ jianzhang$ oc create -f cs-qe.yaml catalogsource.operators.coreos.com/qe-app-registry created mac:~ jianzhang$ cat ~/cs-qe.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: qe-app-registry namespace: openshift-marketplace spec: displayName: Production Operators image: quay.io/openshift-qe-optional-operators/ocp4-index:latest publisher: OpenShift QE sourceType: grpc updateStrategy: registryPoll: interval: 15m mac:~ jianzhang$ oc get catalogsource NAME DISPLAY TYPE PUBLISHER AGE certified-operators Certified Operators grpc Red Hat 32m community-operators Community Operators grpc Red Hat 32m qe-app-registry Production Operators grpc OpenShift QE 61s redhat-marketplace Red Hat Marketplace grpc Red Hat 32m redhat-operators Red Hat Operators grpc Red Hat 32m mac:~ jianzhang$ oc get pods NAME READY STATUS RESTARTS AGE certified-operators-fpbtc 1/1 Running 0 32m community-operators-8d6fw 1/1 Running 0 32m marketplace-operator-5d5cc746d4-skxjn 1/1 Running 1 (26m ago) 35m qe-app-registry-9mfdg 0/1 ErrImagePull 0 65s redhat-marketplace-5wnzx 1/1 Running 0 32m redhat-operators-k82bt 1/1 Running 0 32m 3, Subscribe to the etcd operator (from community-operators) to default project. mac:~ jianzhang$ oc get sub -A NAMESPACE NAME PACKAGE SOURCE CHANNEL default etcd etcd community-operators singlenamespace-alpha Still be blocked. mac:~ jianzhang$ oc get sub -n default etcd -o yaml apiVersion: operators.coreos.com/v1alpha1 kind: Subscription ... conditions: - lastTransitionTime: "2022-06-22T09:39:42Z" message: all available catalogsources are healthy reason: AllCatalogSourcesHealthy status: "False" type: CatalogSourcesUnhealthy - message: 'failed to populate resolver cache from source qe-app-registry/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.33.41:50051: i/o timeout"' reason: ErrorPreventedResolution status: "True" type: ResolutionFailed 3-1, add olm.operatorframework.io/exclude-global-namespace-resolution: "true" to the OperatorGroup. mac:~ jianzhang$ oc get og default-bk9zf -o yaml apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: annotations: olm.operatorframework.io/exclude-global-namespace-resolution: "true" olm.providedAPIs: "" creationTimestamp: "2022-06-22T09:38:55Z" generateName: default- generation: 1 name: default-bk9zf namespace: default resourceVersion: "47944" uid: 8dad82ef-faa8-4963-94f3-2b6ffd768a15 spec: targetNamespaces: - default upgradeStrategy: Default status: lastUpdated: "2022-06-22T09:38:55Z" namespaces: - default Nothing changed. mac:~ jianzhang$ oc get ip No resources found in default namespace. mac:~ jianzhang$ oc get csv No resources found in default namespace. 3-2, resubscribe it. Got another error: "constraints not satisfiable" mac:~ jianzhang$ oc get sub NAME PACKAGE SOURCE CHANNEL etcd etcd community-operators singlenamespace-alpha mac:~ jianzhang$ oc get ip mac:~ jianzhang$ oc get sub etcd -o yaml ... ... conditions: - lastTransitionTime: "2022-06-22T10:14:05Z" message: all available catalogsources are healthy reason: AllCatalogSourcesHealthy status: "False" type: CatalogSourcesUnhealthy - message: 'constraints not satisfiable: no operators found from catalog community-operators in namespace openshift-marketplace referenced by subscription etcd, subscription etcd exists' reason: ConstraintsNotSatisfiable status: "True" type: ResolutionFailed lastUpdated: "2022-06-22T10:14:05Z" PS: even if this step work, I still have some concerns: 1) As you know, the OperatorGroup is created automatically when subscribing to it on the Web console. So, how does the user add the annotation? Must create the OperatorGroup before subscribing? 4, remove the bad CatalogSource from the openshift-marketplace project, and install it in other project mac:~ jianzhang$ oc delete catalogsource qe-app-registry catalogsource.operators.coreos.com "qe-app-registry" deleted mac:~ jianzhang$ oc create -f cs-qe.yaml catalogsource.operators.coreos.com/qe-app-registry created mac:~ jianzhang$ mac:~ jianzhang$ cat cs-qe.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: qe-app-registry namespace: jian spec: displayName: Production Operators image: quay.io/openshift-qe-optional-operators/ocp4-index:latest publisher: OpenShift QE sourceType: grpc updateStrategy: registryPoll: interval: 15m 5, Subscribe to the etcd operator (from community-operators) to default project. mac:~ jianzhang$ oc get catalogsource -n jian NAME DISPLAY TYPE PUBLISHER AGE qe-app-registry Production Operators grpc OpenShift QE 3m15s mac:~ jianzhang$ oc get pods -n jian NAME READY STATUS RESTARTS AGE qe-app-registry-5c7c4 0/1 ErrImagePull 0 3m21s mac:~ jianzhang$ oc get sub -n default NAME PACKAGE SOURCE CHANNEL etcd etcd community-operators singlenamespace-alpha mac:~ jianzhang$ oc get ip -n default NAME CSV APPROVAL APPROVED install-hnh8h etcdoperator.v0.9.4 Automatic true mac:~ jianzhang$ oc get csv -n default NAME DISPLAY VERSION REPLACES PHASE etcdoperator.v0.9.4 etcd 0.9.4 etcdoperator.v0.9.2 Installing mac:~ jianzhang$ oc get csv -n default NAME DISPLAY VERSION REPLACES PHASE etcdoperator.v0.9.4 etcd 0.9.4 etcdoperator.v0.9.2 Succeeded 6, Subscribe to the etcd operator (from community-operators) to "jian" project that the bad CatalogSource running in. mac:~ jianzhang$ oc get sub -n jian NAME PACKAGE SOURCE CHANNEL etcd etcd community-operators singlenamespace-alpha - message: 'failed to populate resolver cache from source qe-app-registry/jian: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.8.68:50051: i/o timeout"' reason: ErrorPreventedResolution status: "True" Change the status to ASSIGNED.
QE did not verify the behavior that this PR is addressing -- the failure Jian saw is unrelated. Per spoke to Jian on slack -- should have a correct QE test shortly. Moving back to POST.
Below is the explanation from Per: if there is a bad catalog source in the global namespace (openshift-marketplace), this will block subscription resolution across the whole cluster. This doesn't change. Even if you have a custom catalog source in your own namespaces and a subscription pointing to it, it will not resolve. Adding the OG annotation will tell the resolver to only consider local catalog sources during resolution for the OG's namespace. So, if you have a local catalog source and a subscription pointing to it, it will resolve once the annotation is added to the OG. Testing: >> test scenario: a bad catalog source in the global namespace, and a good catalog source in the user's namespace. And, subscribe to an operator from a good catalog source of the local namespace. It works with the OG annotation. mac:~ jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.ci.test-2022-06-23-025214-ci-ln-039sl4k-latest True False 2m11s Cluster version is 4.11.0-0.ci.test-2022-06-23-025214-ci-ln-039sl4k-latest 1, Create a bad catalog source in the global namespace. mac:bug2076323 jianzhang$ oc get pods -n openshift-marketplace NAME READY STATUS RESTARTS AGE certified-operators-zghnf 1/1 Running 0 75m community-operators-zwtvp 1/1 Running 0 75m e8c9651078ae45ddb2807e3a07727d459b82d7def5572a7b7ccaae332b6klgx 0/1 Completed 0 51m marketplace-operator-5b56956987-l7bhb 1/1 Running 0 79m qe-app-registry-fr5t2 0/1 ImagePullBackOff 0 77s redhat-marketplace-dtdg9 1/1 Running 0 75m redhat-operators-b5w2q 1/1 Running 0 75m 2, Create a good catalog source in a project called "test". mac:bug2076323 jianzhang$ oc get catalogsource -n test NAME DISPLAY TYPE PUBLISHER AGE community-operators grpc Red Hat 15m mac:bug2076323 jianzhang$ oc get pods -n test NAME READY STATUS RESTARTS AGE community-operators-692x8 1/1 Running 0 15m 3, Create an OG without the annotation. mac:bug2076323 jianzhang$ oc create -f ~/og.yaml operatorgroup.operators.coreos.com/default-og created mac:bug2076323 jianzhang$ cat ~/og.yaml apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: default-og namespace: test spec: targetNamespaces: - test 4, subscribe to an operator from the good one. mac:bug2076323 jianzhang$ cat ~/sub-etcd.yaml apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: etcd namespace: test spec: channel: singlenamespace-alpha installPlanApproval: Automatic name: etcd source: community-operators sourceNamespace: test startingCSV: etcdoperator.v0.9.4 mac:bug2076323 jianzhang$ oc create -f ~/sub-etcd.yaml subscription.operators.coreos.com/etcd created mac:bug2076323 jianzhang$ oc get sub NAME PACKAGE SOURCE CHANNEL etcd etcd community-operators singlenamespace-alpha mac:bug2076323 jianzhang$ oc get sub etcd -o yaml ... conditions: - lastTransitionTime: "2022-06-23T04:24:25Z" message: all available catalogsources are healthy reason: AllCatalogSourcesHealthy status: "False" type: CatalogSourcesUnhealthy - message: 'failed to populate resolver cache from source qe-app-registry/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.254.241:50051: i/o timeout"' reason: ErrorPreventedResolution status: "True" type: ResolutionFailed 5, Update the OG to add the annotation. mac:bug2076323 jianzhang$ oc edit og default-og operatorgroup.operators.coreos.com/default-og edited mac:bug2076323 jianzhang$ oc get og default-og -o yaml apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: annotations: olm.operatorframework.io/exclude-global-namespace-resolution: "true" olm.providedAPIs: EtcdBackup.v1beta2.etcd.database.coreos.com,EtcdCluster.v1beta2.etcd.database.coreos.com,EtcdRestore.v1beta2.etcd.database.coreos.com creationTimestamp: "2022-06-23T04:23:14Z" generation: 1 name: default-og namespace: test resourceVersion: "51863" uid: b78d9b49-a8b7-41e8-a705-e4ba70e3b687 spec: targetNamespaces: - test upgradeStrategy: Default status: lastUpdated: "2022-06-23T04:23:14Z" namespaces: - test mac:bug2076323 jianzhang$ oc get sub NAME PACKAGE SOURCE CHANNEL etcd etcd community-operators singlenamespace-alpha mac:bug2076323 jianzhang$ oc get ip NAME CSV APPROVAL APPROVED install-hn4tt etcdoperator.v0.9.4 Automatic true mac:bug2076323 jianzhang$ oc get csv NAME DISPLAY VERSION REPLACES PHASE etcdoperator.v0.9.4 etcd 0.9.4 etcdoperator.v0.9.2 Succeeded The subscription succeeded. Looks good >>> test scenario: a bad catalog source in the global namespace, and a good catalog source in the user's namespace. And, subscribe to an operator from a good catalog source of the global namespace. It failed. 6, subscribe to an operator from a good one running on the global namespace, failed. but the error is different. mac:bug2076323 jianzhang$ cat ~/sub-etcd.yaml apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: etcd namespace: test spec: channel: singlenamespace-alpha installPlanApproval: Automatic name: etcd source: community-operators sourceNamespace: openshift-marketplace startingCSV: etcdoperator.v0.9.4 mac:bug2076323 jianzhang$ oc get sub NAME PACKAGE SOURCE CHANNEL etcd etcd community-operators singlenamespace-alpha mac:bug2076323 jianzhang$ oc get ip No resources found in test namespace. mac:bug2076323 jianzhang$ oc get sub etcd -o yaml apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: creationTimestamp: "2022-06-23T04:32:39Z" generation: 1 labels: operators.coreos.com/etcd.test: "" name: etcd namespace: test resourceVersion: "54460" uid: a1d3c005-bcb1-4716-8863-a6b36b4006f5 spec: channel: singlenamespace-alpha installPlanApproval: Automatic name: etcd source: community-operators sourceNamespace: openshift-marketplace startingCSV: etcdoperator.v0.9.4 ... conditions: - lastTransitionTime: "2022-06-23T04:32:39Z" message: all available catalogsources are healthy reason: AllCatalogSourcesHealthy status: "False" type: CatalogSourcesUnhealthy - message: 'constraints not satisfiable: no operators found from catalog community-operators in namespace openshift-marketplace referenced by subscription etcd, subscription etcd exists' reason: ConstraintsNotSatisfiable status: "True" type: ResolutionFailed lastUpdated: "2022-06-23T04:32:39Z" >> There is a bad and a good catalog source in the local namespace, and subscribe to an operator from the good one of the local namespace. It failed. mac:bug2076323 jianzhang$ oc get catalogsource NAME DISPLAY TYPE PUBLISHER AGE community-operators grpc Red Hat 11s qe-app-registry Production Operators grpc OpenShift QE 42m mac:bug2076323 jianzhang$ oc get pods NAME READY STATUS RESTARTS AGE community-operators-692x8 1/1 Running 0 35s qe-app-registry-jwkkp 0/1 ImagePullBackOff 0 43m qe-app-registry-vrwzq 0/1 ImagePullBackOff 0 27m 1, New a project called test, and create an OG with the annotation. mac:bug2076323 jianzhang$ oc get og -o yaml apiVersion: v1 items: - apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: annotations: olm.operatorframework.io/exclude-global-namespace-resolution: "true" creationTimestamp: "2022-06-23T03:36:46Z" generation: 1 name: default-og namespace: test resourceVersion: "33658" uid: 3627d484-36d8-4501-97d7-62aa653ef5c9 spec: targetNamespaces: - test upgradeStrategy: Default status: lastUpdated: "2022-06-23T03:36:46Z" namespaces: - test kind: List metadata: resourceVersion: "" 2, subscribe to the etcd operator from the good catalog source. mac:bug2076323 jianzhang$ cat ~/sub-etcd.yaml apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: etcd namespace: test spec: channel: singlenamespace-alpha installPlanApproval: Automatic name: etcd source: community-operators sourceNamespace: test startingCSV: etcdoperator.v0.9.4 mac:bug2076323 jianzhang$ oc get sub etcd -o yaml apiVersion: operators.coreos.com/v1alpha1 kind: Subscription ... conditions: - lastTransitionTime: "2022-06-23T04:09:04Z" message: all available catalogsources are healthy reason: AllCatalogSourcesHealthy status: "False" type: CatalogSourcesUnhealthy - message: 'failed to populate resolver cache from source qe-app-registry/test: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.157.83:50051: i/o timeout"' reason: ErrorPreventedResolution status: "True" type: ResolutionFailed lastUpdated: "2022-06-23T04:09:28Z" >> There is a bad catalog source in the local namespace, and subscribe to an operator from a good catalog source of the global namespace. It failed. 3, remove the good catalog source and reserve the bad catalog source in it. mac:~ jianzhang$ oc get catalogsource -n test NAME DISPLAY TYPE PUBLISHER AGE qe-app-registry Production Operators grpc OpenShift QE 23s mac:~ jianzhang$ oc get pods -n test NAME READY STATUS RESTARTS AGE qe-app-registry-jwkkp 0/1 ErrImagePull 0 30s 4, Subscribe to etcd operator from community-operators that running in the global namespace. mac:~ jianzhang$ cat sub-etcd.yaml apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: etcd namespace: test spec: channel: singlenamespace-alpha installPlanApproval: Automatic name: etcd source: community-operators sourceNamespace: openshift-marketplace startingCSV: etcdoperator.v0.9.4 mac:~ jianzhang$ oc get sub NAME PACKAGE SOURCE CHANNEL etcd etcd community-operators singlenamespace-alpha mac:~ jianzhang$ oc get ip No resources found in test namespace. mac:~ jianzhang$ oc get sub etcd -o yaml apiVersion: operators.coreos.com/v1alpha1 kind: Subscription ... conditions: - lastTransitionTime: "2022-06-23T03:37:49Z" message: all available catalogsources are healthy reason: AllCatalogSourcesHealthy status: "False" type: CatalogSourcesUnhealthy - message: 'failed to populate resolver cache from source qe-app-registry/test: failed to list bundles: rpc error: code = DeadlineExceeded desc = context deadline exceeded' reason: ErrorPreventedResolution status: "True" type: ResolutionFailed lastUpdated: "2022-06-23T03:40:55Z" conditions: - lastTransitionTime: "2022-06-23T03:37:49Z" message: all available catalogsources are healthy reason: AllCatalogSourcesHealthy status: "False" type: CatalogSourcesUnhealthy - message: 'failed to populate resolver cache from source qe-app-registry/test: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.157.83:50051: i/o timeout"' reason: ErrorPreventedResolution status: "True" type: ResolutionFailed lastUpdated: "2022-06-23T03:41:41Z" So, for this PR, only fixed the first scenario: If there is a bad catalog source in the global namespace(openshift-marketplace), the user can subscribe to an operator from a good catalog source of their own namespace with the OG annotation. The subscription point to the good catalog source of the global namespace still is blocked. If there is a bad catalog source in the local namespace(user's namespace), the user cannot subscribe to any operator into this namespace, no matter whether the subscription point to the good catalog source of the local or global namespace. Correct me if I'm wrong, thanks! Include the document team here, it's better to document this point in the 4.11 release note.
*** Bug 2082676 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069