Description of problem: When one Catalog Source failed to run, the user cannot use another Catalog Source. we define a Subscription to use that worked Catalog Source, but OLM still read the failure one. That's unreasonable. Because we specify the `source` in the subscription explicitly. Why must OLM read all catalog sources? Version-Release number of selected component (if applicable): mac:kubernetes jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-28-125342 True False 24h Cluster version is 4.10.0-0.nightly-2022-01-28-125342 mac:kubernetes jianzhang$ oc exec deploy/catalog-operator -- olm --version OLM version: 0.19.0 git commit: d795a1d8ebe4419f8d007018a5d19f4a07b6e977 How reproducible: always Steps to Reproduce: 1. Disable the default Catalog Sources mac:kubernetes jianzhang$ oc patch operatorhub cluster -p '{"spec": {"disableAllDefaultSources": true}}' --type=merge 2. Install a customize catalog source called "community-operators", it failed to run since ImagePullBackOff error. mac:kubernetes jianzhang$ oc get pods -n openshift-marketplace NAME READY STATUS RESTARTS AGE community-operators-6wj49 0/1 ImagePullBackOff 0 23h community-operators-qtmbg 0/1 ImagePullBackOff 0 23h marketplace-operator-86d8985bf8-pcdlw 1/1 Running 1 (24h ago) 24h qe-app-registry-cd5gc 1/1 Running 0 22h mac:kubernetes jianzhang$ oc get catalogsource community-operators -o yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"operators.coreos.com/v1alpha1","kind":"CatalogSource","metadata":{"annotations":{},"name":"community-operators","namespace":"openshift-marketplace"},"spec":{"displayName":"Community Operators","image":"ec2-18-116-47-156.us-east-2.compute.amazonaws.com:5000/openshifttest/etcd-index:latest","publisher":"OLM QE","sourceType":"grpc","updateStrategy":{"registryPoll":{"interval":"15m"}}}} creationTimestamp: "2022-01-29T02:03:54Z" generation: 1 name: community-operators namespace: openshift-marketplace resourceVersion: "524767" uid: 08795f7b-4a05-4459-8210-b51fe505b948 spec: displayName: Community Operators image: ec2-18-116-47-156.us-east-2.compute.amazonaws.com:5000/openshifttest/etcd-index:latest publisher: OLM QE sourceType: grpc updateStrategy: registryPoll: interval: 15m status: connectionState: address: community-operators.openshift-marketplace.svc:50051 lastConnect: "2022-01-30T02:17:46Z" lastObservedState: TRANSIENT_FAILURE latestImageRegistryPoll: "2022-01-29T02:19:11Z" registryService: createdAt: "2022-01-29T02:03:54Z" port: "50051" protocol: grpc serviceName: community-operators serviceNamespace: openshift-marketplace 3, Install another catalog source called "qe-app-registry". mac:kubernetes jianzhang$ oc get catalogsource qe-app-registry -o yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: creationTimestamp: "2022-01-29T03:28:47Z" generation: 1 name: qe-app-registry namespace: openshift-marketplace resourceVersion: "524994" uid: ea329c86-b47d-49c3-a4b2-ef722ddeeceb spec: image: ec2-18-116-47-156.us-east-2.compute.amazonaws.com:5000/openshift-qe-optional-operators/ocp4-index:1643421828 sourceType: grpc status: connectionState: address: qe-app-registry.openshift-marketplace.svc:50051 lastConnect: "2022-01-30T02:18:26Z" lastObservedState: READY registryService: createdAt: "2022-01-29T03:28:47Z" port: "50051" protocol: grpc serviceName: qe-app-registry serviceNamespace: openshift-marketplace 4. Subscribe to the aws-efs-csi-driver-operator, which from qe-app-registry catalog source. mac:kubernetes jianzhang$ oc get sub aws-efs-csi-driver-operator -n openshift-cluster-csi-drivers -o yaml apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: creationTimestamp: "2022-01-29T12:08:15Z" generation: 1 labels: operators.coreos.com/aws-efs-csi-driver-operator.openshift-cluster-csi-drivers: "" name: aws-efs-csi-driver-operator namespace: openshift-cluster-csi-drivers resourceVersion: "237012" uid: d850c2c7-e4a5-419a-acb0-4e08b351553e spec: channel: "4.10" installPlanApproval: Automatic name: aws-efs-csi-driver-operator source: qe-app-registry sourceNamespace: openshift-marketplace startingCSV: aws-efs-csi-driver-operator.4.10.0-202201261535 status: catalogHealth: - catalogSourceRef: apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource name: community-operators namespace: openshift-marketplace resourceVersion: "236975" uid: 08795f7b-4a05-4459-8210-b51fe505b948 healthy: true lastUpdated: "2022-01-29T12:08:16Z" - catalogSourceRef: apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource name: qe-app-registry namespace: openshift-marketplace resourceVersion: "234398" uid: ea329c86-b47d-49c3-a4b2-ef722ddeeceb healthy: true lastUpdated: "2022-01-29T12:08:16Z" conditions: - lastTransitionTime: "2022-01-29T12:08:16Z" message: all available catalogsources are healthy reason: AllCatalogSourcesHealthy status: "False" type: CatalogSourcesUnhealthy - message: 'error using catalog community-operators (in namespace openshift-marketplace): failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.36.150:50051: connect: no route to host"' reason: ErrorPreventedResolution status: "True" type: ResolutionFailed lastUpdated: "2022-01-29T12:08:16Z" Actual results: 1, That's unreasonable. In this aws-efs-csi-driver-operator subscription, it specifies the `qe-app-registry` Catalog Source, but it still read that `community-operators` Catalog Source. That's lead the subscription failed to run. - message: 'error using catalog community-operators (in namespace openshift-marketplace): failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.36.150:50051: connect: no route to host"' reason: ErrorPreventedResolution status: "True" type: ResolutionFailed lastUpdated: "2022-01-29T12:08:16Z" 2, The community-operators Catalog Source didn't work well, but the status in the subscription display was healthy. As follows, catalogHealth: - catalogSourceRef: apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource name: community-operators namespace: openshift-marketplace resourceVersion: "236975" uid: 08795f7b-4a05-4459-8210-b51fe505b948 healthy: true lastUpdated: "2022-01-29T12:08:16Z" mac:kubernetes jianzhang$ oc get pods -n openshift-marketplace NAME READY STATUS RESTARTS AGE community-operators-6wj49 0/1 ImagePullBackOff 0 23h community-operators-qtmbg 0/1 ImagePullBackOff 0 23h Expected results: 1, OLM should read the specified `source` directly, not read all catalog sources. Or, even if one catalogs source failure, it should not restraint the user use others. 2, Should display the right Catalog Source status in the subscription. Additional info: mac:kubernetes jianzhang$ oc get catalogsource -n openshift-marketplace NAME DISPLAY TYPE PUBLISHER AGE community-operators Community Operators grpc OLM QE 23h qe-app-registry grpc 22h mac:kubernetes jianzhang$ oc get svc -n openshift-marketplace NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE community-operators ClusterIP 172.30.36.150 <none> 50051/TCP 23h marketplace-operator-metrics ClusterIP 172.30.150.171 <none> 8383/TCP,8081/TCP 24h qe-app-registry ClusterIP 172.30.216.180 <none> 50051/TCP 22h I also test it in the latest nightly payload, the same issue. [cloud-user@preserve-olm-env jian]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-29-094046 True False 45m Cluster version is 4.10.0-0.nightly-2022-01-29-094046 [cloud-user@preserve-olm-env jian]$ oc -n openshift-operator-lifecycle-manager exec deploy/catalog-operator -- olm --version OLM version: 0.19.0 git commit: 5863540f44addf07e564b2e7c833c8a5f85841e7 Workaround: Remove the issued Catalog Source. The subscription will be run successfully. mac:kubernetes jianzhang$ oc get catalogsource -n openshift-marketplace NAME DISPLAY TYPE PUBLISHER AGE qe-app-registry grpc 23h mac:kubernetes jianzhang$ oc get sub -n openshift-cluster-csi-drivers NAME PACKAGE SOURCE CHANNEL aws-efs-csi-driver-operator aws-efs-csi-driver-operator qe-app-registry stable mac:kubernetes jianzhang$ oc get csv -n openshift-cluster-csi-drivers NAME DISPLAY VERSION REPLACES PHASE aws-efs-csi-driver-operator.4.10.0-202201261535 AWS EFS CSI Driver Operator 4.10.0-202201261535 Succeeded
bug 2076323 had been fixed, but it doesn't fix this bug, details: https://bugzilla.redhat.com/show_bug.cgi?id=2076323#c16 For this bug, as a cluster admin user, I am aware of a bad catalog source in the global namespace, I hope it won't block the user use the good one. For example, assume the redhat-operators crashed, the user can still subscribe to an operator from the good one, such as certified-operators. The bad catalog source shouldn't block the user uses other good ones. Reopen it. mac:~ jianzhang$ oc get catalogsource -n openshift-marketplace NAME DISPLAY TYPE PUBLISHER AGE certified-operators Certified Operators grpc Red Hat 8h community-operators Community Operators grpc Red Hat 8h qe-app-registry Production Operators grpc OpenShift QE 7h38m qitang-operators grpc 4h6m redhat-marketplace Red Hat Marketplace grpc Red Hat 8h redhat-operators Red Hat Operators grpc Red Hat 8h
Hi, For bug 2076323, I test 4 scenarios, and only one scenario works. Others failed. For example, the below scenario failed, the bad catalog source still blocks the resolver reconcile other good ones >>> test scenario: a bad catalog source in the global namespace, and a good catalog source in the user's namespace. And, subscribe to an operator from a good catalog source of the global namespace. It failed. Details: https://bugzilla.redhat.com/show_bug.cgi?id=2076323#c16