Description of problem: An operator failed to install on OCP 4.10.13. The subscription shows it is in the state "ResolutionFailed", however the error message is pointing to the wrong CatalogSource. This occurred on 3 cluster out of ~2200 deployed in scale testing. The sequence of events leading to the failure occur during initial zero-touch-provisioning configuration of the cluster. The configuration performs these steps (among others) very close in time. 1. Disable the default catalogsources in OperatorHub 2. Create a new CatalogSource (different name than default ones) pointing to a disconnected registry 3. Create the subscription using the new CatalogSource The spec and error from the Subscription: spec: channel: stable installPlanApproval: Manual name: sriov-network-operator source: rh-du-operators sourceNamespace: openshift-marketplace status: catalogHealth: - catalogSourceRef: apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource name: rh-du-operators namespace: openshift-marketplace resourceVersion: "23058" uid: 40ed88f4-fea9-4dff-8778-8b59d6869046 healthy: true lastUpdated: "2022-05-05T15:41:19Z" conditions: - lastTransitionTime: "2022-05-05T15:41:19Z" message: all available catalogsources are healthy reason: AllCatalogSourcesHealthy status: "False" type: CatalogSourcesUnhealthy - message: 'error using catalog redhat-operators (in namespace openshift-marketplace): failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup redhat-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving"' reason: ErrorPreventedResolution status: "True" type: ResolutionFailed lastUpdated: "2022-05-05T15:41:19Z" Note that the CatalogSource is rh-du-operators but the message points to redhat-operators. The CatalogSource: apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: annotations: target.workload.openshift.io/management: '{"effect": "PreferredDuringScheduling"}' creationTimestamp: "2022-05-05T15:36:04Z" generation: 1 name: rh-du-operators namespace: openshift-marketplace resourceVersion: "763377" uid: 40ed88f4-fea9-4dff-8778-8b59d6869046 spec: displayName: disconnected-redhat-operators image: e24-h01-000-r640.rdu2.scalelab.redhat.com:5000/olm-mirror/redhat-operator-index:v4.9 publisher: Red Hat sourceType: grpc status: connectionState: address: rh-du-operators.openshift-marketplace.svc:50051 lastConnect: "2022-05-06T18:01:25Z" lastObservedState: READY registryService: createdAt: "2022-05-05T15:36:04Z" port: "50051" protocol: grpc serviceName: rh-du-operators serviceNamespace: openshift-marketplace Version-Release number of selected component (if applicable): 4.10.13 How reproducible: unknown. 3 out of 2200 clusters resulted in this state Steps to Reproduce: Issue occurred in GitOps ZTP deployment of cluster in scale test lab. The sequence of actions around this subscription include: 1. Disable the default catalogsources in OperatorHub 2. Create a new CatalogSource (different name than default ones) pointing to a disconnected registry 3. Create the subscription using the new CatalogSource Actual results: Operator fails to install Expected results: Operator installs Additional info:
Marking it as non-blocker because it occurs in very few cases. Seems like a race condition and we should investigate it, though.
Marking it as a non-blocker it's not on 4.11
Sorry for the delay. The error stems from the fact that OLM expects all CatalogSources to be online when it resolves the Subscription. I've managed to reproduce this by: - bringing up crc - oc scale --replicas 0 -n openshift-cluster-version deployments/cluster-version-operator - editing one of the marketplace catsrc to point to a bad image (breaking it) - creating a healthy catsrc in the default namespace - creating and og and a sub (ref. the local catsrc) in the default namespace and watching it fail for the reasons cited above - disabling all of the default catalog sources: oc patch OperatorHub cluster --type json -p '[{"op": "add", "path": "/spec/disableAllDefaultSources", "value": true}]' subscription seems to be stuck in an error state. My best guess is that due to the quick timing of your execution path, the sub resolves before the default catsrs go away. As a workaround, it might be worth recreating the sub, or giving it a few seconds before creating it. We should investigate how/if we should get the olm reconciler to re-resolve subs on catalog source deletions
I'm going to close this a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2076323 From 4.11 you'll be able to annotate the operator group to exclude global catalogs from the resolution in the namespace. This should ameliorate the issues ^^ *** This bug has been marked as a duplicate of bug 2076323 ***