Bug 2082676 - Subscription ResolutionFailed, error message points to wrong CatalogSource
Summary: Subscription ResolutionFailed, error message points to wrong CatalogSource
Keywords:
Status: CLOSED DUPLICATE of bug 2076323
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.10
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Per da Silva
QA Contact: Jian Zhang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-06 18:14 UTC by Ian Miller
Modified: 2022-07-07 10:01 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-07-07 10:01:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Ian Miller 2022-05-06 18:14:03 UTC
Description of problem:
An operator failed to install on OCP 4.10.13. The subscription shows it is in the state "ResolutionFailed", however the error message is pointing to the wrong CatalogSource.  This occurred on 3 cluster out of ~2200 deployed in scale testing. The sequence of events leading to the failure occur during initial zero-touch-provisioning configuration of the cluster. The configuration performs these steps (among others) very close in time.
1. Disable the default catalogsources in OperatorHub
2. Create a new CatalogSource (different name than default ones) pointing to a disconnected registry
3. Create the subscription using the new CatalogSource

The spec and error from the Subscription:

spec:
  channel: stable
  installPlanApproval: Manual
  name: sriov-network-operator
  source: rh-du-operators
  sourceNamespace: openshift-marketplace
status:
  catalogHealth:
  - catalogSourceRef:
      apiVersion: operators.coreos.com/v1alpha1
      kind: CatalogSource
      name: rh-du-operators
      namespace: openshift-marketplace
      resourceVersion: "23058"
      uid: 40ed88f4-fea9-4dff-8778-8b59d6869046
    healthy: true
    lastUpdated: "2022-05-05T15:41:19Z"
  conditions:
  - lastTransitionTime: "2022-05-05T15:41:19Z"
    message: all available catalogsources are healthy
    reason: AllCatalogSourcesHealthy
    status: "False"
    type: CatalogSourcesUnhealthy
  - message: 'error using catalog redhat-operators (in namespace openshift-marketplace):
      failed to list bundles: rpc error: code = Unavailable desc = connection error:
      desc = "transport: Error while dialing dial tcp: lookup redhat-operators.openshift-marketplace.svc
      on [fd02::a]:53: server misbehaving"'
    reason: ErrorPreventedResolution
    status: "True"
    type: ResolutionFailed
  lastUpdated: "2022-05-05T15:41:19Z"


Note that the CatalogSource is rh-du-operators but the message points to redhat-operators.

The CatalogSource:
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  annotations:
    target.workload.openshift.io/management: '{"effect": "PreferredDuringScheduling"}'
  creationTimestamp: "2022-05-05T15:36:04Z"
  generation: 1
  name: rh-du-operators
  namespace: openshift-marketplace
  resourceVersion: "763377"
  uid: 40ed88f4-fea9-4dff-8778-8b59d6869046
spec:
  displayName: disconnected-redhat-operators
  image: e24-h01-000-r640.rdu2.scalelab.redhat.com:5000/olm-mirror/redhat-operator-index:v4.9
  publisher: Red Hat
  sourceType: grpc
status:
  connectionState:
    address: rh-du-operators.openshift-marketplace.svc:50051
    lastConnect: "2022-05-06T18:01:25Z"
    lastObservedState: READY
  registryService:
    createdAt: "2022-05-05T15:36:04Z"
    port: "50051"
    protocol: grpc
    serviceName: rh-du-operators
    serviceNamespace: openshift-marketplace



Version-Release number of selected component (if applicable): 4.10.13


How reproducible: unknown. 3 out of 2200 clusters resulted in this state


Steps to Reproduce:
Issue occurred in GitOps ZTP deployment of cluster in scale test lab. The sequence of actions around this subscription include:
1. Disable the default catalogsources in OperatorHub
2. Create a new CatalogSource (different name than default ones) pointing to a disconnected registry
3. Create the subscription using the new CatalogSource

Actual results: Operator fails to install


Expected results: Operator installs


Additional info:

Comment 2 Per da Silva 2022-05-18 11:02:53 UTC
Marking it as non-blocker because it occurs in very few cases. Seems like a race condition and we should investigate it, though.

Comment 3 Per da Silva 2022-05-18 12:12:44 UTC
Marking it as a non-blocker it's not on 4.11

Comment 4 Per da Silva 2022-06-13 13:21:25 UTC
Sorry for the delay. The error stems from the fact that OLM expects all CatalogSources to be online when it resolves the Subscription. I've managed to reproduce this by:
 - bringing up crc
 - oc scale --replicas 0 -n openshift-cluster-version deployments/cluster-version-operator
 - editing one of the marketplace catsrc to point to a bad image (breaking it)
 - creating a healthy catsrc in the default namespace
 - creating and og and a sub (ref. the local catsrc) in the default namespace and watching it fail for the reasons cited above
 - disabling all of the default catalog sources: oc patch OperatorHub cluster --type json -p '[{"op": "add", "path": "/spec/disableAllDefaultSources", "value": true}]'  

subscription seems to be stuck in an error state.

My best guess is that due to the quick timing of your execution path, the sub resolves before the default catsrs go away.

As a workaround, it might be worth recreating the sub, or giving it a few seconds before creating it.

We should investigate how/if we should get the olm reconciler to re-resolve subs on catalog source deletions

Comment 5 Per da Silva 2022-07-07 10:01:57 UTC
I'm going to close this a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2076323
From 4.11 you'll be able to annotate the operator group to exclude global catalogs from the resolution in the namespace. This should ameliorate the issues ^^

*** This bug has been marked as a duplicate of bug 2076323 ***


Note You need to log in before you can comment on or make changes to this bug.