Bug 2048197 - The bad catalog source should not block the resolver reconcile other good ones
Summary: The bad catalog source should not block the resolver reconcile other good ones
Keywords:
Status: CLOSED DUPLICATE of bug 2076323
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.10
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Per da Silva
QA Contact: Jian Zhang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-30 03:27 UTC by Jian Zhang
Modified: 2022-07-01 19:30 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-07-01 19:30:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Jian Zhang 2022-01-30 03:27:38 UTC
Description of problem:
When one Catalog Source failed to run, the user cannot use another Catalog Source. we define a Subscription to use that worked Catalog Source, but OLM still read the failure one. That's unreasonable. Because we specify the `source` in the subscription explicitly. Why must OLM read all catalog sources?

Version-Release number of selected component (if applicable):
mac:kubernetes jianzhang$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-28-125342   True        False         24h     Cluster version is 4.10.0-0.nightly-2022-01-28-125342
mac:kubernetes jianzhang$ oc exec deploy/catalog-operator -- olm --version
OLM version: 0.19.0
git commit: d795a1d8ebe4419f8d007018a5d19f4a07b6e977

How reproducible:
always

Steps to Reproduce:
1. Disable the default Catalog Sources
mac:kubernetes jianzhang$ oc patch operatorhub cluster -p '{"spec": {"disableAllDefaultSources": true}}' --type=merge

2. Install a customize catalog source called "community-operators", it failed to run since ImagePullBackOff error.

mac:kubernetes jianzhang$ oc get pods -n openshift-marketplace
NAME                                    READY   STATUS             RESTARTS      AGE
community-operators-6wj49               0/1     ImagePullBackOff   0             23h
community-operators-qtmbg               0/1     ImagePullBackOff   0             23h
marketplace-operator-86d8985bf8-pcdlw   1/1     Running            1 (24h ago)   24h
qe-app-registry-cd5gc                   1/1     Running            0             22h

mac:kubernetes jianzhang$ oc get catalogsource  community-operators -o yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"operators.coreos.com/v1alpha1","kind":"CatalogSource","metadata":{"annotations":{},"name":"community-operators","namespace":"openshift-marketplace"},"spec":{"displayName":"Community Operators","image":"ec2-18-116-47-156.us-east-2.compute.amazonaws.com:5000/openshifttest/etcd-index:latest","publisher":"OLM QE","sourceType":"grpc","updateStrategy":{"registryPoll":{"interval":"15m"}}}}
  creationTimestamp: "2022-01-29T02:03:54Z"
  generation: 1
  name: community-operators
  namespace: openshift-marketplace
  resourceVersion: "524767"
  uid: 08795f7b-4a05-4459-8210-b51fe505b948
spec:
  displayName: Community Operators
  image: ec2-18-116-47-156.us-east-2.compute.amazonaws.com:5000/openshifttest/etcd-index:latest
  publisher: OLM QE
  sourceType: grpc
  updateStrategy:
    registryPoll:
      interval: 15m
status:
  connectionState:
    address: community-operators.openshift-marketplace.svc:50051
    lastConnect: "2022-01-30T02:17:46Z"
    lastObservedState: TRANSIENT_FAILURE
  latestImageRegistryPoll: "2022-01-29T02:19:11Z"
  registryService:
    createdAt: "2022-01-29T02:03:54Z"
    port: "50051"
    protocol: grpc
    serviceName: community-operators
    serviceNamespace: openshift-marketplace

3, Install another catalog source called "qe-app-registry".

mac:kubernetes jianzhang$ oc get catalogsource qe-app-registry -o yaml
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  creationTimestamp: "2022-01-29T03:28:47Z"
  generation: 1
  name: qe-app-registry
  namespace: openshift-marketplace
  resourceVersion: "524994"
  uid: ea329c86-b47d-49c3-a4b2-ef722ddeeceb
spec:
  image: ec2-18-116-47-156.us-east-2.compute.amazonaws.com:5000/openshift-qe-optional-operators/ocp4-index:1643421828
  sourceType: grpc
status:
  connectionState:
    address: qe-app-registry.openshift-marketplace.svc:50051
    lastConnect: "2022-01-30T02:18:26Z"
    lastObservedState: READY
  registryService:
    createdAt: "2022-01-29T03:28:47Z"
    port: "50051"
    protocol: grpc
    serviceName: qe-app-registry
    serviceNamespace: openshift-marketplace


4. Subscribe to the aws-efs-csi-driver-operator, which from qe-app-registry catalog source.
mac:kubernetes jianzhang$ oc get sub aws-efs-csi-driver-operator -n openshift-cluster-csi-drivers -o yaml
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  creationTimestamp: "2022-01-29T12:08:15Z"
  generation: 1
  labels:
    operators.coreos.com/aws-efs-csi-driver-operator.openshift-cluster-csi-drivers: ""
  name: aws-efs-csi-driver-operator
  namespace: openshift-cluster-csi-drivers
  resourceVersion: "237012"
  uid: d850c2c7-e4a5-419a-acb0-4e08b351553e
spec:
  channel: "4.10"
  installPlanApproval: Automatic
  name: aws-efs-csi-driver-operator
  source: qe-app-registry
  sourceNamespace: openshift-marketplace
  startingCSV: aws-efs-csi-driver-operator.4.10.0-202201261535
status:
  catalogHealth:
  - catalogSourceRef:
      apiVersion: operators.coreos.com/v1alpha1
      kind: CatalogSource
      name: community-operators
      namespace: openshift-marketplace
      resourceVersion: "236975"
      uid: 08795f7b-4a05-4459-8210-b51fe505b948
    healthy: true
    lastUpdated: "2022-01-29T12:08:16Z"
  - catalogSourceRef:
      apiVersion: operators.coreos.com/v1alpha1
      kind: CatalogSource
      name: qe-app-registry
      namespace: openshift-marketplace
      resourceVersion: "234398"
      uid: ea329c86-b47d-49c3-a4b2-ef722ddeeceb
    healthy: true
    lastUpdated: "2022-01-29T12:08:16Z"
  conditions:
  - lastTransitionTime: "2022-01-29T12:08:16Z"
    message: all available catalogsources are healthy
    reason: AllCatalogSourcesHealthy
    status: "False"
    type: CatalogSourcesUnhealthy
  - message: 'error using catalog community-operators (in namespace openshift-marketplace):
      failed to list bundles: rpc error: code = Unavailable desc = connection error:
      desc = "transport: Error while dialing dial tcp 172.30.36.150:50051: connect:
      no route to host"'
    reason: ErrorPreventedResolution
    status: "True"
    type: ResolutionFailed
  lastUpdated: "2022-01-29T12:08:16Z"

Actual results:
1, That's unreasonable. In this aws-efs-csi-driver-operator subscription, it specifies the `qe-app-registry` Catalog Source, but it still read that `community-operators` Catalog Source. That's lead the subscription failed to run.

  - message: 'error using catalog community-operators (in namespace openshift-marketplace):
      failed to list bundles: rpc error: code = Unavailable desc = connection error:
      desc = "transport: Error while dialing dial tcp 172.30.36.150:50051: connect:
      no route to host"'
    reason: ErrorPreventedResolution
    status: "True"
    type: ResolutionFailed
  lastUpdated: "2022-01-29T12:08:16Z"

2, The community-operators Catalog Source didn't work well, but the status in the subscription display was healthy. As follows,

  catalogHealth:
  - catalogSourceRef:
      apiVersion: operators.coreos.com/v1alpha1
      kind: CatalogSource
      name: community-operators
      namespace: openshift-marketplace
      resourceVersion: "236975"
      uid: 08795f7b-4a05-4459-8210-b51fe505b948
    healthy: true
    lastUpdated: "2022-01-29T12:08:16Z"

mac:kubernetes jianzhang$ oc get pods -n openshift-marketplace
NAME                                    READY   STATUS             RESTARTS      AGE
community-operators-6wj49               0/1     ImagePullBackOff   0             23h
community-operators-qtmbg               0/1     ImagePullBackOff   0             23h


Expected results:
1, OLM should read the specified `source` directly, not read all catalog sources. Or, even if one catalogs source failure, it should not restraint the user use others.

2, Should display the right Catalog Source status in the subscription.

Additional info:

mac:kubernetes jianzhang$ oc get catalogsource -n openshift-marketplace
NAME                  DISPLAY               TYPE   PUBLISHER   AGE
community-operators   Community Operators   grpc   OLM QE      23h
qe-app-registry                             grpc               22h


mac:kubernetes jianzhang$ oc get svc -n openshift-marketplace
NAME                           TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
community-operators            ClusterIP   172.30.36.150    <none>        50051/TCP           23h
marketplace-operator-metrics   ClusterIP   172.30.150.171   <none>        8383/TCP,8081/TCP   24h
qe-app-registry                ClusterIP   172.30.216.180   <none>        50051/TCP           22h

I also test it in the latest nightly payload, the same issue.
[cloud-user@preserve-olm-env jian]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-29-094046   True        False         45m     Cluster version is 4.10.0-0.nightly-2022-01-29-094046
[cloud-user@preserve-olm-env jian]$ oc -n openshift-operator-lifecycle-manager  exec deploy/catalog-operator -- olm --version
OLM version: 0.19.0
git commit: 5863540f44addf07e564b2e7c833c8a5f85841e7

Workaround:
Remove the issued Catalog Source. The subscription will be run successfully.

mac:kubernetes jianzhang$ oc get catalogsource -n openshift-marketplace
NAME              DISPLAY   TYPE   PUBLISHER   AGE
qe-app-registry             grpc               23h
mac:kubernetes jianzhang$ oc get sub -n openshift-cluster-csi-drivers 
NAME                          PACKAGE                       SOURCE            CHANNEL
aws-efs-csi-driver-operator   aws-efs-csi-driver-operator   qe-app-registry   stable
mac:kubernetes jianzhang$ oc get csv -n openshift-cluster-csi-drivers 
NAME                                              DISPLAY                       VERSION               REPLACES   PHASE
aws-efs-csi-driver-operator.4.10.0-202201261535   AWS EFS CSI Driver Operator   4.10.0-202201261535              Succeeded

Comment 3 Jian Zhang 2022-06-23 07:29:37 UTC
bug 2076323 had been fixed, but it doesn't fix this bug, details: https://bugzilla.redhat.com/show_bug.cgi?id=2076323#c16

For this bug, as a cluster admin user, I am aware of a bad catalog source in the global namespace, I hope it won't block the user use the good one.
For example, assume the redhat-operators crashed, the user can still subscribe to an operator from the good one, such as certified-operators. The bad catalog source shouldn't block the user uses other good ones. Reopen it.

mac:~ jianzhang$ oc get catalogsource -n openshift-marketplace
NAME                  DISPLAY                TYPE   PUBLISHER      AGE
certified-operators   Certified Operators    grpc   Red Hat        8h
community-operators   Community Operators    grpc   Red Hat        8h
qe-app-registry       Production Operators   grpc   OpenShift QE   7h38m
qitang-operators                             grpc                  4h6m
redhat-marketplace    Red Hat Marketplace    grpc   Red Hat        8h
redhat-operators      Red Hat Operators      grpc   Red Hat        8h

Comment 5 Jian Zhang 2022-06-24 01:07:24 UTC
Hi, 

For bug 2076323, I test 4 scenarios, and only one scenario works. Others failed. For example, the below scenario failed, the bad catalog source still blocks the resolver reconcile other good ones

>>> test scenario: a bad catalog source in the global namespace, and a good catalog source in the user's namespace. And, subscribe to an operator from a good catalog source of the global namespace. It failed.

Details: https://bugzilla.redhat.com/show_bug.cgi?id=2076323#c16


Note You need to log in before you can comment on or make changes to this bug.