Bug 1947212

Summary: Upgrades may get stuck if permissions reduced
Product: OpenShift Container Platform Reporter: Ben Luddy <bluddy>
Component: OLMAssignee: Kevin Rizza <krizza>
OLM sub component: OLM QA Contact: Jian Zhang <jiazha>
Status: CLOSED DEFERRED Docs Contact:
Severity: medium    
Priority: medium CC: anbhatta, jkeister, pegoncal
Version: 4.6Keywords: Reopened, Triaged
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-03-09 01:02:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ben Luddy 2021-04-07 23:00:12 UTC
Description of problem:

During an upgrade, if the scope of an operator's permissions is reduced (e.g. removal of an entry from the ClusterServiceVersion's .spec.permissions or .spec.clusterPermissions), both the old and the new ClusterServiceVersion may enter phase "Pending" and stay there indefinitely. These symptoms were first reported with a different root cause in https://bugzilla.redhat.com/show_bug.cgi?id=1934080.

The timeline looks something like this, assuming CSV "old" in "Replacing" and CSV "new" in "Pending", with an in-progress InstallPlan:

1. An InstallPlan Role step is applied, removing some rules in an update to the existing Role.
2. CSV "old" is reconciled, its RBAC requirements are no longer met, and it transitions from "Replacing" to "Pending". Without intervention, that requirement will remain unsatisfied because InstallPlan execution is a one-way process.
3. CSV "new" is reconciled. Because its .spec.replaces is "old", and "old" exists, it refuses to progress unless "old" is in phase "Replacing".

Version-Release number of selected component (if applicable): 4.6

How reproducible:

Sometimes, depending on the winner of the races between catalog-operator applying steps from an InstallPlan and olm-operator's CSV reconciliation.

Steps to Reproduce:

1. Create an index image containing a channel with two entries. The first entry should define some permission in its CSV, for example:

      clusterPermissions:
      - serviceAccountName: service-account
        rules:
        - apiGroups:
          - ""
          resources:
          - configmaps
          verbs:
          - get
          - list

and the second should reduce the scope of that permission, for example:

      clusterPermissions:
      - serviceAccountName: service-account
        rules:
        - apiGroups:
          - ""
          resources:
          - configmaps
          verbs:
          - get

2. Create a CatalogSource pointing to the index that was created in (1).
3. Create a Subscription (and OperatorGroup if necessary) for the CatalogSource created in (2) with .spec.startingCSV set to the name of the _first_ entry.

Actual results:

The first entry is installed, then (sometimes) the upgrade to the second entry never completes -- both "old" and "new" CSVs show phase "Pending" indefinitely.

Expected results:

The first entry is installed, then an upgrade to the second entry succeeds (the first CSV is automatically deleted and the second CSV has a good status).

Comment 1 Per da Silva 2022-01-11 20:36:57 UTC

*** This bug has been marked as a duplicate of bug 1942818 ***

Comment 2 Per da Silva 2022-01-12 00:32:45 UTC
The above closure was erroneous. Re-opening.

Comment 5 Shiftzilla 2023-03-09 01:02:04 UTC
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-8859