Bug 1857877 - Operator upgrades can delete existing CSV before completion
Summary: Operator upgrades can delete existing CSV before completion
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.4
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.7.0
Assignee: Vu Dinh
QA Contact: Jian Zhang
URL:
Whiteboard:
: 1904585 1907586 (view as bug list)
Depends On:
Blocks: 1904583
TreeView+ depends on / blocked
 
Reported: 2020-07-16 17:28 UTC by Nick Hale
Modified: 2021-02-24 15:15 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: OLM deletes existing CSV before the operator upgrade is completed. Consequence: The new CSV is stuck in Pending state Fix: OLM will check ServiceAccount's ownership to ensure the new ServiceAccount is created for the new CSV because transitioning the new CSV into Succeeded state. Result: The existing CSV will not be deleted until the new CSV reaches Succeeded state correctly.
Clone Of:
Environment:
Last Closed: 2021-02-24 15:13:58 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github operator-framework operator-lifecycle-manager pull 1881 0 None closed Bug 1857877: check the service account owner in the requirement 2021-02-15 08:49:43 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:15:02 UTC

Description Nick Hale 2020-07-16 17:28:32 UTC
Description of problem:

When an InstallPlan fails to apply a CRD -- after applying a new CSV -- during an Operator upgrade, it's possible for the new CSV to temporarily transition to Succeeded which causes OLM to erroneously garbage collect the CSV being replaced, along with required resources that have yet to be adopted by the new CSV (due to their application being blocked in the InstallPlan by the failed CRD application).

The end result is the new Operator in a permanent Pending state, missing a subset of required resources.

Version-Release number of selected component (if applicable): 4.4.8


How reproducible: Always


Steps to Reproduce:
1. Build a catalog containing an operator with two bundles in a package/channel 
   - 0.0.1: Requiring a CRD "Foo" and specifying permissions on a ServiceAccount "sa"
   - 0.0.2: Requiring the same CRD "Foo", with an invalid OpenAPI schema, and specifying permissions on a ServiceAccount "sa", replacing 0.0.1
2. Create a Namespace and OperatorGroup compatible with both CSVs
3. Create a CatalogSource referencing the catalog in the Namespace
4. Create a Subscription in the Namespace on the package/channel with a manual approval strategy and startingCSV set to 0.0.1
5. Approve the resulting InstallPlan and wait for the 0.0.1 CSV to transition to Succeeded
6. Approve the next InstallPlan to be generated

Actual results:

- CSV 0.0.1 is deleted along with ServiceAccount "sa"
- CSV 0.0.2 is in a Pending state, with conditions that show it has transitioned to Succeeded

Expected results:

- CSV 0.0.1 and CSV 0.0.2 are present
- CSV 0.0.2 never transitioned to succeeded


Additional info:

Customer Report:

Succeeded new CSV: https://gist.github.com/alexeykazakov/2f16daab0b14c83b1852f8a93cbf47bd

InstallPlan blocked on CRD upgrade issue: https://gist.github.com/alexeykazakov/fa3a224b48091a6d21c5c886666bec22

Comment 8 Jian Zhang 2020-12-08 08:29:36 UTC
I can reproduce this on Cluster version is 4.7.0-0.nightly-2020-12-04-013308

1, Create the index image for etcd 0.9.2 version.
[root@preserve-olm-env etcd]# opm alpha bundle build -c alpha -e alpha -d ./0.9.2/ -o -b docker -p etcd  -t quay.io/olmqe/etcd-bundle:0.9.2-sa
...
[root@preserve-olm-env etcd]# docker push quay.io/olmqe/etcd-bundle:0.9.2-sa
The push refers to repository [quay.io/olmqe/etcd-bundle]
1f7e5652ecb7: Pushed 
f9cde18c30f6: Pushed 
0.9.2-sa: digest: sha256:5aedf81994df417ea9a051738d499e7bd66b9faf1bf74be528d92b8a35fbae20 size: 732
[root@preserve-olm-env etcd]# opm index add -b quay.io/olmqe/etcd-bundle:0.9.2-sa -t quay.io/olmqe/etcd-index:0.9.2-sa
INFO[0000] building the index                            bundles="[quay.io/olmqe/etcd-bundle:0.9.2-sa]"
[root@preserve-olm-env etcd]# docker push quay.io/olmqe/etcd-index:0.9.2-sa 
The push refers to repository [quay.io/olmqe/etcd-index]
...

2, Modify the CRD etcdcluster for etcd 0.9.4 version, add an invalid OpenAPI schema: https://github.com/jianzhangbjz/community-operators/tree/bug-1857877/community-operators/etcd/0.9.4
1) Create a bundle image
[root@preserve-olm-env etcd]# opm alpha bundle build -c alpha -e alpha -d ./0.9.4/ -o -b docker -p etcd  -t quay.io/olmqe/etcd-bundle:0.9.4-sa
...
2) add the bundle image to 0.9.2 index image and generate a new index image: quay.io/olmqe/etcd-index:0.9.4-sa
[root@preserve-olm-env etcd]# opm index add -f quay.io/olmqe/etcd-index:0.9.2-sa --mode semver -c docker -b quay.io/olmqe/etcd-bundle:0.9.4-sa -t quay.io/olmqe/etcd-index:0.9.4-sa
INFO[0000] building the index                            bundles="[quay.io/olmqe/etcd-bundle:0.9.4-sa]"
...

3, Consume this index image on the cluster.
[root@preserve-olm-env etcd]# cat /data/cs-etcd.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: etcd-test
  namespace: openshift-marketplace
spec:
  displayName: Jian Test
  publisher: Jian
  sourceType: grpc
  image: quay.io/olmqe/etcd-index:0.9.4-sa
  updateStrategy:
    registryPoll:
      interval: 10m

[root@preserve-olm-env etcd]# oc get catalogsource -n openshift-marketplace
NAME                  DISPLAY                TYPE   PUBLISHER      AGE
...
etcd-test             Jian Test              grpc   Jian           94m

4, subscribe to the etcd operator with manual approval.

[root@preserve-olm-env etcd]# cat /data/og.yaml 
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: test-og
  namespace: default
spec:
  targetNamespaces:
  - default
[root@preserve-olm-env etcd]# cat /data/sub-0.9.2.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: etcd-sub
  namespace: default
spec:
  installPlanApproval: Manual
  channel: alpha
  name: etcd
  source: etcd-test
  sourceNamespace: openshift-marketplace
  startingCSV: etcdoperator.v0.9.2

[root@preserve-olm-env etcd]# oc get sub -n default
NAME       PACKAGE   SOURCE      CHANNEL
etcd-sub   etcd      etcd-test   alpha
[root@preserve-olm-env etcd]# oc get ip -n default
NAME            CSV                   APPROVAL   APPROVED
install-mc4cw   etcdoperator.v0.9.2   Manual     false

[root@preserve-olm-env etcd]# oc get ip
NAME            CSV                   APPROVAL   APPROVED
install-jfnrv   etcdoperator.v0.9.4   Manual     false
install-mc4cw   etcdoperator.v0.9.2   Manual     true
[root@preserve-olm-env etcd]# oc get csv
NAME                                           DISPLAY                            VERSION                 REPLACES   PHASE
etcdoperator.v0.9.2                            etcd                               0.9.2                              Succeeded

5, Approve the 0.9.4 installplan
[root@preserve-olm-env etcd]# oc get ip
NAME            CSV                   APPROVAL   APPROVED
install-jfnrv   etcdoperator.v0.9.4   Manual     true
install-mc4cw   etcdoperator.v0.9.2   Manual     true

[root@preserve-olm-env etcd]# oc get csv
NAME                                           DISPLAY                            VERSION                 REPLACES              PHASE
etcdoperator.v0.9.2                            etcd                               0.9.2                                         Replacing
etcdoperator.v0.9.4                            etcd                               0.9.4                   etcdoperator.v0.9.2   Installing

[root@preserve-olm-env etcd]#  oc get csv
NAME                                           DISPLAY                            VERSION                 REPLACES              PHASE
etcdoperator.v0.9.4                            etcd                               0.9.4                   etcdoperator.v0.9.2   Pending

The etcd-operator ServiceAccount was deleted.
[root@preserve-olm-env etcd]# oc get sa
 NAME       SECRETS   AGE
builder    2         4h59m
default    2         5h11m
deployer   2         4h59m

[root@preserve-olm-env etcd]#  oc get csv etcdoperator.v0.9.4  -o yaml
apiVersion: operators.coreos.com/v1alpha1
kind: ClusterServiceVersion
...
  phase: Pending
  reason: RequirementsNotMet
  requirementStatus:
...
  - group: ""
    kind: ServiceAccount
    message: Service account does not exist
    name: etcd-operator
    status: NotPresent
    version: v1

Test it on the cluster that contains the fixed PR:
Cluster version is 4.7.0-0.nightly-2020-12-07-232943
[root@preserve-olm-env data]# oc -n openshift-operator-lifecycle-manager  exec catalog-operator-8649b7f8d5-f4lhq -- olm --version
OLM version: 0.17.0
git commit: 4ee4e876522c4d1b97e59d96588b2468149673eb

Rerun the above steps: 3, 4, 5

[root@preserve-olm-env data]# oc get sa
NAME            SECRETS   AGE
builder         2         31m
default         2         45m
deployer        2         31m
etcd-operator   2         2m14s
[root@preserve-olm-env data]# oc get csv
NAME                  DISPLAY   VERSION   REPLACES              PHASE
etcdoperator.v0.9.2   etcd      0.9.2                           Replacing
etcdoperator.v0.9.4   etcd      0.9.4     etcdoperator.v0.9.2   Pending
...

The sa still exist and the owner is v0.9.2 csv.
[root@preserve-olm-env data]# oc get sa etcd-operator -o yaml
apiVersion: v1
imagePullSecrets:
...
  ownerReferences:
  - apiVersion: operators.coreos.com/v1alpha1
    blockOwnerDeletion: false
    controller: false
    kind: ClusterServiceVersion
    name: etcdoperator.v0.9.2
    uid: 6f4527b0-e200-49be-ab5d-7c3c387bc441

The error info is "Service account is not owned by this ClusterServiceVersion", LGTM.

[root@preserve-olm-env data]# oc get csv etcdoperator.v0.9.4 -o yaml
apiVersion: operators.coreos.com/v1alpha1
kind: ClusterServiceVersion
metadata:
  annotations:
...
  - group: ""
    kind: ServiceAccount
    message: Service account is not owned by this ClusterServiceVersion
    name: etcd-operator
    status: PresentNotSatisfied
    version: v1

verify it.

Comment 10 Vu Dinh 2020-12-21 22:03:37 UTC
*** Bug 1907586 has been marked as a duplicate of this bug. ***

Comment 11 Ankita Thomas 2021-01-11 17:30:00 UTC
*** Bug 1904585 has been marked as a duplicate of this bug. ***

Comment 13 errata-xmlrpc 2021-02-24 15:13:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.