Bug 1934080

Summary:	Both old and new Clusterlogging CSVs stuck in Pending during upgrade
Product:	OpenShift Container Platform	Reporter:	Jonas Nordell <jnordell>
Component:	OLM	Assignee:	Ben Luddy <bluddy>
OLM sub component:	OLM	QA Contact:	kuiwang
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	assingh, bluddy, cpassare, dgautam, ekasprzy, jmalde, kiyyappa, krizza, ksathe, kuiwang, mmohan, mpandey, naygupta, nhale, scolange, tflannag, xzha
Version:	4.6	Keywords:	Triaged
Target Milestone:	---
Target Release:	4.8.0
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: During an operator upgrade, the owner reference of any associated ServiceAccount objects is updated to point to the new ClusterServiceVersion instead of the old one. Consequence: A race condition can occur between the olm-operator (which reconciles ClusterServiceVersions) and the catalog-operator (which executes InstallPlans), marking the old CSV as "Pending/RequirementsNotMet" due to the ServiceAccount ownership change. This blocks upgrade completion while the new CSV waits indefinitely for the old CSV to indicate a healthy status. Fix: Instead of updating owner references in one step, the second owner is appended to any existing owners. Result: The same ServiceAccount can satisfy requirements for both the old and the new ClusterServiceVersion.	Story Points:	---
Clone Of:
Clones:	1949139 (view as bug list)		Environment:
Last Closed:	2021-07-27 22:49:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1949139

Description Jonas Nordell 2021-03-02 13:23:05 UTC

Description of problem:

In a OCP 4.6 cluster an automatic upgrade of the clusterlogging operator failed. Both the old and the new is now stuck in Phase: pending

oc get csv
NAME                                           DISPLAY                            VERSION                 REPLACES                                       PHASE
clusterlogging.4.6.0-202101230113.p0           Cluster Logging                    4.6.0-202101230113.p0   clusterlogging.4.6.0-202101162152.p0           Pending
clusterlogging.4.6.0-202101301510.p0           Cluster Logging                    4.6.0-202101301510.p0   clusterlogging.4.6.0-202101230113.p0           Pending
elasticsearch-operator.4.6.0-202101300140.p0   OpenShift Elasticsearch Operator   4.6.0-202101300140.p0   elasticsearch-operator.4.6.0-202101230113.p0   Succeeded

When checking the 202101301510 CSV it does not really say why

   replaces: clusterlogging.4.6.0-202101230113.p0
    version: 4.6.0-202101301510.p0
  status:
    conditions:
    - lastTransitionTime: "2021-02-08T15:26:55Z"
      lastUpdateTime: "2021-02-08T15:26:55Z"
      message: requirements not yet checked
      phase: Pending
      reason: RequirementsUnknown

But when I check the 202101230113 CSV it states:

   - lastTransitionTime: "2021-02-04T08:13:20Z"
      lastUpdateTime: "2021-02-04T08:13:20Z"
      message: install strategy completed with no errors
      phase: Succeeded
      reason: InstallSucceeded
    - lastTransitionTime: "2021-02-08T15:26:45Z" <------------- Stared failing
      lastUpdateTime: "2021-02-08T15:26:45Z"
      message: requirements no longer met
      phase: Failed
      reason: RequirementsNotMet
    - lastTransitionTime: "2021-02-08T15:26:50Z"
      lastUpdateTime: "2021-02-08T15:26:50Z"
      message: requirements not met
      phase: Pending
      reason: RequirementsNotMet

    - group: ""
      kind: ServiceAccount
      message: Service account is not owned by this ClusterServiceVersion
      name: cluster-logging-operator
      status: PresentNotSatisfied
      version: v1

And when I check the SA

    manager: olm
    operation: Update
    time: "2021-02-08T15:26:46Z"
  name: cluster-logging-operator
  namespace: openshift-logging
  ownerReferences:
  - apiVersion: operators.coreos.com/v1alpha1
    blockOwnerDeletion: false
    controller: false
    kind: ClusterServiceVersion
    name: clusterlogging.4.6.0-202101301510.p0
    uid: 58956251-875e-49b2-b11d-526d098c0258

So OLM changed the ownerReference on the SA to clusterlogging.4.6.0-202101301510.p0 before clusterlogging.4.6.0-202101230113.p0 was "uninstalled/removed". This upset the clusterlogging.4.6.0-202101230113.p0 and now everything is stuck/in a deadlock.

It is possible to workaround by manually changing the onwerReference on the SA, but the customer would like to understand why this is happening. It has happened before but different operator.


Version-Release number of selected component (if applicable):
OCP 4.6

How reproducible:
Has happened with different Operators, seems not only related to clusterlogging. But it is not reproducible by "force"

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Split off from https://bugzilla.redhat.com/show_bug.cgi?id=1924970

Comment 20 kuiwang 2021-04-16 03:10:39 UTC

verify it on 4.8

--
[root@preserve-olm-env 1934080]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-13-171608   True        False         7h42m   Cluster version is 4.8.0-0.nightly-2021-04-13-171608
[root@preserve-olm-env 1934080]# oc get pod -n openshift-operator-lifecycle-manager
NAME                               READY   STATUS    RESTARTS   AGE
catalog-operator-fd9cb85b6-swpzl   1/1     Running   0          7h38m
olm-operator-77956d5d6b-vc5r2      1/1     Running   0          7h38m
packageserver-849744b889-4lv27     1/1     Running   0          7h38m
packageserver-849744b889-qfqc7     1/1     Running   0          7h39m
[root@preserve-olm-env 1934080]# oc exec catalog-operator-fd9cb85b6-swpzl -n openshift-operator-lifecycle-manager -- olm --version
OLM version: 0.17.0
git commit: 873f908ed63b71ed264c3264c34d8308a7830f52


firstly make index image with following cmd
brew list-builds --package=cluster-logging-operator-metadata-container --sort-key=Build --state=COMPLETE --quiet --after=2021-02-14
brew --noauth call --json getBuild cluster-logging-operator-metadata-container-v4.6.0.202104091041.p0-1
brew --noauth call --json getBuild cluster-logging-operator-metadata-container-v4.6.0.202104061129.p0-1
opm index add --bundles registry-proxy.engineering.redhat.com/rh-osbs/openshift-ose-cluster-logging-operator-bundle@sha256:a28ea61e6202e9bec77bfe4382b622f124bcf5836f3081864b3ff2f638697345,registry-proxy.engineering.redhat.com/rh-osbs/openshift-ose-cluster-logging-operator-bundle@sha256:5a45df7acd5580786c985b037ab864ee93d52dd5013c863cc59aaf6c98c6dc73 --tag quay.io/openshifttest/cluster-logging-index:v1 --mode semver -c podman
podman push quay.io/openshifttest/cluster-logging-index:v1

second try to install it more times.
[root@preserve-olm-env 1934080]# cat og1.yaml 
kind: OperatorGroup
apiVersion: operators.coreos.com/v1
metadata:
  name: og1
  namespace: default
spec:
  targetNamespaces:
  - default
[root@preserve-olm-env 1934080]# cat catsrc.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: logging
  namespace: default
spec:
  displayName: "logging Operators"
  image: quay.io/openshifttest/cluster-logging-index:v1
  publisher: QE
  sourceType: grpc
[root@preserve-olm-env 1934080]# cat sub.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name:  logging
  namespace: default
spec:
  source: logging
  sourceNamespace: default

  channel: "4.6"
  installPlanApproval: Automatic
  name: cluster-logging
  startingCSV: clusterlogging.4.6.0-202104061129.p0
[root@preserve-olm-env 1934080]# 

[root@preserve-olm-env 1934080]# oc apply -f og1.yaml 
operatorgroup.operators.coreos.com/og1 created

[root@preserve-olm-env 1934080]# oc apply -f catsrc.yaml 
catalogsource.operators.coreos.com/logging created

[root@preserve-olm-env 1934080]# oc apply -f sub.yaml 
subscription.operators.coreos.com/logging created

[root@preserve-olm-env 1934080]# oc get ip
NAME            CSV                                    APPROVAL    APPROVED
install-q6h6v   clusterlogging.4.6.0-202104061129.p0   Automatic   true
install-wtnhn   clusterlogging.4.6.0-202104091041.p0   Automatic   true

[root@preserve-olm-env 1934080]# oc get csv
NAME                                   DISPLAY           VERSION                 REPLACES                               PHASE
clusterlogging.4.6.0-202104091041.p0   Cluster Logging   4.6.0-202104091041.p0   clusterlogging.4.6.0-202104061129.p0   Succeeded

[root@preserve-olm-env 1934080]# oc delete sub logging
subscription.operators.coreos.com "logging" deleted
[root@preserve-olm-env 1934080]# oc delete csv clusterlogging.4.6.0-202104091041.p0
clusterserviceversion.operators.coreos.com "clusterlogging.4.6.0-202104091041.p0" deleted
[root@preserve-olm-env 1934080]# oc delete og og1
operatorgroup.operators.coreos.com "og1" deleted
[root@preserve-olm-env 1934080]# oc delete catsrc logging
catalogsource.operators.coreos.com "logging" deleted

--

Comment 31 Ben Luddy 2021-06-18 15:54:38 UTC

*** Bug 1924970 has been marked as a duplicate of this bug. ***

Comment 37 errata-xmlrpc 2021-07-27 22:49:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438