Bug 1934080
| Summary: | Both old and new Clusterlogging CSVs stuck in Pending during upgrade | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jonas Nordell <jnordell> | |
| Component: | OLM | Assignee: | Ben Luddy <bluddy> | |
| OLM sub component: | OLM | QA Contact: | kuiwang | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | high | |||
| Priority: | high | CC: | assingh, bluddy, cpassare, dgautam, ekasprzy, jmalde, kiyyappa, krizza, ksathe, kuiwang, mmohan, mpandey, naygupta, nhale, scolange, tflannag, xzha | |
| Version: | 4.6 | Keywords: | Triaged | |
| Target Milestone: | --- | |||
| Target Release: | 4.8.0 | |||
| Hardware: | All | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: |
Cause: During an operator upgrade, the owner reference of any associated ServiceAccount objects is updated to point to the new ClusterServiceVersion instead of the old one.
Consequence: A race condition can occur between the olm-operator (which reconciles ClusterServiceVersions) and the catalog-operator (which executes InstallPlans), marking the old CSV as "Pending/RequirementsNotMet" due to the ServiceAccount ownership change. This blocks upgrade completion while the new CSV waits indefinitely for the old CSV to indicate a healthy status.
Fix: Instead of updating owner references in one step, the second owner is appended to any existing owners.
Result: The same ServiceAccount can satisfy requirements for both the old and the new ClusterServiceVersion.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1949139 (view as bug list) | Environment: | ||
| Last Closed: | 2021-07-27 22:49:00 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1949139 | |||
verify it on 4.8 -- [root@preserve-olm-env 1934080]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-13-171608 True False 7h42m Cluster version is 4.8.0-0.nightly-2021-04-13-171608 [root@preserve-olm-env 1934080]# oc get pod -n openshift-operator-lifecycle-manager NAME READY STATUS RESTARTS AGE catalog-operator-fd9cb85b6-swpzl 1/1 Running 0 7h38m olm-operator-77956d5d6b-vc5r2 1/1 Running 0 7h38m packageserver-849744b889-4lv27 1/1 Running 0 7h38m packageserver-849744b889-qfqc7 1/1 Running 0 7h39m [root@preserve-olm-env 1934080]# oc exec catalog-operator-fd9cb85b6-swpzl -n openshift-operator-lifecycle-manager -- olm --version OLM version: 0.17.0 git commit: 873f908ed63b71ed264c3264c34d8308a7830f52 firstly make index image with following cmd brew list-builds --package=cluster-logging-operator-metadata-container --sort-key=Build --state=COMPLETE --quiet --after=2021-02-14 brew --noauth call --json getBuild cluster-logging-operator-metadata-container-v4.6.0.202104091041.p0-1 brew --noauth call --json getBuild cluster-logging-operator-metadata-container-v4.6.0.202104061129.p0-1 opm index add --bundles registry-proxy.engineering.redhat.com/rh-osbs/openshift-ose-cluster-logging-operator-bundle@sha256:a28ea61e6202e9bec77bfe4382b622f124bcf5836f3081864b3ff2f638697345,registry-proxy.engineering.redhat.com/rh-osbs/openshift-ose-cluster-logging-operator-bundle@sha256:5a45df7acd5580786c985b037ab864ee93d52dd5013c863cc59aaf6c98c6dc73 --tag quay.io/openshifttest/cluster-logging-index:v1 --mode semver -c podman podman push quay.io/openshifttest/cluster-logging-index:v1 second try to install it more times. [root@preserve-olm-env 1934080]# cat og1.yaml kind: OperatorGroup apiVersion: operators.coreos.com/v1 metadata: name: og1 namespace: default spec: targetNamespaces: - default [root@preserve-olm-env 1934080]# cat catsrc.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: logging namespace: default spec: displayName: "logging Operators" image: quay.io/openshifttest/cluster-logging-index:v1 publisher: QE sourceType: grpc [root@preserve-olm-env 1934080]# cat sub.yaml apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: logging namespace: default spec: source: logging sourceNamespace: default channel: "4.6" installPlanApproval: Automatic name: cluster-logging startingCSV: clusterlogging.4.6.0-202104061129.p0 [root@preserve-olm-env 1934080]# [root@preserve-olm-env 1934080]# oc apply -f og1.yaml operatorgroup.operators.coreos.com/og1 created [root@preserve-olm-env 1934080]# oc apply -f catsrc.yaml catalogsource.operators.coreos.com/logging created [root@preserve-olm-env 1934080]# oc apply -f sub.yaml subscription.operators.coreos.com/logging created [root@preserve-olm-env 1934080]# oc get ip NAME CSV APPROVAL APPROVED install-q6h6v clusterlogging.4.6.0-202104061129.p0 Automatic true install-wtnhn clusterlogging.4.6.0-202104091041.p0 Automatic true [root@preserve-olm-env 1934080]# oc get csv NAME DISPLAY VERSION REPLACES PHASE clusterlogging.4.6.0-202104091041.p0 Cluster Logging 4.6.0-202104091041.p0 clusterlogging.4.6.0-202104061129.p0 Succeeded [root@preserve-olm-env 1934080]# oc delete sub logging subscription.operators.coreos.com "logging" deleted [root@preserve-olm-env 1934080]# oc delete csv clusterlogging.4.6.0-202104091041.p0 clusterserviceversion.operators.coreos.com "clusterlogging.4.6.0-202104091041.p0" deleted [root@preserve-olm-env 1934080]# oc delete og og1 operatorgroup.operators.coreos.com "og1" deleted [root@preserve-olm-env 1934080]# oc delete catsrc logging catalogsource.operators.coreos.com "logging" deleted -- *** Bug 1924970 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |
Description of problem: In a OCP 4.6 cluster an automatic upgrade of the clusterlogging operator failed. Both the old and the new is now stuck in Phase: pending oc get csv NAME DISPLAY VERSION REPLACES PHASE clusterlogging.4.6.0-202101230113.p0 Cluster Logging 4.6.0-202101230113.p0 clusterlogging.4.6.0-202101162152.p0 Pending clusterlogging.4.6.0-202101301510.p0 Cluster Logging 4.6.0-202101301510.p0 clusterlogging.4.6.0-202101230113.p0 Pending elasticsearch-operator.4.6.0-202101300140.p0 OpenShift Elasticsearch Operator 4.6.0-202101300140.p0 elasticsearch-operator.4.6.0-202101230113.p0 Succeeded When checking the 202101301510 CSV it does not really say why replaces: clusterlogging.4.6.0-202101230113.p0 version: 4.6.0-202101301510.p0 status: conditions: - lastTransitionTime: "2021-02-08T15:26:55Z" lastUpdateTime: "2021-02-08T15:26:55Z" message: requirements not yet checked phase: Pending reason: RequirementsUnknown But when I check the 202101230113 CSV it states: - lastTransitionTime: "2021-02-04T08:13:20Z" lastUpdateTime: "2021-02-04T08:13:20Z" message: install strategy completed with no errors phase: Succeeded reason: InstallSucceeded - lastTransitionTime: "2021-02-08T15:26:45Z" <------------- Stared failing lastUpdateTime: "2021-02-08T15:26:45Z" message: requirements no longer met phase: Failed reason: RequirementsNotMet - lastTransitionTime: "2021-02-08T15:26:50Z" lastUpdateTime: "2021-02-08T15:26:50Z" message: requirements not met phase: Pending reason: RequirementsNotMet - group: "" kind: ServiceAccount message: Service account is not owned by this ClusterServiceVersion name: cluster-logging-operator status: PresentNotSatisfied version: v1 And when I check the SA manager: olm operation: Update time: "2021-02-08T15:26:46Z" name: cluster-logging-operator namespace: openshift-logging ownerReferences: - apiVersion: operators.coreos.com/v1alpha1 blockOwnerDeletion: false controller: false kind: ClusterServiceVersion name: clusterlogging.4.6.0-202101301510.p0 uid: 58956251-875e-49b2-b11d-526d098c0258 So OLM changed the ownerReference on the SA to clusterlogging.4.6.0-202101301510.p0 before clusterlogging.4.6.0-202101230113.p0 was "uninstalled/removed". This upset the clusterlogging.4.6.0-202101230113.p0 and now everything is stuck/in a deadlock. It is possible to workaround by manually changing the onwerReference on the SA, but the customer would like to understand why this is happening. It has happened before but different operator. Version-Release number of selected component (if applicable): OCP 4.6 How reproducible: Has happened with different Operators, seems not only related to clusterlogging. But it is not reproducible by "force" Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Split off from https://bugzilla.redhat.com/show_bug.cgi?id=1924970