Bug 1905299 - OLM fails to update operator
Summary: OLM fails to update operator
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.7.0
Assignee: Vu Dinh
QA Contact: kuiwang
URL:
Whiteboard:
Depends On:
Blocks: 1907586
TreeView+ depends on / blocked
 
Reported: 2020-12-08 01:32 UTC by Alexey Kazakov
Modified: 2021-02-24 15:40 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Previously, Operator updates could result in Operator pods being deployed before a new service account was created. Consequence: The pod could be deployed by using the existing service account and would fail to start with insufficient permissions. Fix: A check has been added to verify that a new service account exists before the cluster service version (CSV) is moved from a `Pending` to `Installing` state. Result: If a new service account does not exist, the CSV remains in a `Pending` state which prevents the deployment from being updated.
Clone Of:
Environment:
Last Closed: 2021-02-24 15:40:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github operator-framework operator-lifecycle-manager pull 1904 0 None closed Bug 1905299: fix(olm): Verify ServiceAccount ownership before installing deployment 2021-02-16 17:07:42 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:40:50 UTC

Description Alexey Kazakov 2020-12-08 01:32:14 UTC
Description of problem:

We have an operator that is being deployed and automatically updated to our OCP cluster for more than one year, however, since Dec 4 every update of the operator causes that there is missing a SA token which results in broken deployment.

The operator doesn't go through OperatorHub, but we create CatalogSource in the respective namespace and let OLM install/update the operator using this way.
Basically we install the operator by applying https://github.com/codeready-toolchain/toolchain-infra/blob/master/config/operator_deploy.yaml


Version-Release number of selected component (if applicable): 4.4.20


How reproducible:
We have two clusters with identical operator installation. One is OCP 4.4.20 which we installed more than a year ago (we kept updating it and it's now 4.4.20). We can now reproduce this issue every time on this cluster (not sure if it started to happen since we updated it to 4.20). We didn't have this issue before on this cluster.

The other cluster is OSD 4.4.16 which we installed a month ago or so. We can't reproduce this issue on that cluster.


Steps to Reproduce:

1. Install some previous version of the operator: https://quay.io/repository/codeready-toolchain/hosted-toolchain-index?tab=tags
Just for the reference. This is how we install the latest one:
- git clone https://github.com/codeready-toolchain/toolchain-infra 
- cd toolchain-infra/config
- NAME=member-operator OPERATOR_NAME=toolchain-member-operator NAMESPACE=toolchain-member-operator envsubst < ./operator_deploy.yaml | oc apply -f -
2. Update the operator to the latest version

Actual results:

The operator is updated but the deployment is crash looping: 
(*zapLogger).Error\n\t/tmp/go/pkg/mod/github.com/go-logr/zapr.1/zapr.go:128\nmain.main\n\t/source/go/src/github.com/codeready-toolchain/host-operator/cmd/manager/main.go:92\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"}

Deleting the pod or re-installing the operator helps.



Expected results:

Operator pod is started successfully. 



Additional info:
It fails like the SA is not present. But it is. Vu Dinh was able to reproduce and guesses that the SA is created after the deployment.

We update the operator pretty often. A few times a week at least for the last year or so. Not sure if it plays any role here. OLM seems to create many resources (config maps, secrets, etc) for every update and never clean them up. Maybe it contributes to this timeing issue somehow.

Comment 1 Vu Dinh 2020-12-08 03:11:07 UTC
This is quite a strange scenario. It seems for some reason the new SA is being created after the new deployment pod is already spinning up. If you delete the failed pod, the ReplicaSet will spin up a new pod and it will succeed.

Comment 2 Matous Jobanek 2020-12-08 14:18:53 UTC
Hi @vdinh 
Thanks a lot for looking at this issue. Is there any progress on it? I guess that you understand that we cannot delete the failed pod for every operator update.

Comment 3 Matous Jobanek 2020-12-10 09:52:15 UTC
This issue is getting to be critical - it also affects our production OSD cluster - OpenShift version: 4.5.16

@vdinh could you please give me any update on this?

Comment 4 Steve Gutz 2020-12-10 16:25:54 UTC
Since this is holding up a release and is going to be impacting revenue unless we get it resolved, can get a bit more insight on the problem and when it is going to get fixed?  If we're not asking the right question then please let us know what the right question is.

Comment 6 kuiwang 2020-12-15 03:32:31 UTC
verify it on 4.7. LGTM

--
[root@preserve-olm-env 1905299]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2020-12-14-165231   True        False         93m     Cluster version is 4.7.0-0.nightly-2020-12-14-165231
[root@preserve-olm-env 1905299]# oc get pod -n openshift-operator-lifecycle-manager
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-66cf979978-k58km   1/1     Running   0          89m
olm-operator-55d756959-v9vzn        1/1     Running   0          89m
packageserver-597d7f4fb-jckjw       1/1     Running   0          89m
packageserver-597d7f4fb-kgtsj       1/1     Running   0          90m
[root@preserve-olm-env 1905299]# oc exec catalog-operator-66cf979978-k58km -n openshift-operator-lifecycle-manager -- olm --version
OLM version: 0.17.0
git commit: 4b66803055a8ab611447c33ed86e755ad39cb313
[root@preserve-olm-env 1905299]# 

[root@preserve-olm-env 1905299]# cat og-single.yaml 
kind: OperatorGroup
apiVersion: operators.coreos.com/v1
metadata:
  name: og-single1
  namespace: default
spec:
  targetNamespaces:
  - default
[root@preserve-olm-env 1905299]# oc apply -f og-single.yaml 
operatorgroup.operators.coreos.com/og-single1 created
[root@preserve-olm-env 1905299]# cat catsrc.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  labels:
    opsrc-provider: codeready-toolchain
  name: hosted-toolchain-operators
  namespace: default
spec:
  sourceType: grpc
  image: quay.io/codeready-toolchain/hosted-toolchain-index:latest
  displayName: Hosted Toolchain Operators
  updateStrategy:
    registryPoll:
      interval: 5m
[root@preserve-olm-env 1905299]# oc apply -f catsrc.yaml 
catalogsource.operators.coreos.com/hosted-toolchain-operators created
[root@preserve-olm-env 1905299]# 

[root@preserve-olm-env 1905299]# cat sub1.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: host-operator
  namespace: default
spec:
  channel: staging
  installPlanApproval: Automatic
  name: toolchain-host-operator
  source: hosted-toolchain-operators
  sourceNamespace: default
  startingCSV: toolchain-host-operator.v0.0.302-134-commit-3f1ed73-e1d3119
[root@preserve-olm-env 1905299]# oc apply -f sub1.yaml 
subscription.operators.coreos.com/host-operator created
[root@preserve-olm-env 1905299]# 

[root@preserve-olm-env 1905299]# oc get sub
NAME            PACKAGE                   SOURCE                       CHANNEL
host-operator   toolchain-host-operator   hosted-toolchain-operators   staging
[root@preserve-olm-env 1905299]# oc get ip
NAME            CSV                                                           APPROVAL    APPROVED
install-cttrg   toolchain-host-operator.v0.0.302-134-commit-3f1ed73-e1d3119   Automatic   true
install-hlgmw   toolchain-host-operator.v0.0.303-134-commit-a512840-e1d3119   Automatic   true
install-q8m5s   toolchain-host-operator.v0.0.304-134-commit-7723fcf-e1d3119   Automatic   true
[root@preserve-olm-env 1905299]# oc get csv
NAME                                                          DISPLAY                   VERSION                              REPLACES                                                      PHASE
toolchain-host-operator.v0.0.303-134-commit-a512840-e1d3119   Toolchain Host Operator   0.0.303-134-commit-a512840-e1d3119   toolchain-host-operator.v0.0.302-134-commit-3f1ed73-e1d3119   Replacing
toolchain-host-operator.v0.0.304-134-commit-7723fcf-e1d3119   Toolchain Host Operator   0.0.304-134-commit-7723fcf-e1d3119   toolchain-host-operator.v0.0.303-134-commit-a512840-e1d3119   Installing
[root@preserve-olm-env 1905299]# oc get ip
NAME            CSV                                                           APPROVAL    APPROVED
install-262fz   toolchain-host-operator.v0.0.306-135-commit-c3ceb05-f0f86eb   Automatic   true
install-6g84v   toolchain-host-operator.v0.0.308-136-commit-ab38d4a-386dc5d   Automatic   true
install-jh5pk   toolchain-host-operator.v0.0.305-135-commit-aca313a-f0f86eb   Automatic   true
install-pvzwp   toolchain-host-operator.v0.0.307-136-commit-74f7fad-386dc5d   Automatic   true
install-xvvlg   toolchain-host-operator.v0.0.306-136-commit-c3ceb05-386dc5d   Automatic   true
[root@preserve-olm-env 1905299]# oc get csv
NAME                                                          DISPLAY                   VERSION                              REPLACES                                                      PHASE
toolchain-host-operator.v0.0.307-136-commit-74f7fad-386dc5d   Toolchain Host Operator   0.0.307-136-commit-74f7fad-386dc5d   toolchain-host-operator.v0.0.306-136-commit-c3ceb05-386dc5d   Replacing
toolchain-host-operator.v0.0.308-136-commit-ab38d4a-386dc5d   Toolchain Host Operator   0.0.308-136-commit-ab38d4a-386dc5d   toolchain-host-operator.v0.0.307-136-commit-74f7fad-386dc5d   Installing
[root@preserve-olm-env 1905299]# oc get ip
NAME            CSV                                                           APPROVAL    APPROVED
install-6h4vl   toolchain-host-operator.v0.0.314-140-commit-5c442dc-633c7ba   Automatic   true
install-j2hxx   toolchain-host-operator.v0.0.313-140-commit-a1632a7-633c7ba   Automatic   true
install-lrmvt   toolchain-host-operator.v0.0.316-140-commit-05b62d3-633c7ba   Automatic   true
install-mfn4x   toolchain-host-operator.v0.0.315-140-commit-8e834dc-633c7ba   Automatic   true
install-r5c24   toolchain-host-operator.v0.0.316-141-commit-05b62d3-a2ed2a7   Automatic   true
[root@preserve-olm-env 1905299]# oc get csv
NAME                                                          DISPLAY                   VERSION                              REPLACES                                                      PHASE
toolchain-host-operator.v0.0.316-141-commit-05b62d3-a2ed2a7   Toolchain Host Operator   0.0.316-141-commit-05b62d3-a2ed2a7   toolchain-host-operator.v0.0.316-140-commit-05b62d3-633c7ba   Succeeded
--

Comment 7 Alexey Kazakov 2020-12-15 05:12:22 UTC
Any chance to backport it to 4.6?

Comment 12 errata-xmlrpc 2021-02-24 15:40:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.