Description of problem: We have an operator that is being deployed and automatically updated to our OCP cluster for more than one year, however, since Dec 4 every update of the operator causes that there is missing a SA token which results in broken deployment. The operator doesn't go through OperatorHub, but we create CatalogSource in the respective namespace and let OLM install/update the operator using this way. Basically we install the operator by applying https://github.com/codeready-toolchain/toolchain-infra/blob/master/config/operator_deploy.yaml Version-Release number of selected component (if applicable): 4.4.20 How reproducible: We have two clusters with identical operator installation. One is OCP 4.4.20 which we installed more than a year ago (we kept updating it and it's now 4.4.20). We can now reproduce this issue every time on this cluster (not sure if it started to happen since we updated it to 4.20). We didn't have this issue before on this cluster. The other cluster is OSD 4.4.16 which we installed a month ago or so. We can't reproduce this issue on that cluster. Steps to Reproduce: 1. Install some previous version of the operator: https://quay.io/repository/codeready-toolchain/hosted-toolchain-index?tab=tags Just for the reference. This is how we install the latest one: - git clone https://github.com/codeready-toolchain/toolchain-infra - cd toolchain-infra/config - NAME=member-operator OPERATOR_NAME=toolchain-member-operator NAMESPACE=toolchain-member-operator envsubst < ./operator_deploy.yaml | oc apply -f - 2. Update the operator to the latest version Actual results: The operator is updated but the deployment is crash looping: (*zapLogger).Error\n\t/tmp/go/pkg/mod/github.com/go-logr/zapr.1/zapr.go:128\nmain.main\n\t/source/go/src/github.com/codeready-toolchain/host-operator/cmd/manager/main.go:92\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"} Deleting the pod or re-installing the operator helps. Expected results: Operator pod is started successfully. Additional info: It fails like the SA is not present. But it is. Vu Dinh was able to reproduce and guesses that the SA is created after the deployment. We update the operator pretty often. A few times a week at least for the last year or so. Not sure if it plays any role here. OLM seems to create many resources (config maps, secrets, etc) for every update and never clean them up. Maybe it contributes to this timeing issue somehow.
This is quite a strange scenario. It seems for some reason the new SA is being created after the new deployment pod is already spinning up. If you delete the failed pod, the ReplicaSet will spin up a new pod and it will succeed.
Hi @vdinh Thanks a lot for looking at this issue. Is there any progress on it? I guess that you understand that we cannot delete the failed pod for every operator update.
This issue is getting to be critical - it also affects our production OSD cluster - OpenShift version: 4.5.16 @vdinh could you please give me any update on this?
Since this is holding up a release and is going to be impacting revenue unless we get it resolved, can get a bit more insight on the problem and when it is going to get fixed? If we're not asking the right question then please let us know what the right question is.
verify it on 4.7. LGTM -- [root@preserve-olm-env 1905299]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2020-12-14-165231 True False 93m Cluster version is 4.7.0-0.nightly-2020-12-14-165231 [root@preserve-olm-env 1905299]# oc get pod -n openshift-operator-lifecycle-manager NAME READY STATUS RESTARTS AGE catalog-operator-66cf979978-k58km 1/1 Running 0 89m olm-operator-55d756959-v9vzn 1/1 Running 0 89m packageserver-597d7f4fb-jckjw 1/1 Running 0 89m packageserver-597d7f4fb-kgtsj 1/1 Running 0 90m [root@preserve-olm-env 1905299]# oc exec catalog-operator-66cf979978-k58km -n openshift-operator-lifecycle-manager -- olm --version OLM version: 0.17.0 git commit: 4b66803055a8ab611447c33ed86e755ad39cb313 [root@preserve-olm-env 1905299]# [root@preserve-olm-env 1905299]# cat og-single.yaml kind: OperatorGroup apiVersion: operators.coreos.com/v1 metadata: name: og-single1 namespace: default spec: targetNamespaces: - default [root@preserve-olm-env 1905299]# oc apply -f og-single.yaml operatorgroup.operators.coreos.com/og-single1 created [root@preserve-olm-env 1905299]# cat catsrc.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: labels: opsrc-provider: codeready-toolchain name: hosted-toolchain-operators namespace: default spec: sourceType: grpc image: quay.io/codeready-toolchain/hosted-toolchain-index:latest displayName: Hosted Toolchain Operators updateStrategy: registryPoll: interval: 5m [root@preserve-olm-env 1905299]# oc apply -f catsrc.yaml catalogsource.operators.coreos.com/hosted-toolchain-operators created [root@preserve-olm-env 1905299]# [root@preserve-olm-env 1905299]# cat sub1.yaml apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: host-operator namespace: default spec: channel: staging installPlanApproval: Automatic name: toolchain-host-operator source: hosted-toolchain-operators sourceNamespace: default startingCSV: toolchain-host-operator.v0.0.302-134-commit-3f1ed73-e1d3119 [root@preserve-olm-env 1905299]# oc apply -f sub1.yaml subscription.operators.coreos.com/host-operator created [root@preserve-olm-env 1905299]# [root@preserve-olm-env 1905299]# oc get sub NAME PACKAGE SOURCE CHANNEL host-operator toolchain-host-operator hosted-toolchain-operators staging [root@preserve-olm-env 1905299]# oc get ip NAME CSV APPROVAL APPROVED install-cttrg toolchain-host-operator.v0.0.302-134-commit-3f1ed73-e1d3119 Automatic true install-hlgmw toolchain-host-operator.v0.0.303-134-commit-a512840-e1d3119 Automatic true install-q8m5s toolchain-host-operator.v0.0.304-134-commit-7723fcf-e1d3119 Automatic true [root@preserve-olm-env 1905299]# oc get csv NAME DISPLAY VERSION REPLACES PHASE toolchain-host-operator.v0.0.303-134-commit-a512840-e1d3119 Toolchain Host Operator 0.0.303-134-commit-a512840-e1d3119 toolchain-host-operator.v0.0.302-134-commit-3f1ed73-e1d3119 Replacing toolchain-host-operator.v0.0.304-134-commit-7723fcf-e1d3119 Toolchain Host Operator 0.0.304-134-commit-7723fcf-e1d3119 toolchain-host-operator.v0.0.303-134-commit-a512840-e1d3119 Installing [root@preserve-olm-env 1905299]# oc get ip NAME CSV APPROVAL APPROVED install-262fz toolchain-host-operator.v0.0.306-135-commit-c3ceb05-f0f86eb Automatic true install-6g84v toolchain-host-operator.v0.0.308-136-commit-ab38d4a-386dc5d Automatic true install-jh5pk toolchain-host-operator.v0.0.305-135-commit-aca313a-f0f86eb Automatic true install-pvzwp toolchain-host-operator.v0.0.307-136-commit-74f7fad-386dc5d Automatic true install-xvvlg toolchain-host-operator.v0.0.306-136-commit-c3ceb05-386dc5d Automatic true [root@preserve-olm-env 1905299]# oc get csv NAME DISPLAY VERSION REPLACES PHASE toolchain-host-operator.v0.0.307-136-commit-74f7fad-386dc5d Toolchain Host Operator 0.0.307-136-commit-74f7fad-386dc5d toolchain-host-operator.v0.0.306-136-commit-c3ceb05-386dc5d Replacing toolchain-host-operator.v0.0.308-136-commit-ab38d4a-386dc5d Toolchain Host Operator 0.0.308-136-commit-ab38d4a-386dc5d toolchain-host-operator.v0.0.307-136-commit-74f7fad-386dc5d Installing [root@preserve-olm-env 1905299]# oc get ip NAME CSV APPROVAL APPROVED install-6h4vl toolchain-host-operator.v0.0.314-140-commit-5c442dc-633c7ba Automatic true install-j2hxx toolchain-host-operator.v0.0.313-140-commit-a1632a7-633c7ba Automatic true install-lrmvt toolchain-host-operator.v0.0.316-140-commit-05b62d3-633c7ba Automatic true install-mfn4x toolchain-host-operator.v0.0.315-140-commit-8e834dc-633c7ba Automatic true install-r5c24 toolchain-host-operator.v0.0.316-141-commit-05b62d3-a2ed2a7 Automatic true [root@preserve-olm-env 1905299]# oc get csv NAME DISPLAY VERSION REPLACES PHASE toolchain-host-operator.v0.0.316-141-commit-05b62d3-a2ed2a7 Toolchain Host Operator 0.0.316-141-commit-05b62d3-a2ed2a7 toolchain-host-operator.v0.0.316-140-commit-05b62d3-633c7ba Succeeded --
Any chance to backport it to 4.6?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633