Bug 1896102

Summary: OLM not updating operator to the next version due to a stuck installplan in the "UpgradePending" state
Product: OpenShift Container Platform Reporter: James Harrington <jaharrin>
Component: OLMAssignee: Evan Cordell <ecordell>
OLM sub component: OLM QA Contact: Jian Zhang <jiazha>
Status: CLOSED DUPLICATE Docs Contact:
Severity: unspecified    
Priority: unspecified CC: bjarolim, bmontgom, krizza, lseelye, mmazur, nraghava, openshift-bugs-escalate, scuppett
Version: 4.5Keywords: ServiceDeliveryImpact
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-09 21:43:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description James Harrington 2020-11-09 18:13:21 UTC
Description of problem:

OLM isn't upgrading the cloud-ingress-operator operator on cluster to the latest version. The subscription status is showing the `currentCSV` to be "cloud-ingress-operator.v0.1.175-e727583" however the CSV on cluster is "cloud-ingress-operator.v0.1.177-8cad995"

The subscription status shows that the installplan install-spvzh for CSV version cloud-ingress-operator.v0.1.175-e727583 is in the "UpgradePending" state.

Looking at the installplans on cluster we see install-f9mcs which is for CSV version cloud-ingress-operator.v0.1.177-8cad995 was approved an installed as well as install-spvzh for cloud-ingress-operator.v0.1.175-e727583


$ oc get ip -n openshift-cloud-ingress-operator install-spvzh -o json | jq '.status | "\(.conditions) \(.phase)"'

"[{\"lastTransitionTime\":\"2020-06-19T15:20:38Z\",\"lastUpdateTime\":\"2020-06-19T15:20:38Z\",\"status\":\"True\",\"type\":\"Installed\"}] Complete"

$ oc get ip -n openshift-cloud-ingress-operator install-f9mcs -o json | jq '.status | "\(.conditions) \(.phase)"'

"[{\"lastTransitionTime\":\"2020-06-19T15:20:44Z\",\"lastUpdateTime\":\"2020-06-19T15:20:44Z\",\"status\":\"True\",\"type\":\"Installed\"}] Complete"

Please NOTE:

Full disclosure this operator's CSV is referencing the CRD that it doesn't deploy. That CRD is present on cluster and the requirements for the CSV are satisfied. We are fixing this.


Version-Release number of selected component (if applicable):

oc get pods -n openshift-operator-lifecycle-manager -o json | jq '.items[].spec.containers[].image'
"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d97a825602c5285fc6847534aeb7ead1b99059b709c513ad806686b52e27d2b4"
"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d97a825602c5285fc6847534aeb7ead1b99059b709c513ad806686b52e27d2b4"
"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d97a825602c5285fc6847534aeb7ead1b99059b709c513ad806686b52e27d2b4"
"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d97a825602c5285fc6847534aeb7ead1b99059b709c513ad806686b52e27d2b4"


How reproducible:

Not everytime, unable to reproduce reliably at the moment

Steps to Reproduce:
1.
2.
3.

Actual results:

OLM appears to be stuck and cannot move the installplan install-spvzh into a AtLatestKnown state

Expected results:

OLM to upgrade the cloud-ingress-operator

Additional info:

The catalog pod is return a new version for cloud-ingress-operator.v0.1.177-8cad995 

oc run grpcurl-query -n openshift-operator-lifecycle-manager --rm=true  --restart=Never --attach=true --image=quay.io/rogbas/grpcurl -- -plaintext 10.204.132.87:50051  api.Registry/ListBundles | jq -c '. |select(.replaces|match("8cad995"))| {packageName,csvName,channelName,replaces}'
{"packageName":"cloud-ingress-operator","csvName":"cloud-ingress-operator.v0.1.179-ae0b008","channelName":"production","replaces":"cloud-ingress-operator.v0.1.177-8cad995"}

Install plans on cluster 

oc get ip -n openshift-cloud-ingress-operator
NAME            CSV                                       APPROVAL    APPROVED
install-f9mcs   cloud-ingress-operator.v0.1.177-8cad995   Automatic   true
install-mxw2p   cloud-ingress-operator.v0.1.172-64a442f   Automatic   true
install-spvzh   cloud-ingress-operator.v0.1.175-e727583   Automatic   true
install-v7fnl   cloud-ingress-operator.v0.1.174-184d837   Automatic   true

CSV on cluster 

oc get csv -n openshift-cloud-ingress-operator
NAME                                               DISPLAY                           VERSION           REPLACES                                           PHASE
cloud-ingress-operator.v0.1.177-8cad995            cloud-ingress-operator   

CSV interesting metadata:

oc get csv cloud-ingress-operator.v0.1.177-8cad995 -n openshift-cloud-ingress-operator -o json | jq '.status.requirementStatus[] | "\(.name) \(.kind) \(.status)"' 
"subjectpermissions.managed.openshift.io CustomResourceDefinition Present"
"cloud-ingress-operator ServiceAccount Present"

oc get csv cloud-ingress-operator.v0.1.177-8cad995 -n openshift-cloud-ingress-operator -o json | jq '.status.requirementStatus[] | select(.dependents!=null) | .dependents[] | "\(.kind) \(.status)"'
"PolicyRule Satisfied"
"PolicyRule Satisfied"
"PolicyRule Satisfied"
"PolicyRule Satisfied"
"PolicyRule Satisfied"
"PolicyRule Satisfied"
"PolicyRule Satisfied"
"PolicyRule Satisfied"
"PolicyRule Satisfied"
"PolicyRule Satisfied"
"PolicyRule Satisfied"

Comment 3 Kevin Rizza 2020-11-09 21:43:56 UTC
Based on our investigation, I'm going to close this out as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1860185

While this bug is already resolved, it appears that the problem state was already tripped during install on a previous version. Doing a reinstall will resolve this problem, and it won't be encountered in the future based on the current ocp version of this cluster (which now includes the fix).

*** This bug has been marked as a duplicate of bug 1860185 ***