Bug 2058417

Summary: Subscription has olm.generated-by annotation but missing InstallPlanRef and installplan
Product: OpenShift Container Platform Reporter: Neil Girard <ngirard>
Component: OLMAssignee: Per da Silva <pegoncal>
OLM sub component: OLM QA Contact: Jian Zhang <jiazha>
Status: CLOSED NOTABUG Docs Contact:
Severity: high    
Priority: medium CC: anbhatta, nmanos, oarribas, piotr.godowski, tilos, wojciech.polnik
Version: 4.6Keywords: Triaged
Target Milestone: ---Flags: anbhatta: needinfo-
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-07-22 15:14:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2078543    

Description Neil Girard 2022-02-24 20:54:06 UTC
Description of problem:

Somehow OLM is leaving Subscription objects without an InstallPlanRef and installplan resulting in any future updates of existing operators or installations of new operators being able to generate install plans.

Looking at the catalog-operator log, you will see:

~~~
2022-02-16T18:02:32.250375793Z time="2022-02-16T18:02:32Z" level=warning msg="unable to get installplan from cache" channel=stable id=7HkBf installplan=install-lhmkp namespace=ibm-common-services pkg=cloud-native-postgresql source=ibm-operator-catalog sub=cloud-native-postgresql-stable-ibm-operator-catalog-openshift-marketplace
~~~

Once this error is cleared (by removal of the annotation "olm.generated-by" from the subscription(s) with missing InstallPlanRef and installplan objects under status) the OLM is able to once again start generating new install plans for updates / new installs.

Version-Release number of selected component (if applicable):

ocp 4.6 and above.  I've seen it on different versions of ocp.

How reproducible:

Difficult to reproduce, but seems customers are hitting it in namespaces with lots of operators installed.

Steps to Reproduce:
1.N/A

Actual results:

Creating a new subscription does not have installplan / csv get created.

Expected results:

Creating a subscription results in installplan and csv being created.

Additional info:

Logs are included in attached cases and can be pulled from there.

Comment 2 Noam Manos 2022-04-25 12:26:28 UTC
(In reply to Neil Girard from comment #0)
> 
> Once this error is cleared (by removal of the annotation "olm.generated-by"
> from the subscription(s) with missing InstallPlanRef and installplan objects
> under status) the OLM is able to once again start generating new install
> plans for updates / new installs.


Having similar issue on OCP 4.8 with other operators, I found out that deleting previous Subscription and CSV from the namespace
of that operator which need to be upgraded, has fixed the missing InstallPlan generation:

$ oc delete subs --all -n ocm --wait
subscription.operators.coreos.com "old-operator-subscription" deleted

$ oc delete csv --all -n ocm --wait
clusterserviceversion.operators.coreos.com "old-operator.v2.4.2" deleted

$ oc describe subs/new-operator-subscription -n ocm

 Install Plan Generation:  1
  Install Plan Ref:
    API Version:       operators.coreos.com/v1alpha1
    Kind:              InstallPlan
    Name:              install-mxw59
    Namespace:         ocm
    Resource Version:  215607
    UID:               43162fc6-c91e-4394-a836-1399c7465a11
  Installplan:
    API Version:  operators.coreos.com/v1alpha1
    Kind:         InstallPlan
    Name:         install-mxw59
    Uuid:         43162fc6-c91e-4394-a836-1399c7465a11
  Last Updated:   2022-04-25T08:24:08Z
  State:          UpgradePending


# Note:
The most important step was deleting the old operator CSV. Deleting just the old Subscription, did not resolve issue.

Comment 4 Anik 2022-07-22 15:14:27 UTC
From the description in the CC 03262802: 

```
We are trying to update from CP4BA IF007 to IF010 and also changing to the pinned catalog version."
We update the catalog and ran the "./update_subscrpition.sh" from the "ibm-cp-automation-3.2.10.tgz" the operator still only offered the old install plan (IP) 
We removed the IP and now new IP was created. Based on this we would expect a new IP 
Based on this bug https://bugzilla.redhat.com/show_bug.cgi?id=1841175 does OLM create a new install plan once it is deleted. Is it fixed for version 4.8.27
```

It is unclear why the Subscription was updated to begin with (the script update_subscripiton.sh is not documented in the case). Besides, if updating the catalog (with an expectation that a newer version will be installed as a result) does not create a new InstallPlan, the Subscription status needs to be checked for possible errors/warning. Deleting the existing InstallPlan is the wrong step to take(it's not documented anywhere), because of which the Subscription was broken. As a result, reinstalling the operator was what fixed the issue. 


As a final clarification, the fix[1] for bz 1841175 was to recreate the InstallPlan, IF the InstallPlan was deleted BEFORE approval. Since the customer in this case was expecting an upgrade of their operator, my guess is that the InstallPlan was already approved (possibly with `Automatic` approval). So that fix is irrelevant to this case. Besides, that fix was included in OCP 4.7.z, which mean it's included in 4.8.27.

Since the issue was encountered due to the wrong steps taken, closing this bug as NOTABUG.


[1] https://github.com/operator-framework/operator-lifecycle-manager/pull/1874

Comment 5 Tilo 2022-08-15 17:54:05 UTC
Hi @anbhatta 

there was a TON of information lost in the redirect from IBM support team A to team B to RH support team z. Let me try to feed this in. 

"./update_subscrpition.sh" is packaged here: 
https://github.com/IBM/cloud-pak/blob/master/repo/case/ibm-cp-automation/3.2.10/ibm-cp-automation-3.2.10.tgz

wget -nv https://github.com/IBM/cloud-pak/raw/master/repo/case/ibm-cp-automation/3.2.10/ibm-cp-automation-3.2.10.tgz
tar xzf ibm-cp-automation-3.2.10.tgz ; tar xzf ibm-cp-automation/inventory/cp4aOperatorSdk/files/deploy/crs/cert-k8s-*.tar 

find the script here: cert-kubernetes\scripts\update_subscription.sh


The issue happened on OCP 4.8.27 and 4.8.41.

IP Approval was set to manual. 

Situation was that the we had an IP for like IBM CP4BA IF008 but we didn't want to install it (left IP sitting as it was on manual approval). 
Now IP010 was release and we wanted to take it. IBM also switched to use pinned catalog (via above script). 
After we applied this for new NS we got offered the right version IF010 but the existing NS didn't get a new IP with the IF010 offered. 

No info by IBM if the pending IP IF008 should be denied or deleted. 

We deleted the IP but still nothing happened. This was due to the annotation "olm.generated-by" which seemed to block OLM to generate new IP. 
After removing the  annotation "olm.generated-by" we got new IP with the right version IF010 and could install it. 

The Sub which had the annotation "olm.generated-by" were created by the "main" operator CP4BA (the above cert-kubernetes\scripts\cp4a-clusteradmin-setup.sh installed the "main" CP4BA operator)

Comment 6 piotr.godowski 2022-12-16 15:29:46 UTC
We hit this issue once again in one of the customer's production environment, attempting the production environment upgrade during the year-end holiday season.
It is really getting us in a difficult situations with the customers, so I am asking RH OLM team to perhaps automate the recovery procedure which is documented?

I do understand the complexity of OLM code to prevent the issue, but can we consider a self-healing solution to this problem, to avoid customer production issues?

The recovery procedure documented I mean this one: https://access.redhat.com/solutions/6957109

Comment 7 wojciech.polnik 2023-03-31 18:26:25 UTC
Faced the same issue.
If my case the devworkspace-operator-fast-redhat-operators-openshift-marketplace operator was bricked.

E0331 18:17:34.540772       1 queueinformer_operator.go:290] sync "openshift-operators" failed: installplans.operators.coreos.com "install-l24ds" not found
time="2023-03-31T18:17:35Z" level=warning msg="unable to get installplan from cache" channel=fast id=ll1ad installplan=install-l24ds namespace=openshift-operators pkg=devworkspace-operator source=redhat-operators sub=devworkspace-operator-fast-redhat-operators-openshift-marketplace

To have it working I've deleted annotation olm.generated-by from subscription of this operator

Comment 8 Anik 2023-08-15 17:27:26 UTC
Looks like all the customer reports are in the closed status at this point. I'm guessing workaround provided by the KCS article has been sufficient to alleviate the situations that the customers ran into.