Bug 1980755 - Subscription and CSV can't bind each other
Summary: Subscription and CSV can't bind each other
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.7
Hardware: All
OS: All
medium
medium
Target Milestone: ---
: ---
Assignee: Kevin Rizza
QA Contact: Jian Zhang
URL:
Whiteboard:
: 2093176 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-09 12:57 UTC by Jiaming Hu
Modified: 2023-09-18 00:28 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-03-09 01:04:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github 2201 0 None None None 2021-07-09 12:57:09 UTC

Description Jiaming Hu 2021-07-09 12:57:09 UTC
Description of problem:

This is an intermittent defect we observe in the operator install and upgrade. Operator Subscription and CSV can't bind each other. With further investigations, we found it will happen when OLM failed to update the subscription status. 


How reproducible:

It can't be reproduced every time, it will appear when subscription status is failed to be updated when installplan is created.


Actual results:

What I observe is the CSV of the operator is created, but there is no update in the subscription status,  which cause even if the install plan is completed, the subscription is in the `unknown` status and CSV is in the `Cannot Update` status

Also, it will block the catalog operator reconcile other operators.


Expected results:

I expect the operator could be deployed or upgraded successfully


Additional info:

The root cause is figured out in this github issue: https://github.com/operator-framework/operator-lifecycle-manager/issues/2201

Comment 1 Kevin Rizza 2021-07-12 14:08:57 UTC
Updating to blocker- and setting priority and severity to medium.

How frequently are you running into this issue? On review of this upstream issue, it appears that there is already retry logic baked into that function and this particular edge case is something that will only occur when there is an extended reason why OLM is unable to reach the API Server. Aside from bumping the retry, the only remedy to this problem is to actually resolve the bundle identity problem in the new OLM APIs, otherwise once this routine is resolved OLM will no longer have the context to properly associate the new CSV with the new Subscription.

Comment 2 Jiaming Hu 2021-07-15 19:02:29 UTC
Thanks for your update.

This issue does not happen very frequently, but once it happens it will block the install and upgrade of other operators and it's hard to find since operator deployment could be deployed successfully.

According to the retry mechanism, I believe OLM can prevent the problem to a certain extent. I will keep monitoring if it happens again.

Comment 3 Jiaming Hu 2021-07-28 19:31:45 UTC
Hi Kevin,

We can see this issue happens when multiple operators are upgraded simultaneously. I am wondering if we can enhance the subscription status update retry logic to be more robust, considering once the status update is failed because of the network issue or other unexpected situation, the operator will break the upgrade and installation of other operators. 

Could we update the retry logic from the default backoff (only retry 5 times when conflict issues happen) to  retrying until update succeed?

Thanks,
Jiaming

Comment 5 Per da Silva 2022-04-06 12:59:29 UTC
We think that effort to fix this is far greater than the value we'd derive from it. Closing as WONTFIX. Should this happen with a far greater frequency, please re-open.

Comment 6 Jiaming Hu 2022-04-13 15:15:04 UTC
I believe we still need a solution for solving this issue because even if it doesn't happen every time, the impact of the issue is large. It will block the installation and upgrade of all the operators in the same namespace.

Comment 7 JAGADEESWAR GANGARAJU 2022-05-20 13:49:11 UTC
We also got into same issue and we our install got stuck. We had to manually delete our operator and then install started again.


Thanks,
Jag.

Comment 11 Daniel Fan 2022-11-14 22:27:10 UTC
We hit the this issue today in 2 production Envs when we were doing the seamless upgrade inside the same channel.

```
ERROR: catalog-operator: I1114 18:59:21.690684 1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"", Name:"ibm-common-services", UID:"49c68d94-76a3-4600-94e9-4086727deec5", APIVersion:"v1", ResourceVersion:"210893", FieldPath:""}): type: 'Warning' reason: 'ResolutionFailed' constraints not satisfiable: clusterserviceversion ibm-common-service-operator.v3.19.5 exists and is not referenced by a subscription, subscription ibm-common-service-operator exists, subscription ibm-common-service-operator requires @existing/ibm-common-services//ibm-common-service-operator.v3.19.6, @existing/ibm-common-services//ibm-common-service-operator.v3.19.6 and @existing/ibm-common-services//ibm-common-service-operator.v3.19.5 originate from package ibm-common-service-operator
```

When OLM failed to update the subscription resource, usually failed to Add CSV information into subscription status. It shows following error msg

```
error="Operation cannot be fulfilled on subscriptions.operators.coreos.com \"operand-deployment-lifecycle-manager-app\": the object has been modified; please apply your changes to the latest version and try again" 

```

When OLM will try to reconcile the request, it failed at operator resolving stage.

It observes that there is an operator CSV existing in the cluster, but it is not in any subscription status.
  - it shows in logs as @existing/cp4d//operand-deployment-lifecycle-manager.v1.20.0
It observes that there is a subscription without a CSV bound, so a new CSV is expected
  - it shows in logs as opencloud-operators/openshift-marketplace/v3.22/operand-deployment-lifecycle-manager.v1.20.0

Those two CSV's provide the same CRD like OperandRegistry, and it cause the conflict.

Comment 16 Glenn Marcy 2022-12-06 21:23:16 UTC
Adding this comment from a thread on olm-dev channel on kubernetes slack

I have reached a tentative conclusion after several days of continuous testing that the version of OLM I'm using from OCP 4.11.9 is working, and that previous releases of OCP included an OLM with an intermittent container crash.

Since there have been multiple occasions that new code has added to the catalog operator that resulted in such a crash, and that the OCP process to choose a version of OLM to ship has been unlucky more than once, I am wondering if there is any way to handle this type of failure better?

Comment 19 piotr.godowski 2022-12-16 15:12:47 UTC
We hit this issue once again in one of the customer's production environment, attempting the production environment upgrade during the year-end holiday season.
It is really getting us in a difficult situations with the customers, so I am asking RH OLM team to perhaps automate the recovery procedure which is documented?

I do understand the complexity of OLM code to prevent the issue, but can we consider a self-healing solution to this problem, to avoid customer production issues?

Comment 20 jkeister 2023-02-08 21:57:33 UTC
*** Bug 2093176 has been marked as a duplicate of this bug. ***

Comment 21 Shiftzilla 2023-03-09 01:04:23 UTC
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-8914

Comment 22 Red Hat Bugzilla 2023-09-18 00:28:12 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.