Bug 2178619 - odf-operator failing to resolve its sub-dependencies leaving the ocs-consumer/provider addon in a failed and halted state
Summary: odf-operator failing to resolve its sub-dependencies leaving the ocs-consumer...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-operator
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.13.0
Assignee: Nitin Goyal
QA Contact: Itzhak
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-03-15 13:02 UTC by Yashvardhan Kukreja
Modified: 2023-12-08 04:32 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-06-21 15:24:39 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage odf-operator pull 301 0 None open controllers: do not use odf sub cache 2023-03-16 11:57:00 UTC
Github red-hat-storage odf-operator pull 302 0 None open Bug 2178619:[release-4.13] controllers: do not use odf sub cache 2023-03-16 12:56:48 UTC
Red Hat Product Errata RHBA-2023:3742 0 None None None 2023-06-21 15:25:57 UTC

Description Yashvardhan Kukreja 2023-03-15 13:02:38 UTC
Description of problem (please be detailed as possible and provide log
snippests):

Whenever the Subscription of odf-operator is deleted and re-created, as a part of the OLM dance troubleshooting process, `odf-operator-controller-manager` still continues to create subscriptions of its sub-dependencies with the ownerReferences of `odf-operator` Subscription with old/stale/not-anymore-existing `uid`.

Here the "sub-dependencies" are ocs-operator, mcg-operator, odf-csi-addons-operator.

Due to this, as soon as those sub-dependencies' Subscriptions are created, kube-controller-manager's garbage collector instantly deletes them because it recognises that those newly created Subscriptions have non-existent ownerReferences.

Subsequently, `odf-operator-controller-manager` reacts to this garbage collection by recreating the sub-dependencies' Subscriptions with non-existent ownerReferences again. 

This causes a hotloop of the creation, deletion and re-creation of the Subscriptions of sub-dependencies without leaving them in a stable state to resolve themselves and create sub-dependencies workloads.

This ends up leaving the entire ocs-consumer/provider stack in an unresolved state halting all the day-2 operations such as upgrading the addon and multiple other capabilities of the addon.

Root cause

This happens because when `odf-operator-controller-manager` starts for the first time, only then, it looks at the `odf-operator` Subscription present in the cluster and caches its `metadata.uid` permanently with the assumption that that Subscription with that `uid` will be left untouched throughout its lifecycle. Therefore, everytime it creates/reconciles its sub-dependency Subscriptions, it uses that cached `uid` of odf-operator Subscription, instead of rightfully GET-ing the `uid` of latest `odf-operator` Subscription present in the cluster.

Ref - https://github.com/red-hat-storage/odf-operator/blob/main/controllers/subscriptions.go#L115-L117 



Version of all relevant components (if applicable):

odf-operator - 4.10.5


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Yes, due to this issue, any day 2 operations such as upgrades of the addon or even the connection establishment between the ocs-consumer and ocs-provider ends up being broken because ocs-operator, the component responsible for the connection establishment, is one of those sub-dependencies which end up getting stuck in the unstable and unresolved state as well.


Is there any workaround available to the best of your knowledge?

The workaround is to restart the `odf-operator-controller-manager` pods so that they end up refreshing the wrongfully cached `metadata.uid` with the one correctly pointing to the latest `odf-operator` Subscription existing in the cluster.

T


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?

Yes

Can this issue reproduce from the UI?

- No


If this is a regression, please provide more details to justify this:


Steps to Reproduce:

Perform the OLM dance of odf-operator and its sub-dependencies as per the following steps

0. Get the `metadata.uid` of the odf-operator Subscription and make a note of it
`oc get subscription -n openshift-storage -o json <odf-operator subscription name> | jq '.metadata.uid'`

1. Delete the odf-operator CSV: 
`oc delete csv -n openshift-storage --cascade=orphan <odf-operator-CSV name>`

2. Delete the sub-dependency CSV:
`oc delete csv -n openshift-storage --cascade=orphan <ocs-operator-csv name> <mcg-operator-csv name> <odf-csi-addons-operator-csv name>`

3. Delete the odf-operator Subscriptions:
`oc delete subscription -n openshift-storage --cascade=orphan <odf-operator-subscription name>`

4. Delete the sub-dependency Subscriptions:
`oc delete subscription -n openshift-storage --cascade=orphan <ocs-operator-subscription name> <mcg-operator-subscription name> <odf-csi-addons-operator-subscription name>`

5. Get the `odf-operator` Subscription re-created by running the following sub-steps:
   5.1 Delete the CSV of ocs-osd-deployer
       `oc delete csv -n openshift-storage --cascade=orphan <ocs-osd-deployer csv name>`
   5.2 Delete the Subscription of addon-ocs-consumer or addon-ocs-provider
       `oc delete subscription -n openshift-storage --cascade=orphan addon-ocs-consumer/addon-ocs-provider`

6. Automatically, the addon-operator should re-create the addon-ocs-consumer/addon-ocs-provider Subscription deleted in Step 5.2.

7. The re-creation of the `addon-ocs-consumer`/`addon-ocs-provider` Subscription should automatically lead to the creation of a new `odf-operator` Subscription with a new `metadata.uid` and make note of it. This would be definitely different from the old `.metadata.uid` you made a note of in the Step 0.

8. Confirm the Step 6. and 7. by runnint
`oc get subscription -n openshift-storage`

9. Finally, run the following command and you will notice how the `ocs-operator`, `mcg-operator` and `odf-csi-addons-operator` are stuck in a hotloop of getting created, deleted and re-created

`oc get subscriptions -n openshift-storage -o custom-columns="NAME":.metadata.name,"OWNER-UID":'.metadata.ownerReferences[0].uid' --watch | grep "ocs-operator\|mcg\|odf-csi"`

you would notice that the OWNER-UID column has the value which you made note of in the Step 0. i.e. the `uid` of the old `odf-operator` Subscription.

Ideally, the `.metadata.uid` of the newest `odf-operator` Subscription (Step 7.) should have been used and displayed under the `OWNER-ID` column.



Actual results:

the CSVs of mcg-operator, ocs-operator and odf-csi-addons-operator are never stably created.

`oc get csvs -n openshift-storage`  shows no entry of the CSVs of the above operators.



Expected results:

the CSVs of mcg-operator, ocs-operator and odf-csi-addons-operator are stably created.

`oc get csvs -n openshift-storage`  shows the entries of the CSVs of the above operators.


Additional info:

Comment 2 Yashvardhan Kukreja 2023-03-15 15:57:30 UTC
Hi Nitin, if you notice the "Steps to reproduce" carefully, I mentioned that the deletion of CSV happens with `--cascade=orphan` which would leave the `odf-operator-controller-manager` behind.

We do that as a part of the "OLM Dance" troubleshooting process because we want every step of ours to be absolutely non-destructive.

>  My question will be why the sub was deleted in the first place

Regarding this question, a Subscription has to be deleted so that it can be re-created and its re-creation can further lead to OLM re-creating its CSV, which is the ultimate goal of OLM Dance process.

Comment 3 Nitin Goyal 2023-03-16 04:42:09 UTC
> Regarding this question, a Subscription has to be deleted so that it can be re-created and its re-creation can further lead to OLM re-creating its CSV, which is the ultimate goal of OLM Dance process.

Why were you deleting and creating the SUB and CSV? What difference will it make? It is already there.

Comment 4 Yashvardhan Kukreja 2023-03-16 04:54:07 UTC
As I mentioned Nitin, it is for the sake of performing the “OLM Dance” troubleshooting process.

Without bothering you with too much details, there are times when the relationship between existing Subscription and CSV breaks which causes OLM to mark the existing Subscription in an “Unsatisfied” state. The consequence is that on such Subscriptions, operations like upgrades can't be performed.

To fix this, there is a process called “OLM Dance” in which we delete the csv and subscription, then re-create the subscription (which automatically leads to the re-creation of csv as well) so as to freshen up the relationship between the Subscription and CSV so that from the eyes of OLM, the Subscription reaches a “Satisfied” state, hence, making it capable of normal operations like upgrades on it.

This is just the OLM and its dependency resolution works and OLM never guarantees the immutability of Subscriptions like that of ‘odf-operator’ anyway.

Therefore, odf-operator shouldn’t cache the uid of odf-operator just once with the hopes/assumption that it would never change.

Hope this explains.

Comment 6 Yashvardhan Kukreja 2023-03-16 06:13:53 UTC
The OLM Dance process tends to be fairly frequent. And moreover, Nitin, this indeed is a bug (a missed edge-case which can have critical implications ) on odf-operator. So, even if we don’t perform this OLM Dance enough, a cluster-admin can do the same.

Long story short, this is a critical bug which is capable of incurring extra toil on SREs and hamper the day 2 operations.

Therefore, we don’t want to rely workarounds just hoping that this doesn’t because after all, they are just “workarounds” and not actual solution and as they say, hope ain’t a strategy ;)

Comment 7 Nitin Goyal 2023-03-16 11:52:36 UTC
Thank you for bringing this issue to my attention and for your concern regarding the potential impact on our SREs and day 2 operations.

After carefully assessing the situation, I have concluded that the probability of encountering this bug is low, and its impact on our operations is not significant. We do have a simple workaround in place that addresses the issue.

While I understand that you consider this bug to be critical, I believe that it is not at a critical level. However, I will prioritize addressing this in our next release, which is scheduled for 4.13. Unfortunately, I cannot guarantee that we will backport this fix to older versions of the system such as 4.10.

Thank you again for bringing this to my attention and for your diligence in identifying this issue.

Comment 13 Itzhak 2023-05-29 14:56:45 UTC
I need more information about the platform details.
From my understanding, we need to test it with Managed Service, OCP 4.11, and ODF 4.11 with provider and consumer clusters. 
Is that correct? 
If not, please provide more details about the platform that needs to be tested.

Comment 14 Nitin Goyal 2023-05-30 04:48:19 UTC
The fix is in 4.13 only, so you need to test it on odf 4.13.

Comment 15 Nitin Goyal 2023-05-31 12:48:48 UTC
Verification steps on the product cluster:

oc get sub
oc get csv
oc delete sub odf-operator

oc get sub
oc get csv
oc get pods

oc get sub -w (wait for some time and see if it is trying to create faulty subs with the wrong uid)

oc logs odf-operator-controller-manager-**** manager (pls see if it is complaining about the subs not found in the logs)


Pls perform these steps on the 4.12 and 4.13 clusters. In 4.13 you should see the error in the logs and 4.12 you should not see the error.

Comment 16 Itzhak 2023-06-01 16:40:12 UTC
We checked the steps above with a 4.12 cluster and got the following results.

$ oc get sub -w
NAME           PACKAGE        SOURCE             CHANNEL
mcg-operator   mcg-operator   redhat-operators   stable-4.12
ocs-operator   ocs-operator   redhat-operators   stable-4.12
mcg-operator   mcg-operator   redhat-operators   stable-4.12
ocs-operator   ocs-operator   redhat-operators   stable-4.12
odf-csi-addons-operator   odf-csi-addons-operator   redhat-operators   stable-4.12
mcg-operator              mcg-operator              redhat-operators   stable-4.12
ocs-operator              ocs-operator              redhat-operators   stable-4.12
odf-csi-addons-operator   odf-csi-addons-operator   redhat-operators   stable-4.12
ocs-operator              ocs-operator              redhat-operators   stable-4.12
mcg-operator              mcg-operator              redhat-operators   stable-4.12
odf-csi-addons-operator   odf-csi-addons-operator   redhat-operators   stable-4.12
mcg-operator              mcg-operator              redhat-operators   stable-4.12
ocs-operator              ocs-operator              redhat-operators   stable-4.12
odf-csi-addons-operator   odf-csi-addons-operator   redhat-operators   stable-4.12
odf-csi-addons-operator   odf-csi-addons-operator   redhat-operators   stable-4.1


And when checking the steps above with AWS 4.13 cluster, we got the following results as expected:
$ oc get sub -w

When checking the logs of the odf-operator-controller-manager-**** manager we saw the following output: 
2023-06-01T13:16:34Z    ERROR    controllers.StorageSystem    failed to ensure subscription    {"instance": "openshift-storage/ocs-storagecluster-storagesystem", "Subscription": "mcg-operator", "error": "odf-operator subscription not found"}

Comment 19 errata-xmlrpc 2023-06-21 15:24:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742

Comment 20 Red Hat Bugzilla 2023-12-08 04:32:46 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.