Description of problem (please be detailed as possible and provide log snippests): Whenever the Subscription of odf-operator is deleted and re-created, as a part of the OLM dance troubleshooting process, `odf-operator-controller-manager` still continues to create subscriptions of its sub-dependencies with the ownerReferences of `odf-operator` Subscription with old/stale/not-anymore-existing `uid`. Here the "sub-dependencies" are ocs-operator, mcg-operator, odf-csi-addons-operator. Due to this, as soon as those sub-dependencies' Subscriptions are created, kube-controller-manager's garbage collector instantly deletes them because it recognises that those newly created Subscriptions have non-existent ownerReferences. Subsequently, `odf-operator-controller-manager` reacts to this garbage collection by recreating the sub-dependencies' Subscriptions with non-existent ownerReferences again. This causes a hotloop of the creation, deletion and re-creation of the Subscriptions of sub-dependencies without leaving them in a stable state to resolve themselves and create sub-dependencies workloads. This ends up leaving the entire ocs-consumer/provider stack in an unresolved state halting all the day-2 operations such as upgrading the addon and multiple other capabilities of the addon. Root cause This happens because when `odf-operator-controller-manager` starts for the first time, only then, it looks at the `odf-operator` Subscription present in the cluster and caches its `metadata.uid` permanently with the assumption that that Subscription with that `uid` will be left untouched throughout its lifecycle. Therefore, everytime it creates/reconciles its sub-dependency Subscriptions, it uses that cached `uid` of odf-operator Subscription, instead of rightfully GET-ing the `uid` of latest `odf-operator` Subscription present in the cluster. Ref - https://github.com/red-hat-storage/odf-operator/blob/main/controllers/subscriptions.go#L115-L117 Version of all relevant components (if applicable): odf-operator - 4.10.5 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, due to this issue, any day 2 operations such as upgrades of the addon or even the connection establishment between the ocs-consumer and ocs-provider ends up being broken because ocs-operator, the component responsible for the connection establishment, is one of those sub-dependencies which end up getting stuck in the unstable and unresolved state as well. Is there any workaround available to the best of your knowledge? The workaround is to restart the `odf-operator-controller-manager` pods so that they end up refreshing the wrongfully cached `metadata.uid` with the one correctly pointing to the latest `odf-operator` Subscription existing in the cluster. T Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Yes Can this issue reproduce from the UI? - No If this is a regression, please provide more details to justify this: Steps to Reproduce: Perform the OLM dance of odf-operator and its sub-dependencies as per the following steps 0. Get the `metadata.uid` of the odf-operator Subscription and make a note of it `oc get subscription -n openshift-storage -o json <odf-operator subscription name> | jq '.metadata.uid'` 1. Delete the odf-operator CSV: `oc delete csv -n openshift-storage --cascade=orphan <odf-operator-CSV name>` 2. Delete the sub-dependency CSV: `oc delete csv -n openshift-storage --cascade=orphan <ocs-operator-csv name> <mcg-operator-csv name> <odf-csi-addons-operator-csv name>` 3. Delete the odf-operator Subscriptions: `oc delete subscription -n openshift-storage --cascade=orphan <odf-operator-subscription name>` 4. Delete the sub-dependency Subscriptions: `oc delete subscription -n openshift-storage --cascade=orphan <ocs-operator-subscription name> <mcg-operator-subscription name> <odf-csi-addons-operator-subscription name>` 5. Get the `odf-operator` Subscription re-created by running the following sub-steps: 5.1 Delete the CSV of ocs-osd-deployer `oc delete csv -n openshift-storage --cascade=orphan <ocs-osd-deployer csv name>` 5.2 Delete the Subscription of addon-ocs-consumer or addon-ocs-provider `oc delete subscription -n openshift-storage --cascade=orphan addon-ocs-consumer/addon-ocs-provider` 6. Automatically, the addon-operator should re-create the addon-ocs-consumer/addon-ocs-provider Subscription deleted in Step 5.2. 7. The re-creation of the `addon-ocs-consumer`/`addon-ocs-provider` Subscription should automatically lead to the creation of a new `odf-operator` Subscription with a new `metadata.uid` and make note of it. This would be definitely different from the old `.metadata.uid` you made a note of in the Step 0. 8. Confirm the Step 6. and 7. by runnint `oc get subscription -n openshift-storage` 9. Finally, run the following command and you will notice how the `ocs-operator`, `mcg-operator` and `odf-csi-addons-operator` are stuck in a hotloop of getting created, deleted and re-created `oc get subscriptions -n openshift-storage -o custom-columns="NAME":.metadata.name,"OWNER-UID":'.metadata.ownerReferences[0].uid' --watch | grep "ocs-operator\|mcg\|odf-csi"` you would notice that the OWNER-UID column has the value which you made note of in the Step 0. i.e. the `uid` of the old `odf-operator` Subscription. Ideally, the `.metadata.uid` of the newest `odf-operator` Subscription (Step 7.) should have been used and displayed under the `OWNER-ID` column. Actual results: the CSVs of mcg-operator, ocs-operator and odf-csi-addons-operator are never stably created. `oc get csvs -n openshift-storage` shows no entry of the CSVs of the above operators. Expected results: the CSVs of mcg-operator, ocs-operator and odf-csi-addons-operator are stably created. `oc get csvs -n openshift-storage` shows the entries of the CSVs of the above operators. Additional info:
Hi Nitin, if you notice the "Steps to reproduce" carefully, I mentioned that the deletion of CSV happens with `--cascade=orphan` which would leave the `odf-operator-controller-manager` behind. We do that as a part of the "OLM Dance" troubleshooting process because we want every step of ours to be absolutely non-destructive. > My question will be why the sub was deleted in the first place Regarding this question, a Subscription has to be deleted so that it can be re-created and its re-creation can further lead to OLM re-creating its CSV, which is the ultimate goal of OLM Dance process.
> Regarding this question, a Subscription has to be deleted so that it can be re-created and its re-creation can further lead to OLM re-creating its CSV, which is the ultimate goal of OLM Dance process. Why were you deleting and creating the SUB and CSV? What difference will it make? It is already there.
As I mentioned Nitin, it is for the sake of performing the “OLM Dance” troubleshooting process. Without bothering you with too much details, there are times when the relationship between existing Subscription and CSV breaks which causes OLM to mark the existing Subscription in an “Unsatisfied” state. The consequence is that on such Subscriptions, operations like upgrades can't be performed. To fix this, there is a process called “OLM Dance” in which we delete the csv and subscription, then re-create the subscription (which automatically leads to the re-creation of csv as well) so as to freshen up the relationship between the Subscription and CSV so that from the eyes of OLM, the Subscription reaches a “Satisfied” state, hence, making it capable of normal operations like upgrades on it. This is just the OLM and its dependency resolution works and OLM never guarantees the immutability of Subscriptions like that of ‘odf-operator’ anyway. Therefore, odf-operator shouldn’t cache the uid of odf-operator just once with the hopes/assumption that it would never change. Hope this explains.
The OLM Dance process tends to be fairly frequent. And moreover, Nitin, this indeed is a bug (a missed edge-case which can have critical implications ) on odf-operator. So, even if we don’t perform this OLM Dance enough, a cluster-admin can do the same. Long story short, this is a critical bug which is capable of incurring extra toil on SREs and hamper the day 2 operations. Therefore, we don’t want to rely workarounds just hoping that this doesn’t because after all, they are just “workarounds” and not actual solution and as they say, hope ain’t a strategy ;)
Thank you for bringing this issue to my attention and for your concern regarding the potential impact on our SREs and day 2 operations. After carefully assessing the situation, I have concluded that the probability of encountering this bug is low, and its impact on our operations is not significant. We do have a simple workaround in place that addresses the issue. While I understand that you consider this bug to be critical, I believe that it is not at a critical level. However, I will prioritize addressing this in our next release, which is scheduled for 4.13. Unfortunately, I cannot guarantee that we will backport this fix to older versions of the system such as 4.10. Thank you again for bringing this to my attention and for your diligence in identifying this issue.
I need more information about the platform details. From my understanding, we need to test it with Managed Service, OCP 4.11, and ODF 4.11 with provider and consumer clusters. Is that correct? If not, please provide more details about the platform that needs to be tested.
The fix is in 4.13 only, so you need to test it on odf 4.13.
Verification steps on the product cluster: oc get sub oc get csv oc delete sub odf-operator oc get sub oc get csv oc get pods oc get sub -w (wait for some time and see if it is trying to create faulty subs with the wrong uid) oc logs odf-operator-controller-manager-**** manager (pls see if it is complaining about the subs not found in the logs) Pls perform these steps on the 4.12 and 4.13 clusters. In 4.13 you should see the error in the logs and 4.12 you should not see the error.
We checked the steps above with a 4.12 cluster and got the following results. $ oc get sub -w NAME PACKAGE SOURCE CHANNEL mcg-operator mcg-operator redhat-operators stable-4.12 ocs-operator ocs-operator redhat-operators stable-4.12 mcg-operator mcg-operator redhat-operators stable-4.12 ocs-operator ocs-operator redhat-operators stable-4.12 odf-csi-addons-operator odf-csi-addons-operator redhat-operators stable-4.12 mcg-operator mcg-operator redhat-operators stable-4.12 ocs-operator ocs-operator redhat-operators stable-4.12 odf-csi-addons-operator odf-csi-addons-operator redhat-operators stable-4.12 ocs-operator ocs-operator redhat-operators stable-4.12 mcg-operator mcg-operator redhat-operators stable-4.12 odf-csi-addons-operator odf-csi-addons-operator redhat-operators stable-4.12 mcg-operator mcg-operator redhat-operators stable-4.12 ocs-operator ocs-operator redhat-operators stable-4.12 odf-csi-addons-operator odf-csi-addons-operator redhat-operators stable-4.12 odf-csi-addons-operator odf-csi-addons-operator redhat-operators stable-4.1 And when checking the steps above with AWS 4.13 cluster, we got the following results as expected: $ oc get sub -w When checking the logs of the odf-operator-controller-manager-**** manager we saw the following output: 2023-06-01T13:16:34Z ERROR controllers.StorageSystem failed to ensure subscription {"instance": "openshift-storage/ocs-storagecluster-storagesystem", "Subscription": "mcg-operator", "error": "odf-operator subscription not found"}
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:3742
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days