Bug 2178619
Summary: | odf-operator failing to resolve its sub-dependencies leaving the ocs-consumer/provider addon in a failed and halted state | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Yashvardhan Kukreja <ykukreja> |
Component: | odf-operator | Assignee: | Nitin Goyal <nigoyal> |
Status: | CLOSED ERRATA | QA Contact: | Itzhak <ikave> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.10 | CC: | kramdoss, muagarwa, nigoyal, ocs-bugs, odf-bz-bot, owasserm |
Target Milestone: | --- | Keywords: | AutomationTriaged |
Target Release: | ODF 4.13.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-06-21 15:24:39 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Yashvardhan Kukreja
2023-03-15 13:02:38 UTC
Hi Nitin, if you notice the "Steps to reproduce" carefully, I mentioned that the deletion of CSV happens with `--cascade=orphan` which would leave the `odf-operator-controller-manager` behind.
We do that as a part of the "OLM Dance" troubleshooting process because we want every step of ours to be absolutely non-destructive.
> My question will be why the sub was deleted in the first place
Regarding this question, a Subscription has to be deleted so that it can be re-created and its re-creation can further lead to OLM re-creating its CSV, which is the ultimate goal of OLM Dance process.
> Regarding this question, a Subscription has to be deleted so that it can be re-created and its re-creation can further lead to OLM re-creating its CSV, which is the ultimate goal of OLM Dance process.
Why were you deleting and creating the SUB and CSV? What difference will it make? It is already there.
As I mentioned Nitin, it is for the sake of performing the “OLM Dance” troubleshooting process. Without bothering you with too much details, there are times when the relationship between existing Subscription and CSV breaks which causes OLM to mark the existing Subscription in an “Unsatisfied” state. The consequence is that on such Subscriptions, operations like upgrades can't be performed. To fix this, there is a process called “OLM Dance” in which we delete the csv and subscription, then re-create the subscription (which automatically leads to the re-creation of csv as well) so as to freshen up the relationship between the Subscription and CSV so that from the eyes of OLM, the Subscription reaches a “Satisfied” state, hence, making it capable of normal operations like upgrades on it. This is just the OLM and its dependency resolution works and OLM never guarantees the immutability of Subscriptions like that of ‘odf-operator’ anyway. Therefore, odf-operator shouldn’t cache the uid of odf-operator just once with the hopes/assumption that it would never change. Hope this explains. The OLM Dance process tends to be fairly frequent. And moreover, Nitin, this indeed is a bug (a missed edge-case which can have critical implications ) on odf-operator. So, even if we don’t perform this OLM Dance enough, a cluster-admin can do the same. Long story short, this is a critical bug which is capable of incurring extra toil on SREs and hamper the day 2 operations. Therefore, we don’t want to rely workarounds just hoping that this doesn’t because after all, they are just “workarounds” and not actual solution and as they say, hope ain’t a strategy ;) Thank you for bringing this issue to my attention and for your concern regarding the potential impact on our SREs and day 2 operations. After carefully assessing the situation, I have concluded that the probability of encountering this bug is low, and its impact on our operations is not significant. We do have a simple workaround in place that addresses the issue. While I understand that you consider this bug to be critical, I believe that it is not at a critical level. However, I will prioritize addressing this in our next release, which is scheduled for 4.13. Unfortunately, I cannot guarantee that we will backport this fix to older versions of the system such as 4.10. Thank you again for bringing this to my attention and for your diligence in identifying this issue. I need more information about the platform details. From my understanding, we need to test it with Managed Service, OCP 4.11, and ODF 4.11 with provider and consumer clusters. Is that correct? If not, please provide more details about the platform that needs to be tested. The fix is in 4.13 only, so you need to test it on odf 4.13. Verification steps on the product cluster: oc get sub oc get csv oc delete sub odf-operator oc get sub oc get csv oc get pods oc get sub -w (wait for some time and see if it is trying to create faulty subs with the wrong uid) oc logs odf-operator-controller-manager-**** manager (pls see if it is complaining about the subs not found in the logs) Pls perform these steps on the 4.12 and 4.13 clusters. In 4.13 you should see the error in the logs and 4.12 you should not see the error. We checked the steps above with a 4.12 cluster and got the following results. $ oc get sub -w NAME PACKAGE SOURCE CHANNEL mcg-operator mcg-operator redhat-operators stable-4.12 ocs-operator ocs-operator redhat-operators stable-4.12 mcg-operator mcg-operator redhat-operators stable-4.12 ocs-operator ocs-operator redhat-operators stable-4.12 odf-csi-addons-operator odf-csi-addons-operator redhat-operators stable-4.12 mcg-operator mcg-operator redhat-operators stable-4.12 ocs-operator ocs-operator redhat-operators stable-4.12 odf-csi-addons-operator odf-csi-addons-operator redhat-operators stable-4.12 ocs-operator ocs-operator redhat-operators stable-4.12 mcg-operator mcg-operator redhat-operators stable-4.12 odf-csi-addons-operator odf-csi-addons-operator redhat-operators stable-4.12 mcg-operator mcg-operator redhat-operators stable-4.12 ocs-operator ocs-operator redhat-operators stable-4.12 odf-csi-addons-operator odf-csi-addons-operator redhat-operators stable-4.12 odf-csi-addons-operator odf-csi-addons-operator redhat-operators stable-4.1 And when checking the steps above with AWS 4.13 cluster, we got the following results as expected: $ oc get sub -w When checking the logs of the odf-operator-controller-manager-**** manager we saw the following output: 2023-06-01T13:16:34Z ERROR controllers.StorageSystem failed to ensure subscription {"instance": "openshift-storage/ocs-storagecluster-storagesystem", "Subscription": "mcg-operator", "error": "odf-operator subscription not found"} Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:3742 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |