Created attachment 1756372 [details] rook operator log Description of problem (please be detailed as possible and provide log snippests): ======================================================================= In an OCS dynamic mode cluster on vmware, created 2 OBCs and then initiated storagecluster deletion(without deleting the OBCs to test if it blocks uninstall). Working as expected:until I deleted the Noobaa based OBC, noobaa deletion was stuck, hence halting storagecluster deletion. Once I deleted only the noobaa OBC, the uninstall progressed towatds cephcluster deletion >>Not working as expected: Even with mode:graceful, storagecluster and cephcluster deletion succeeded, though the RGW based OBC still existed on the cluster. rook should have checked for its existence and blocked cephcluster deletion, similar to the way it blocks cephcluster deletion when PVCs exist(cephfs/rbd) Storagecluster annotation: ----------------------------- annotations: uninstall.ocs.openshift.io/cleanup-policy: delete uninstall.ocs.openshift.io/mode: graceful ======= obc ========== NAME STORAGE-CLASS PHASE AGE obc-rgw ocs-storagecluster-ceph-rgw Bound 7s obcs-nb openshift-storage.noobaa.io Bound 45s Because of presence of OBC, the namespace deletion got stuck and I had to manually patch the finalizers of the OBC related CM, secret and the OBC itself. Version of all relevant components (if applicable): =============================================== OCS = 4.7.0-258.ci OCP = 4.7.0-0.nightly-2021-02-09-224509 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? ================================================================= Due to leftovers, We have to manually patch the finalizers in OBC resources for namespace deletion to succeed Is there any workaround available to the best of your knowledge? ========================================================= NA Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? ============================== Tested once Can this issue reproduce from the UI? ===================================== NA If this is a regression, please provide more details to justify this: =================================================== Not sure Steps to Reproduce: ======================= 1. Created one RGW and one noobaa based OBC using UI 2. Keeping them intact, initiate storagecluster deletion $ date --utc; oc delete storagecluster --all ; date --utc Thu Feb 11 11:26:44 UTC 2021 storagecluster.ocs.openshift.io "ocs-storagecluster" deleted Thu Feb 11 12:06:43 UTC 2021 3. Storagecluster deletion stays stuck in Deleting state since Noobaa is waiting for the OBC bucket to be removed 4. Delete only the noobaa OBC from UI/CLI. Noobaa deletion succeeds and uninstall progresses to delete the cephcluster 5. Check the progress of storagecluster deletion. cephcluster deletion does not get stuck for RGW OBC Actual results: ================= With graceful mode ON, Storagecluster(aka cephcluster) deletion succeeds even when RGW based OBC exists Expected results: =================== cephcluster deletion should have stuck and waited for user to delete the RGW OBC first. Additional info: =================== Only remaining resources after storagecluster deletion ------------------------------------------------- $ oc get cm NAME DATA AGE obc-rgw 5 133m [nberry@localhost dynamic-258.ci]$ oc get secret NAME TYPE DATA AGE obc-rgw Opaque 2 134m $oc get obc NAME STORAGE-CLASS PHASE AGE obc-rgw ocs-storagecluster-ceph-rgw Bound 43m -------------- ======= storagecluster ========== No resources found in openshift-storage namespace. -------------- ======= cephcluster ========== No resources found in openshift-storage namespace. ======= PV ==== No resources found ======= backingstore ========== No resources found in openshift-storage namespace. ======= bucketclass ========== No resources found in openshift-storage namespace. ======= obc ========== NAME STORAGE-CLASS PHASE AGE obc-rgw ocs-storagecluster-ceph-rgw Bound 43m ======= oc get cephobjectstore and user ========== No resources found in openshift-storage namespace.
Since not a regression, only related to uninstall, and there is a workaround, moving to 4.8.
Per discussion on upstream design doc for cleanup, this would have potential breaking behavior change so we need to wait for an upstream minor release v1.7 to make this change. Thus, moving to 4.9.
Upstream design doc to solve this issue as well as a related class of issues: https://github.com/rook/rook/pull/7885
Branch https://github.com/red-hat-data-services/rook/tree/release-4.9 has the fix from a resync. Moving to MODIFIED.
created two OBCs via UI ------------------------------------------------------------------- [asandler@fedora ~]$ oc get obc -A No resources found [asandler@fedora ~]$ oc get obc -A NAMESPACE NAME STORAGE-CLASS PHASE AGE default obc1 ocs-storagecluster-ceph-rgw Bound 15s default obc2 openshift-storage.noobaa.io Bound 3s deleting storagecluster ------------------------------------------------------------- [asandler@fedora ~]$ oc delete storagecluster ocs-storagecluster -n openshift-storage storagecluster.ocs.openshift.io "ocs-storagecluster" deleted [asandler@fedora ~]$ oc get storagecluster -A NAMESPACE NAME AGE PHASE EXTERNAL CREATED AT VERSION openshift-storage ocs-storagecluster 4h15m Deleting 2021-08-23T20:02:18Z 4.8.0 -----> stuck on deleting deleting noobaa OBC ----------------------------------------------------------- [asandler@fedora ~]$ oc get obc -A NAMESPACE NAME STORAGE-CLASS PHASE AGE default obc1 ocs-storagecluster-ceph-rgw Bound 12m [asandler@fedora ~]$ oc get storagecluster -A NAMESPACE NAME AGE PHASE EXTERNAL CREATED AT VERSION openshift-storage ocs-storagecluster 4h17m Deleting 2021-08-23T20:02:18Z 4.8.0 -------> still stuck deleting RGW OBC ------------------------------------------------------------------ [asandler@fedora ~]$ oc get obc -A No resources found [asandler@fedora ~]$ oc get storagecluster -A No resources found * PVCs were deleted too to prevent storagecluster being stuck because of them moving to verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:5086