Bug 2005040 - Uninstallation of ODF StorageSystem via OCP Console fails, gets stuck in Terminating state
Summary: Uninstallation of ODF StorageSystem via OCP Console fails, gets stuck in Term...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ODF 4.14.0
Assignee: Mudit Agarwal
QA Contact: Jilju Joy
URL:
Whiteboard:
: 2049309 (view as bug list)
Depends On: 2060897
Blocks: 1943527 1974344 2000941 2029744 2049309
TreeView+ depends on / blocked
 
Reported: 2021-09-16 15:14 UTC by Martin Bukatovic
Modified: 2023-08-09 17:00 UTC (History)
17 users (show)

Fixed In Version: 4.10.0-175
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-06-21 15:22:14 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github noobaa noobaa-operator pull 755 0 None Merged Fix for noobaa CR deletion (uninstall) flow 2022-01-05 09:42:06 UTC
Github noobaa noobaa-operator pull 760 0 None Merged Backport to 5.9: Fix CR dependency on uninstall 2022-01-05 09:42:06 UTC
Github red-hat-storage ocs-operator pull 1563 0 None open uninstall: delete CephCluster at the start of uninstall 2022-02-24 20:32:20 UTC
Github red-hat-storage ocs-operator pull 1566 0 None open Bug 2005040: [release-4.10] uninstall: delete CephCluster at the start of uninstall 2022-03-01 08:40:08 UTC
Github red-hat-storage rook pull 310 0 None Merged Bug 2005040: Treat cluster as not existing if the cleanup policy is set 2022-01-05 09:42:06 UTC
Github rook rook pull 9041 0 None Merged core: Treat cluster as not existing if the cleanup policy is set 2022-01-05 09:42:08 UTC
Red Hat Product Errata RHBA-2023:3742 0 None None None 2023-06-21 15:22:53 UTC

Description Martin Bukatovic 2021-09-16 15:14:46 UTC
Description of problem
======================

It's not possible to uninstall ODF StorageSystem via OCP Console web UI.

This is a regression compared to OCS 4.8.

Version-Release number of selected component
============================================

OCP 4.9.0-0.nightly-2021-09-14-200602
LSO 4.9.0-202109132154
ODF 4.9.0-139.ci

How reproducible
================

2/2

Steps to Reproduce
==================

1. Install OCP cluster.
2. Install OCS/ODF (OpenShift Data Foundation) operator.
3. Install LSO operator.
4. Start "Create a StorageSystem" wizard in OCP Console web UI and complete
   the process.
5. Wait for the StorageSystem to be installed.
6. Initiate removal of StorageSystem via OCP Console.

Actual results
==============

Uninstallation fails, StorageSystem gets stuck in Terminating state (see
screenshot #1).

Expected results
================

Uninstallation finishes with success.

Additional info
===============

This is declared as a test blocker since it slows down testing in a significant
way: instead of quick uninstallation of StorageSystem, we have to remove the
whole cluster to be able to retry StorageSystem installation again or check
StorageSystem with different configuration. Please suggest a workaround which
mitigates this problem (quick and reliable way to remove storage system) to
drop test blocker status.

Comment 5 Nitin Goyal 2021-09-16 15:57:07 UTC
StorageSystem waits on deleting the storageCluster. If StorageCluster still exists then you can not delete the storageSystem at all without removing finalizers. So the question is Does the StorageCluster still exist? Was uninstalling StorageCluster working in the prev release with the same type of deployment.

Comment 6 Martin Bukatovic 2021-09-17 16:00:00 UTC
Ah, it seems that a root cause behind this problem is bit more severe and
complex, and that some design overhaul will be necessary. Let me explain what
I mean:

In OCS 4.8, the operator clearly advertised StorageCluster CRD as the most
important API it provides, and when you wanted to actually install OCS managed
ceph cluster, you used "Create StorageCluster" button to start "Create
StorageCluster" wizard. Then the StorageCluster resource is clearly present in
OSD operator UI. If you decided to uninstall the cluster later, you used
"remove" operation on the StorageCluster resource.

In ODF 4.9, the operator advertises StorageSystem CRD as the only API it
provides, and when you want to have a cluster installed, you go do so via
"Create StorageSystem" button and wizard. When the installation finishes, you
have StorageSystem resource clearly presented in the UI instead of
StorageCluster, and one would expect that this resource is the main point for
the user to understand the overall status of the storage system, and that one
would also assume it's the right place to uninstall it.

While the StorageCluster is still there, it's basically hidden away so that
user won't see it via UI (see eg. BZ 2004030).

So it seems that the design of new CRD's and UI doesn't align well. The
question is how do we want to resolve it:

- Do we want to have StorageCluster as the main resource for storage admin to
  work with? Then we need to change the way who StorageCluster CRD works, and
  make it better representation of the storage system as a whole.

- Do we want storage admin to understand components of StorageSystem, so that
  it will naturally occurs to them that to uninstall it, one need to remove
  StorageCluster first? Then we need to redesign most of the UI, and make sure
  that we don't allow delete operation on StorageSystem CR (via k8s validation).

Imho the 1st option makes more sense, but I don't see the whole picture here at
this moment.

Comment 7 Jose A. Rivera 2021-09-17 17:35:47 UTC
For context, I have to emphasize that the current backend design was taking into consideration scenarios outside of the UI, especially ones of headless automation. However, the StorageSystem CR is almost entirely a convenience API for the Console to deal with both OCS and IBM with some level of abstraction, deriving from multiple high-level discussions between product owners many months ago. It provides basically no technical benefit otherwise. As such, I'm open to doing what we can to make its interactions with the UI more agreeable, as long as the overall design doesn't suffer because of it.

This product has a history, from its inception, of designing its UI to abstract and hide. IMO it's basically a race to the bottom of "what's the absolutely bare minimum amount of information we need to expose", leading to all sorts of headaches and conversations around vague definitions of actual users. The direction this BZ has taken is a perfect example of this.

All that said, at this point this is more a product decision than a technical one. We're currently stuck with whatever UI is in the actual Console itself, but if I remember right everything outside the installation wizard will be coming in our dynamic plugin, so we still have some time and flexibility there. With that in mind we have to get a decision on just how much we want to lean on StorageSystems or StorageClusters as the primary interface for our UI customers. Obviously for existing customers upgrading this may be something of a jump, so that also needs to be considered. Honestly, my ideal would be to not rely on any CRD-based interface at all as it would give us much more flexibility in terms of displayed names and placements of a variety of UI elements, but I'm pretty sure that's out of the question for this release.

Comment 8 Martin Bukatovic 2021-09-17 22:38:10 UTC
I can confirm that the suggested workaround (removing StorageCluster resource before removing StorageSystem) works fine.

Dropping TestBlocker keyword.

Comment 9 Martin Bukatovic 2021-09-17 22:48:27 UTC
I can imagine that discussion leading towards current design was not easy nor straightforward. That said, I seems to me that we need to consider keeping StorageCluster CR visible in the new ODF UI. It would be really nice if we can do this by tweaking the UI code we ship in the operator, as Jose pointed out. There are other problems such as BZ 2005014 which are caused by the redesign.

Comment 10 Martin Bukatovic 2021-09-18 00:31:10 UTC
(In reply to Martin Bukatovic from comment #8)
> I can confirm that the suggested workaround (removing StorageCluster
> resource before removing StorageSystem) works fine.
> 
> Dropping TestBlocker keyword.

The datapoint above was observed during removal of storage cluster CR which got stuck in Error state during installation.

When I try the scenario with a successfully installed cluster, I get StorageCluster CR stuck in Deleting phase.

Command `oc describe storagecluster` reports:

```
Events:
  Type     Reason            Age   From                       Message
  ----     ------            ----  ----                       -------
  Warning  UninstallPending  15m   controller_storagecluster  uninstall: Waiting on NooBaa system noobaa to be deleted
  Warning  UninstallPending  15m   controller_storagecluster  Uninstall: Waiting for Ceph RGW Route storagecluster-cephobjectstore to be deleted
  Warning  UninstallPending  15m   controller_storagecluster  uninstall: Waiting for CephObjectStoreUser storagecluster-cephobjectstoreuser to be deleted
  Warning  UninstallPending  15m   controller_storagecluster  uninstall: Waiting for CephObjectStore storagecluster-cephobjectstore to be deleted
```

Has something changed? I was able to remove StorageCluster this way before.

Comment 11 Nitin Goyal 2021-09-20 07:46:43 UTC
(In reply to Martin Bukatovic from comment #8)
> I can confirm that the suggested workaround (removing StorageCluster
> resource before removing StorageSystem) works fine.
> 
> Dropping TestBlocker keyword.

I think I was not too clear let me explain again, I did not suggest removing StorageCluster resource before removing StorageSystem. odf-operator itself issue a delete and wait for it to delete. so there is no manual deletion required of a StorageCluster.

It is like how cephcluster deletion works in 4.8 upon deletion of the StorageCluster and StorageCluster wait for cephcluster to be deleted. Now we have one more layer on top which is StorageSystem. I hope that clears all doubts regarding the uninstall.

By removing finalizers I meant the same way as we do for cephCluster if something is not right in case of storageCluster deletion.


(In reply to Martin Bukatovic from comment #10)

> Has something changed? I was able to remove StorageCluster this way before.

On the ocs-operator side, we just changed the order of cephCluster deletion but on the rook side, a lot has changed. Blaine can help you with the rook doubts.

ocs-operator PR to change the order:-
https://github.com/red-hat-storage/ocs-operator/pull/1293

Comment 12 Jose A. Rivera 2021-09-20 14:25:17 UTC
Hmm... it seems this BZ has somewhat evolved.

Martin, can you explicitly outline the testing process you're using, including any UI actions or CLI commands? If simply doing sometihng like `oc delete storagecluster` isn't working, that is probably worth investigating via a must-gather.

Comment 13 Martin Bukatovic 2021-09-20 15:00:08 UTC
(In reply to Jose A. Rivera from comment #12)
> Hmm... it seems this BZ has somewhat evolved.
> 
> Martin, can you explicitly outline the testing process you're using,
> including any UI actions or CLI commands? If simply doing sometihng like `oc
> delete storagecluster` isn't working, that is probably worth investigating
> via a must-gather.

The reproducer from the bug description (and the must gather referenced in
comment 4) still applies.

Sorry for the confusion on my side (which Nitin cleared up in comment 11).
I noticed the problem from comment 10 when I tried an invalid procedure
(I misunderstood Nitin's comment). I'm not sure if it's related
to the actual bug as reported here (that would be visible in a must gather
though) and whether it's worth chasing that use case (in a separate bz
maybe?) as well.

Comment 14 Martin Bukatovic 2021-09-22 09:08:27 UTC
Providing QE ack based on a triage meeting on 2021-09-21.

We have agreement that removal of StorageSystem CR should remove the cluster, this bug should stay focused on uninstallation.

Comment 20 Martin Bukatovic 2021-10-04 14:35:21 UTC
The same behaviour can be observed when one tries to remove storage cluster without OCP Console.

The delete request gets stuck, storage cluster moves into Deleting phase, but nothing is actually removed:

```
$ oc delete storagecluster/ocs-storagecluster -n openshift-storage
storagecluster.ocs.openshift.io "ocs-storagecluster" deleted
^C
$  oc get storagecluster -n openshift-storage
NAME                 AGE   PHASE      EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   15m   Deleting              2021-10-04T14:08:34Z   4.9.0
```

Checking details via `oc describe` shows the same set of warnings:

```
$ oc describe storagecluster -n openshift-storage | tail -5
  Normal   CreationSucceeded  21m    StorageCluster controller  StorageSystem ocs-storagecluster-storagesystem created for the StorageCluster ocs-storagecluster.
  Warning  UninstallPending   9m12s  controller_storagecluster  uninstall: Waiting on NooBaa system noobaa to be deleted
  Warning  UninstallPending   9m12s  controller_storagecluster  Uninstall: Waiting for Ceph RGW Route ocs-storagecluster-cephobjectstore to be deleted
  Warning  UninstallPending   9m11s  controller_storagecluster  uninstall: Waiting for CephObjectStoreUser ocs-storagecluster-cephobjectstoreuser to be deleted
  Warning  UninstallPending   9m10s  controller_storagecluster  uninstall: Waiting for CephObjectStore ocs-storagecluster-cephobjectstore to be deleted
```

I was using 4.9.0-164.ci

Comment 21 Martin Bukatovic 2021-10-04 14:50:06 UTC
Additional information
======================

When I tried to list the items controller_storagecluster is waiting for to be deleted, I noticed one discrepancy:

- controller_storagecluster is waiting for CephObjectStoreUser/ocs-storagecluster-cephobjectstoreuser to be deleted
- but I see that there is CephObjectStoreUser/noobaa-ceph-objectstore-user instead

```
$ oc get CephObjectStoreUser -n openshift-storage
NAME                           AGE
noobaa-ceph-objectstore-user   37m
```

I'm not sure if this is expected or not, it could be unrelated to the problem I see here.

Comment 22 Martin Bukatovic 2021-10-04 14:52:05 UTC
Reattaching TestBlocker keyword, as there is no workaround other than removal and reinstallation of whole Openshift cluster.

Comment 23 Martin Bukatovic 2021-10-05 13:12:05 UTC
Asking Nimrod whether it would be possible to come up with some workaround, which would allow a storage cluster to be removed.

Comment 24 Nimrod Becker 2021-10-05 13:30:09 UTC
Manually, we can create a different BS and remove the one on top of RGW (if that was not used for buckets, if yes those need to be deleted first).

Code fix, when NooBaa will go down it will also (or should at least) remove the objectstoreuser, as I saw from the comments above it seems like NooBaa is stuck in uninstall as well (or have I mixed up between the real repro and the non-repro scenarios?)

Comment 26 Alexander Indenbaum 2021-10-13 12:50:28 UTC
According to NooBaa operator logs, provided by Martin in must gather, during uninstall
1. NooBaa CR was removed: Not Found: NooBaa \"noobaa\"
2. CephCluster CR continues to exist, also after NooBaa CR removal: Exists:  \"ocs-storagecluster-cephcluster\"
3. NooBaa operator watches the CephCluster CR, in order to react to Ceph cluster capacity changes. This  controller was added in PR 511, https://github.com/noobaa/noobaa-operator/pull/511


I am not sure about the flow, is it expected that CephCluster CR continues to exist, also after NooBaa CR removal?

Comment 27 Alexander Indenbaum 2021-10-14 13:56:51 UTC
There might be a race condition in regards to `noobaa-ceph-objectstore-user` creation, during system uninstall.

I pushed NooBaa operator image based on https://github.com/noobaa/noobaa-operator/pull/755 to:
quay.io/baum/noobaa-operator:bz_2005040_Oct_14_2021  

It is interesting if the termination issue is reproducible with this change.

Comment 29 Martin Bukatovic 2021-10-19 19:08:34 UTC
Verifying with:

- OCP 4.9.0-0.nightly-2021-10-19-063835
- LSO 4.9.0-202110012022
- ODF 4.9.0-193.ci 

I tried to initiate storagecluster removal, and I see that after about 5 minutes, cluster is stuck in Terminating state:

```
$ oc get storagecluster -n openshift-storage
NAME                 AGE    PHASE      EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   129m   Deleting              2021-10-19T16:56:56Z   4.9.0
```

Output of oc describe shows the following events:

```
  Phase:  Deleting
  Related Objects:
    API Version:       ceph.rook.io/v1
    Kind:              CephCluster
    Name:              ocs-storagecluster-cephcluster
    Namespace:         openshift-storage
    Resource Version:  114601
    UID:               47e67e29-b8c9-44b3-823c-2205fa412b8e
    API Version:       noobaa.io/v1alpha1
    Kind:              NooBaa
    Name:              noobaa
    Namespace:         openshift-storage
    Resource Version:  114984
    UID:               3f840ecd-e358-40cc-aac8-11e19b9dc899
Events:
  Type     Reason            Age   From                       Message
  ----     ------            ----  ----                       -------
  Warning  UninstallPending  20m   controller_storagecluster  uninstall: Waiting on NooBaa system noobaa to be deleted
  Warning  UninstallPending  20m   controller_storagecluster  uninstall: Waiting for CephObjectStoreUser ocs-storagecluster-cephobjectstoreuser to be deleted
  Warning  UninstallPending  20m   controller_storagecluster  uninstall: Waiting for CephObjectStore ocs-storagecluster-cephobjectstore to be deleted
```

Compared to the comment 20, there is one less warning, but otherwise the problem is still here.

List of pods running:

```
$ oc get pods -n openshift-storage 
NAME                                                              READY   STATUS      RESTARTS       AGE
csi-cephfsplugin-7ckrs                                            3/3     Running     0              117m
csi-cephfsplugin-cg84n                                            3/3     Running     0              117m
csi-cephfsplugin-jqsjq                                            3/3     Running     0              117m
csi-cephfsplugin-provisioner-5b4d988899-mg8pj                     6/6     Running     0              117m
csi-cephfsplugin-provisioner-5b4d988899-nvqsr                     6/6     Running     0              117m
csi-cephfsplugin-s252p                                            3/3     Running     0              117m
csi-cephfsplugin-wv572                                            3/3     Running     0              117m
csi-cephfsplugin-zh5lt                                            3/3     Running     0              117m
csi-rbdplugin-bgjjq                                               3/3     Running     0              117m
csi-rbdplugin-bql8h                                               3/3     Running     0              117m
csi-rbdplugin-fpwr7                                               3/3     Running     0              117m
csi-rbdplugin-hlq4p                                               3/3     Running     0              117m
csi-rbdplugin-nhmlt                                               3/3     Running     0              117m
csi-rbdplugin-provisioner-676987456c-s2cg7                        6/6     Running     0              117m
csi-rbdplugin-provisioner-676987456c-tzd6h                        6/6     Running     0              117m
csi-rbdplugin-vzldh                                               3/3     Running     0              117m
noobaa-operator-5895464d68-hgmht                                  1/1     Running     0              121m
ocs-metrics-exporter-6b8887d6ff-wqvnj                             1/1     Running     0              121m
ocs-operator-84cbfbcc97-7sc9p                                     1/1     Running     0              121m
odf-console-797d6f968f-8ljbf                                      1/1     Running     0              122m
odf-operator-controller-manager-58849f95c7-6q2dh                  2/2     Running     1 (120m ago)   122m
rook-ceph-crashcollector-compute-0-5df8f8fd78-zbz8x               1/1     Running     0              115m
rook-ceph-crashcollector-compute-1-c5b68564c-glhqf                1/1     Running     0              115m
rook-ceph-crashcollector-compute-2-7b649f48fb-jbv8t               1/1     Running     0              116m
rook-ceph-crashcollector-compute-3-85655499c9-7vchr               1/1     Running     0              116m
rook-ceph-crashcollector-compute-4-6d65d9d48b-mz6cn               1/1     Running     0              116m
rook-ceph-crashcollector-compute-5-6459f445f-8d5zd                1/1     Running     0              115m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5ff896dcbk2g2   2/2     Running     0              115m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6c784564bk7cf   2/2     Running     0              115m
rook-ceph-mgr-a-7b9bf457fd-j9vss                                  2/2     Running     0              116m
rook-ceph-mon-a-655b6888f6-x9g64                                  2/2     Running     0              117m
rook-ceph-mon-b-667f56f7d5-vl2lt                                  2/2     Running     0              116m
rook-ceph-mon-c-697c784ccc-pd5sb                                  2/2     Running     0              116m
rook-ceph-operator-6bf667c8cf-zb4t6                               1/1     Running     0              121m
rook-ceph-osd-0-5f65c96569-5p89p                                  2/2     Running     0              116m
rook-ceph-osd-1-7545dbd854-j2rdh                                  2/2     Running     0              116m
rook-ceph-osd-2-5965dfc57d-tc9pn                                  2/2     Running     0              116m
rook-ceph-osd-3-54d9f9bb9c-rzzsm                                  2/2     Running     0              116m
rook-ceph-osd-4-c67b555dc-r97n7                                   2/2     Running     0              116m
rook-ceph-osd-5-74dbd866c4-6klp9                                  2/2     Running     0              115m
rook-ceph-osd-6-755c76cf77-5kh7j                                  2/2     Running     0              115m
rook-ceph-osd-7-85c8dd5b64-jv59v                                  2/2     Running     0              115m
rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data---1-fwp5l   0/1     Completed   0              116m
rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data---1-lql6x   0/1     Completed   0              116m
rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data---1-zv5h8   0/1     Completed   0              116m
rook-ceph-osd-prepare-ocs-deviceset-localblock-1-data---1-qfdtk   0/1     Completed   0              116m
rook-ceph-osd-prepare-ocs-deviceset-localblock-1-data---1-vx8c5   0/1     Completed   0              116m
rook-ceph-osd-prepare-ocs-deviceset-localblock-1-data---1-xpj8m   0/1     Completed   0              116m
rook-ceph-osd-prepare-ocs-deviceset-localblock-2-data---1-gcdb5   0/1     Completed   0              116m
rook-ceph-osd-prepare-ocs-deviceset-localblock-2-data---1-qwbht   0/1     Completed   0              112m
rook-ceph-osd-prepare-ocs-deviceset-localblock-2-data---1-wbsvg   0/1     Completed   0              116m
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-59b5695sv9qt   2/2     Running     0              115m
rook-ceph-tools-77cb57894f-vl4pv                                  1/1     Running     0              115m
```

>>> ASSIGNED

Comment 32 Sébastien Han 2021-10-20 12:26:50 UTC
At a quick glance, it seems that one bucket is still present "nb.1634662874470.apps.mbukatov-1019b.qe.rh-ocs.com", that's the reason why the ceph-object-store-user-controller failed to remove the user and the CephObjectStoreUser CR.

See the error:

2021-10-19T18:47:21.747629646Z 2021-10-19 18:47:21.747556 E | ceph-object-store-user-controller: failed to reconcile failed to delete ceph object user "noobaa-ceph-objectstore-user": failed to delete ceph object user "noobaa-ceph-objectstore-user".: BucketAlreadyExists tx00000000000000000059e-00616f12b9-3a60-ocs-storagecluster-cephobjectstore 3a60-ocs-storagecluster-cephobjectstore-ocs-storagecluster-cephobjectstore

So the bucket must be cleaned up in order to proceed with the deletion.
Assigning to Blaine since this is related to the recent "dependent" patch.

Martin, where is this bucket coming from? OBC?
Thanks

Comment 33 Sébastien Han 2021-10-20 13:55:48 UTC
I'm fine keeping it since I started the initial evaluation of the bug.

Comment 34 Danny 2021-10-20 14:03:35 UTC
Hi Sebastien,

this bucket is created by noobaa to use as the default backing-store. it is created via S3 API and not OBC.
As far as I remember, by design, noobaa does not delete the buckets\containers it uses as storage for backing-stores (regardless of the type AWS\Azure\RGW, etc.)

Comment 35 Martin Bukatovic 2021-10-20 14:46:55 UTC
(In reply to Sébastien Han from comment #32)
> Martin, where is this bucket coming from? OBC?

I haven't created any buckets myself. It is created by NooBaa as explained by Danny in comment 34.

Comment 36 Sébastien Han 2021-10-20 15:15:35 UTC
(In reply to Martin Bukatovic from comment #35)
> (In reply to Sébastien Han from comment #32)
> > Martin, where is this bucket coming from? OBC?
> 
> I haven't created any buckets myself. It is created by NooBaa as explained
> by Danny in comment 34.

Ok thanks, Martin and Danny. It looks like we are catching now because Rook has become more protective with the resources it creates.
If we want to force the deletion, the finalizer must be removed or Noobaa should remove the bucket during uninstallation.

José, is something ocs-op could do? (remove the finalizer).

Comment 38 Travis Nielsen 2021-10-25 15:20:04 UTC
Seb, Blaine, and I discussed this and we don't see that Rook upstream should accommodate the forced uninstall case with a setting in the CR. The finalizers are the protection for the cluster. If the protection is not desired, the finalizers should be force removed, which means the OCS operator really is the only place this could happen.

Comment 39 Blaine Gardner 2021-10-26 19:27:25 UTC
We don't want to set in place mechanisms for upstream administrators to potentially destroy their data accidentally. We have put a lot of design and intention into Rook to have better default-safe behaviors for user data.

After discussion between myself, Jose, and Travis, we found that we can make an uninstallation optimization to meet the needs expressed here:

If  : we are deleting a CephObjectStore (could be any dependent resource, but let's keep this for the example)
and : the CephCluster has yes-really-destroy-data-set
and : the CephCluster has a nonzero deletion timestamp
then: we can treat deletion of the CephObjectStore as though the CephCluster doesn't exist, because we can be pretty sure it won't exist very soon. (i.e., just delete the CephObjectStore resource)

I think we will want to track this BZ for OCS-operator and Rook both since both will need to implement some changes. Rook implements the logic above, and OCS-operator needs to delete the CephCluster (with yes-really-destroy-data) and all other resources at the same time in order for Rook to proceed with deletion. I'll move this BZ to Rook for now.

Comment 41 Anna Sandler 2021-11-15 22:00:48 UTC
Verifying this bug buy the updated 4.9 uninstall flow 
after deleting finalizers and dependent resources, using the command "oc delete -n openshift-storage storagesystem --all --wait=true"
storagecluster and storagesystem were deleted successfully and were not stuck on terminating 

tested on OCP 4.9 on AWS 
openshift-storage                      mcg-operator.v4.9.0   NooBaa Operator               4.9.0                Succeeded
openshift-storage                      ocs-operator.v4.9.0   OpenShift Container Storage   4.9.0                Succeeded
openshift-storage                      odf-operator.v4.9.0   OpenShift Data Foundation     4.9.0                Succeeded

Comment 42 Anna Sandler 2021-11-15 22:07:43 UTC
Verifying this bug buy the updated 4.9 uninstall flow 
after deleting finalizers and dependent resources, deleted the storagesystem using UI
storagecluster and storagesystem were deleted successfully and were not stuck on terminating 

process can be seen on added attachment 

tested on OCP 4.9 on AWS 
openshift-storage                      mcg-operator.v4.9.0   NooBaa Operator               4.9.0                Succeeded
openshift-storage                      ocs-operator.v4.9.0   OpenShift Container Storage   4.9.0                Succeeded
openshift-storage                      odf-operator.v4.9.0   OpenShift Data Foundation     4.9.0                Succeeded

Comment 44 Jilju Joy 2021-11-16 11:31:48 UTC
The bug status was moved to ON_QA by errata system. Hit this issue again in VMware platform internal mode cluster. Changing the bug status to modified.
@Mudit FYI. There are other bugs where status is changed by errata system.

Comment 45 Jilju Joy 2021-11-16 11:49:46 UTC
Hit this issue as mentioned in Comment 14. Details and logs will be shared in next comment.

Comment 46 Jilju Joy 2021-11-16 11:56:45 UTC
The command to delete storagesystem is not completing because storagesystem is waiting for storagecluster to get deleted.

$ oc delete -n openshift-storage storagesystem --all --wait=true
storagesystem.odf.openshift.io "storagesystem-odf" deleted


Storagecluster is not getting deleted.
$ oc get storagecluster
NAME                 AGE   PHASE      EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   28h   Deleting              2021-11-15T05:42:30Z   4.9.0

Events from Storagecluster ocs-storagecluster:

Events:
  Type     Reason            Age   From                       Message
  ----     ------            ----  ----                       -------
  Warning  UninstallPending  31m   controller_storagecluster  uninstall: Waiting on NooBaa system noobaa to be deleted
  Warning  UninstallPending  31m   controller_storagecluster  uninstall: Waiting for CephObjectStoreUser ocs-storagecluster-cephobjectstoreuser to be deleted
  Warning  UninstallPending  31m   controller_storagecluster  uninstall: Waiting for CephObjectStore ocs-storagecluster-cephobjectstore to be deleted


storagecluster is not deleted due to the presence of CephObjectStoreUser and CephObjectStore.
$ oc get CephObjectStoreUser
NAME                           AGE
noobaa-ceph-objectstore-user   28h
$ oc get CephObjectStore
NAME                                 AGE
ocs-storagecluster-cephobjectstore   28h

Events from CephObjectStore ocs-storagecluster-cephobjectstore:

Events:
  Type     Reason           Age   From                         Message
  ----     ------           ----  ----                         -------
  Warning  ReconcileFailed  34m   rook-ceph-object-controller  CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore" will not be deleted until all dependents are removed: CephObjectStoreUsers: [noobaa-ceph-objectstore-user], buckets in the object store (could be from ObjectBucketClaims or COSI Buckets): [nb.1636955290789.apps.jijoy-nov15.qe.rh-ocs.com]


Tested in version:
4.9.0-237.ci
4.10.0-0.nightly-2021-11-14-184249

Platform is VMware.
Internal mode cluster.

Comment 47 Jilju Joy 2021-11-16 12:04:08 UTC
Adding to comment #46


$ oc get cephobjectstoreuser -o yaml
apiVersion: v1
items:
- apiVersion: ceph.rook.io/v1
  kind: CephObjectStoreUser
  metadata:
    creationTimestamp: "2021-11-15T05:46:29Z"
    deletionGracePeriodSeconds: 0
    deletionTimestamp: "2021-11-16T09:38:15Z"
    finalizers:
    - cephobjectstoreuser.ceph.rook.io
    generation: 2
    name: noobaa-ceph-objectstore-user
    namespace: openshift-storage
    ownerReferences:
    - apiVersion: noobaa.io/v1alpha1
      blockOwnerDeletion: true
      controller: true
      kind: NooBaa
      name: noobaa
      uid: ead94976-a44e-4ab1-b76f-8acc127f9d41
    resourceVersion: "1228781"
    uid: b298e368-79eb-4b68-ac10-936bf9db9c80
  spec:
    displayName: my display name
    store: ocs-storagecluster-cephobjectstore
  status:
    info:
      secretName: rook-ceph-object-user-ocs-storagecluster-cephobjectstore-noobaa-ceph-objectstore-user
    phase: Ready
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""



$ oc describe cephobjectstore
Name:         ocs-storagecluster-cephobjectstore
Namespace:    openshift-storage
Labels:       <none>
Annotations:  <none>
API Version:  ceph.rook.io/v1
Kind:         CephObjectStore
Metadata:
  Creation Timestamp:             2021-11-15T05:42:31Z
  Deletion Grace Period Seconds:  0
  Deletion Timestamp:             2021-11-16T09:38:16Z
  Finalizers:
    cephobjectstore.ceph.rook.io
  Generation:  2
  Managed Fields:
    API Version:  ceph.rook.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:ownerReferences:
          .:
          k:{"uid":"c75b415e-13fa-40fc-8e4c-0bbabdf62275"}:
      f:spec:
        .:
        f:dataPool:
          .:
          f:compressionMode:
          f:erasureCoded:
            .:
            f:codingChunks:
            f:dataChunks:
          f:failureDomain:
          f:mirroring:
          f:quotas:
          f:replicated:
            .:
            f:replicasPerFailureDomain:
            f:size:
            f:targetSizeRatio:
          f:statusCheck:
            .:
            f:mirror:
        f:gateway:
          .:
          f:instances:
          f:placement:
            .:
            f:nodeAffinity:
              .:
              f:requiredDuringSchedulingIgnoredDuringExecution:
                .:
                f:nodeSelectorTerms:
            f:podAntiAffinity:
              .:
              f:preferredDuringSchedulingIgnoredDuringExecution:
              f:requiredDuringSchedulingIgnoredDuringExecution:
            f:tolerations:
          f:port:
          f:priorityClassName:
          f:resources:
            .:
            f:limits:
              .:
              f:cpu:
              f:memory:
            f:requests:
              .:
              f:cpu:
              f:memory:
          f:securePort:
          f:service:
            .:
            f:annotations:
              .:
              f:service.beta.openshift.io/serving-cert-secret-name:
        f:healthCheck:
          .:
          f:bucket:
        f:metadataPool:
          .:
          f:compressionMode:
          f:erasureCoded:
            .:
            f:codingChunks:
            f:dataChunks:
          f:failureDomain:
          f:mirroring:
          f:quotas:
          f:replicated:
            .:
            f:replicasPerFailureDomain:
            f:size:
          f:statusCheck:
            .:
            f:mirror:
        f:zone:
          .:
          f:name:
    Manager:      ocs-operator
    Operation:    Update
    Time:         2021-11-15T05:42:31Z
    API Version:  ceph.rook.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:bucketStatus:
          f:lastChecked:
    Manager:      rook
    Operation:    Update
    Time:         2021-11-16T09:38:10Z
    API Version:  ceph.rook.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
        f:phase:
    Manager:      rook
    Operation:    Update
    Subresource:  status
    Time:         2021-11-16T09:38:18Z
  Owner References:
    API Version:           ocs.openshift.io/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  StorageCluster
    Name:                  ocs-storagecluster
    UID:                   c75b415e-13fa-40fc-8e4c-0bbabdf62275
  Resource Version:        1325984
  UID:                     bf664b76-cb79-4420-ae0b-8968ecd41d83
Spec:
  Data Pool:
    Compression Mode:  none
    Erasure Coded:
      Coding Chunks:  0
      Data Chunks:    0
    Failure Domain:   rack
    Mirroring:
    Quotas:
    Replicated:
      Replicas Per Failure Domain:  1
      Size:                         3
      Target Size Ratio:            0.49
    Status Check:
      Mirror:
  Gateway:
    Instances:  1
    Placement:
      Node Affinity:
        Required During Scheduling Ignored During Execution:
          Node Selector Terms:
            Match Expressions:
              Key:       cluster.ocs.openshift.io/openshift-storage
              Operator:  Exists
      Pod Anti Affinity:
        Preferred During Scheduling Ignored During Execution:
          Pod Affinity Term:
            Label Selector:
              Match Expressions:
                Key:       app
                Operator:  In
                Values:
                  rook-ceph-rgw
            Topology Key:  kubernetes.io/hostname
          Weight:          100
        Required During Scheduling Ignored During Execution:
          Label Selector:
            Match Expressions:
              Key:       app
              Operator:  In
              Values:
                rook-ceph-rgw
          Topology Key:  kubernetes.io/hostname
      Tolerations:
        Effect:           NoSchedule
        Key:              node.ocs.openshift.io/storage
        Operator:         Equal
        Value:            true
    Port:                 80
    Priority Class Name:  openshift-user-critical
    Resources:
      Limits:
        Cpu:     2
        Memory:  4Gi
      Requests:
        Cpu:      2
        Memory:   4Gi
    Secure Port:  443
    Service:
      Annotations:
        service.beta.openshift.io/serving-cert-secret-name:  ocs-storagecluster-cos-ceph-rgw-tls-cert
  Health Check:
    Bucket:
  Metadata Pool:
    Compression Mode:  none
    Erasure Coded:
      Coding Chunks:  0
      Data Chunks:    0
    Failure Domain:   rack
    Mirroring:
    Quotas:
    Replicated:
      Replicas Per Failure Domain:  1
      Size:                         3
    Status Check:
      Mirror:
  Zone:
    Name:  
Status:
  Bucket Status:
    Health:        Connected
    Last Changed:  2021-11-15T05:46:55Z
    Last Checked:  2021-11-16T09:38:10Z
  Conditions:
    Last Heartbeat Time:   2021-11-16T11:58:45Z
    Last Transition Time:  2021-11-16T09:38:18Z
    Message:               CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore" will not be deleted until all dependents are removed: CephObjectStoreUsers: [noobaa-ceph-objectstore-user], buckets in the object store (could be from ObjectBucketClaims or COSI Buckets): [nb.1636955290789.apps.jijoy-nov15.qe.rh-ocs.com]
    Reason:                ObjectHasDependents
    Status:                True
    Type:                  DeletionIsBlocked
  Info:
    Endpoint:         http://rook-ceph-rgw-ocs-storagecluster-cephobjectstore.openshift-storage.svc:80
    Secure Endpoint:  https://rook-ceph-rgw-ocs-storagecluster-cephobjectstore.openshift-storage.svc:443
  Phase:              Deleting
Events:
  Type     Reason           Age                 From                         Message
  ----     ------           ----                ----                         -------
  Warning  ReconcileFailed  35m                 rook-ceph-object-controller  failed to check for object buckets. failed to get admin ops API context: failed to create or retrieve rgw admin ops user: failed to create object user "rgw-admin-ops-user". error code 1 for object store "ocs-storagecluster-cephobjectstore": failed to create s3 user. . : signal: interrupt
  Warning  ReconcileFailed  35m (x3 over 140m)  rook-ceph-object-controller  CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore" will not be deleted until all dependents are removed: CephObjectStoreUsers: [noobaa-ceph-objectstore-user], buckets in the object store (could be from ObjectBucketClaims or COSI Buckets): [nb.1636955290789.apps.jijoy-nov15.qe.rh-ocs.com]

Comment 49 Travis Nielsen 2021-11-16 19:58:21 UTC
The operator log shows that the object store user failed to be deleted because the bucket exists.

2021-11-16 09:38:17.566321 E | ceph-object-store-user-controller: failed to reconcile failed to delete ceph object user "noobaa-ceph-objectstore-user": failed to delete ceph object user "noobaa-ceph-objectstore-user".: BucketAlreadyExists tx00000000000000000561e-0061937c09-5fe4-ocs-storagecluster-cephobjectstore 5fe4-ocs-storagecluster-cephobjectstore-ocs-storagecluster-cephobjectstore
2021-11-16 09:38:17.613729 I | op-mon: parsing mon endpoints: a=172.30.40.140:6789,b=172.30.128.131:6789,c=172.30.35.111:6789
2021-11-16 09:38:17.613837 I | ceph-object-store-user-controller: CephObjectStore "ocs-storagecluster-cephobjectstore" found
2021-11-16 09:38:17.613968 I | ceph-object-store-user-controller: CephObjectStore "ocs-storagecluster-cephobjectstore" found
2021-11-16 09:38:17.980512 E | ceph-object-store-user-controller: failed to reconcile failed to delete ceph object user "noobaa-ceph-objectstore-user": failed to delete ceph object user "noobaa-ceph-objectstore-user".: BucketAlreadyExists tx00000000000000000561f-0061937c09-5fe4-ocs-storagecluster-cephobjectstore 5fe4-ocs-storagecluster-cephobjectstore-ocs-storagecluster-cephobjectstore
2021-11-16 09:38:18.051215 I | ceph-object-controller: CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore" will not be deleted until all dependents are removed: CephObjectStoreUsers: [noobaa-ceph-objectstore-user], buckets in the object store (could be from ObjectBucketClaims or COSI Buckets): [nb.1636955290789.apps.jijoy-nov15.qe.rh-ocs.com]
2021-11-16 09:38:18.068891 E | ceph-object-controller: failed to reconcile CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore". CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore" will not be deleted until all dependents are removed: CephObjectStoreUsers: [noobaa-ceph-objectstore-user], buckets in the object store (could be from ObjectBucketClaims or COSI Buckets): [nb.1636955290789.apps.jijoy-nov15.qe.rh-ocs.com]
2021-11-16 09:38:18.068932 I | op-k8sutil: Reporting Event openshift-storage:ocs-storagecluster-cephobjectstore Warning:ReconcileFailed:CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore" will not be deleted until all dependents are removed: CephObjectStoreUsers: [noobaa-ceph-objectstore-user], buckets in the object store (could be from ObjectBucketClaims or COSI Buckets): [nb.1636955290789.apps.jijoy-nov15.qe.rh-ocs.com]

I don't see any OBCs existing in the cluster, so it seems the bucket should have been deleted. If the bucket didn't exist, the user would be deleted, the object store would be cleaned up, and the uninstall could proceed.

Blaine could you take a look?

Comment 50 Blaine Gardner 2021-11-16 20:48:22 UTC
This comment explains why there is a bucket on the user created for NooBaa. NooBaa creates it to be a default bucket and doesn't delete it when NooBaa is deleted. https://bugzilla.redhat.com/show_bug.cgi?id=2005040#c34

-----

The problem, as far as I can tell is that the CephCluster doesn't have a deletionTimestamp. It has the `cleanupPolicy` set, but not a deletionTimestamp, so the optimized/forced deletion is not happening as intended based on this comment: https://bugzilla.redhat.com/show_bug.cgi?id=2005040#c39. Both are necessary for Rook to do the optimized/forced delete.

I believe this suggests that we need a change from ocs-operator to request deletion of all resources (including the CephCluster) in order to proceed with the optimized/forced delete strategy.

-----

I'm confused how the same procedure yielded different results here https://bugzilla.redhat.com/show_bug.cgi?id=2005040#c41 and here https://bugzilla.redhat.com/show_bug.cgi?id=2005040#c46.

@jijoy and @asandler, are these two different uninstall cases somehow?

Comment 52 Jilju Joy 2021-11-17 06:26:58 UTC
Testing was done on WMware platform where the cephobjectstoreuser "noobaa-ceph-objectstore-user" will be present.

Comment 53 Blaine Gardner 2021-11-17 16:46:39 UTC
I think I am understanding from the responses that the AWS tests don't install an object store or NooBaa, so they would naturally not be affected by this bug.

Given that the deletion timestamp was missing on the CephCluster resource in Jiju's tests, I believe this still means OCS-Operator needs to do some slight adjustments to ensure it is deleting the CephCluster and all other Rook resources at the same time.

Notably, the following command should cause the OCS-Operator to delete all the Rook resources:
  > $ oc delete -n openshift-storage storagesystem --all --wait=true

Comment 60 Martin Bukatovic 2021-11-25 17:22:42 UTC
(In reply to Anna Sandler from comment #41)
> Verifying this bug buy the updated 4.9 uninstall flow 
> after deleting finalizers

Right now, the current documentation:

https://gitlab.cee.redhat.com/red-hat-openshift-container-storage-documentation/openshift-data-foundation-documentation/tree/6c33df43168f6c21f7b221e27710684c7ef6788b

doesn't mention anything about removing finalizers. If this information was about
to be added at the moment the above statement was made, I would expect a reference
to a bug or JIRA tracking it.

The note about finilizers also conflicts with Blaine's suggested approach for
the fix noted in comment 39. If we have decided to do something else,
I'm missing a clear statement about that between comment 39 and comment 41.

Only in comment 58 I see that the decision about this bug was basically not
to fix it.

> the docs are clear

Does it mean that the hack about finalizers won't be necessary?
Could you reference a description of this procedure somewhere?

> tested on OCP 4.9 on AWS
> the flow works as needed and the bug is fixed.

This also needs to be tested on vSphere with LSO. I haven't noted down that
this is vSphere specific, but vSphere is the only on premise platform
where we could deploy LSO in a way which is usable for testing purposes
(mimicking on premise LSO setup, using AWS with or without LSO won't do).

Comment 61 Martin Bukatovic 2021-11-25 17:44:19 UTC
(In reply to Jose A. Rivera from comment #57)
> "It's not a good user experience" is not an argument for blocking a release.

I would not explain this as just a bad UX. Problems like this are
unacceptable and if we let them in, we will end up hard to
maintain mess in the end.

> If there's no functional harm, no chance of production data corruption
> (we're intentionally destroying access to the data at this point!), and a
> workaround exists,

Do you mean a workaround noted in

https://bugzilla.redhat.com/show_bug.cgi?id=2000941

I haven't found any direct reference of the workaroud.
At the moment I'm writing this, I don't see it the KCS
https://access.redhat.com/articles/6525111 neither.

>  it's not a *blocker*.

It is a blocker since it's a regression, as noted in the original bug
report, the procedure explained in the reproducer was working fine before.
Moreover it has nontrivial testing impact.

Of course, if a program agrees on not fixing it because of particular
reason, that is another question.

Comment 62 Martin Bukatovic 2021-11-25 17:59:31 UTC
I just retried the original reproducer, and can still see the same behaviour:

```
Events:
  Type     Reason            Age    From                       Message
  ----     ------            ----   ----                       -------
  Warning  UninstallPending  3m33s  controller_storagecluster  uninstall: Waiting on NooBaa system noobaa to be deleted
  Warning  UninstallPending  3m33s  controller_storagecluster  uninstall: Waiting for CephObjectStoreUser ocs-storagecluster-cephobjectstoreuser to be deleted
  Warning  UninstallPending  3m32s  controller_storagecluster  uninstall: Waiting for CephObjectStore ocs-storagecluster-cephobjectstore to be deleted
```

Retried with vSphere LSO cluster:

OCP 4.9.0-0.nightly-2021-11-24-090558
LSO 4.9.0-202111151318
OCS 4.9.0-249.ci

Comment 63 Martin Bukatovic 2021-11-25 23:48:17 UTC
I also tried to perform the "workaround" suggested in BZ 2000941 before removing the ODF storage system:

```
$ oc patch -n openshift-storage noobaa/noobaa --type=merge -p '{"metadata": {"finalizers":null}}'
$ oc patch -n openshift-storage backingstore/noobaa-default-backing-store --type=merge -p '{"metadata": {"finalizers":null}}'
$ oc patch -n openshift-storage bucketclasses.noobaa.io/noobaa-default-bucket-class --type=merge -p '{"metadata": {"finalizers":null}}'
```

But I observed no difference, removal of StorageCluster still got stuck.

It's possible that I got the workaround wrong. In such case, could someone be so kind as to explain a workaround procedure so that ODF StorageSystem could be consistently removed?

Comment 65 Blaine Gardner 2021-12-01 17:56:00 UTC
Martin and I had a sync up chat about this. We both are pretty confused and not sure the best way to proceed. I will lay out the component interactions that are causing this issue to the best of my knowledge, workarounds I know, and from there we should figure out what to do to resolve the issue.

For 4.9, we should update documentation with a workaround. Otherwise, users will not know how to delete an OCS storage cluster.

For 4.10, we will have to decide whether to fix this in a component or whether documentation is the preferred fix.

---
CURRENT STATE

This issue has arisen in 4.9 because Rook includes broader checks for user data when it deletes object stores. In order to not delete object stores with user data, Rook will block deleting a CephObjectStore if the object store has any user buckets created. In Rook upstream, the way to force delete the CephObjectStore resource (which won't delete the pools for the store), is to remove the finalizer on the CephObjectStore resource.

NooBaa creates at least one bucket in the CephObjectStore and does not delete the bucket when NooBaa is deleted. Because of this, Rook does not remove the CephObjectStore, not wanting to delete user data.

Travis implemented an "optimized" deletion path in Rook which will skip checks for user buckets if three criteria are met:
1. The CephCluster has the `cleanupPolicy` set
2. The CephCluster has a deletion timestamp (it has been requested to be deleted)
3. The CephObjectStore has a deletion timestamp (it has also been requested to be deleted)

From the GChat discussion here https://chat.google.com/room/AAAAREGEba8/9MaG2Ig_TWM, it is my understanding that Jose does not wish to use the optimized deletion path in OCS in order to protect from accidental deletion of user data.

At this point, deleting an OCS cluster with a CephObjectStore and NooBaa will always hang because NooBaa does not delete the bucket when NooBaa is removed, OCS does not force the CephObjectStore to be deleted using the "optimized" deletion path, and Rook will not delete the object store with NooBaa's bucket still there.

---
WORKAROUNDS

There are 2 potential workarounds to this issue to force deletion. 
1. Remove the finalizer from the CephObjectStore
2. Use `oc delete` (or `kubectl delete`) to delete the CephCluster to use the "optimized" deletion path

---
WHAT DO WE DO NEXT?

I see 3 possible paths forward. There may be more. I would like to get input from NooBaa, Jose, and Mudit about how to do so. Options I see:

1. We update documentation to require users to perform one of the workaround steps when they want to remove the OCS storage cluster (bare minimum for 4.9)
2. Change NooBaa's uninstall procedure to remove the bucket(s) it creates in the CephObjectStore when NooBaa is being deleted
3. Use the "optimized" deletion path in ocs-operator

@muagarwa , @jrivera, @nbecker

Comment 66 Nimrod Becker 2021-12-02 09:19:45 UTC
Thanks for the good summary Blaine.

I want to add details regarding 4.10, which might complicate suggested approach #2, but also provide a suggestion out of the mess if we go with #1

As you wrote, NooBaa created a default BS on top of RGW for on-prem deployments. That is in addition to the fact that a customer can create any number of BackingStores on top of new or existing RGW buckets when RGW is avail (internal OR external). Since there is the option of creating a BS on top of an existing RGW bucket, with existing data which not not written via ODF/NooBaa we get to the same point of protecting the user data that rook implemented... we don't want to delete that data.

In 4.10, to make things a little more complicated, even the default BS won't be "Safe" to delete since we are adding an ability (requested by several customers) to set the default BS they want and not necessarily keep going with the out of the box default. This means that even the default could now point to a bucket with data not written via ODF/NooBaa.

The only way I see having #2 working is by giving the customer (UI and CLI flows) a warning in case he has his default BS on top of RGW and asking him to confirm that ALL data would be deleted. This way the customer takes the decision and if he is ok with that, so should we be. That would mean we will pass something similar to the "force" option to let the noobaa-operator know about this choice. Even if we go with this path though, we still need to think about the deletion process. Since there is no "delete all files" in S3 and a delete bucket command would fail if there is data on it. A client is essentially iterating over all objects and deletes them. This can take quite some time and I'm not sure we would wait to wait that time during uninstall. So if we go with this approach (and adding the warning to the customer) we would need to think about how we can efficiently delete or mark the bucket and objects in it to be deleted.

I have to admit that path Travis implemented sounds the more reasonable to me, during uninstall we would mark certain things that will let the components know we are during uninstall and they should behave differently, that sounds like the right approach to me.

Comment 67 Nimrod Becker 2021-12-02 09:25:09 UTC
A small fix to the comment which I Cannot add in BZ ...

I meant a suggestion out of the mess if we go with #2 and not #1

Comment 68 Martin Bukatovic 2021-12-02 14:10:46 UTC
I can confirm that removing finalizers of ocs-storagecluster-cephobjectstore:

```
oc patch -n openshift-storage CephObjectStore/ocs-storagecluster-cephobjectstore --type=merge -p '{"metadata": {"finalizers":null}}'
```

works as a workaround here.

Comment 70 Mudit Agarwal 2021-12-17 06:09:11 UTC
Nimrod already replied, removing need info on me.

Comment 73 Jose A. Rivera 2022-01-18 16:17:29 UTC
I don't entirely remember the full extend of discussion on this BZ since it's been a while, and three weeks of PTO basically wiped my brain. That said, I believe we reached a general consensus that the "optimized" deletion strategy is valid and good to go. I don't foresee any changes to ocs-operator to accommodate this, so I think all the work is done??

Giving devel_ack+ and moving to ON_QA. Testing for this is just validating the standard regression procedures.

Comment 74 Martin Bukatovic 2022-01-18 16:18:42 UTC
Will be tested via normal uninstallation procedure.

Comment 77 Blaine Gardner 2022-01-18 21:21:18 UTC
My recollection is that there are changes needed in ocs-operator to enabled this. During non-graceful deletion of a cluster, ocs-operator needs to set the `cleanupPolicy` on the `CephCluster`, then issue a delete call to the `CephCluster` before moving on to deleting the remainder of the `Ceph...` resources (chiefly the CephObjectStore). 

The last conversation we had about it, I believe ocs-operator instead tries to delete all of the secondary `Ceph...` resources (including `CephObjectStore`) before deleting the `CephCluster`. IMO, it is worth it for someone from the ocs-operator team to verify this behavior while QA is looking at it so that we don't get further time delay if QA comes back with a "no pass" result.

@jrivera

Comment 78 Jilju Joy 2022-01-21 20:17:30 UTC
The uninstall cannot be completed without applying the manual workaround of deleting the finalizers. The actual issue is mentioned in comment #77. 


$ oc describe storagesystem ocs-storagecluster-storagesystem | grep Events -A 4
Events:
  Type     Reason           Age   From                      Message
  ----     ------           ----  ----                      -------
  Warning  ReconcileFailed  13m   StorageSystem controller  Waiting for storagecluster.ocs.openshift.io/v1 ocs-storagecluster to be deleted


$ oc describe storagecluster ocs-storagecluster | grep Events -A 10
Events:
  Type     Reason            Age   From                       Message
  ----     ------            ----  ----                       -------
  Warning  UninstallPending  14m   controller_storagecluster  uninstall: Waiting on NooBaa system noobaa to be deleted
  Warning  UninstallPending  14m   controller_storagecluster  Uninstall: Waiting for Ceph RGW Route ocs-storagecluster-cephobjectstore to be deleted
  Warning  UninstallPending  14m   controller_storagecluster  uninstall: Waiting for CephObjectStoreUser ocs-storagecluster-cephobjectstoreuser to be deleted
  Warning  UninstallPending  14m   controller_storagecluster  uninstall: Waiting for CephObjectStoreUser prometheus-user to be deleted
  Warning  UninstallPending  14m   controller_storagecluster  uninstall: Waiting for CephObjectStore ocs-storagecluster-cephobjectstore to be deleted
  
  
$ oc describe CephObjectStore ocs-storagecluster-cephobjectstore | grep Events -A 10
Events:
  Type     Reason           Age                From                         Message
  ----     ------           ----               ----                         -------
  Warning  ReconcileFailed  5s (x82 over 15m)  rook-ceph-object-controller  CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore" will not be deleted until all dependents are removed: CephObjectStoreUsers: [noobaa-ceph-objectstore-user], buckets in the object store (could be from ObjectBucketClaims or COSI Buckets): [nb.1642741709887.apps.jijoy-jan21.qe.rh-ocs.com]



$ oc get CephObjectStoreUser noobaa-ceph-objectstore-user
NAME                           AGE
noobaa-ceph-objectstore-user   14h


The command "oc delete CephObjectStoreUser noobaa-ceph-objectstore-user"(deleting manually is still a workaround) will not be completed until the finalizers are removed by running the command given below.

$ oc patch -n openshift-storage CephObjectStoreUser/noobaa-ceph-objectstore-user --type=merge -p '{"metadata": {"finalizers":null}}'
cephobjectstoreuser.ceph.rook.io/noobaa-ceph-objectstore-user patched


Deleting CephObjectStoreUser noobaa-ceph-objectstore-user did not help in deleting CephObjectStore ocs-storagecluster-cephobjectstore automatically.

$ oc describe CephObjectStore ocs-storagecluster-cephobjectstore | grep Events -A 10
Events:
  Type     Reason           Age                    From                         Message
  ----     ------           ----                   ----                         -------
  Warning  ReconcileFailed  7m56s (x82 over 23m)   rook-ceph-object-controller  CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore" will not be deleted until all dependents are removed: CephObjectStoreUsers: [noobaa-ceph-objectstore-user], buckets in the object store (could be from ObjectBucketClaims or COSI Buckets): [nb.1642741709887.apps.jijoy-jan21.qe.rh-ocs.com]
  Warning  ReconcileFailed  2m54s (x3 over 3m16s)  rook-ceph-object-controller  CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore" will not be deleted until all dependents are removed: buckets in the object store (could be from ObjectBucketClaims or COSI Buckets): [nb.1642741709887.apps.jijoy-jan21.qe.rh-ocs.com]


Removed the finalizer
$ oc patch -n openshift-storage CephObjectStore/ocs-storagecluster-cephobjectstore --type=merge -p '{"metadata": {"finalizers":null}}'
cephobjectstore.ceph.rook.io/ocs-storagecluster-cephobjectstore patched

Comment 79 Jose A. Rivera 2022-02-04 14:44:42 UTC
*** Bug 2049309 has been marked as a duplicate of this bug. ***

Comment 80 Jose A. Rivera 2022-02-24 20:32:20 UTC
Upstream PR for ocs-operator is posted: https://github.com/red-hat-storage/ocs-operator/pull/1563

After talking it over with Blaine, we're fairly confident that this should resolve the problem. I both love and hate how often tiny changes like this end up being the solution.

Comment 81 Jilju Joy 2022-03-08 05:42:46 UTC
Isn't this bug dependent on the fix of the bug #2060897 ?
According to the comment https://bugzilla.redhat.com/show_bug.cgi?id=2060897#c14, the issue described in #2060897 is related to the code changes linked  to this bug.

Comment 82 Jilju Joy 2022-03-10 11:17:37 UTC
The command given below became stuck because storagecluster is not getting deleted.

$ oc delete -n openshift-storage storagesystem --all --wait=true
storagesystem.odf.openshift.io "ocs-storagecluster-storagesystem" deleted


$ oc describe storagesystem ocs-storagecluster-storagesystem | grep Events -A 30
Events:
  Type     Reason           Age   From                      Message
  ----     ------           ----  ----                      -------
  Warning  ReconcileFailed  22m   StorageSystem controller  Waiting for storagecluster.ocs.openshift.io/v1 ocs-storagecluster to be deleted



$ oc describe storagecluster ocs-storagecluster | grep Events -A 30
Events:
  Type     Reason            Age   From                       Message
  ----     ------            ----  ----                       -------
  Warning  UninstallPending  23m   controller_storagecluster  uninstall: Waiting on NooBaa system noobaa to be deleted
  Warning  UninstallPending  23m   controller_storagecluster  uninstall: Waiting for CephObjectStoreUser ocs-storagecluster-cephobjectstoreuser to be deleted
  Warning  UninstallPending  23m   controller_storagecluster  uninstall: Waiting for CephObjectStoreUser prometheus-user to be deleted
  Warning  UninstallPending  23m   controller_storagecluster  uninstall: Waiting for CephObjectStore ocs-storagecluster-cephobjectstore to be deleted
  Warning  UninstallPending  23m   controller_storagecluster  uninstall: Waiting for CephFileSystem ocs-storagecluster-cephfilesystem to be deleted
  Warning  UninstallPending  23m   controller_storagecluster  uninstall: Waiting for CephBlockPool ocs-storagecluster-cephblockpool to be deleted
  Warning  UninstallPending  23m   controller_storagecluster  uninstall: Waiting for CephCluster to be deleted



$ oc get cephcluster
NAME                             DATADIRHOSTPATH   MONCOUNT   AGE     PHASE      MESSAGE                    HEALTH      EXTERNAL
ocs-storagecluster-cephcluster   /var/lib/rook     3          4h59m   Deleting   Deleting the CephCluster   HEALTH_OK   


$ oc describe cephcluster | grep Events -A 30
Events:
  Type     Reason              Age                   From                          Message
  ----     ------              ----                  ----                          -------
  Normal   ReconcileSucceeded  44m (x2 over 4h56m)   rook-ceph-cluster-controller  successfully configured CephCluster "openshift-storage/ocs-storagecluster-cephcluster"
  Warning  ReconcileFailed     24m                   rook-ceph-cluster-controller  CephCluster "openshift-storage/ocs-storagecluster-cephcluster" will not be deleted until all dependents are removed: CephBlockPool: [ocs-storagecluster-cephblockpool], CephFilesystem: [ocs-storagecluster-cephfilesystem], CephObjectStore: [ocs-storagecluster-cephobjectstore]
  Normal   Deleting            20m (x12 over 24m)    rook-ceph-cluster-controller  deleting CephCluster "openshift-storage/ocs-storagecluster-cephcluster"
  Warning  ReconcileFailed     3m30s (x29 over 24m)  rook-ceph-cluster-controller  failed to clean up CephCluster "openshift-storage/ocs-storagecluster-cephcluster": failed to check if volumes exist for CephCluster in namespace "openshift-storage": waiting for csi volume attachments in cluster "openshift-storage" to be cleaned up
  
  
  $ oc get pvc,pv
NAME                                                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/ocs-deviceset-0-data-05lxrb   Bound    pvc-f79af469-0dd7-4aed-8450-24d3b351b909   100Gi      RWO            thin           4h58m
persistentvolumeclaim/ocs-deviceset-1-data-0d9qd6   Bound    pvc-06a34bc6-b82b-4cd0-a545-ade14399f5c2   100Gi      RWO            thin           4h58m
persistentvolumeclaim/ocs-deviceset-2-data-0dsw7c   Bound    pvc-cfd4b5c9-7a70-4df2-851e-9589ccf9cf7f   100Gi      RWO            thin           4h58m
persistentvolumeclaim/rook-ceph-mon-a               Bound    pvc-4735b6fa-f60e-4d87-9141-a9dd8f4c8b2d   50Gi       RWO            thin           5h1m
persistentvolumeclaim/rook-ceph-mon-b               Bound    pvc-89a13c6b-28f6-432f-a3d9-c4c05dee77b3   50Gi       RWO            thin           5h1m
persistentvolumeclaim/rook-ceph-mon-c               Bound    pvc-276c2678-b663-4299-bc2f-3c5a42f53eaa   50Gi       RWO            thin           5h1m

NAME                                                        CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                                           STORAGECLASS                  REASON   AGE
persistentvolume/pvc-06a34bc6-b82b-4cd0-a545-ade14399f5c2   100Gi      RWO            Delete           Bound      openshift-storage/ocs-deviceset-1-data-0d9qd6   thin                                   4h58m
persistentvolume/pvc-276c2678-b663-4299-bc2f-3c5a42f53eaa   50Gi       RWO            Delete           Bound      openshift-storage/rook-ceph-mon-c               thin                                   5h
persistentvolume/pvc-4735b6fa-f60e-4d87-9141-a9dd8f4c8b2d   50Gi       RWO            Delete           Bound      openshift-storage/rook-ceph-mon-a               thin                                   5h
persistentvolume/pvc-89a13c6b-28f6-432f-a3d9-c4c05dee77b3   50Gi       RWO            Delete           Bound      openshift-storage/rook-ceph-mon-b               thin                                   5h
persistentvolume/pvc-b79a0258-68ea-4aea-ae0c-c64b39da16cf   50Gi       RWO            Delete           Released   openshift-storage/db-noobaa-db-pg-0             ocs-storagecluster-ceph-rbd            4h56m
persistentvolume/pvc-cfd4b5c9-7a70-4df2-851e-9589ccf9cf7f   100Gi      RWO            Delete           Bound      openshift-storage/ocs-deviceset-2-data-0dsw7c   thin                                   4h58m
persistentvolume/pvc-f79af469-0dd7-4aed-8450-24d3b351b909   100Gi      RWO            Delete           Bound      openshift-storage/ocs-deviceset-0-data-05lxrb   thin                                   4h58m


$ oc describe pv pvc-b79a0258-68ea-4aea-ae0c-c64b39da16cf | grep Events -A 30
Events:
  Type     Reason              Age                 From                                                                                                                Message
  ----     ------              ----                ----                                                                                                                -------
  Warning  VolumeFailedDelete  45s (x17 over 29m)  openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-7b59944d67-pfsjb_21e16f1c-a476-4de0-8456-f47cc5f46d3b  rpc error: code = InvalidArgument desc = provided secret is empty


This issue is reported in the bug #2060897


Tested in version:
ODF 4.10.0-184
ODF 4.10.0-0.nightly-2022-03-09-224546

Tested in VMware.

Comment 83 Mudit Agarwal 2022-03-10 12:32:44 UTC
Because the issue mentioned in the above comment is already tracked by bug #2060897, I am moving this BZ to MODIFIED and out of 4.10
Can be moved back to ON_QA when bug #2060897 is fixed.

Let me know if I have missed anything here.

Comment 88 Jilju Joy 2022-06-24 14:31:17 UTC
(In reply to Mudit Agarwal from comment #83)
> Because the issue mentioned in the above comment is already tracked by bug
> #2060897, I am moving this BZ to MODIFIED and out of 4.10
> Can be moved back to ON_QA when bug #2060897 is fixed.
Hi Mudit,
Is this bug actually ready for verification ? The bug #2060897 is not fixed.
The "Target Release" and "Fixed In Version" are not matching.
> 
> Let me know if I have missed anything here.

Comment 89 Mudit Agarwal 2022-06-27 05:57:33 UTC
This was moved to ON_QA automatically, moving it back.

Comment 99 Elad 2022-11-15 13:29:29 UTC
Hi Mudit, why has this bug moved out of 4.12.0? it got all the acks for 4.12.0

Comment 100 Mudit Agarwal 2022-11-15 13:42:42 UTC
This is being moved for many releases just not 4.12, we don't have bandwidth to fix uninstallation and it is having low priority.

Comment 133 errata-xmlrpc 2023-06-21 15:22:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742


Note You need to log in before you can comment on or make changes to this bug.