Bug 1916850

Summary: Uninstall 4.7- rook: Storagecluster deletion stuck on a partially created KMS enabled OCS cluster(OSD creation failed)
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Neha Berry <nberry>
Component: rookAssignee: Santosh Pillai <sapillai>
Status: CLOSED ERRATA QA Contact: Neha Berry <nberry>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.7CC: ebenahar, etamir, jarrpa, madam, muagarwa, nbecker, nigoyal, ocs-bugs, ratamir, sapillai, srozen, tnielsen
Target Milestone: ---Keywords: AutomationBackLog, Regression
Target Release: OCS 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.7.0-714.ci Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 1915445 Environment:
Last Closed: 2021-05-19 09:18:16 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1915445    
Bug Blocks:    

Comment 4 Santosh Pillai 2021-01-21 09:24:31 UTC
- This is a case where there are no OSDs, but we still have cephObjectStore created in the cluster.
- On deletion, deletionTimeStamp gets added the cephObjectStore CR, but its reconciler never runs because of which finalizer is never removed. As a result, the CR deletion remains stuck for ever. 
  
  ```
  $ oc get cephobjectstores.ceph.rook.io ocs-storagecluster-cephobjectstore -o yaml
apiVersion: ceph.rook.io/v1
kind: CephObjectStore
metadata:
  creationTimestamp: "2021-01-13T12:52:06Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2021-01-14T11:28:29Z"
  finalizers:
  - cephobjectstore.ceph.rook.io

 ```

So two questions here : 

1. Why the cephObjectStore reconciler didn't run in case there are no OSDs?  (This could just be cause of the bad state of the cluster. Needs more investigation). 
2. Having cephObjectStore (cephBlookPools and cephFileSystem) without any OSDs is not right. Should OCS avoid creating these resources when there are no OSDS? 

IMO, we should be working on #2 above.

@Jose and @Travis WDYT?

Comment 5 Travis Nielsen 2021-01-21 16:27:03 UTC
The CephObjectStore controller in the rook operator must be getting stuck when it is trying to initialize the object store. OSDs are required for the object store to complete its reconcile. Then if you delete the object store, it likewise would be getting stuck trying to clean up the object store. We need a fix so that the operator won't get stuck removing the finalizer on the object store. But rather than check for the existence of OSDs, it seems reasonable to add a timeout to the ceph commands that are getting stuck during object store cleanup. This timeout would allow the finalizer to also be removed in other scenarios as well where the OSDs are not responding.

Comment 6 Santosh Pillai 2021-01-22 16:30:55 UTC
The reconciler for object store is stuck when running `radosgw-admin realm get` (https://github.com/rook/rook/blob/83fee8adfbce1ce572a80dd7696ec5bd84ba1df4/pkg/operator/ceph/object/objectstore.go#L346). Because of this (next) reconciler never runs when ceph Object Store resource is deleted and finalizer is never removed. This results in deletion of the cephobjectstore getting stuck for a long time.

Comment 7 Travis Nielsen 2021-01-26 19:01:56 UTC
My recommendation for the object store configuration is:
1. Add a timeout wrapper from the exec package to all the rgw commands
2. If installing the object store, we continue to treat timeouts just like any error that will fail the reconcile and requeue it to try again.
3. If cleaning up the object store, treat it as a best effort. If there are failures cleaning up pools, realms or other rgw resources, just log the error and continue with the attempt to clean up all the resources, and proceed to remove the finalizer. If there are multiple timeouts this may take several minutes, but at least the removal will be allowed.

The side effect of step 3 is that in some corner cases, rgw pools or other resources may be left behind if the OSDs are down or PGs are unhealthy. In that case, the admin could connect to the toolbox to clean up the resources if needed. Since there is a workaround in the corner case, this seems acceptable.

Comment 8 Travis Nielsen 2021-02-01 23:25:28 UTC
Merged downstream with https://github.com/openshift/rook/pull/157

Comment 14 errata-xmlrpc 2021-05-19 09:18:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041