1916850 – Uninstall 4.7- rook: Storagecluster deletion stuck on a partially created KMS enabled OCS cluster(OSD creation failed)

Bug 1916850 - Uninstall 4.7- rook: Storagecluster deletion stuck on a partially created KMS enabled OCS cluster(OSD creation failed)

Summary: Uninstall 4.7- rook: Storagecluster deletion stuck on a partially created KMS...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.7.0
Assignee:	Santosh Pillai
QA Contact:	Neha Berry
Docs Contact:
URL:
Whiteboard:
Depends On:	1915445
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-15 16:40 UTC by Neha Berry
Modified:	2021-06-01 08:50 UTC (History)
CC List:	12 users (show)
Fixed In Version:	4.7.0-714.ci
Doc Type:	No Doc Update
Doc Text:
Clone Of:	1915445
Environment:
Last Closed:	2021-05-19 09:18:16 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	rook rook pull 7075	0	None	closed	ceph: timeout `radosgw-admin` cli commands	2021-02-11 05:21:06 UTC
Red Hat Product Errata	RHSA-2021:2041	0	None	None	None	2021-05-19 09:18:43 UTC

Comment 4 Santosh Pillai 2021-01-21 09:24:31 UTC

- This is a case where there are no OSDs, but we still have cephObjectStore created in the cluster.
- On deletion, deletionTimeStamp gets added the cephObjectStore CR, but its reconciler never runs because of which finalizer is never removed. As a result, the CR deletion remains stuck for ever. 
  
  ```
  $ oc get cephobjectstores.ceph.rook.io ocs-storagecluster-cephobjectstore -o yaml
apiVersion: ceph.rook.io/v1
kind: CephObjectStore
metadata:
  creationTimestamp: "2021-01-13T12:52:06Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2021-01-14T11:28:29Z"
  finalizers:
  - cephobjectstore.ceph.rook.io

 ```

So two questions here : 

1. Why the cephObjectStore reconciler didn't run in case there are no OSDs?  (This could just be cause of the bad state of the cluster. Needs more investigation). 
2. Having cephObjectStore (cephBlookPools and cephFileSystem) without any OSDs is not right. Should OCS avoid creating these resources when there are no OSDS? 

IMO, we should be working on #2 above.

@Jose and @Travis WDYT?

Comment 5 Travis Nielsen 2021-01-21 16:27:03 UTC

The CephObjectStore controller in the rook operator must be getting stuck when it is trying to initialize the object store. OSDs are required for the object store to complete its reconcile. Then if you delete the object store, it likewise would be getting stuck trying to clean up the object store. We need a fix so that the operator won't get stuck removing the finalizer on the object store. But rather than check for the existence of OSDs, it seems reasonable to add a timeout to the ceph commands that are getting stuck during object store cleanup. This timeout would allow the finalizer to also be removed in other scenarios as well where the OSDs are not responding.

Comment 6 Santosh Pillai 2021-01-22 16:30:55 UTC

The reconciler for object store is stuck when running `radosgw-admin realm get` (https://github.com/rook/rook/blob/83fee8adfbce1ce572a80dd7696ec5bd84ba1df4/pkg/operator/ceph/object/objectstore.go#L346). Because of this (next) reconciler never runs when ceph Object Store resource is deleted and finalizer is never removed. This results in deletion of the cephobjectstore getting stuck for a long time.

Comment 7 Travis Nielsen 2021-01-26 19:01:56 UTC

My recommendation for the object store configuration is:
1. Add a timeout wrapper from the exec package to all the rgw commands
2. If installing the object store, we continue to treat timeouts just like any error that will fail the reconcile and requeue it to try again.
3. If cleaning up the object store, treat it as a best effort. If there are failures cleaning up pools, realms or other rgw resources, just log the error and continue with the attempt to clean up all the resources, and proceed to remove the finalizer. If there are multiple timeouts this may take several minutes, but at least the removal will be allowed.

The side effect of step 3 is that in some corner cases, rgw pools or other resources may be left behind if the OSDs are down or PGs are unhealthy. In that case, the admin could connect to the toolbox to clean up the resources if needed. Since there is a workaround in the corner case, this seems acceptable.

Comment 8 Travis Nielsen 2021-02-01 23:25:28 UTC

Merged downstream with https://github.com/openshift/rook/pull/157

Comment 14 errata-xmlrpc 2021-05-19 09:18:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041

Note You need to log in before you can comment on or make changes to this bug.