- This is a case where there are no OSDs, but we still have cephObjectStore created in the cluster. - On deletion, deletionTimeStamp gets added the cephObjectStore CR, but its reconciler never runs because of which finalizer is never removed. As a result, the CR deletion remains stuck for ever. ``` $ oc get cephobjectstores.ceph.rook.io ocs-storagecluster-cephobjectstore -o yaml apiVersion: ceph.rook.io/v1 kind: CephObjectStore metadata: creationTimestamp: "2021-01-13T12:52:06Z" deletionGracePeriodSeconds: 0 deletionTimestamp: "2021-01-14T11:28:29Z" finalizers: - cephobjectstore.ceph.rook.io ``` So two questions here : 1. Why the cephObjectStore reconciler didn't run in case there are no OSDs? (This could just be cause of the bad state of the cluster. Needs more investigation). 2. Having cephObjectStore (cephBlookPools and cephFileSystem) without any OSDs is not right. Should OCS avoid creating these resources when there are no OSDS? IMO, we should be working on #2 above. @Jose and @Travis WDYT?
The CephObjectStore controller in the rook operator must be getting stuck when it is trying to initialize the object store. OSDs are required for the object store to complete its reconcile. Then if you delete the object store, it likewise would be getting stuck trying to clean up the object store. We need a fix so that the operator won't get stuck removing the finalizer on the object store. But rather than check for the existence of OSDs, it seems reasonable to add a timeout to the ceph commands that are getting stuck during object store cleanup. This timeout would allow the finalizer to also be removed in other scenarios as well where the OSDs are not responding.
The reconciler for object store is stuck when running `radosgw-admin realm get` (https://github.com/rook/rook/blob/83fee8adfbce1ce572a80dd7696ec5bd84ba1df4/pkg/operator/ceph/object/objectstore.go#L346). Because of this (next) reconciler never runs when ceph Object Store resource is deleted and finalizer is never removed. This results in deletion of the cephobjectstore getting stuck for a long time.
My recommendation for the object store configuration is: 1. Add a timeout wrapper from the exec package to all the rgw commands 2. If installing the object store, we continue to treat timeouts just like any error that will fail the reconcile and requeue it to try again. 3. If cleaning up the object store, treat it as a best effort. If there are failures cleaning up pools, realms or other rgw resources, just log the error and continue with the attempt to clean up all the resources, and proceed to remove the finalizer. If there are multiple timeouts this may take several minutes, but at least the removal will be allowed. The side effect of step 3 is that in some corner cases, rgw pools or other resources may be left behind if the OSDs are down or PGs are unhealthy. In that case, the admin could connect to the toolbox to clean up the resources if needed. Since there is a workaround in the corner case, this seems acceptable.
Merged downstream with https://github.com/openshift/rook/pull/157
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041