Bug 1916850
| Summary: | Uninstall 4.7- rook: Storagecluster deletion stuck on a partially created KMS enabled OCS cluster(OSD creation failed) | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Neha Berry <nberry> |
| Component: | rook | Assignee: | Santosh Pillai <sapillai> |
| Status: | CLOSED ERRATA | QA Contact: | Neha Berry <nberry> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.7 | CC: | ebenahar, etamir, jarrpa, madam, muagarwa, nbecker, nigoyal, ocs-bugs, ratamir, sapillai, srozen, tnielsen |
| Target Milestone: | --- | Keywords: | AutomationBackLog, Regression |
| Target Release: | OCS 4.7.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | 4.7.0-714.ci | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 1915445 | Environment: | |
| Last Closed: | 2021-05-19 09:18:16 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1915445 | ||
| Bug Blocks: | |||
|
Comment 4
Santosh Pillai
2021-01-21 09:24:31 UTC
The CephObjectStore controller in the rook operator must be getting stuck when it is trying to initialize the object store. OSDs are required for the object store to complete its reconcile. Then if you delete the object store, it likewise would be getting stuck trying to clean up the object store. We need a fix so that the operator won't get stuck removing the finalizer on the object store. But rather than check for the existence of OSDs, it seems reasonable to add a timeout to the ceph commands that are getting stuck during object store cleanup. This timeout would allow the finalizer to also be removed in other scenarios as well where the OSDs are not responding. The reconciler for object store is stuck when running `radosgw-admin realm get` (https://github.com/rook/rook/blob/83fee8adfbce1ce572a80dd7696ec5bd84ba1df4/pkg/operator/ceph/object/objectstore.go#L346). Because of this (next) reconciler never runs when ceph Object Store resource is deleted and finalizer is never removed. This results in deletion of the cephobjectstore getting stuck for a long time. My recommendation for the object store configuration is: 1. Add a timeout wrapper from the exec package to all the rgw commands 2. If installing the object store, we continue to treat timeouts just like any error that will fail the reconcile and requeue it to try again. 3. If cleaning up the object store, treat it as a best effort. If there are failures cleaning up pools, realms or other rgw resources, just log the error and continue with the attempt to clean up all the resources, and proceed to remove the finalizer. If there are multiple timeouts this may take several minutes, but at least the removal will be allowed. The side effect of step 3 is that in some corner cases, rgw pools or other resources may be left behind if the OSDs are down or PGs are unhealthy. In that case, the admin could connect to the toolbox to clean up the resources if needed. Since there is a workaround in the corner case, this seems acceptable. Merged downstream with https://github.com/openshift/rook/pull/157 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041 |