Description of problem (please be detailed as possible and provide log snippests): ---------------------------------------------------------------------------- Storagecluster deletion is stuck on a cluster where some OSD PVCs were in pending state. This bug seems to be different than Bug 1916850#c4, as the cephobjectStore deletion had not yet been initated (hence no deletionTimeStamp) because the cephcluster deletion itslef was stuck. ocs-operator log snip --------------------------- {"level":"info","ts":1611903838.524638,"logger":"controllers.StorageCluster","msg":"Uninstall in progress","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Status":"Uninstall: Waiting for cephCluster to be deleted"} rook log snip ------------------------------------------------------------------------------- 2021-01-29 07:03:59.739807 I | op-osd: orchestration status config map result channel closed, will restart watch. 2021-01-29 07:03:59.954105 I | op-mon: parsing mon endpoints: b=172.30.86.254:6789,c=172.30.45.216:6789,d=172.30.149.26:6789,e=172.30.170.108:6789,a=172.30.49.88:6789 2021-01-29 07:03:59.954208 I | ceph-object-store-user-controller: CephObjectStore "ocs-storagecluster-cephobjectstore" found 2021-01-29 07:04:04.743261 I | op-osd: osd orchestration status for node ocs-deviceset-localblock-2-data-1-rvbvf is starting 2021-01-29 07:04:04.743277 I | op-osd: osd orchestration status for node ocs-deviceset-localblock-2-data-2-sjsfv is starting 2021-01-29 07:04:04.743283 I | op-osd: osd orchestration status for node ocs-deviceset-localblock-3-data-1-lnzqm is starting 2021-01-29 07:04:04.743287 I | op-osd: osd orchestration status for node ocs-deviceset-localblock-3-data-2-4hq5n is starting 2021-01-29 07:04:04.743291 I | op-osd: 8/12 node(s) completed osd provisioning, resource version 2742170 2021-01-29 07:04:04.744237 I | op-osd: orchestration status config map result channel closed, will restart watch. 2021-01-29 07:04:09.748437 I | op-osd: osd orchestration status for node ocs-deviceset-localblock-2-data-1-rvbvf is starting 2021-01-29 07:04:09.748460 I | op-osd: osd orchestration status for node ocs-deviceset-localblock-2-data-2-sjsfv is starting 2021-01-29 07:04:09.748466 I | op-osd: osd orchestration status for node ocs-deviceset-localblock-3-data-1-lnzqm is starting 2021-01-29 07:04:09.748473 I | op-osd: osd orchestration status for node ocs-deviceset-localblock-3-data-2-4hq5n is starting 2021-01-29 07:04:09.748477 I | op-osd: 8/12 node(s) completed osd provisioning, resource version 2742170 Some background on the cluster where the storagecluster deletion was initiated ------------------------------------------------------------------------------- Created an arbiter based deployment on VMware LSO with inital set of 4 OSDs in 2 zones. Arbiter zone=c. due to impending bugs in arbiter, there were no RGW or MDS pods. Noobaa was also partially deployed (no backingstore because of no RGW pod). Hence storage cluster was stuck in progressing state since deployment. Added 1 new disk to each of the 4 OCS nodes from VMware console and Performed add capacity from UI. Due to issue in Add capacity from UI(bug 1920981), OCS tried to add 8 new OSDs instead of 4. Hence 4 OSDs were in Pending state throughout. In this state, initiated a storagecluster deletion and it is stuck , waiting for cephcluster to be deleted. $ oc get pod,pvc |grep osd|grep -i pending pod/rook-ceph-osd-prepare-ocs-deviceset-localblock-2-data-1-rvdwvgt 0/1 Pending 0 45h pod/rook-ceph-osd-prepare-ocs-deviceset-localblock-2-data-2-sjrtc9s 0/1 Pending 0 45h pod/rook-ceph-osd-prepare-ocs-deviceset-localblock-3-data-1-lnssj84 0/1 Pending 0 45h pod/rook-ceph-osd-prepare-ocs-deviceset-localblock-3-data-2-4hmv95b 0/1 Pending 0 45h Version of all relevant components (if applicable): ====================================================== OCS = ocs-operator.v4.7.0-238.ci OCP = 4.7.0-0.nightly-2021-01-19-095812 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? =========================================================================== Yes. uniable to uninstall OCS and re-install. Is there any workaround available to the best of your knowledge? ==================================================================== NA Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? -================================================================================== 4 Can this issue reproducible? ============================= Tested once Can this issue reproduce from the UI? ========================================= OCS was instaled from UI. If this is a regression, please provide more details to justify this: ========================================================================= Not sure Steps to Reproduce: ================= 1. Created an arbiter based OCS deployment on Vmware LSO. Due to impending bug fixes, there were no MDS and RGW pods and Storagecluster was in progressiong state 2. Performed add capacity and due to bug 1920981, OCS tried to add 8 OSDs instead of 4. Hence 4 OSDs were stuck in pending state and rook kept reconciling on them 3. Initiated uninstall by deleting the storagecluster @2021-01-29T05:56:03Z Actual results: =================== Storage cluster deletion is stuck indefinitely, waiting for cephcluster to be deleted. Expected results: ====================== Deletion of storagecluster and cleanup of OCS should succeed. Additional info: ======================= PVC before storagecluster deletion -------------- ======= PVC ========== NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE db-noobaa-db-pg-0 Bound pvc-0e97d52a-403a-42cc-b40f-a244f0da8f0c 50Gi RWO ocs-storagecluster-ceph-rbd 43h ocs-deviceset-localblock-0-data-0-mjss4 Bound local-pv-e114baad 150Gi RWO localblock 43h ocs-deviceset-localblock-0-data-1-lzrbq Bound local-pv-3afb8338 150Gi RWO localblock 43h ocs-deviceset-localblock-0-data-2-zxvcz Bound local-pv-ccbbe0c3 150Gi RWO localblock 43h ocs-deviceset-localblock-1-data-0-gbndv Bound local-pv-3cbd97e6 150Gi RWO localblock 43h ocs-deviceset-localblock-1-data-1-fpk9v Bound local-pv-12330b5b 150Gi RWO localblock 43h ocs-deviceset-localblock-1-data-2-tc6lk Bound local-pv-9cb2b93f 150Gi RWO localblock 43h ocs-deviceset-localblock-2-data-0-gwzqz Bound local-pv-119c924 150Gi RWO localblock 43h ocs-deviceset-localblock-2-data-1-rvbvf Pending localblock 43h ocs-deviceset-localblock-2-data-2-sjsfv Pending localblock 43h ocs-deviceset-localblock-3-data-0-2xcwn Bound local-pv-d5a8a2cb 150Gi RWO localblock 43h ocs-deviceset-localblock-3-data-1-lnzqm Pending localblock 43h ocs-deviceset-localblock-3-data-2-4hq5n Pending localblock 43h
CephCluster received a delete request, but the delete request reconcile never started. This could be because the previous reconcile was still pending and it was looping over. 2021-01-29 07:04:04.743291 I | op-osd: 8/12 node(s) completed osd provisioning, resource version 2742170 2021-01-29 07:04:04.744237 I | op-osd: orchestration status config map result channel closed, will restart watch. 2021-01-29 07:04:09.748437 I | op-osd: osd orchestration status for node ocs-deviceset-localblock-2-data-1-rvbvf is starting 2021-01-29 07:04:09.748460 I | op-osd: osd orchestration status for node ocs-deviceset-localblock-2-data-2-sjsfv is starting 2021-01-29 07:04:09.748466 I | op-osd: osd orchestration status for node ocs-deviceset-localblock-3-data-1-lnzqm is starting 2021-01-29 07:04:09.748473 I | op-osd: osd orchestration status for node ocs-deviceset-localblock-3-data-2-4hq5n is starting 2021-01-29 07:04:09.748477 I | op-osd: 8/12 node(s) completed osd provisioning, resource version 2742170
reconciler in rook-ceph operator is stuck while trying to provision all the pending OSDs: https://github.com/rook/rook/blob/f86f8300f924b98e7df37207276ee40a4bd33870/pkg/operator/ceph/cluster/osd/status.go#L189-L203 Possible root cause: - the watcher is trying to watch for the `rook-ceph-osd` configMaps with resourceVersion `2742170` Log snippet: `8/12 node(s) completed osd provisioning, resource version 2742170` - Non of the `rook-ceph-osd` configMaps have the resourceVersion `2742170` ``` $ oc get cm --selector=app=rook-ceph-osd --selector=status=provisioning -o yaml apiVersion: v1 items: - apiVersion: v1 data: status: '{"osds":null,"status":"starting","pvc-backed-osd":true,"message":""}' kind: ConfigMap metadata: creationTimestamp: "2021-01-27T13:33:34Z" labels: app: rook-ceph-osd node: ocs-deviceset-localblock-2-data-1-rvbvf status: provisioning managedFields: - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:data: .: {} f:status: {} f:metadata: f:labels: .: {} f:app: {} f:node: {} f:status: {} f:ownerReferences: .: {} k:{"uid":"739c2dc2-0acb-4817-a37f-1b8a4f4cdfeb"}: .: {} f:apiVersion: {} f:blockOwnerDeletion: {} f:controller: {} f:kind: {} f:name: {} f:uid: {} manager: rook operation: Update time: "2021-01-27T13:33:34Z" name: rook-ceph-osd-ocs-deviceset-localblock-2-data-1-rvbvf-status namespace: openshift-storage ownerReferences: - apiVersion: ceph.rook.io/v1 blockOwnerDeletion: true controller: true kind: CephCluster name: ocs-storagecluster-cephcluster uid: 739c2dc2-0acb-4817-a37f-1b8a4f4cdfeb resourceVersion: "2742146" selfLink: /api/v1/namespaces/openshift-storage/configmaps/rook-ceph-osd-ocs-deviceset-localblock-2-data-1-rvbvf-status uid: 25029d6d-5615-4c47-8cfc-18968d2cce1c - apiVersion: v1 data: status: '{"osds":null,"status":"starting","pvc-backed-osd":true,"message":""}' kind: ConfigMap metadata: creationTimestamp: "2021-01-27T13:33:35Z" labels: app: rook-ceph-osd node: ocs-deviceset-localblock-2-data-2-sjsfv status: provisioning managedFields: - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:data: .: {} f:status: {} f:metadata: f:labels: .: {} f:app: {} f:node: {} f:status: {} f:ownerReferences: .: {} k:{"uid":"739c2dc2-0acb-4817-a37f-1b8a4f4cdfeb"}: .: {} f:apiVersion: {} f:blockOwnerDeletion: {} f:controller: {} f:kind: {} f:name: {} f:uid: {} manager: rook operation: Update time: "2021-01-27T13:33:35Z" name: rook-ceph-osd-ocs-deviceset-localblock-2-data-2-sjsfv-status namespace: openshift-storage ownerReferences: - apiVersion: ceph.rook.io/v1 blockOwnerDeletion: true controller: true kind: CephCluster name: ocs-storagecluster-cephcluster uid: 739c2dc2-0acb-4817-a37f-1b8a4f4cdfeb resourceVersion: "2742151" selfLink: /api/v1/namespaces/openshift-storage/configmaps/rook-ceph-osd-ocs-deviceset-localblock-2-data-2-sjsfv-status uid: 0cb5f700-1514-4705-a8de-d29d3d2d62f3 - apiVersion: v1 data: status: '{"osds":null,"status":"starting","pvc-backed-osd":true,"message":""}' kind: ConfigMap metadata: creationTimestamp: "2021-01-27T13:33:35Z" labels: app: rook-ceph-osd node: ocs-deviceset-localblock-3-data-1-lnzqm status: provisioning managedFields: - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:data: .: {} f:status: {} f:metadata: f:labels: .: {} f:app: {} f:node: {} f:status: {} f:ownerReferences: .: {} k:{"uid":"739c2dc2-0acb-4817-a37f-1b8a4f4cdfeb"}: .: {} f:apiVersion: {} f:blockOwnerDeletion: {} f:controller: {} f:kind: {} f:name: {} f:uid: {} manager: rook operation: Update time: "2021-01-27T13:33:35Z" name: rook-ceph-osd-ocs-deviceset-localblock-3-data-1-lnzqm-status namespace: openshift-storage ownerReferences: - apiVersion: ceph.rook.io/v1 blockOwnerDeletion: true controller: true kind: CephCluster name: ocs-storagecluster-cephcluster uid: 739c2dc2-0acb-4817-a37f-1b8a4f4cdfeb resourceVersion: "2742160" selfLink: /api/v1/namespaces/openshift-storage/configmaps/rook-ceph-osd-ocs-deviceset-localblock-3-data-1-lnzqm-status uid: 346636e6-ec94-4abf-ac59-a13a340e01dd - apiVersion: v1 data: status: '{"osds":null,"status":"starting","pvc-backed-osd":true,"message":""}' kind: ConfigMap metadata: creationTimestamp: "2021-01-27T13:33:36Z" labels: app: rook-ceph-osd node: ocs-deviceset-localblock-3-data-2-4hq5n status: provisioning managedFields: - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:data: .: {} f:status: {} f:metadata: f:labels: .: {} f:app: {} f:node: {} f:status: {} f:ownerReferences: .: {} k:{"uid":"739c2dc2-0acb-4817-a37f-1b8a4f4cdfeb"}: .: {} f:apiVersion: {} f:blockOwnerDeletion: {} f:controller: {} f:kind: {} f:name: {} f:uid: {} manager: rook operation: Update time: "2021-01-27T13:33:36Z" name: rook-ceph-osd-ocs-deviceset-localblock-3-data-2-4hq5n-status namespace: openshift-storage ownerReferences: - apiVersion: ceph.rook.io/v1 blockOwnerDeletion: true controller: true kind: CephCluster name: ocs-storagecluster-cephcluster uid: 739c2dc2-0acb-4817-a37f-1b8a4f4cdfeb resourceVersion: "2742169" selfLink: /api/v1/namespaces/openshift-storage/configmaps/rook-ceph-osd-ocs-deviceset-localblock-3-data-2-4hq5n-status uid: b4b83b62-0341-4f92-b424-43f54406ed9a kind: List metadata: resourceVersion: "" selfLink: "" ``` Possible Solutions: - Updated the `resourceVersion` every time when checking for the completed Nodes here - https://github.com/rook/rook/blob/f86f8300f924b98e7df37207276ee40a4bd33870/pkg/operator/ceph/cluster/osd/status.go#L193
The operator has a mechanism to cancel a reconcile so that a long-running reconcile doesn't get stuck and prevent other actions such as updates to the cluster CR or deleting the cluster CR as we see in this scenario. The operator log shows this message a number of times, proving that the cluster CR has been requested for deletion: 2021-01-29 05:56:03.777128 I | ceph-cluster-controller: CR "ocs-storagecluster-cephcluster" is going be deleted, cancelling any ongoing orchestration The FIX: The operator is certainly not cancelling the reconcile as expected. When the operator is in this loop of waiting for OSDs it needs to call the method controller.CheckForCancelledOrchestration(c.context) to determine if it should continue. If it returns an error, the reconcile should be cancelled. Currently it looks like the check for cancellation is only done when preparing the OSDs. The operator does timeout after 20 minutes waiting for the OSDs, then is expected to cancel the reconcile. The operator log looks like it is completed about 20 minutes after the cluster was deleted, but I'm not clear if the log was just cut off, or if the 20 minutes timeout was not even sufficient to cancel the reconcile. I was able to repro this with a simple upstream rook using cluster-on-pvc.yaml where the storage class was invalid for the device set and the OSD prepare pods stayed pending forever. After the 20 minute timeout, I did see that the reconcile was cancelled and the cluster was deleted.
https://github.com/rook/rook/pull/7112
I see that there is also a problem with control plane cleanup: BZ 1927023 (I created a separate BZ as there seems to be lot of details involved in each case, and a different reason for the failure).
This BZ fixed the issue with getting stuck waiting for OSDs to be created. In the rook operator log, you would see the following message if the reconcile was cancelled: CANCELLING CURRENT ORCHESTRATION Since the mons are crashing in the arbiter cluster, it must be a different issue than for the OSDs. But I don't think we need a BZ to track the uninstall issue since we expect the arbiter issues to be fixed soon, unless we can repro after the arbiter is fixed.
(In reply to Travis Nielsen from comment #12) > This BZ fixed the issue with getting stuck waiting for OSDs to be created. > In the rook operator log, you would see the following message if the > reconcile was cancelled: > CANCELLING CURRENT ORCHESTRATION > > Since the mons are crashing in the arbiter cluster, it must be a different > issue than for the OSDs. But I don't think we need a BZ to track the > uninstall issue since we expect the arbiter issues to be fixed soon, unless > we can repro after the arbiter is fixed. On a healthy arbiter cluster, started add-capacity with insufficient disks to make the osd-prepare pod goes to pending state. With the cluster in this state, started uninstall. "oc delete -n openshift-storage storagecluster --all --wait=true" command succeeded without any issues. Below log messages were observed in the rook operator logs, 2021-03-16 14:22:46.918991 E | op-osd: CANCELLING CURRENT ORCHESTRATION 2021-03-16 14:22:46.939752 E | ceph-cluster-controller: failed to reconcile. failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to create cluster: failed to start ceph osds: 1 failures encountered while running osds on PVCs in namespace "openshift-storage". CANCELLING CURRENT ORCHESTRATION 2021-03-16 14:22:46.942150 I | ceph-cluster-controller: CR "ocs-storagecluster-cephcluster" is going be deleted, cancelling any ongoing orchestration @tarvis, can you please confirm if I am seeing the right events in the logs.
Yes, those messages look expected. The cluster uninstall succeeded, right?
(In reply to Travis Nielsen from comment #14) > Yes, those messages look expected. The cluster uninstall succeeded, right? Yes, the cluster uninstall succeeded. This is verified on v4.7.0-294.ci. Moving the BZ to verified state based on Comment13 test results.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041