1922064 – uninstall on VMware LSO+ arbiter with 4 OSDs in Pending state: Storagecluster deletion stuck, waiting for cephcluster to be deleted

Bug 1922064 - uninstall on VMware LSO+ arbiter with 4 OSDs in Pending state: Storagecluster deletion stuck, waiting for cephcluster to be deleted

Summary: uninstall on VMware LSO+ arbiter with 4 OSDs in Pending state: Storagecluster...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.7.0
Assignee:	Santosh Pillai
QA Contact:	Prasad Desala
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-29 07:17 UTC by Neha Berry
Modified:	2021-06-01 08:48 UTC (History)
CC List:	6 users (show)
Fixed In Version:	4.7.0-721.ci
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-05-19 09:18:35 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift rook pull 165	None	closed	Bug 1922064: ceph: check for orchestration cancellation while OSDs are pending	2021-02-17 17:44:44 UTC
Github	rook rook pull 7112	None	closed	ceph: check for orchestration cancellation while waiting for all OSDs to start.	2021-02-17 17:44:43 UTC
Red Hat Product Errata	RHSA-2021:2041	None	None	None	2021-05-19 09:19:02 UTC

Description Neha Berry 2021-01-29 07:17:49 UTC

Description of problem (please be detailed as possible and provide log
snippests):
----------------------------------------------------------------------------
Storagecluster deletion is stuck on a cluster where some OSD PVCs were in pending state.

This bug seems to be different than Bug 1916850#c4, as the cephobjectStore deletion had not yet been initated (hence no deletionTimeStamp) because the cephcluster deletion itslef was stuck.


ocs-operator log snip
---------------------------
{"level":"info","ts":1611903838.524638,"logger":"controllers.StorageCluster","msg":"Uninstall in progress","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Status":"Uninstall: Waiting for cephCluster to be deleted"}


rook log snip
-------------------------------------------------------------------------------

2021-01-29 07:03:59.739807 I | op-osd: orchestration status config map result channel closed, will restart watch.
2021-01-29 07:03:59.954105 I | op-mon: parsing mon endpoints: b=172.30.86.254:6789,c=172.30.45.216:6789,d=172.30.149.26:6789,e=172.30.170.108:6789,a=172.30.49.88:6789
2021-01-29 07:03:59.954208 I | ceph-object-store-user-controller: CephObjectStore "ocs-storagecluster-cephobjectstore" found
2021-01-29 07:04:04.743261 I | op-osd: osd orchestration status for node ocs-deviceset-localblock-2-data-1-rvbvf is starting
2021-01-29 07:04:04.743277 I | op-osd: osd orchestration status for node ocs-deviceset-localblock-2-data-2-sjsfv is starting
2021-01-29 07:04:04.743283 I | op-osd: osd orchestration status for node ocs-deviceset-localblock-3-data-1-lnzqm is starting
2021-01-29 07:04:04.743287 I | op-osd: osd orchestration status for node ocs-deviceset-localblock-3-data-2-4hq5n is starting
2021-01-29 07:04:04.743291 I | op-osd: 8/12 node(s) completed osd provisioning, resource version 2742170
2021-01-29 07:04:04.744237 I | op-osd: orchestration status config map result channel closed, will restart watch.
2021-01-29 07:04:09.748437 I | op-osd: osd orchestration status for node ocs-deviceset-localblock-2-data-1-rvbvf is starting
2021-01-29 07:04:09.748460 I | op-osd: osd orchestration status for node ocs-deviceset-localblock-2-data-2-sjsfv is starting
2021-01-29 07:04:09.748466 I | op-osd: osd orchestration status for node ocs-deviceset-localblock-3-data-1-lnzqm is starting
2021-01-29 07:04:09.748473 I | op-osd: osd orchestration status for node ocs-deviceset-localblock-3-data-2-4hq5n is starting
2021-01-29 07:04:09.748477 I | op-osd: 8/12 node(s) completed osd provisioning, resource version 2742170


Some background on the cluster where the storagecluster deletion was initiated
-------------------------------------------------------------------------------


Created an arbiter based deployment on VMware LSO with inital set of 4 OSDs in 2 zones. Arbiter zone=c. due to impending bugs in arbiter, there were no RGW or MDS pods. Noobaa was also partially deployed (no backingstore because of no RGW pod). Hence storage cluster was stuck in progressing state since deployment.

Added 1 new disk to each of the 4 OCS nodes from VMware console and Performed add capacity from UI. Due to issue in Add capacity from UI(bug 1920981), OCS tried to add 8 new OSDs instead of 4. Hence 4 OSDs were in Pending state throughout. 

In this state, initiated a storagecluster deletion and it is stuck , waiting for cephcluster to be deleted.



$ oc get pod,pvc |grep osd|grep -i pending
pod/rook-ceph-osd-prepare-ocs-deviceset-localblock-2-data-1-rvdwvgt   0/1     Pending     0          45h
pod/rook-ceph-osd-prepare-ocs-deviceset-localblock-2-data-2-sjrtc9s   0/1     Pending     0          45h
pod/rook-ceph-osd-prepare-ocs-deviceset-localblock-3-data-1-lnssj84   0/1     Pending     0          45h
pod/rook-ceph-osd-prepare-ocs-deviceset-localblock-3-data-2-4hmv95b   0/1     Pending     0          45h



Version of all relevant components (if applicable):
======================================================
OCS = ocs-operator.v4.7.0-238.ci
OCP = 4.7.0-0.nightly-2021-01-19-095812

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
===========================================================================
Yes. uniable to uninstall OCS and re-install.

Is there any workaround available to the best of your knowledge?
====================================================================
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
-==================================================================================
4

Can this issue reproducible?
=============================
Tested once


Can this issue reproduce from the UI?
=========================================
OCS was instaled from UI.

If this is a regression, please provide more details to justify this:
=========================================================================
Not sure

Steps to Reproduce:
=================
1. Created an arbiter based OCS deployment on Vmware LSO. Due to impending bug fixes, there were no MDS and RGW pods and Storagecluster was in progressiong state
2. Performed add capacity and due to bug 1920981, OCS tried to add 8 OSDs instead of 4. Hence 4 OSDs were stuck in pending state and rook kept reconciling on them
3. Initiated uninstall by deleting the storagecluster @2021-01-29T05:56:03Z


Actual results:
===================
Storage cluster deletion is stuck indefinitely, waiting for cephcluster to be deleted.

Expected results:
======================
Deletion of storagecluster and cleanup of OCS should succeed.


Additional info:
=======================
PVC before storagecluster deletion

--------------
======= PVC ==========
NAME                                      STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE
db-noobaa-db-pg-0                         Bound     pvc-0e97d52a-403a-42cc-b40f-a244f0da8f0c   50Gi       RWO            ocs-storagecluster-ceph-rbd   43h
ocs-deviceset-localblock-0-data-0-mjss4   Bound     local-pv-e114baad                          150Gi      RWO            localblock                    43h
ocs-deviceset-localblock-0-data-1-lzrbq   Bound     local-pv-3afb8338                          150Gi      RWO            localblock                    43h
ocs-deviceset-localblock-0-data-2-zxvcz   Bound     local-pv-ccbbe0c3                          150Gi      RWO            localblock                    43h
ocs-deviceset-localblock-1-data-0-gbndv   Bound     local-pv-3cbd97e6                          150Gi      RWO            localblock                    43h
ocs-deviceset-localblock-1-data-1-fpk9v   Bound     local-pv-12330b5b                          150Gi      RWO            localblock                    43h
ocs-deviceset-localblock-1-data-2-tc6lk   Bound     local-pv-9cb2b93f                          150Gi      RWO            localblock                    43h
ocs-deviceset-localblock-2-data-0-gwzqz   Bound     local-pv-119c924                           150Gi      RWO            localblock                    43h
ocs-deviceset-localblock-2-data-1-rvbvf   Pending                                                                        localblock                    43h
ocs-deviceset-localblock-2-data-2-sjsfv   Pending                                                                        localblock                    43h
ocs-deviceset-localblock-3-data-0-2xcwn   Bound     local-pv-d5a8a2cb                          150Gi      RWO            localblock                    43h
ocs-deviceset-localblock-3-data-1-lnzqm   Pending                                                                        localblock                    43h
ocs-deviceset-localblock-3-data-2-4hq5n   Pending                                                                        localblock                    43h

Comment 3 Santosh Pillai 2021-01-29 10:36:25 UTC

CephCluster received a delete request, but the delete request reconcile never started. This could be because the previous reconcile was still pending and it was looping over. 

2021-01-29 07:04:04.743291 I | op-osd: 8/12 node(s) completed osd provisioning, resource version 2742170
2021-01-29 07:04:04.744237 I | op-osd: orchestration status config map result channel closed, will restart watch.
2021-01-29 07:04:09.748437 I | op-osd: osd orchestration status for node ocs-deviceset-localblock-2-data-1-rvbvf is starting
2021-01-29 07:04:09.748460 I | op-osd: osd orchestration status for node ocs-deviceset-localblock-2-data-2-sjsfv is starting
2021-01-29 07:04:09.748466 I | op-osd: osd orchestration status for node ocs-deviceset-localblock-3-data-1-lnzqm is starting
2021-01-29 07:04:09.748473 I | op-osd: osd orchestration status for node ocs-deviceset-localblock-3-data-2-4hq5n is starting
2021-01-29 07:04:09.748477 I | op-osd: 8/12 node(s) completed osd provisioning, resource version 2742170

Comment 4 Santosh Pillai 2021-01-29 16:19:09 UTC

reconciler in rook-ceph operator is stuck while trying to provision all the pending OSDs: 

https://github.com/rook/rook/blob/f86f8300f924b98e7df37207276ee40a4bd33870/pkg/operator/ceph/cluster/osd/status.go#L189-L203

Possible root cause: 

- the watcher is trying to watch for the `rook-ceph-osd` configMaps with resourceVersion `2742170`
   
   Log snippet:
   `8/12 node(s) completed osd provisioning, resource version 2742170`

- Non of the `rook-ceph-osd` configMaps have the resourceVersion `2742170`
   
  ```
 $ oc get cm --selector=app=rook-ceph-osd --selector=status=provisioning -o yaml 
apiVersion: v1
items:
- apiVersion: v1
  data:
    status: '{"osds":null,"status":"starting","pvc-backed-osd":true,"message":""}'
  kind: ConfigMap
  metadata:
    creationTimestamp: "2021-01-27T13:33:34Z"
    labels:
      app: rook-ceph-osd
      node: ocs-deviceset-localblock-2-data-1-rvbvf
      status: provisioning
    managedFields:
    - apiVersion: v1
      fieldsType: FieldsV1
      fieldsV1:
        f:data:
          .: {}
          f:status: {}
        f:metadata:
          f:labels:
            .: {}
            f:app: {}
            f:node: {}
            f:status: {}
          f:ownerReferences:
            .: {}
            k:{"uid":"739c2dc2-0acb-4817-a37f-1b8a4f4cdfeb"}:
              .: {}
              f:apiVersion: {}
              f:blockOwnerDeletion: {}
              f:controller: {}
              f:kind: {}
              f:name: {}
              f:uid: {}
      manager: rook
      operation: Update
      time: "2021-01-27T13:33:34Z"
    name: rook-ceph-osd-ocs-deviceset-localblock-2-data-1-rvbvf-status
    namespace: openshift-storage
    ownerReferences:
    - apiVersion: ceph.rook.io/v1
      blockOwnerDeletion: true
      controller: true
      kind: CephCluster
      name: ocs-storagecluster-cephcluster
      uid: 739c2dc2-0acb-4817-a37f-1b8a4f4cdfeb
    resourceVersion: "2742146"
    selfLink: /api/v1/namespaces/openshift-storage/configmaps/rook-ceph-osd-ocs-deviceset-localblock-2-data-1-rvbvf-status
    uid: 25029d6d-5615-4c47-8cfc-18968d2cce1c
- apiVersion: v1
  data:
    status: '{"osds":null,"status":"starting","pvc-backed-osd":true,"message":""}'
  kind: ConfigMap
  metadata:
    creationTimestamp: "2021-01-27T13:33:35Z"
    labels:
      app: rook-ceph-osd
      node: ocs-deviceset-localblock-2-data-2-sjsfv
      status: provisioning
    managedFields:
    - apiVersion: v1
      fieldsType: FieldsV1
      fieldsV1:
        f:data:
          .: {}
          f:status: {}
        f:metadata:
          f:labels:
            .: {}
            f:app: {}
            f:node: {}
            f:status: {}
          f:ownerReferences:
            .: {}
            k:{"uid":"739c2dc2-0acb-4817-a37f-1b8a4f4cdfeb"}:
              .: {}
              f:apiVersion: {}
              f:blockOwnerDeletion: {}
              f:controller: {}
              f:kind: {}
              f:name: {}
              f:uid: {}
      manager: rook
      operation: Update
      time: "2021-01-27T13:33:35Z"
    name: rook-ceph-osd-ocs-deviceset-localblock-2-data-2-sjsfv-status
    namespace: openshift-storage
    ownerReferences:
    - apiVersion: ceph.rook.io/v1
      blockOwnerDeletion: true
      controller: true
      kind: CephCluster
      name: ocs-storagecluster-cephcluster
      uid: 739c2dc2-0acb-4817-a37f-1b8a4f4cdfeb
    resourceVersion: "2742151"
    selfLink: /api/v1/namespaces/openshift-storage/configmaps/rook-ceph-osd-ocs-deviceset-localblock-2-data-2-sjsfv-status
    uid: 0cb5f700-1514-4705-a8de-d29d3d2d62f3
- apiVersion: v1
  data:
    status: '{"osds":null,"status":"starting","pvc-backed-osd":true,"message":""}'
  kind: ConfigMap
  metadata:
    creationTimestamp: "2021-01-27T13:33:35Z"
    labels:
      app: rook-ceph-osd
      node: ocs-deviceset-localblock-3-data-1-lnzqm
      status: provisioning
    managedFields:
    - apiVersion: v1
      fieldsType: FieldsV1
      fieldsV1:
        f:data:
          .: {}
          f:status: {}
        f:metadata:
          f:labels:
            .: {}
            f:app: {}
            f:node: {}
            f:status: {}
          f:ownerReferences:
            .: {}
            k:{"uid":"739c2dc2-0acb-4817-a37f-1b8a4f4cdfeb"}:
              .: {}
              f:apiVersion: {}
              f:blockOwnerDeletion: {}
              f:controller: {}
              f:kind: {}
              f:name: {}
              f:uid: {}
      manager: rook
      operation: Update
      time: "2021-01-27T13:33:35Z"
    name: rook-ceph-osd-ocs-deviceset-localblock-3-data-1-lnzqm-status
    namespace: openshift-storage
    ownerReferences:
    - apiVersion: ceph.rook.io/v1
      blockOwnerDeletion: true
      controller: true
      kind: CephCluster
      name: ocs-storagecluster-cephcluster
      uid: 739c2dc2-0acb-4817-a37f-1b8a4f4cdfeb
    resourceVersion: "2742160"
    selfLink: /api/v1/namespaces/openshift-storage/configmaps/rook-ceph-osd-ocs-deviceset-localblock-3-data-1-lnzqm-status
    uid: 346636e6-ec94-4abf-ac59-a13a340e01dd
- apiVersion: v1
  data:
    status: '{"osds":null,"status":"starting","pvc-backed-osd":true,"message":""}'
  kind: ConfigMap
  metadata:
    creationTimestamp: "2021-01-27T13:33:36Z"
    labels:
      app: rook-ceph-osd
      node: ocs-deviceset-localblock-3-data-2-4hq5n
      status: provisioning
    managedFields:
    - apiVersion: v1
      fieldsType: FieldsV1
      fieldsV1:
        f:data:
          .: {}
          f:status: {}
        f:metadata:
          f:labels:
            .: {}
            f:app: {}
            f:node: {}
            f:status: {}
          f:ownerReferences:
            .: {}
            k:{"uid":"739c2dc2-0acb-4817-a37f-1b8a4f4cdfeb"}:
              .: {}
              f:apiVersion: {}
              f:blockOwnerDeletion: {}
              f:controller: {}
              f:kind: {}
              f:name: {}
              f:uid: {}
      manager: rook
      operation: Update
      time: "2021-01-27T13:33:36Z"
    name: rook-ceph-osd-ocs-deviceset-localblock-3-data-2-4hq5n-status
    namespace: openshift-storage
    ownerReferences:
    - apiVersion: ceph.rook.io/v1
      blockOwnerDeletion: true
      controller: true
      kind: CephCluster
      name: ocs-storagecluster-cephcluster
      uid: 739c2dc2-0acb-4817-a37f-1b8a4f4cdfeb
    resourceVersion: "2742169"
    selfLink: /api/v1/namespaces/openshift-storage/configmaps/rook-ceph-osd-ocs-deviceset-localblock-3-data-2-4hq5n-status
    uid: b4b83b62-0341-4f92-b424-43f54406ed9a
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

```

Possible Solutions:
- Updated the `resourceVersion` every time when checking for the completed  Nodes here - https://github.com/rook/rook/blob/f86f8300f924b98e7df37207276ee40a4bd33870/pkg/operator/ceph/cluster/osd/status.go#L193

Comment 5 Travis Nielsen 2021-01-29 19:36:22 UTC

The operator has a mechanism to cancel a reconcile so that a long-running reconcile doesn't get stuck and prevent other actions such as updates to the cluster CR or deleting the cluster CR as we see in this scenario. 

The operator log shows this message a number of times, proving that the cluster CR has been requested for deletion:
2021-01-29 05:56:03.777128 I | ceph-cluster-controller: CR "ocs-storagecluster-cephcluster" is going be deleted, cancelling any ongoing orchestration

The FIX: The operator is certainly not cancelling the reconcile as expected. When the operator is in this loop of waiting for OSDs it needs to call the method controller.CheckForCancelledOrchestration(c.context) to determine if it should continue. If it returns an error, the reconcile should be cancelled. Currently it looks like the check for cancellation is only done when preparing the OSDs. 

The operator does timeout after 20 minutes waiting for the OSDs, then is expected to cancel the reconcile. The operator log looks like it is completed about 20 minutes after the cluster was deleted, but I'm not clear if the log was just cut off, or if the 20 minutes timeout was not even sufficient to cancel the reconcile. 

I was able to repro this with a simple upstream rook using cluster-on-pvc.yaml where the storage class was invalid for the device set and the OSD prepare pods stayed pending forever. After the 20 minute timeout, I did see that the reconcile was cancelled and the cluster was deleted.

Comment 6 Travis Nielsen 2021-02-01 16:42:30 UTC

https://github.com/rook/rook/pull/7112

Comment 9 Martin Bukatovic 2021-02-09 21:21:16 UTC

I see that there is also a problem with control plane cleanup: BZ 1927023 (I created a separate BZ as there seems to be lot of details involved in each case, and a different reason for the failure).

Comment 12 Travis Nielsen 2021-02-23 18:33:08 UTC

This BZ fixed the issue with getting stuck waiting for OSDs to be created. In the rook operator log, you would see the following message if the reconcile was cancelled:
CANCELLING CURRENT ORCHESTRATION

Since the mons are crashing in the arbiter cluster, it must be a different issue than for the OSDs. But I don't think we need a BZ to track the uninstall issue since we expect the arbiter issues to be fixed soon, unless we can repro after the arbiter is fixed.

Comment 13 Prasad Desala 2021-03-16 15:03:41 UTC

(In reply to Travis Nielsen from comment #12)
> This BZ fixed the issue with getting stuck waiting for OSDs to be created.
> In the rook operator log, you would see the following message if the
> reconcile was cancelled:
> CANCELLING CURRENT ORCHESTRATION
> 
> Since the mons are crashing in the arbiter cluster, it must be a different
> issue than for the OSDs. But I don't think we need a BZ to track the
> uninstall issue since we expect the arbiter issues to be fixed soon, unless
> we can repro after the arbiter is fixed.

On a healthy arbiter cluster, started add-capacity with insufficient disks to make the osd-prepare pod goes to pending state. With the cluster in this state, started uninstall.
"oc delete -n openshift-storage storagecluster --all --wait=true" command succeeded without any issues.

Below log messages were observed in the rook operator logs,

2021-03-16 14:22:46.918991 E | op-osd: CANCELLING CURRENT ORCHESTRATION
2021-03-16 14:22:46.939752 E | ceph-cluster-controller: failed to reconcile. failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to create cluster: failed to start ceph osds: 1 failures encountered while running osds on PVCs in namespace "openshift-storage". CANCELLING CURRENT ORCHESTRATION
2021-03-16 14:22:46.942150 I | ceph-cluster-controller: CR "ocs-storagecluster-cephcluster" is going be deleted, cancelling any ongoing orchestration

@tarvis, can you please confirm if I am seeing the right events in the logs.

Comment 14 Travis Nielsen 2021-03-16 17:40:02 UTC

Yes, those messages look expected. The cluster uninstall succeeded, right?

Comment 16 Prasad Desala 2021-03-23 13:47:32 UTC

(In reply to Travis Nielsen from comment #14)
> Yes, those messages look expected. The cluster uninstall succeeded, right?

Yes, the cluster uninstall succeeded.
This is verified on v4.7.0-294.ci.
Moving the BZ to verified state based on Comment13 test results.

Comment 18 errata-xmlrpc 2021-05-19 09:18:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041

Note You need to log in before you can comment on or make changes to this bug.