Description of problem (please be detailed as possible and provide log snippests): OCS storagecluster/storagesystem was deleted and ODF entered in terminating mode, but all the PVCs exist which prevented the deletion For the cephcluster cr yaml, it shows there was a deletion initiated which did not complete. The deletion didn't went through since there were dependent objects that was unable to remove. Please see the conditions > 'CephCluster "openshift-storage/ocs-storagecluster-cephcluster" will not be deleted until all dependents are removed . This failed with reason: ObjectHasDependents oc get cephcluster -o yaml apiVersion: v1 items: - apiVersion: ceph.rook.io/v1 kind: CephCluster ... conditions: - lastHeartbeatTime: "2024-05-17T02:11:20Z" lastTransitionTime: "2023-11-05T09:59:50Z" message: Cluster created successfully reason: ClusterCreated status: "True" type: Ready - lastHeartbeatTime: "2024-06-03T06:55:16Z" lastTransitionTime: "2024-05-17T02:12:01Z" message: 'CephCluster "openshift-storage/ocs-storagecluster-cephcluster" will not be deleted until all dependents are removed: CephBlockPool: [ocs-storagecluster-cephblockpool], CephFilesystem: [ocs-storagecluster-cephfilesystem], CephObjectStore: [ocs-storagecluster-cephobjectstore], CephObjectStoreUser: [noobaa-ceph-objectstore-user ocs-storagecluster-cephobjectstoreuser prometheus-user]' reason: ObjectHasDependents status: "True" type: DeletionIsBlocked - lastHeartbeatTime: "2024-06-03T06:55:15Z" lastTransitionTime: "2024-05-17T02:12:00Z" message: Deleting the CephCluster reason: ClusterDeleting status: "True" type: Deleting message: Deleting the CephCluster 2024-05-31 02:35:36.204878 E | ceph-cluster-controller: failed to reconcile CephCluster "openshift-storage/ocs-storagecluster-cephcluster". CephCluster "openshift-storage/ocs-storagecluster-cephcluster" will not be deleted until all dependents are removed: CephBlockPool: [ocs-storagecluster-cephblockpool], CephFilesystem: [ocs-storagecluster-cephfilesystem], CephObjectStore: [ocs-storagecluster-cephobjectstore], CephObjectStoreUser: [noobaa-ceph-objectstore-user ocs-storagecluster-cephobjectstoreuser prometheus-user] Version of all relevant components (if applicable): ODF 4.12 ocs-operator.v4.12.11-rhodf OpenShift Container Storage 4.12.11-rhodf ocs-operator.v4.11.13 Succeeded odf-csi-addons-operator.v4.12.11-rhodf CSI Addons 4.12.11-rhodf odf-csi-addons-operator.v4.11.13 Succeeded odf-multicluster-orchestrator.v4.12.12-rhodf ODF Multicluster Orchestrator 4.12.12-rhodf odf-multicluster-orchestrator.v4.12.11-rhodf Succeeded odf-operator.v4.12.11-rhodf OpenShift Data Foundation 4.12.11-rhodf odf-operator.v4.11.13 Succeeded odr-hub-operator.v4.12.12-rhodf Openshift DR Hub Operator 4.12.12-rhodf odr-hub-operator.v4.12.11-rhodf Succeeded openshift-gitops-operator.v1.11.2 Red Hat OpenShift GitOps 1.11.2 openshift-gitops-operator.v1.11.1 Succeeded Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes Is there any workaround available to the best of your knowledge? There are two options which we discussed with customer : a] First is to take back up of their data and reinstall ODf b] Second is to restore the cluster using upstream procedure https://www.rook.io/docs/rook/v1.14/Troubleshooting/disaster-recovery/#restoring-crds-after-deletion The cluster is being extensively used for Quay and several applications are using ODF based PVCs and OBCs hence customer is not okay with the first option. Ask : Can we recover the cluster using the upstream procedure and attempt to restore the cephcluster ? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? Yes Can this issue reproduce from the UI? N/A If this is a regression, please provide more details to justify this: N/A Steps to Reproduce: N/A Actual results: N/A Expected results: N/A Additional info:
Its not very clear what's the requirement here. From what I understand the customer deleted the cluster, it got stuck due to the dependent resources, but now the customer wants the restore the same cluster? is that correct?
Santosh, Yes, that's correct. Upstream has a disaster recovery doc [1] that we were hoping to get a +1 from engineering to use to hopefully stop the deletion of the cephcluster resource. [1] https://www.rook.io/docs/rook/v1.14/Troubleshooting/disaster-recovery/#restoring-crds-after-deletion
I ran through this process on a lab machine. One thing I noticed, is as soon as I remove the finalizer from the cephcluster cr, it gets recreated... might need to scale down the ocs-operator deployment as well? Here are my findings: [system:admin/openshift-storage root ~]$ oc get cephcluster NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL FSID ocs-storagecluster-cephcluster /var/lib/rook 3 53d Ready Cluster created successfully HEALTH_OK 246db4a4-d3a0-4a6a-9d55-17a84bdc0274 [system:admin/openshift-storage root ~]$ oc delete cephcluster ocs-storagecluster-cephcluster cephcluster.ceph.rook.io "ocs-storagecluster-cephcluster" deleted ^C[system:admin/openshift-storage root ~]$ oc get cephcluster -o yaml apiVersion: v1 items: - apiVersion: ceph.rook.io/v1 kind: CephCluster metadata: creationTimestamp: "2024-04-11T21:10:53Z" deletionGracePeriodSeconds: 0 deletionTimestamp: "2024-06-04T14:08:07Z" finalizers: - cephcluster.ceph.rook.io generation: 12 labels: app: ocs-storagecluster replicationid.multicluster.openshift.io: eb5a7b12c32796fcbb2278bcc4a38bf945443a4 name: ocs-storagecluster-cephcluster namespace: openshift-storage ownerReferences: - apiVersion: ocs.openshift.io/v1 blockOwnerDeletion: true controller: true kind: StorageCluster name: ocs-storagecluster uid: 72ca44f0-33a0-4fe2-ba0d-39c5870550db resourceVersion: "60596460" uid: af973f13-f880-431f-b84b-e90e8a95e08b ... conditions: - lastHeartbeatTime: "2024-06-04T14:07:15Z" lastTransitionTime: "2024-04-11T21:19:32Z" message: Cluster created successfully reason: ClusterCreated status: "True" type: Ready - lastHeartbeatTime: "2024-06-04T14:08:13Z" lastTransitionTime: "2024-06-04T14:08:13Z" message: Deleting the CephCluster reason: ClusterDeleting status: "True" type: Deleting - lastHeartbeatTime: "2024-06-04T14:08:14Z" lastTransitionTime: "2024-06-04T14:08:14Z" message: 'CephCluster "openshift-storage/ocs-storagecluster-cephcluster" will not be deleted until all dependents are removed: CephBlockPool: [ocs-storagecluster-cephblockpool], CephFilesystem: [ocs-storagecluster-cephfilesystem ocs-storagecluster-cephfilesystem-new], CephFilesystemSubVolumeGroup: [ocs-storagecluster-cephfilesystem-csi], CephObjectStore: [ocs-storagecluster-cephobjectstore], CephObjectStoreUser: [noobaa-ceph-objectstore-user ocs-storagecluster-cephobjectstoreuser prometheus-user], CephRBDMirror: [ocs-storagecluster-cephrbdmirror]' reason: ObjectHasDependents status: "True" type: DeletionIsBlocked message: Deleting the CephCluster observedGeneration: 11 phase: Deleting state: Deleting ... [system:admin/openshift-storage root ~]$ oc scale deployment rook-ceph-operator --replicas 0 deployment.apps/rook-ceph-operator scaled [system:admin/openshift-storage root ~]$ oc get cephcluster -o yaml > cluster.yaml [system:admin/openshift-storage root ~]$ oc get secrets -o yaml >secrets.yaml [system:admin/openshift-storage root ~]$ oc get cm -o yaml >configmaps.yaml [system:admin/openshift-storage root ~]$ oc get cephcluster ocs-storagecluster-cephcluster -o 'jsonpath={.metadata.uid}' af973f13-f880-431f-b84b-e90e8a95e08b[system:admin/openshift-storage root ~]$ [system:admin/openshift-storage root ~]$ ROOK_UID=$(oc get cephcluster ocs-storagecluster-cephcluster -o 'jsonpath={.metadata.uid}') [system:admin/openshift-storage root ~]$ RESOURCES=$(oc get secret,configmap,service,deployment,pvc -o jsonpath='{range .items[?(@.metadata.ownerReferences[*].uid=="'"$ROOK_UID"'")]}{.kind}{"/"}{.metadata.name}{"\n"}{end}') [system:admin/openshift-storage root ~]$ oc get $RESOURCES NAME TYPE DATA AGE secret/cluster-peer-token-ocs-storagecluster-cephcluster kubernetes.io/rook 2 53d secret/rook-ceph-admin-keyring kubernetes.io/rook 1 53d secret/rook-ceph-config kubernetes.io/rook 2 53d secret/rook-ceph-crash-collector-keyring kubernetes.io/rook 1 53d secret/rook-ceph-exporter-keyring kubernetes.io/rook 1 53d secret/rook-ceph-mgr-a-keyring kubernetes.io/rook 1 53d secret/rook-ceph-mgr-b-keyring kubernetes.io/rook 1 53d secret/rook-ceph-mon kubernetes.io/rook 4 53d secret/rook-ceph-mons-keyring kubernetes.io/rook 1 53d secret/rook-csi-cephfs-node kubernetes.io/rook 2 53d secret/rook-csi-cephfs-provisioner kubernetes.io/rook 2 53d secret/rook-csi-rbd-node kubernetes.io/rook 2 53d secret/rook-csi-rbd-provisioner kubernetes.io/rook 2 53d NAME DATA AGE configmap/rook-ceph-mon-endpoints 5 53d NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/rook-ceph-exporter ClusterIP 172.30.27.200 <none> 9926/TCP 53d service/rook-ceph-mgr ClusterIP 172.30.107.23 <none> 9283/TCP 53d service/rook-ceph-mon-d ClusterIP 172.30.112.231 <none> 3300/TCP 49d service/rook-ceph-mon-e ClusterIP 172.30.114.57 <none> 3300/TCP 49d service/rook-ceph-mon-f ClusterIP 172.30.241.30 <none> 3300/TCP 49d service/rook-ceph-osd-0 ClusterIP 172.30.22.209 <none> 6800/TCP 49d service/rook-ceph-osd-1 ClusterIP 172.30.120.167 <none> 6800/TCP 49d service/rook-ceph-osd-2 ClusterIP 172.30.12.188 <none> 6800/TCP 49d NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/rook-ceph-crashcollector-odf0.libvirt2.ocpcluster.cc 1/1 1 1 53d deployment.apps/rook-ceph-crashcollector-odf1.libvirt2.ocpcluster.cc 1/1 1 1 53d deployment.apps/rook-ceph-crashcollector-odf2.libvirt2.ocpcluster.cc 1/1 1 1 53d deployment.apps/rook-ceph-exporter-odf0.libvirt2.ocpcluster.cc 1/1 1 1 53d deployment.apps/rook-ceph-exporter-odf1.libvirt2.ocpcluster.cc 1/1 1 1 53d deployment.apps/rook-ceph-exporter-odf2.libvirt2.ocpcluster.cc 1/1 1 1 53d deployment.apps/rook-ceph-mgr-a 1/1 1 1 53d deployment.apps/rook-ceph-mgr-b 1/1 1 1 53d deployment.apps/rook-ceph-mon-d 1/1 1 1 49d deployment.apps/rook-ceph-mon-e 1/1 1 1 49d deployment.apps/rook-ceph-mon-f 1/1 1 1 49d deployment.apps/rook-ceph-osd-0 1/1 1 1 49d deployment.apps/rook-ceph-osd-1 1/1 1 1 49d deployment.apps/rook-ceph-osd-2 1/1 1 1 49d NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/ocs-deviceset-0-data-0r6qc8 Bound local-pv-e059f7de 500Gi RWO localblock 53d persistentvolumeclaim/ocs-deviceset-1-data-0rtkmk Bound local-pv-9ef1ecc5 500Gi RWO localblock 53d persistentvolumeclaim/ocs-deviceset-2-data-0djbtg Bound local-pv-e0c8a328 500Gi RWO localblock 53d [system:admin/openshift-storage root ~]$ for resource in $(oc -n openshift-storage get $RESOURCES -o name) > do > oc -n openshift-storage patch $resource -p '{"metadata": {"ownerReferences":null}}' > done secret/cluster-peer-token-ocs-storagecluster-cephcluster patched secret/rook-ceph-admin-keyring patched secret/rook-ceph-config patched secret/rook-ceph-crash-collector-keyring patched secret/rook-ceph-exporter-keyring patched secret/rook-ceph-mgr-a-keyring patched secret/rook-ceph-mgr-b-keyring patched secret/rook-ceph-mon patched secret/rook-ceph-mons-keyring patched secret/rook-csi-cephfs-node patched secret/rook-csi-cephfs-provisioner patched secret/rook-csi-rbd-node patched secret/rook-csi-rbd-provisioner patched configmap/rook-ceph-mon-endpoints patched service/rook-ceph-exporter patched service/rook-ceph-mgr patched service/rook-ceph-mon-d patched service/rook-ceph-mon-e patched service/rook-ceph-mon-f patched service/rook-ceph-osd-0 patched service/rook-ceph-osd-1 patched service/rook-ceph-osd-2 patched deployment.apps/rook-ceph-crashcollector-odf0.libvirt2.ocpcluster.cc patched deployment.apps/rook-ceph-crashcollector-odf1.libvirt2.ocpcluster.cc patched deployment.apps/rook-ceph-crashcollector-odf2.libvirt2.ocpcluster.cc patched deployment.apps/rook-ceph-exporter-odf0.libvirt2.ocpcluster.cc patched deployment.apps/rook-ceph-exporter-odf1.libvirt2.ocpcluster.cc patched deployment.apps/rook-ceph-exporter-odf2.libvirt2.ocpcluster.cc patched deployment.apps/rook-ceph-mgr-a patched deployment.apps/rook-ceph-mgr-b patched deployment.apps/rook-ceph-mon-d patched deployment.apps/rook-ceph-mon-e patched deployment.apps/rook-ceph-mon-f patched deployment.apps/rook-ceph-osd-0 patched deployment.apps/rook-ceph-osd-1 patched deployment.apps/rook-ceph-osd-2 patched persistentvolumeclaim/ocs-deviceset-0-data-0r6qc8 patched persistentvolumeclaim/ocs-deviceset-1-data-0rtkmk patched persistentvolumeclaim/ocs-deviceset-2-data-0djbtg patched [system:admin/openshift-storage root ~]$ for resource in $(oc -n openshift-storage get $RESOURCES -o name); do oc get $resource -o yaml >>file.txt; done [system:admin/openshift-storage root ~]$ less file.txt [system:admin/openshift-storage root ~]$ less cluster.yaml [system:admin/openshift-storage root ~]$ oc patch cephcluster/ocs-storagecluster-cephcluster --type json --patch='[ { "op": "remove", "path": "/metadata/finalizers" } ]' cephcluster.ceph.rook.io/ocs-storagecluster-cephcluster patched [system:admin/openshift-storage root ~]$ oc get cephcluster NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL FSID ocs-storagecluster-cephcluster /var/lib/rook 3 7s [system:admin/openshift-storage root ~]$ oc create -f cluster.yaml Error from server (AlreadyExists): error when creating "cluster.yaml": cephclusters.ceph.rook.io "ocs-storagecluster-cephcluster" already exists [system:admin/openshift-storage root ~]$ oc get pods NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-68f7c7d494-xp2kk 2/2 Running 0 4d22h csi-cephfsplugin-p6hfb 2/2 Running 0 42d csi-cephfsplugin-provisioner-784cf69787-btcqz 6/6 Running 0 27d csi-cephfsplugin-provisioner-784cf69787-jx6hm 6/6 Running 0 53d csi-cephfsplugin-v7xff 2/2 Running 0 42d csi-cephfsplugin-v9cpg 2/2 Running 0 42d csi-cephfsplugin-vkkwp 2/2 Running 0 42d csi-cephfsplugin-x5mjx 2/2 Running 0 42d csi-rbdplugin-2cnf9 3/3 Running 4 53d csi-rbdplugin-8drgj 3/3 Running 0 53d csi-rbdplugin-hpg5z 3/3 Running 1 53d csi-rbdplugin-jlflf 3/3 Running 4 53d csi-rbdplugin-m82s4 3/3 Running 4 53d csi-rbdplugin-provisioner-7845b8779f-55thz 7/7 Running 0 27d csi-rbdplugin-provisioner-7845b8779f-ljp7j 7/7 Running 0 49d maintenance-agent-548986c6d7-pk6xp 1/1 Running 0 27d noobaa-core-0 1/1 Running 0 53d noobaa-db-pg-0 1/1 Running 0 53d noobaa-endpoint-5d8b4755-6f2vz 1/1 Running 0 53d noobaa-operator-67ffc7bdf5-mnbt5 1/1 Running 1 (29s ago) 49d ocs-metrics-exporter-55788b6cdb-r2hjn 1/1 Running 0 27d ocs-operator-86f58456c4-rzp7q 1/1 Running 0 47d odf-console-77b5f8c787-686gk 1/1 Running 0 53d odf-operator-controller-manager-6754f68ccc-vrlll 2/2 Running 0 53d rook-ceph-crashcollector-odf0.libvirt2.ocpcluster.cc-66bcbt2r9m 1/1 Running 0 53d rook-ceph-crashcollector-odf1.libvirt2.ocpcluster.cc-74c7b2hpmr 1/1 Running 0 53d rook-ceph-crashcollector-odf2.libvirt2.ocpcluster.cc-669cdtgg59 1/1 Running 0 26d rook-ceph-exporter-odf0.libvirt2.ocpcluster.cc-785556d5c8-wslwg 1/1 Running 0 53d rook-ceph-exporter-odf1.libvirt2.ocpcluster.cc-5bbd756997-7cs75 1/1 Running 0 53d rook-ceph-exporter-odf2.libvirt2.ocpcluster.cc-5448954bf6-j544k 1/1 Running 0 26d rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-99dc5fdbttk4t 2/2 Running 0 26d rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-595dbc7fs6z2s 2/2 Running 0 26d rook-ceph-mds-ocs-storagecluster-cephfilesystem-new-a-85cd4t7td 2/2 Running 0 26d rook-ceph-mds-ocs-storagecluster-cephfilesystem-new-b-574bc4sdq 2/2 Running 0 26d rook-ceph-mgr-a-7dcbd99c68-8bvwj 3/3 Running 0 41d rook-ceph-mgr-b-56f6dd854f-49t2c 3/3 Running 0 41d rook-ceph-mon-d-7c857959bd-lg4ck 2/2 Running 0 32d rook-ceph-mon-e-5bb6769b4b-fh9ns 2/2 Running 0 32d rook-ceph-mon-f-5d5b879fd8-lrmlr 2/2 Running 0 32d rook-ceph-osd-0-7f4cfd54b6-wsdsb 2/2 Running 0 40d rook-ceph-osd-1-b84fd6754-vt8v4 2/2 Running 0 40d rook-ceph-osd-2-7bd7d8568b-g4p8f 2/2 Running 0 40d rook-ceph-rbd-mirror-a-6498df6bf6-nhw69 2/2 Running 0 49d rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-785d6dbrnb7w 2/2 Running 0 53d rook-ceph-tools-675454f488-qzpc2 1/1 Running 0 49d token-exchange-agent-5d9df8896c-zdg8q 1/1 Running 0 27d ux-backend-server-596cb57484-cfhqm 2/2 Running 0 53d [system:admin/openshift-storage root ~]$ oc get $RESOURCES NAME TYPE DATA AGE secret/cluster-peer-token-ocs-storagecluster-cephcluster kubernetes.io/rook 2 53d secret/rook-ceph-admin-keyring kubernetes.io/rook 1 53d secret/rook-ceph-config kubernetes.io/rook 2 53d secret/rook-ceph-crash-collector-keyring kubernetes.io/rook 1 53d secret/rook-ceph-exporter-keyring kubernetes.io/rook 1 53d secret/rook-ceph-mgr-a-keyring kubernetes.io/rook 1 53d secret/rook-ceph-mgr-b-keyring kubernetes.io/rook 1 53d secret/rook-ceph-mon kubernetes.io/rook 4 53d secret/rook-ceph-mons-keyring kubernetes.io/rook 1 53d secret/rook-csi-cephfs-node kubernetes.io/rook 2 53d secret/rook-csi-cephfs-provisioner kubernetes.io/rook 2 53d secret/rook-csi-rbd-node kubernetes.io/rook 2 53d secret/rook-csi-rbd-provisioner kubernetes.io/rook 2 53d NAME DATA AGE configmap/rook-ceph-mon-endpoints 5 53d NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/rook-ceph-exporter ClusterIP 172.30.27.200 <none> 9926/TCP 53d service/rook-ceph-mgr ClusterIP 172.30.107.23 <none> 9283/TCP 53d service/rook-ceph-mon-d ClusterIP 172.30.112.231 <none> 3300/TCP 49d service/rook-ceph-mon-e ClusterIP 172.30.114.57 <none> 3300/TCP 49d service/rook-ceph-mon-f ClusterIP 172.30.241.30 <none> 3300/TCP 49d service/rook-ceph-osd-0 ClusterIP 172.30.22.209 <none> 6800/TCP 49d service/rook-ceph-osd-1 ClusterIP 172.30.120.167 <none> 6800/TCP 49d service/rook-ceph-osd-2 ClusterIP 172.30.12.188 <none> 6800/TCP 49d NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/rook-ceph-crashcollector-odf0.libvirt2.ocpcluster.cc 1/1 1 1 53d deployment.apps/rook-ceph-crashcollector-odf1.libvirt2.ocpcluster.cc 1/1 1 1 53d deployment.apps/rook-ceph-crashcollector-odf2.libvirt2.ocpcluster.cc 1/1 1 1 53d deployment.apps/rook-ceph-exporter-odf0.libvirt2.ocpcluster.cc 1/1 1 1 53d deployment.apps/rook-ceph-exporter-odf1.libvirt2.ocpcluster.cc 1/1 1 1 53d deployment.apps/rook-ceph-exporter-odf2.libvirt2.ocpcluster.cc 1/1 1 1 53d deployment.apps/rook-ceph-mgr-a 1/1 1 1 53d deployment.apps/rook-ceph-mgr-b 1/1 1 1 53d deployment.apps/rook-ceph-mon-d 1/1 1 1 49d deployment.apps/rook-ceph-mon-e 1/1 1 1 49d deployment.apps/rook-ceph-mon-f 1/1 1 1 49d deployment.apps/rook-ceph-osd-0 1/1 1 1 49d deployment.apps/rook-ceph-osd-1 1/1 1 1 49d deployment.apps/rook-ceph-osd-2 1/1 1 1 49d NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/ocs-deviceset-0-data-0r6qc8 Bound local-pv-e059f7de 500Gi RWO localblock 53d persistentvolumeclaim/ocs-deviceset-1-data-0rtkmk Bound local-pv-9ef1ecc5 500Gi RWO localblock 53d persistentvolumeclaim/ocs-deviceset-2-data-0djbtg Bound local-pv-e0c8a328 500Gi RWO localblock 53d [system:admin/openshift-storage root ~]$ oc get deployments|grep rook-ceph-ope rook-ceph-operator 0/0 0 0 53d [system:admin/openshift-storage root ~]$ oc scale deployment rook-ceph-operator --replicas 1 deployment.apps/rook-ceph-operator scaled [system:admin/openshift-storage root ~]$ oc get deploy|grep oper noobaa-operator 1/1 1 1 53d ocs-operator 1/1 1 1 53d odf-operator-controller-manager 1/1 1 1 53d rook-ceph-operator 1/1 1 1 53d We hang when trying to run ceph -s from the tools pod... according to the rook-ceph-operator logs, were stuck in a loop trying to reconcile: 2024-06-04 14:28:35.006722 I | op-mon: mons running: [e f d] 2024-06-04 14:28:40.580081 I | ceph-spec: ceph-rbd-mirror-controller: CephCluster "ocs-storagecluster-cephcluster" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2024-06-04T14:27:46Z LastChanged: PreviousHealth: Capacity:{TotalBytes:0 UsedBytes:0 AvailableBytes:0 LastUpdated:} Versions:<nil> FSID:} 2024-06-04 14:28:40.879155 I | ceph-spec: ceph-block-pool-controller: CephCluster "ocs-storagecluster-cephcluster" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2024-06-04T14:27:46Z LastChanged: PreviousHealth: Capacity:{TotalBytes:0 UsedBytes:0 AvailableBytes:0 LastUpdated:} Versions:<nil> FSID:} 2024-06-04 14:28:41.027479 I | ceph-spec: ceph-fs-subvolumegroup-controller: CephCluster "ocs-storagecluster-cephcluster" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2024-06-04T14:27:46Z LastChanged: PreviousHealth: Capacity:{TotalBytes:0 UsedBytes:0 AvailableBytes:0 LastUpdated:} Versions:<nil> FSID:} 2024-06-04 14:28:41.029319 I | ceph-spec: ceph-object-controller: CephCluster "ocs-storagecluster-cephcluster" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2024-06-04T14:27:46Z LastChanged: PreviousHealth: Capacity:{TotalBytes:0 UsedBytes:0 AvailableBytes:0 LastUpdated:} Versions:<nil> FSID:} 2024-06-04 14:28:41.380374 I | ceph-spec: ceph-file-controller: CephCluster "ocs-storagecluster-cephcluster" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2024-06-04T14:27:46Z LastChanged: PreviousHealth: Capacity:{TotalBytes:0 UsedBytes:0 AvailableBytes:0 LastUpdated:} Versions:<nil> FSID:} 2024-06-04 14:28:41.579268 I | ceph-spec: ceph-file-controller: CephCluster "ocs-storagecluster-cephcluster" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2024-06-04T14:27:46Z LastChanged: PreviousHealth: Capacity:{TotalBytes:0 UsedBytes:0 AvailableBytes:0 LastUpdated:} Versions:<nil> FSID:} 2024-06-04 14:28:41.691344 I | ceph-spec: ceph-object-store-user-controller: CephCluster "ocs-storagecluster-cephcluster" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2024-06-04T14:27:46Z LastChanged: PreviousHealth: Capacity:{TotalBytes:0 UsedBytes:0 AvailableBytes:0 LastUpdated:} Versions:<nil> FSID:} 2024-06-04 14:28:41.691521 I | ceph-spec: ceph-object-store-user-controller: CephCluster "ocs-storagecluster-cephcluster" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2024-06-04T14:27:46Z LastChanged: PreviousHealth: Capacity:{TotalBytes:0 UsedBytes:0 AvailableBytes:0 LastUpdated:} Versions:<nil> FSID:} 2024-06-04 14:28:41.691626 I | ceph-spec: ceph-object-store-user-controller: CephCluster "ocs-storagecluster-cephcluster" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2024-06-04T14:27:46Z LastChanged: PreviousHealth: Capacity:{TotalBytes:0 UsedBytes:0 AvailableBytes:0 LastUpdated:} Versions:<nil> FSID:}
(In reply to kelwhite from comment #5) > Santosh, > > Yes, that's correct. Upstream has a disaster recovery doc [1] that we were > hoping to get a +1 from engineering to use to hopefully stop the deletion of > the cephcluster resource. > > [1] > https://www.rook.io/docs/rook/v1.14/Troubleshooting/disaster-recovery/ > #restoring-crds-after-deletion Upstream doc to restore the CRDs should work.
(In reply to kelwhite from comment #6) > I ran through this process on a lab machine. One thing I noticed, is as soon > as I remove the finalizer from the cephcluster cr, it gets recreated... > might need to scale down the ocs-operator deployment as well? Here are my > findings: That's right. Need to stop the OCS operator deployment as the first step here.
Thanks, I will give that a go and report the outcome.
Hi, I ran the steps on a 4.15 cluster, and scaled down the ocs and rook-ceph operators, the same results happened above. The rook-ceph operator is having a hard time reconciling: [kube:admin/openshift-storage root ~]$ oc scale deployment rook-ceph-operator ocs-operator --replicas 0 deployment.apps/rook-ceph-operator scaled deployment.apps/ocs-operator scaled [kube:admin/openshift-storage root ~]$ oc get pods|grep oper noobaa-operator-769b96d865-vvr7q 1/1 Running 0 28d odf-operator-controller-manager-5d5bbccf5f-tqrcf 2/2 Running 0 28d [kube:admin/openshift-storage root ~]$ mkdir backups [kube:admin/openshift-storage root ~]$ cd backups/ [kube:admin/openshift-storage root backups]$ oc get cephcluster -o yaml > cluster.yaml [kube:admin/openshift-storage root backups]$ oc get secrets -o yaml >secrets.yaml [kube:admin/openshift-storage root backups]$ oc get cm -o yaml >configmaps.yaml [kube:admin/openshift-storage root backups]$ oc get cephcluster ocs-storagecluster-cephcluster -o 'jsonpath={.metadata.uid}' 830b951e-46b8-4ba8-9c65-a5e6fa631dd2 [kube:admin/openshift-storage root backups]$ [kube:admin/openshift-storage root backups]$ ROOK_UID=$(oc get cephcluster ocs-storagecluster-cephcluster -o 'jsonpath={.metadata.uid}') [system:admin/openshift-storage root backups]$ oc get cephcluster ocs-storagecluster-cephcluster -o 'jsonpath={.metadata.uid}' c937d8af-2550-40d6-9f3d-d8a97b1affec [system:admin/openshift-storage root backups]$ ROOK_UID=$(oc get cephcluster ocs-storagecluster-cephcluster -o 'jsonpath={.metadata.uid}') [system:admin/openshift-storage root backups]$ RESOURCES=$(kubectl get secret,configmap,service,deployment,pvc -o jsonpath='{range .items[?(@.metadata.ownerReferences[*].uid=="'"$ROOK_UID"'")]}{.kind}{"/"}{.metadata.name}{"\n"}{end}') [system:admin/openshift-storage root backups]$ kubectl get $RESOURCES NAME TYPE DATA AGE secret/cluster-peer-token-ocs-storagecluster-cephcluster kubernetes.io/rook 2 55d secret/rook-ceph-admin-keyring kubernetes.io/rook 1 55d secret/rook-ceph-config kubernetes.io/rook 2 55d secret/rook-ceph-crash-collector-keyring kubernetes.io/rook 1 55d secret/rook-ceph-exporter-keyring kubernetes.io/rook 1 55d secret/rook-ceph-mgr-a-keyring kubernetes.io/rook 1 55d secret/rook-ceph-mgr-b-keyring kubernetes.io/rook 1 55d secret/rook-ceph-mon kubernetes.io/rook 4 55d secret/rook-ceph-mons-keyring kubernetes.io/rook 1 55d secret/rook-csi-cephfs-node kubernetes.io/rook 2 55d secret/rook-csi-cephfs-provisioner kubernetes.io/rook 2 55d secret/rook-csi-rbd-node kubernetes.io/rook 2 55d secret/rook-csi-rbd-provisioner kubernetes.io/rook 2 55d NAME DATA AGE configmap/rook-ceph-mon-endpoints 5 55d NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/rook-ceph-exporter ClusterIP 172.30.11.29 <none> 9926/TCP 55d service/rook-ceph-mgr ClusterIP 172.30.132.136 <none> 9283/TCP 55d service/rook-ceph-mon-d ClusterIP 172.30.86.183 <none> 3300/TCP 50d service/rook-ceph-mon-e ClusterIP 172.30.113.197 <none> 3300/TCP 50d service/rook-ceph-mon-f ClusterIP 172.30.168.125 <none> 3300/TCP 50d service/rook-ceph-osd-0 ClusterIP 172.30.45.34 <none> 6800/TCP 50d service/rook-ceph-osd-1 ClusterIP 172.30.245.247 <none> 6800/TCP 50d service/rook-ceph-osd-2 ClusterIP 172.30.208.229 <none> 6800/TCP 50d NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/rook-ceph-crashcollector-odf0.libvirt3.ocpcluster.cc 1/1 1 1 55d deployment.apps/rook-ceph-crashcollector-odf1.libvirt3.ocpcluster.cc 1/1 1 1 55d deployment.apps/rook-ceph-crashcollector-odf2.libvirt3.ocpcluster.cc 1/1 1 1 55d deployment.apps/rook-ceph-exporter-odf0.libvirt3.ocpcluster.cc 1/1 1 1 55d deployment.apps/rook-ceph-exporter-odf1.libvirt3.ocpcluster.cc 1/1 1 1 55d deployment.apps/rook-ceph-exporter-odf2.libvirt3.ocpcluster.cc 1/1 1 1 55d deployment.apps/rook-ceph-mgr-a 1/1 1 1 55d deployment.apps/rook-ceph-mgr-b 1/1 1 1 55d deployment.apps/rook-ceph-mon-d 1/1 1 1 50d deployment.apps/rook-ceph-mon-e 1/1 1 1 50d deployment.apps/rook-ceph-mon-f 1/1 1 1 50d deployment.apps/rook-ceph-osd-0 1/1 1 1 50d deployment.apps/rook-ceph-osd-1 1/1 1 1 50d deployment.apps/rook-ceph-osd-2 1/1 1 1 50d NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/ocs-deviceset-0-data-0mwxd4 Bound local-pv-7678b776 500Gi RWO localblock 55d persistentvolumeclaim/ocs-deviceset-1-data-0dwdbm Bound local-pv-2732353f 500Gi RWO localblock 55d persistentvolumeclaim/ocs-deviceset-2-data-0x72z6 Bound local-pv-b1185a6d 500Gi RWO localblock 55d [system:admin/openshift-storage root backups]$ for resource in $(oc -n openshift-storage get $RESOURCES -o name) > do > oc -n openshift-storage patch $resource -p '{"metadata": {"ownerReferences":null}}' > done secret/cluster-peer-token-ocs-storagecluster-cephcluster patched secret/rook-ceph-admin-keyring patched secret/rook-ceph-config patched secret/rook-ceph-crash-collector-keyring patched secret/rook-ceph-exporter-keyring patched secret/rook-ceph-mgr-a-keyring patched secret/rook-ceph-mgr-b-keyring patched secret/rook-ceph-mon patched secret/rook-ceph-mons-keyring patched secret/rook-csi-cephfs-node patched secret/rook-csi-cephfs-provisioner patched secret/rook-csi-rbd-node patched secret/rook-csi-rbd-provisioner patched configmap/rook-ceph-mon-endpoints patched service/rook-ceph-exporter patched service/rook-ceph-mgr patched service/rook-ceph-mon-d patched service/rook-ceph-mon-e patched service/rook-ceph-mon-f patched service/rook-ceph-osd-0 patched service/rook-ceph-osd-1 patched service/rook-ceph-osd-2 patched deployment.apps/rook-ceph-crashcollector-odf0.libvirt3.ocpcluster.cc patched deployment.apps/rook-ceph-crashcollector-odf1.libvirt3.ocpcluster.cc patched deployment.apps/rook-ceph-crashcollector-odf2.libvirt3.ocpcluster.cc patched deployment.apps/rook-ceph-exporter-odf0.libvirt3.ocpcluster.cc patched deployment.apps/rook-ceph-exporter-odf1.libvirt3.ocpcluster.cc patched deployment.apps/rook-ceph-exporter-odf2.libvirt3.ocpcluster.cc patched deployment.apps/rook-ceph-mgr-a patched deployment.apps/rook-ceph-mgr-b patched deployment.apps/rook-ceph-mon-d patched deployment.apps/rook-ceph-mon-e patched deployment.apps/rook-ceph-mon-f patched deployment.apps/rook-ceph-osd-0 patched deployment.apps/rook-ceph-osd-1 patched deployment.apps/rook-ceph-osd-2 patched persistentvolumeclaim/ocs-deviceset-0-data-0mwxd4 patched persistentvolumeclaim/ocs-deviceset-1-data-0dwdbm patched persistentvolumeclaim/ocs-deviceset-2-data-0x72z6 patched [system:admin/openshift-storage root backups]$ oc delete cephcluster ocs-storagecluster-cephcluster cephcluster.ceph.rook.io "ocs-storagecluster-cephcluster" deleted [system:admin/openshift-storage root backups]$ oc get cephcluster No resources found in openshift-storage namespace. [system:admin/openshift-storage root backups]$ oc create -f cluster.yaml cephcluster.ceph.rook.io/ocs-storagecluster-cephcluster created [system:admin/openshift-storage root backups]$ // odf operator logs: 2024-06-05 00:01:08.160537 I | ceph-spec: ceph-file-controller: CephCluster "ocs-storagecluster-cephcluster" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2024-06-05T00:00:32Z LastChanged: PreviousHealth: Capacity:{TotalBytes:0 UsedBytes:0 AvailableBytes:0 LastUpdated:} Versions:<nil> FSID:} 2024-06-05 00:01:08.238233 I | ceph-spec: ceph-object-controller: CephCluster "ocs-storagecluster-cephcluster" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2024-06-05T00:00:32Z LastChanged: PreviousHealth: Capacity:{TotalBytes:0 UsedBytes:0 AvailableBytes:0 LastUpdated:} Versions:<nil> FSID:} 2024-06-05 00:01:08.712411 I | ceph-spec: ceph-rbd-mirror-controller: CephCluster "ocs-storagecluster-cephcluster" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2024-06-05T00:00:32Z LastChanged: PreviousHealth: Capacity:{TotalBytes:0 UsedBytes:0 AvailableBytes:0 LastUpdated:} Versions:<nil> FSID:} 2024-06-05 00:01:08.809975 I | ceph-spec: ceph-object-store-user-controller: CephCluster "ocs-storagecluster-cephcluster" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2024-06-05T00:00:32Z LastChanged: PreviousHealth: Capacity:{TotalBytes:0 UsedBytes:0 AvailableBytes:0 LastUpdated:} Versions:<nil> FSID:} 2024-06-05 00:01:08.912430 I | ceph-spec: ceph-block-pool-controller: CephCluster "ocs-storagecluster-cephcluster" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2024-06-05T00:00:32Z LastChanged: PreviousHealth: Capacity:{TotalBytes:0 UsedBytes:0 AvailableBytes:0 LastUpdated:} Versions:<nil> FSID:} 2024-06-05 00:01:09.108841 I | ceph-spec: ceph-object-store-user-controller: CephCluster "ocs-storagecluster-cephcluster" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2024-06-05T00:00:32Z LastChanged: PreviousHealth: Capacity:{TotalBytes:0 UsedBytes:0 AvailableBytes:0 LastUpdated:} Versions:<nil> FSID:} 2024-06-05 00:01:09.109023 I | ceph-spec: ceph-object-store-user-controller: CephCluster "ocs-storagecluster-cephcluster" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2024-06-05T00:00:32Z LastChanged: PreviousHealth: Capacity:{TotalBytes:0 UsedBytes:0 AvailableBytes:0 LastUpdated:} Versions:<nil> FSID:} 2024-06-05 00:01:09.161389 I | ceph-spec: ceph-fs-subvolumegroup-controller: CephCluster "ocs-storagecluster-cephcluster" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] LastChecked:2024-06-05T00:00:32Z LastChanged: PreviousHealth: Capacity:{TotalBytes:0 UsedBytes:0 AvailableBytes:0 LastUpdated:} Versions:<nil> FSID:} I'll do 1 final test on ODF 4.12 to triple check this still doesn't work, but... based on these so far, I don't feel very confident giving this to the customer.
Hi, I still haven't had time to do any of the testing [1,2]... I might be able to go to it soon... [1] https://github.com/rook/kubectl-rook-ceph/blob/master/docs/crd.md [2] https://github.com/red-hat-storage/odf-cli?tab=readme-ov-file#odf-cli
Parth, Following [1] to install krew as I don't have this, then following [2] to install the ... ################################################################################################### kelson@quorra:~$ ( set -x; cd "$(mktemp -d)" && OS="$(uname | tr '[:upper:]' '[:lower:]')" && ARCH="$(uname -m | sed -e 's/x86_64/amd64/' -e 's/\(arm\)\(64\)\?.*/\1\2/' -e 's/aarch64$/arm64/')" && KREW="krew-${OS}_${ARCH}" && curl -fsSLO "https://github.com/kubernetes-sigs/krew/releases/latest/download/${KREW}.tar.gz" && tar zxvf "${KREW}.tar.gz" && ./"${KREW}" install krew ) ++ mktemp -d + cd /tmp/tmp.hcY1I4FORZ ++ uname ++ tr '[:upper:]' '[:lower:]' + OS=linux ++ uname -m ++ sed -e s/x86_64/amd64/ -e 's/\(arm\)\(64\)\?.*/\1\2/' -e 's/aarch64$/arm64/' + ARCH=amd64 + KREW=krew-linux_amd64 + curl -fsSLO https://github.com/kubernetes-sigs/krew/releases/latest/download/krew-linux_amd64.tar.gz + tar zxvf krew-linux_amd64.tar.gz ./LICENSE ./krew-linux_amd64 + ./krew-linux_amd64 install krew Adding "default" plugin index from https://github.com/kubernetes-sigs/krew-index.git. Updated the local copy of plugin index. Installing plugin: krew Installed plugin: krew \ | Use this plugin: | kubectl krew | Documentation: | https://krew.sigs.k8s.io/ | Caveats: | \ | | krew is now installed! To start using kubectl plugins, you need to add | | krew's installation directory to your PATH: | | | | * macOS/Linux: | | - Add the following to your ~/.bashrc or ~/.zshrc: | | export PATH="${KREW_ROOT:-$HOME/.krew}/bin:$PATH" | | - Restart your shell. | | | | * Windows: Add %USERPROFILE%\.krew\bin to your PATH environment variable | | | | To list krew commands and to get help, run: | | $ kubectl krew | | For a full list of available plugins, run: | | $ kubectl krew search | | | | You can find documentation at | | https://krew.sigs.k8s.io/docs/user-guide/quickstart/. | / / kelson@quorra:~$ export PATH="${KREW_ROOT:-$HOME/.krew}/bin:$PATH" kelson@quorra:~$ exit kelson@quorra:~$ kubectl krew krew is the kubectl plugin manager. You can invoke krew through kubectl: "kubectl krew [command]..." Usage: kubectl krew [command] Available Commands: help Help about any command index Manage custom plugin indexes info Show information about an available plugin install Install kubectl plugins list List installed kubectl plugins search Discover kubectl plugins uninstall Uninstall plugins update Update the local copy of the plugin index upgrade Upgrade installed plugins to newer versions version Show krew version and diagnostics Flags: -h, --help help for krew -v, --v Level number for the log level verbosity Use "kubectl krew [command] --help" for more information about a command. ################################################################################################### Moving onto [2] to install the rook-ceph krew plugin... kelson@quorra:~$ kubectl krew install rook-ceph Updated the local copy of plugin index. Installing plugin: rook-ceph Installed plugin: rook-ceph \ | Use this plugin: | kubectl rook-ceph | Documentation: | https://github.com/rook/kubectl-rook-ceph / WARNING: You installed plugin "rook-ceph" from the krew-index plugin repository. These plugins are not audited for security by the Krew maintainers. Run them at your own risk. ################################################################################################### Now using [3]... to restore a deleted CR, in this case, the cephcluster CR. kelson@quorra:~$ oc get cephcluster NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL FSID ocs-storagecluster-cephcluster /var/lib/rook 3 110m Ready Cluster created successfully HEALTH_OK a45b02f0-5d1c-4d4a-bbc9-80ca55b64e7d kelson@quorra:~$ oc delete cephcluster ocs-storagecluster-cephcluster cephcluster.ceph.rook.io "ocs-storagecluster-cephcluster" deleted ^Ckelson@quorra:~$ oget cephcluster NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL FSID ocs-storagecluster-cephcluster /var/lib/rook 3 111m Deleting Deleting the CephCluster HEALTH_OK a45b02f0-5d1c-4d4a-bbc9-80ca55b64e7d kelson@quorra:~$ kubectl rook-ceph restore-deleted cephcluster ocs-storagecluster-cephcluster Error: operator namespace 'rook-ceph' does not exist. namespaces "rook-ceph" not found kelson@quorra:~$ kubectl rook-ceph restore-deleted cephcluster ocs-storagecluster-cephcluster -n openshift-storage Info: Detecting which resources to restore for crd "cephcluster" Error: Failed to list resources for crd the server could not find the requested resource Sadly, this process won't work :(. I think we have some confusion here?... the crd 'cephclusters.ceph.rook.io' isn't in a deleting phase... the resource 'cephcluster' and the object 'ocs-storagecluster-cephcluster' is: kelson@quorra:~$ oc get crd cephclusters.ceph.rook.io -o yaml|less apiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition metadata: annotations: controller-gen.kubebuilder.io/version: v0.11.3 operatorframework.io/installed-alongside-aa5a5720474eade5: openshift-storage/ocs-operator.v4.15.3-rhodf creationTimestamp: "2024-06-11T18:43:29Z" generation: 1 labels: olm.managed: "true" operators.coreos.com/ocs-operator.openshift-storage: "" name: cephclusters.ceph.rook.io resourceVersion: "37524" uid: bdbf2fec-cd45-4e54-a976-516d12d3fb84 ... status: acceptedNames: kind: CephCluster listKind: CephClusterList plural: cephclusters singular: cephcluster conditions: - lastTransitionTime: "2024-06-11T18:43:29Z" message: no conflicts found reason: NoConflicts status: "True" type: NamesAccepted - lastTransitionTime: "2024-06-11T18:43:29Z" message: the initial names have been accepted reason: InitialNamesAccepted status: "True" type: Established storedVersions: - v1 ^Ckelson@quorra:~$ oget cephcluster NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL FSID ocs-storagecluster-cephcluster /var/lib/rook 3 111m Deleting Deleting the CephCluster HEALTH_OK a45b02f0-5d1c-4d4a-bbc9-80ca55b64e7d Can you clarify my ignorance? Does the CRD 'cephclusters.ceph.rook.io' get passed to the 'StorageCluster' and then the StorageCluster uses this to create the resource 'cephcluster' and the object 'ocs-storagecluster-cephcluster'? [1] https://krew.sigs.k8s.io/docs/user-guide/setup/install/ [2] https://github.com/rook/kubectl-rook-ceph/tree/master [3] https://github.com/rook/kubectl-rook-ceph/blob/master/docs/crd.md
Parth, This is the downstream testing [1]... ################################################################################################### kelson@quorra:~$ cd git/ kelson@quorra:~/git$ git clone https://github.com/red-hat-storage/odf-cli.git Cloning into 'odf-cli'... remote: Enumerating objects: 408, done. remote: Counting objects: 100% (165/165), done. remote: Compressing objects: 100% (84/84), done. remote: Total 408 (delta 98), reused 86 (delta 80), pack-reused 243 Receiving objects: 100% (408/408), 168.72 KiB | 1.17 MiB/s, done. Resolving deltas: 100% (190/190), done. kelson@quorra:~/git$ cd odf-cli/ && make build gofmt -w ./pkg/rook/logs.go ./pkg/rook/osd/osd.go ./cmd/odf/subvolume/subvolume.go ./cmd/odf/restore/crds.go ./cmd/odf/restore/mon_quorum.go ./cmd/odf/restore/restore.go ./cmd/odf/maintenance/maintenance.go ./cmd/odf/maintenance/start.go ./cmd/odf/maintenance/stop.go ./cmd/odf/set/set.go ./cmd/odf/set/log_level.go ./cmd/odf/set/set_recovery_profile.go ./cmd/odf/set/backfillfull_ratio.go ./cmd/odf/set/full_ratio.go ./cmd/odf/set/nearfull_ratio.go ./cmd/odf/set/ceph.go ./cmd/odf/get/health.go ./cmd/odf/get/rook_status.go ./cmd/odf/get/mon_endpoints.go ./cmd/odf/get/get.go ./cmd/odf/get/dr_health.go ./cmd/odf/get/get_recovery_profile.go ./cmd/odf/operator/operator.go ./cmd/odf/operator/rook/set.go ./cmd/odf/operator/rook/restart.go ./cmd/odf/operator/rook/rook.go ./cmd/odf/main.go ./cmd/odf/root/root.go ./cmd/odf/purgeosd/purge_osd.go env GOOS=linux GOARCH=amd64 go build -o bin/odf cmd/odf/main.go go: downloading github.com/rook/kubectl-rook-ceph v0.9.0 go: downloading github.com/spf13/cobra v1.8.0 go: downloading github.com/pkg/errors v0.9.1 go: downloading github.com/rook/rook v1.14.5 go: downloading k8s.io/apimachinery v0.29.3 go: downloading k8s.io/client-go v0.29.3 go: downloading github.com/spf13/pflag v1.0.5 go: downloading github.com/fatih/color v1.16.0 go: downloading gopkg.in/yaml.v3 v3.0.1 go: downloading github.com/golang/mock v1.6.0 go: downloading k8s.io/api v0.29.3 go: downloading github.com/rook/rook/pkg/apis v0.0.0-20240327171914-dc534051324b go: downloading github.com/imdario/mergo v0.3.16 go: downloading golang.org/x/term v0.18.0 go: downloading k8s.io/klog/v2 v2.120.1 go: downloading golang.org/x/net v0.23.0 go: downloading k8s.io/utils v0.0.0-20240310230437-4693a0247e57 go: downloading github.com/gogo/protobuf v1.3.2 go: downloading github.com/google/gofuzz v1.2.0 go: downloading sigs.k8s.io/yaml v1.4.0 go: downloading sigs.k8s.io/json v0.0.0-20221116044647-bc3834ca7abd go: downloading github.com/golang/protobuf v1.5.4 go: downloading github.com/google/gnostic-models v0.6.8 go: downloading sigs.k8s.io/structured-merge-diff/v4 v4.4.1 go: downloading github.com/gorilla/websocket v1.5.1 go: downloading golang.org/x/time v0.5.0 go: downloading golang.org/x/oauth2 v0.18.0 go: downloading gopkg.in/inf.v0 v0.9.1 go: downloading github.com/mattn/go-colorable v0.1.13 go: downloading github.com/mattn/go-isatty v0.0.20 go: downloading golang.org/x/sys v0.18.0 go: downloading k8s.io/kube-openapi v0.0.0-20240322212309-b815d8309940 go: downloading github.com/hashicorp/vault/api v1.12.2 go: downloading github.com/k8snetworkplumbingwg/network-attachment-definition-client v1.6.0 go: downloading github.com/kube-object-storage/lib-bucket-provisioner v0.0.0-20221122204822-d1a8c34382f1 go: downloading github.com/libopenstorage/secrets v0.0.0-20231011182615-5f4b25ceede1 go: downloading github.com/openshift/api v0.0.0-20240328065759-f8aa75d189e1 go: downloading github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc go: downloading github.com/go-logr/logr v1.4.1 go: downloading google.golang.org/protobuf v1.33.0 go: downloading github.com/moby/spdystream v0.2.0 go: downloading golang.org/x/text v0.14.0 go: downloading github.com/json-iterator/go v1.1.12 go: downloading gopkg.in/yaml.v2 v2.4.0 go: downloading github.com/cenkalti/backoff/v3 v3.2.2 go: downloading github.com/hashicorp/errwrap v1.1.0 go: downloading github.com/go-jose/go-jose/v3 v3.0.3 go: downloading github.com/hashicorp/go-cleanhttp v0.5.2 go: downloading github.com/hashicorp/go-multierror v1.1.1 go: downloading github.com/hashicorp/go-retryablehttp v0.7.5 go: downloading github.com/hashicorp/go-rootcerts v1.0.2 go: downloading github.com/hashicorp/go-secure-stdlib/parseutil v0.1.8 go: downloading github.com/hashicorp/go-secure-stdlib/strutil v0.1.2 go: downloading github.com/hashicorp/hcl v1.0.1-vault-5 go: downloading github.com/mitchellh/mapstructure v1.5.0 go: downloading github.com/sirupsen/logrus v1.9.3 go: downloading github.com/google/uuid v1.6.0 go: downloading github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 go: downloading github.com/mxk/go-flowrate v0.0.0-20140419014527-cca7078d478f go: downloading github.com/go-openapi/jsonreference v0.21.0 go: downloading github.com/go-openapi/swag v0.23.0 go: downloading github.com/containernetworking/cni v1.1.2 go: downloading github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd go: downloading github.com/hashicorp/vault/api/auth/approle v0.6.0 go: downloading github.com/hashicorp/vault/api/auth/kubernetes v0.6.0 go: downloading github.com/ryanuber/go-glob v1.0.0 go: downloading github.com/emicklei/go-restful/v3 v3.12.0 go: downloading github.com/modern-go/reflect2 v1.0.2 go: downloading github.com/hashicorp/go-sockaddr v1.0.6 go: downloading github.com/go-openapi/jsonpointer v0.21.0 go: downloading github.com/mailru/easyjson v0.7.7 go: downloading golang.org/x/crypto v0.21.0 go: downloading github.com/josharian/intern v1.0.0 kelson@quorra:~/git/odf-cli$ ./bin/odf -h Management and troubleshooting tools for ODF clusters. Usage: odf [command] Available Commands: get Get ODF configuration help Help about any command purge-osd Permanently remove an OSD from the cluster. set Set ODF configuration subvolume Manages subvolumes Flags: --context string Openshift context to use -h, --help help for odf --kubeconfig string Openshift config path -n, --namespace string Openshift namespace where the StorageCluster CR is created (default "openshift-storage") --operator-namespace string Openshift namespace where the ODF operator is running Use "odf [command] --help" for more information about a command. Reviewing the available commands and the doc [2] section of the github, I was hoping to have an option like 'restore-deleted', but I don't see anything that would assist in this. Can you elaborate on what options would help? [1] https://github.com/red-hat-storage/odf-cli?tab=readme-ov-file#odf-cli
Woops, forgot [2]. [2] https://github.com/red-hat-storage/odf-cli/tree/main/docs
Parth, Ah, 'restore' isn't a part of the help output but 'deleted' is once you pass 'restore', can this be an informal request to add it?: ################################################################################################### kelson@quorra:~$ odf Management and troubleshooting tools for ODF clusters. Usage: odf [command] Available Commands: get Get ODF configuration help Help about any command purge-osd Permanently remove an OSD from the cluster. set Set ODF configuration subvolume Manages subvolumes Flags: --context string Openshift context to use -h, --help help for odf --kubeconfig string Openshift config path -n, --namespace string Openshift namespace where the StorageCluster CR is created (default "openshift-storage") --operator-namespace string Openshift namespace where the ODF operator is running Use "odf [command] --help" for more information about a command. kelson@quorra:~$ odf restore Usage: odf restore [command] Available Commands: deleted Restores a CR that was accidentally deleted and is still in terminating state. mon-quorum When quorum is lost, restore quorum to the remaining healthy mon Flags: -h, --help help for restore Global Flags: --context string Openshift context to use --kubeconfig string Openshift config path -n, --namespace string Openshift namespace where the StorageCluster CR is created (default "openshift-storage") --operator-namespace string Openshift namespace where the ODF operator is running Use "odf restore [command] --help" for more information about a command. ################################################################################################### Anyways, here are my results using [1]: kelson@quorra:~$ oc get cephcluster NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL FSID ocs-storagecluster-cephcluster /var/lib/rook 3 18h Deleting Deleting the CephCluster HEALTH_OK a45b02f0-5d1c-4d4a-bbc9-80ca55b64e7d kelson@quorra:~$ odf restore deleted cephcluster Info: Detecting which resources to restore for crd "cephcluster" Error: Failed to list resources for crd the server could not find the requested resource kelson@quorra:~$ odf restore deleted cephcluster ocs-storagecluster-cephcluster Info: Detecting which resources to restore for crd "cephcluster" Error: Failed to list resources for crd the server could not find the requested resource kelson@quorra:~$ odf restore deleted cephcluster ocs-storagecluster-cephcluster -n openshift-storage Error: accepts between 1 and 2 arg(s), received 4 Usage: odf restore deleted [flags] Examples: odf restore deleted <CRD> [CRNAME] Flags: -h, --help help for deleted Global Flags: --context string Openshift context to use --kubeconfig string Openshift config path -n, --namespace string Openshift namespace where the StorageCluster CR is created (default "openshift-storage") --operator-namespace string Openshift namespace where the ODF operator is running Error: accepts between 1 and 2 arg(s), received 4 kelson@quorra:~$ odf -n openshift-storage restore deleted cephcluster Info: Detecting which resources to restore for crd "cephcluster" Error: Failed to list resources for crd the server could not find the requested resource kelson@quorra:~$ odf -n openshift-storage restore deleted cephcluster ocs-storagecluster-cephcluster Info: Detecting which resources to restore for crd "cephcluster" Error: Failed to list resources for crd the server could not find the requested resource [1] https://github.com/red-hat-storage/odf-cli/blob/main/docs/restore.md#deleted
Parth, I don't understand the ask. The cephcluster is there: kelson@quorra:~$ oc get cephcluster NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL FSID ocs-storagecluster-cephcluster /var/lib/rook 3 18h Deleting Deleting the CephCluster HEALTH_OK
Parth, I can upload the rook-ceph-opeator log, but it's mainly just spam of these two lines and doesn't seem to be very helpful: 2024-06-14 12:44:29.905882 E | ceph-cluster-controller: failed to reconcile CephCluster "openshift-storage/ocs-storagecluster-cephcluster". CephCluster "openshift-storage/ocs-storagecluster-cephcluster" will no t be deleted until all dependents are removed: CephBlockPool: [ocs-storagecluster-cephblockpool ocs-storagecluster-cephblockpool-us-east-2a ocs-storagecluster-cephblockpool-us-east-2b ocs-storagecluster-cephblo ckpool-us-east-2c], CephFilesystem: [ocs-storagecluster-cephfilesystem], CephFilesystemSubVolumeGroup: [ocs-storagecluster-cephfilesystem-csi] 2024-06-14 12:44:39.981115 I | ceph-cluster-controller: CephCluster "openshift-storage/ocs-storagecluster-cephcluster" will not be deleted until all dependents are removed: CephBlockPool: [ocs-storagecluster-ce phblockpool ocs-storagecluster-cephblockpool-us-east-2a ocs-storagecluster-cephblockpool-us-east-2b ocs-storagecluster-cephblockpool-us-east-2c], CephFilesystem: [ocs-storagecluster-cephfilesystem], CephFilesys temSubVolumeGroup: [ocs-storagecluster-cephfilesystem-csi] 2024-06-14 12:51:14.094352 I | ceph-cluster-controller: CephCluster "openshift-storage/ocs-storagecluster-cephcluster" will not be deleted until all dependents are removed: CephBlockPool: [ocs-storagecluster-cephblockpool ocs-storagecluster-cephblockpool-us-east-2a ocs-storagecluster-cephblockpool-us-east-2b ocs-storagecluster-cephblockpool-us-east-2c], CephFilesystem: [ocs-storagecluster-cephfilesystem], CephFilesystemSubVolumeGroup: [ocs-storagecluster-cephfilesystem-csi] Without the two lines, this is what is in there from the past 3 days: 2024-06-11 21:00:22.060166 I | cephclient: crush rule "ocs-storagecluster-cephfilesystem-data0_zone" will no longer be used by pool "ocs-storagecluster-cephfilesystem-data0" 2024-06-11 21:00:22.422982 I | ceph-block-pool-controller: successfully initialized pool "ocs-storagecluster-cephblockpool-us-east-2c" for RBD use 2024-06-11 21:00:22.751302 I | op-config: setting "mgr"="mgr/prometheus/rbd_stats_pools"="ocs-storagecluster-cephblockpool,ocs-storagecluster-cephblockpool-us-east-2a,ocs-storagecluster-cephblockpool-us-east-2b,ocs-storagecluster-cephblockpool-us-east-2c" option to the mon configuration database 2024-06-11 21:00:22.775796 I | cephclient: setting allow_standby_replay to true for filesystem "ocs-storagecluster-cephfilesystem" 2024-06-11 21:00:23.112025 I | op-config: successfully set "mgr"="mgr/prometheus/rbd_stats_pools"="ocs-storagecluster-cephblockpool,ocs-storagecluster-cephblockpool-us-east-2a,ocs-storagecluster-cephblockpool-us-east-2b,ocs-storagecluster-cephblockpool-us-east-2c" option to the mon configuration database 2024-06-11 21:00:23.129929 I | ceph-spec: parsing mon endpoints: c=172.30.194.31:3300,a=172.30.172.228:3300,b=172.30.177.124:3300 2024-06-11 21:00:23.445093 I | ceph-block-pool-controller: creating pool "ocs-storagecluster-cephblockpool" in namespace "openshift-storage" 2024-06-11 21:00:24.120569 I | cephclient: setting pool property "target_size_ratio" to "0.49" on pool "ocs-storagecluster-cephblockpool" 2024-06-11 21:00:24.128816 I | cephclient: creating cephfs "ocs-storagecluster-cephfilesystem" subvolume group "csi" 2024-06-11 21:00:24.497021 I | cephclient: successfully created cephfs "ocs-storagecluster-cephfilesystem" subvolume group "csi" 2024-06-11 21:00:25.462787 I | cephclient: application "rbd" is already set on pool "ocs-storagecluster-cephblockpool" 2024-06-11 21:00:25.462805 I | cephclient: reconciling replicated pool ocs-storagecluster-cephblockpool succeeded 2024-06-11 21:00:26.114878 I | cephclient: creating a new crush rule for changed deviceClass on crush rule "ocs-storagecluster-cephblockpool_zone" 2024-06-11 21:00:26.114900 I | cephclient: updating pool "ocs-storagecluster-cephblockpool" failure domain from "zone" to "zone" with new crush rule "ocs-storagecluster-cephblockpool_zone_replicated" 2024-06-11 21:00:26.114922 I | cephclient: crush rule "ocs-storagecluster-cephblockpool_zone" will no longer be used by pool "ocs-storagecluster-cephblockpool" 2024-06-11 21:00:26.433014 I | ceph-block-pool-controller: initializing pool "ocs-storagecluster-cephblockpool" for RBD use 2024-06-11 21:00:27.179280 I | ceph-block-pool-controller: successfully initialized pool "ocs-storagecluster-cephblockpool" for RBD use 2024-06-11 21:00:27.514821 I | op-config: setting "mgr"="mgr/prometheus/rbd_stats_pools"="ocs-storagecluster-cephblockpool,ocs-storagecluster-cephblockpool-us-east-2a,ocs-storagecluster-cephblockpool-us-east-2b,ocs-storagecluster-cephblockpool-us-east-2c" option to the mon configuration database 2024-06-11 21:00:27.837586 I | op-config: successfully set "mgr"="mgr/prometheus/rbd_stats_pools"="ocs-storagecluster-cephblockpool,ocs-storagecluster-cephblockpool-us-east-2a,ocs-storagecluster-cephblockpool-us-east-2b,ocs-storagecluster-cephblockpool-us-east-2c" option to the mon configuration database 2024-06-12 16:08:26.864969 I | operator: rook-ceph-operator-config-controller done reconciling 2024-06-12 18:47:50.714935 I | op-k8sutil: format and nodeName longer than 63 chars, nodeName ip-10-0-61-40.us-east-2.compute.internal will be 97867f1f29574478396efda2762f4874 2024-06-12 18:47:50.786739 I | op-k8sutil: format and nodeName longer than 63 chars, nodeName ip-10-0-27-19.us-east-2.compute.internal will be 9429fe593320279547d9f63557097d76 2024-06-12 18:47:50.857332 I | op-k8sutil: format and nodeName longer than 63 chars, nodeName ip-10-0-80-144.us-east-2.compute.internal will be 96400daacdccad86f567c6afdfb1d827 2024-06-13 01:42:36.692557 I | operator: rook-ceph-operator-config-controller done reconciling 2024-06-13 05:41:42.115458 I | op-k8sutil: format and nodeName longer than 63 chars, nodeName ip-10-0-80-144.us-east-2.compute.internal will be 96400daacdccad86f567c6afdfb1d827 2024-06-13 05:41:42.191647 I | op-k8sutil: format and nodeName longer than 63 chars, nodeName ip-10-0-61-40.us-east-2.compute.internal will be 97867f1f29574478396efda2762f4874 2024-06-13 05:41:42.245717 I | op-k8sutil: format and nodeName longer than 63 chars, nodeName ip-10-0-27-19.us-east-2.compute.internal will be 9429fe593320279547d9f63557097d76 2024-06-13 11:16:46.521944 I | operator: rook-ceph-operator-config-controller done reconciling 2024-06-13 16:35:33.514972 I | op-k8sutil: format and nodeName longer than 63 chars, nodeName ip-10-0-80-144.us-east-2.compute.internal will be 96400daacdccad86f567c6afdfb1d827 2024-06-13 16:35:33.582297 I | op-k8sutil: format and nodeName longer than 63 chars, nodeName ip-10-0-61-40.us-east-2.compute.internal will be 97867f1f29574478396efda2762f4874 2024-06-13 16:35:33.648855 I | op-k8sutil: format and nodeName longer than 63 chars, nodeName ip-10-0-27-19.us-east-2.compute.internal will be 9429fe593320279547d9f63557097d76 2024-06-13 20:50:56.352150 I | operator: rook-ceph-operator-config-controller done reconciling 2024-06-14 03:29:24.914869 I | op-k8sutil: format and nodeName longer than 63 chars, nodeName ip-10-0-61-40.us-east-2.compute.internal will be 97867f1f29574478396efda2762f4874 2024-06-14 03:29:24.992146 I | op-k8sutil: format and nodeName longer than 63 chars, nodeName ip-10-0-27-19.us-east-2.compute.internal will be 9429fe593320279547d9f63557097d76 2024-06-14 03:29:25.053954 I | op-k8sutil: format and nodeName longer than 63 chars, nodeName ip-10-0-80-144.us-east-2.compute.internal will be 96400daacdccad86f567c6afdfb1d827 2024-06-14 06:25:06.180562 I | operator: rook-ceph-operator-config-controller done reconciling I've tried the manual method on two clusters and it failed. The steps for this testing are above in c#6 and 10. c#6 was with the ocs-operator deployment scaled up and #10 is when it was scaled down.
Travis, Sorry, been on/off PTO the past few weeks... I'll test this again following the first bullet: remove the deletionTimestamp and other metadata in the backup cr. I'm going on PTO again on Thursday, so hopefully, I can knock this testing out before then.
Hi Parth, Thanks for looking into the bug and testing the steps I have done another round of testing in a VMware OCP+LSO+ODF 4.12 environment and shared the results with the cluster. Ceph cluster was created successfully following the steps https://www.rook.io/docs/rook/v1.14/Troubleshooting/disaster-recovery/#restoring-crds-after-deletion Have shared the commands and steps with the customer, waiting for them to execute them in their setup. For now, there are no action for support and Engineering until customer execute the steps Will update you if there are anything from customer,thanks. Regards, Soumi
ODF cli Failed on ODF 4.15 RH Case 03957493