Description of problem (please be detailed as possible and provide log snippests): We are trying the below scenario where all worker node where ODF was running has replaced at the same time. I am trying to recover ODF. Is there any step I can try to recover ODF Version of all relevant components (if applicable): 4.11 Can this issue reproducible? Replace all workers same time $ oc get pods NAME READY STATUS RESTARTS AGE app-active-replace-deployment-d486cf8d5-wsdfs 1/1 Running 0 69m csi-addons-controller-manager-5c887d798f-j9sj5 2/2 Running 1 (116m ago) 3h25m csi-cephfsplugin-2z8b5 2/2 Running 0 42m csi-cephfsplugin-8js8s 2/2 Running 0 3h25m csi-cephfsplugin-h82q6 2/2 Running 0 3h25m csi-cephfsplugin-jwjrm 2/2 Running 0 3h25m csi-cephfsplugin-kt89t 2/2 Running 0 41m csi-cephfsplugin-provisioner-6ccc575676-95ntk 5/5 Running 0 59m csi-cephfsplugin-provisioner-6ccc575676-z57ck 5/5 Running 2 (116m ago) 3h25m csi-cephfsplugin-ps8qg 2/2 Running 0 39m csi-rbdplugin-2lrmr 2/2 Running 0 3h25m csi-rbdplugin-7ldwc 2/2 Running 0 39m csi-rbdplugin-bzr2r 2/2 Running 0 41m csi-rbdplugin-lww5v 2/2 Running 0 3h25m csi-rbdplugin-p624c 2/2 Running 0 42m csi-rbdplugin-phgqr 2/2 Running 0 3h25m csi-rbdplugin-provisioner-665b6c9699-mxj8s 5/5 Running 2 (116m ago) 3h25m csi-rbdplugin-provisioner-665b6c9699-wwt9p 5/5 Running 0 3h25m noobaa-core-0 1/1 Running 0 59m noobaa-db-pg-0 0/1 Init:0/2 0 59m noobaa-endpoint-7bc6569497-vmcqb 1/1 Running 0 59m noobaa-operator-54bdff96bd-vp9mb 1/1 Running 1 (88m ago) 3h26m ocs-metrics-exporter-fc6fd88fd-2fk7z 1/1 Running 0 3h25m ocs-operator-57b747b75b-htk4c 1/1 Running 1 (116m ago) 3h25m ocs-osd-removal-job-c9f4q 0/1 Error 0 7m1s ocs-osd-removal-job-c9hgw 0/1 Error 0 6m4s ocs-osd-removal-job-hb2mb 0/1 Error 0 6m23s ocs-osd-removal-job-hh2n9 0/1 Error 0 5m45s ocs-osd-removal-job-hngqs 0/1 Error 0 7m39s ocs-osd-removal-job-kpvxz 0/1 Error 0 6m42s ocs-osd-removal-job-nqm5k 0/1 Error 0 7m20s odf-console-db7549559-sj65l 1/1 Running 0 3h26m odf-operator-controller-manager-675c4fd4d4-hfwrc 2/2 Running 1 (116m ago) 3h26m rook-ceph-crashcollector-bha-odf-mar9-p-2645-host-3-6dc94dx9z99 1/1 Running 0 22m rook-ceph-crashcollector-bha-odf-mar9-p-2645-host-7-78bc49l4wgl 1/1 Running 0 22m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-58f449d87zfq2 1/2 CrashLoopBackOff 9 (2m38s ago) 60m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-5cf89698xcsf4 1/2 CrashLoopBackOff 9 (99s ago) 56m rook-ceph-mgr-a-b6dcd9cbf-pd26j 1/2 CrashLoopBackOff 9 (109s ago) 59m rook-ceph-mon-a-598bf9b4b4-wd24d 0/2 Pending 0 60m rook-ceph-mon-b-7d7596fdb9-th7kc 0/2 Pending 0 56m rook-ceph-mon-c-84bf487744-r7zrj 0/2 Pending 0 56m rook-ceph-operator-59ff9c9666-q9rbr 1/1 Running 0 3h25m rook-ceph-osd-0-6d6588855b-cqx5l 0/2 Pending 0 60m rook-ceph-osd-1-547fd7cb6-5khjb 0/2 Pending 0 56m rook-ceph-osd-2-7db8cf96fc-p6htk 0/2 Pending 0 56m rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-5c5c6d9kk4m9 1/2 Running 7 (88s ago) 59m $ oc get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE local-pv-3396bcce 100Gi RWO Delete Available localblock 47m local-pv-38285e9b 100Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-data-0wg6c2 localblock 116m local-pv-5420c01b 100Gi RWO Delete Available localblock 46m local-pv-561cef2e 100Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-data-0dxksv localblock 117m local-pv-8f0ccd08 100Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-data-0v4t56 localblock 116m local-pv-974694cb 100Gi RWO Delete Available localblock 47m pvc-29b2075c-d43d-448f-b350-d6428ee529cb 50Gi RWO Delete Bound openshift-storage/db-noobaa-db-pg-0 ocs-storagecluster-ceph-rbd 114m pvc-d65531c6-c011-4149-8d35-2caa9cf1a210 5Gi RWO Delete Bound openshift-storage/app-static-pvc-replace ocs-storagecluster-ceph-rbd 99m pvc-f0fc2799-fbc6-49ae-9894-46e873a35f37 5Gi RWO Delete Bound openshift-storage/app-active-pvc-replace ocs-storagecluster-ceph-rbd $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME bha-odf-mar9-p-2645-host-10 Ready master,worker 5h40m v1.25.4+a34b9e9 10.240.64.8 <none> Red Hat Enterprise Linux 8.7 (Ootpa) 4.18.0-425.13.1.el8_7.x86_64 cri-o://1.25.2-6.rhaos4.12.git3c4e50c.el8 bha-odf-mar9-p-2645-host-11 Ready master,worker 5h34m v1.25.4+a34b9e9 10.240.128.8 <none> Red Hat Enterprise Linux 8.7 (Ootpa) 4.18.0-425.13.1.el8_7.x86_64 cri-o://1.25.2-6.rhaos4.12.git3c4e50c.el8 bha-odf-mar9-p-2645-host-3 Ready master,worker 69m v1.25.4+a34b9e9 10.240.0.6 <none> Red Hat Enterprise Linux 8.7 (Ootpa) 4.18.0-425.13.1.el8_7.x86_64 cri-o://1.25.2-6.rhaos4.12.git3c4e50c.el8 bha-odf-mar9-p-2645-host-5 Ready master,worker 68m v1.25.4+a34b9e9 10.240.128.4 <none> Red Hat Enterprise Linux 8.7 (Ootpa) 4.18.0-425.13.1.el8_7.x86_64 cri-o://1.25.2-6.rhaos4.12.git3c4e50c.el8 bha-odf-mar9-p-2645-host-7 Ready master,worker 70m v1.25.4+a34b9e9 10.240.64.5 <none> Red Hat Enterprise Linux 8.7 (Ootpa) 4.18.0-425.13.1.el8_7.x86_64 cri-o://1.25.2-6.rhaos4.12.git3c4e50c.el8 bha-odf-mar9-p-2645-host-9 Ready master,worker 5h37m v1.25.4+a34b9e9 10.240.0.8 <none> Red Hat Enterprise Linux 8.7 (Ootpa) 4.18.0-425.13.1.el8_7.x86_64 cri-o://1.25.2-6.rhaos4.12.git3c4e50c.el8 Bhagyashrees-MacBook-Pro-2:odf_replace3worker-trial2 bhagyashree$
$ oc describe storagecluster ocs-storagecluster -n openshift-storage Name: ocs-storagecluster Namespace: openshift-storage Labels: <none> Annotations: uninstall.ocs.openshift.io/cleanup-policy: delete uninstall.ocs.openshift.io/mode: graceful API Version: ocs.openshift.io/v1 Kind: StorageCluster Metadata: Creation Timestamp: 2023-03-10T07:58:42Z Finalizers: storagecluster.ocs.openshift.io Generation: 2 Managed Fields: API Version: ocs.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:uninstall.ocs.openshift.io/cleanup-policy: f:uninstall.ocs.openshift.io/mode: f:finalizers: .: v:"storagecluster.ocs.openshift.io": f:spec: f:version: Manager: ocs-operator Operation: Update Time: 2023-03-10T07:58:43Z API Version: ocs.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:ownerReferences: .: k:{"uid":"bf74285d-62f3-41c9-910d-c92784f62ca1"}: f:spec: .: f:arbiter: f:encryption: .: f:kms: f:externalStorage: f:managedResources: .: f:cephBlockPools: f:cephCluster: f:cephConfig: f:cephDashboard: f:cephFilesystems: f:cephObjectStoreUsers: f:cephObjectStores: f:cephToolbox: f:mirroring: f:monDataDirHostPath: f:storageDeviceSets: Manager: manager Operation: Update Time: 2023-03-10T07:58:54Z API Version: ocs.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: .: f:conditions: f:externalStorage: .: f:grantedCapacity: f:failureDomain: f:failureDomainKey: f:failureDomainValues: f:images: .: f:ceph: .: f:actualImage: f:desiredImage: f:noobaaCore: .: f:actualImage: f:desiredImage: f:noobaaDB: .: f:actualImage: f:desiredImage: f:kmsServerConnection: f:nodeTopologies: .: f:labels: .: f:kubernetes.io/hostname: f:topology.kubernetes.io/region: f:topology.kubernetes.io/zone: f:phase: f:relatedObjects: Manager: ocs-operator Operation: Update Subresource: status Time: 2023-03-10T11:52:08Z Owner References: API Version: odf.openshift.io/v1alpha1 Kind: StorageSystem Name: ocs-storagecluster-storagesystem UID: bf74285d-62f3-41c9-910d-c92784f62ca1 Resource Version: 305666 UID: a3d071fb-58a0-4dfe-b17e-709cc61f315a Spec: Arbiter: Encryption: Kms: External Storage: Managed Resources: Ceph Block Pools: Ceph Cluster: Ceph Config: Ceph Dashboard: Ceph Filesystems: Ceph Object Store Users: Ceph Object Stores: Ceph Toolbox: Mirroring: Mon Data Dir Host Path: /var/lib/rook Storage Device Sets: Config: Count: 1 Data PVC Template: Metadata: Spec: Access Modes: ReadWriteOnce Resources: Requests: Storage: 1 Storage Class Name: localblock Volume Mode: Block Status: Name: ocs-deviceset Placement: Prepare Placement: Replica: 3 Resources: Version: 4.11.0 Status: Conditions: Last Heartbeat Time: 2023-03-10T11:52:08Z Last Transition Time: 2023-03-10T11:01:47Z Message: Reconcile completed successfully Reason: ReconcileCompleted Status: True Type: ReconcileComplete Last Heartbeat Time: 2023-03-10T11:52:08Z Last Transition Time: 2023-03-10T11:01:47Z Message: CephCluster error: Failed to configure ceph cluster Reason: ClusterStateError Status: False Type: Available Last Heartbeat Time: 2023-03-10T11:47:49Z Last Transition Time: 2023-03-10T10:24:44Z Message: CephCluster is creating: Configuring Ceph Mons Reason: ClusterStateCreating Status: True Type: Progressing Last Heartbeat Time: 2023-03-10T11:52:08Z Last Transition Time: 2023-03-10T11:01:47Z Message: CephCluster error: Failed to configure ceph cluster Reason: ClusterStateError Status: True Type: Degraded Last Heartbeat Time: 2023-03-10T11:47:49Z Last Transition Time: 2023-03-10T11:03:22Z Message: CephCluster is creating: Configuring Ceph Mons Reason: ClusterStateCreating Status: False Type: Upgradeable External Storage: Granted Capacity: 0 Failure Domain: zone Failure Domain Key: topology.kubernetes.io/zone Failure Domain Values: us-south-2 us-south-1 us-south-3 Images: Ceph: Actual Image: registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:957294824e1cbf89ca24a1a2aa2a8e8acd567cfb5a25535e2624989ad1046a60 Desired Image: registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:957294824e1cbf89ca24a1a2aa2a8e8acd567cfb5a25535e2624989ad1046a60 Noobaa Core: Actual Image: registry.redhat.io/odf4/mcg-core-rhel8@sha256:f3470a4dc896b30d77bce7f1c887340ca48b4be9e4fc353e81aef8e1834cbcf5 Desired Image: registry.redhat.io/odf4/mcg-core-rhel8@sha256:f3470a4dc896b30d77bce7f1c887340ca48b4be9e4fc353e81aef8e1834cbcf5 Noobaa DB: Actual Image: registry.redhat.io/rhel8/postgresql-12@sha256:aa65868b9684f7715214f5f3fac3139245c212019cc17742f237965a7508222d Desired Image: registry.redhat.io/rhel8/postgresql-12@sha256:aa65868b9684f7715214f5f3fac3139245c212019cc17742f237965a7508222d Kms Server Connection: Node Topologies: Labels: kubernetes.io/hostname: bha-odf-mar9-p-2645-host-4 bha-odf-mar9-p-2645-host-6 bha-odf-mar9-p-2645-host-8 bha-odf-mar9-p-2645-host-7 bha-odf-mar9-p-2645-host-3 bha-odf-mar9-p-2645-host-5 topology.kubernetes.io/region: us-south topology.kubernetes.io/zone: us-south-2 us-south-1 us-south-3 Phase: Progressing Related Objects: API Version: ceph.rook.io/v1 Kind: CephCluster Name: ocs-storagecluster-cephcluster Namespace: openshift-storage Resource Version: 304625 UID: ae876acf-bbed-491f-b2a1-a9d3b7d955a1 API Version: noobaa.io/v1alpha1 Kind: NooBaa Name: noobaa Namespace: openshift-storage Resource Version: 229429 UID: a70e769e-d85f-4a02-b9e8-340dc50862bc Events: <none> Bhagyashrees-MacBook-Pro-2:odf_replace3worker-trial2 bhagyashree$
OSD removal job is in error state $oc logs -l job-name=ocs-osd-removal-job -n openshift-storage mon_data_avail_warn = 15 [osd] osd_memory_target_cgroup_limit_ratio = 0.8 [client.admin] keyring = /var/lib/rook/openshift-storage/client.admin.keyring 2023-03-10 11:33:19.520767 D | exec: Running command: ceph osd dump --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json 2023-03-10 11:33:34.625564 C | rookcmd: failed to get osd dump: failed to get osd dump: exit status 1
This cluster is running on local PVs, and mons are using the host path to store the metadata. If you replace all the nodes at the same time, you will immediately lose mon quorum and all OSDs, and the cluster cannot recover. Notice that all the mon pods are pending, which means they are waiting for their original nodes to come back online and the mon quorum is lost until those same nodes come back online. Replacing all the nodes at the same time can only be done if the mons and OSDs are portable. Have you tested this on an AWS cluster?
Please reopen if there is still an issue to discuss