Description of problem (please be detailed as possible and provide log snippests): Created RDR environment with hub cluster (perf1) and 2 managed clusters perf2 and perf3. Then tested replacement cluster steps using KCS https://access.redhat.com/articles/7049245 and added new recovery cluster perf-2. Last step of Relocating back to Primary cluster failed and shows RBD app pods in creating mode because their PVC/PV are in a bad state. This is because when "perf-2" was added as a new recovery cluster it has Ceph pool IDs which have changed compared to the original replacement cluster "perf2". perf3 when RBD apps Relocated from: $ ceph df | grep -B 3 -A 1 cephblockpool --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL ocs-storagecluster-cephblockpool 1 32 837 MiB 407 2.3 GiB 0.94 83 GiB .mgr 2 1 705 KiB 2 2.1 MiB 0 83 GiB new perf-2 where RBD apps Relocated to: $ ceph df | grep -B 2 cephblockpool POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL .mgr 1 1 577 KiB 2 1.7 MiB 0 82 GiB ocs-storagecluster-cephblockpool 2 32 817 MiB 378 2.3 GiB 0.91 82 GiB Version of all relevant components (if applicable): OCP 4.14.11 ODF 4.15 (build 146) ACM 2.9.2 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, RBD apps are in failed state. Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 5 Can this issue reproducible? Is intermittent because Ceph pool IDs do not always change when new recovery cluster created with ODF installed. Steps to Reproduce: 0) Create RDR environment with hub cluster and 2 managed clusters with names perf2 and perf3 in ACM cluster view. 1) Fail original perf2 cluster (pwr down all nodes) 2) Failover perf2 rbd and cephfs apps to perf3 3) Validate apps failed over correctly and are working as expected given perf2 down (replication between clusters is down) 4) Delete DRCluster perf2 using hub cluster 5) Validate s3Profile for perf2 removed from all VRGs on perf3 6) Disable DR for all rbd and cephfs apps from perf2 7) Remove all DR config from perf3 and hub cluster 8) Remove submariner using ACM UI 9) Detach perf2 cluster using ACM UI 10) Create new cluster and add cluster using ACM UI as perf-2 11) Install ODF 4.15 build 146 on perf-2 12) Add submariner add-ons using ACM UI 13) Install MCO (ODF 4.15 build 146) using hub cluster 14) Create first DRPolicy 15) Apply DR policy to rbd and cephfs apps originally on perf2 16) Relocate rbd and cephfs apps back to perf-2 Actual results: RBD apps failed because of bad PVC/PV state. Expected results: RBD apps are created with healthy PVC/PV state. Additional info: Shyam's diagnosis: The issue is in the Pool ID mapping comfig map for Ceph-csi (as follows): perf-2 (c1) =========== Pool ID for the RBD pool is 2 :pool 2 'ocs-storagecluster-cephblockpool' (from ceph osd pool ls detail) CSI mapping ConfigMap has this:$ oc get cm -n openshift-storage rook-ceph-csi-mapping-config -o yaml apiVersion: v1 data: csi-mapping-config-json: '[{"ClusterIDMapping":{"openshift-storage":"openshift-storage"},"RBDPoolIDMapping":[{"1":"2"}]}]' kind: ConfigMap perf3 (c2) ========== Pool ID for the RBD pool is 1: pool 1 'ocs-storagecluster-cephblockpool' CSI mapping ConfigMap has this: $ oc get cm -n openshift-storage rook-ceph-csi-mapping-config -o yaml apiVersion: v1 data: csi-mapping-config-json: '[{"ClusterIDMapping":{"openshift-storage":"openshift-storage"},"RBDPoolIDMapping":[{"8":"1"}]}]' kind: ConfigMap The PVC was initially created on the cluster that was lost and hence had this as the CSI Volume Handle volumeHandle: 0001-0011-openshift-storage-0000000000000008-06e1ec21-887c-4734-baf4-8f12a319ae0a Note the 000008 that is the Pool ID, which is not the pool ID in either of the current clusters. When this was failed over to perf3 the existing CSI mapping mapped ID 8 to ID 1 in perf3, this is correct. When we added the new cluster perf-2 neither of the CSI mappings work, as the new cluster has Pool ID 1, which is why the error messages also point to the pool with ID 8 on perf-2: pool 8 'ocs-storagecluster-cephobjectstore.rgw.log'. Anyway, this is an interesting issue, we need to map an non-existing Pool ID to one of the existing pool IDs in the current clusters. Ceph-CSI would need to fix this.
Does the workaround provided in the comment #6 work? Moving it to 4.17 to check if there are any permanent solutions other than the workaround. Please move it back to 4.16 if workaround does not work!
@mrajanna During Cluster replacement, I observed this behaviour again. Performed the WA suggested. This time during relocate, PVCs on C2(surviving cluster) are stuck in terminating state. Kindly help me in ensuring that the WA has completely worked, also the reason for PVC stuck. C1 (Recovery cluster) --- $ ceph osd pool ls detail pool 1 'ocs-storagecluster-cephblockpool' replicated size 3 min_size 2 crush_rule 9 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 210 lfor 0/0/26 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_ratio 0.49 application rbd read_balance_score 1.13 pool 2 'ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec' replicated size 3 min_size 2 crush_rule 13 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 123 flags hashpspool stripe_width 0 pg_num_min 8 application rgw read_balance_score 1.88 pool 3 'ocs-storagecluster-cephobjectstore.rgw.meta' replicated size 3 min_size 2 crush_rule 12 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 123 flags hashpspool stripe_width 0 pg_num_min 8 application rgw read_balance_score 1.50 oc get cm -n openshift-storage rook-ceph-csi-mapping-config -o yaml apiVersion: v1 data: csi-mapping-config-json: '[{"ClusterIDMapping":{"openshift-storage":"openshift-storage"},"RBDPoolIDMapping":[{"1":"1"}]}]' kind: ConfigMap metadata: creationTimestamp: "2024-05-22T06:02:15Z" name: rook-ceph-csi-mapping-config namespace: openshift-storage ownerReferences: - apiVersion: apps/v1 blockOwnerDeletion: false controller: true kind: Deployment name: rook-ceph-operator uid: 2dd3eaf4-caaf-4b05-ab07-542e168f1887 resourceVersion: "3906057" uid: d782c3b2-5715-4842-9127-815273c8d8fe C2(surviving cluster) --- $ ceph osd pool ls detail pool 1 'ocs-storagecluster-cephblockpool' replicated size 3 min_size 2 crush_rule 9 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 2408 lfor 0/0/38 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_ratio 0.49 application rbd read_balance_score 1.13 pool 2 'ocs-storagecluster-cephobjectstore.rgw.log' replicated size 3 min_size 2 crush_rule 11 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 2041 flags hashpspool stripe_width 0 pg_num_min 8 application rgw read_balance_score 1.88 pool 3 'ocs-storagecluster-cephobjectstore.rgw.control' replicated size 3 min_size 2 crush_rule 13 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 2041 flags hashpspool stripe_width 0 pg_num_min 8 application rgw read_balance_score 1.50 Edited the config map by adding 1:1 mapping. I believe this is the right way. oc get cm -n openshift-storage rook-ceph-csi-mapping-config -o yaml apiVersion: v1 data: csi-mapping-config-json: '[{"ClusterIDMapping":{"openshift-storage":"openshift-storage"},"RBDPoolIDMapping":[{"3":"1"}]},{"ClusterIDMapping":{"openshift-storage":"openshift-storage"},"RBDPoolIDMapping”:[{“1”:”1”}]}]’ <---- Here I have added the 1:1 pool mapping kind: ConfigMap metadata: creationTimestamp: "2024-05-20T10:19:47Z" name: rook-ceph-csi-mapping-config namespace: openshift-storage ownerReferences: - apiVersion: apps/v1 blockOwnerDeletion: false controller: true kind: Deployment name: rook-ceph-operator uid: d79ca804-3a2f-4f7e-96fb-76780814bd38 resourceVersion: "5653111" uid: f235cd1e-24b7-4062-bd76-ebca186801b9 HUB --- oc get drpc app-sub-busybox1-placement-1-drpc -o yaml apiVersion: ramendr.openshift.io/v1alpha1 kind: DRPlacementControl metadata: annotations: drplacementcontrol.ramendr.openshift.io/app-namespace: app-sub-busybox1 drplacementcontrol.ramendr.openshift.io/last-app-deployment-cluster: kmanohar-c1 creationTimestamp: "2024-05-22T08:20:37Z" finalizers: - drpc.ramendr.openshift.io/finalizer generation: 2 labels: cluster.open-cluster-management.io/backup: ramen name: app-sub-busybox1-placement-1-drpc namespace: app-sub-busybox1 ownerReferences: - apiVersion: cluster.open-cluster-management.io/v1beta1 blockOwnerDeletion: true controller: true kind: Placement name: app-sub-busybox1-placement-1 uid: 353a0fa5-2d42-4b02-b887-a1ecd5ed4e73 resourceVersion: "5409642" uid: 05db7b13-1aac-4b6b-883b-54ea204fe8b5 spec: action: Relocate drPolicyRef: apiVersion: ramendr.openshift.io/v1alpha1 kind: DRPolicy name: dr-policy-10mn failoverCluster: kmanohar-c2 placementRef: apiVersion: cluster.open-cluster-management.io/v1beta1 kind: Placement name: app-sub-busybox1-placement-1 namespace: app-sub-busybox1 preferredCluster: kmanohar-c1 pvcSelector: matchLabels: appname: busybox_app1 status: actionStartTime: "2024-05-22T09:38:21Z" conditions: - lastTransitionTime: "2024-05-22T09:40:22Z" message: Completed observedGeneration: 2 reason: Relocated status: "True" type: Available - lastTransitionTime: "2024-05-22T09:39:20Z" message: Relocation in progress to cluster "kmanohar-c1" observedGeneration: 2 reason: NotStarted status: "False" type: PeerReady - lastTransitionTime: "2024-05-22T12:12:11Z" message: 'VolumeReplicationGroup (app-sub-busybox1/app-sub-busybox1-placement-1-drpc) on cluster kmanohar-c1 is reporting errors (Failed to restore PVs/PVCs: failed to restore PV/PVC for VolRep (failed to restore PVs and PVCs using profile list ([s3profile-kmanohar-c1-ocs-storagecluster s3profile-kmanohar-c2-ocs-storagecluster]): failed to restore all []v1.PersistentVolumeClaim. Total/Restored 20/0)) restoring workload resources, retrying till ClusterDataReady condition is met' observedGeneration: 2 reason: Error status: "False" type: Protected lastGroupSyncBytes: 87576576 lastGroupSyncDuration: 1s lastGroupSyncTime: "2024-05-22T09:30:01Z" lastUpdateTime: "2024-05-22T16:48:09Z" observedGeneration: 2 phase: Relocated preferredDecision: clusterName: kmanohar-c2 clusterNamespace: kmanohar-c2 progression: Cleaning Up resourceConditions: conditions: - lastTransitionTime: "2024-05-22T12:11:40Z" message: Initializing VolumeReplicationGroup observedGeneration: 1 reason: Initializing status: Unknown type: DataReady - lastTransitionTime: "2024-05-22T12:11:40Z" message: Initializing VolumeReplicationGroup observedGeneration: 1 reason: Initializing status: Unknown type: DataProtected - lastTransitionTime: "2024-05-22T12:11:51Z" message: 'Failed to restore PVs/PVCs: failed to restore PV/PVC for VolRep (failed to restore PVs and PVCs using profile list ([s3profile-kmanohar-c1-ocs-storagecluster s3profile-kmanohar-c2-ocs-storagecluster]): failed to restore all []v1.PersistentVolumeClaim. Total/Restored 20/0)' observedGeneration: 1 reason: Error status: "False" type: ClusterDataReady - lastTransitionTime: "2024-05-22T12:11:40Z" message: Initializing VolumeReplicationGroup observedGeneration: 1 reason: Initializing status: Unknown type: ClusterDataProtected resourceMeta: generation: 1 kind: VolumeReplicationGroup name: app-sub-busybox1-placement-1-drpc namespace: app-sub-busybox1 Additional Info:- --------------- --> I did node restart of compute-0 and compute-2 but that didn't help, did rbdplugin-provisioner respin too, that didn't help too. --> CephFS based application relocate was successful --> Submariner connectivity is intact Must gather - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/BZ-2267731/ Cluster details c1 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/37350/ c2 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/37348/ hub - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/37349/