Bug 2267731
Summary: | [RDR] RBD apps fail to Relocate when using stale Ceph pool IDs from replacement cluster | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Annette Clewett <aclewett> |
Component: | rook | Assignee: | Santosh Pillai <sapillai> |
Status: | NEW --- | QA Contact: | Neha Berry <nberry> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.15 | CC: | amagrawa, ebenahar, kbg, kmanohar, kseeger, mrajanna, muagarwa, ndevos, odf-bz-bot, prsurve, sapillai, srangana, tnielsen, uchapaga |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Known Issue | |
Doc Text: |
.RBD applications fail to Relocate when using stale Ceph pool IDs from replacement cluster
For the applications created before the new peer cluster is created, it is not possible to mount the RBD PVC because when a peer cluster is replaced, it is not possible to update the CephBlockPoolID’s mapping in the CSI configmap.
Workaround: Update the `rook-ceph-csi-mapping-config` configmap with cephBlockPoolID's mapping on the peer cluster that is not replaced. This enables mounting the RBD PVC for the application.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | Type: | Bug | |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Annette Clewett
2024-03-04 17:05:03 UTC
Does the workaround provided in the comment #6 work? Moving it to 4.17 to check if there are any permanent solutions other than the workaround. Please move it back to 4.16 if workaround does not work! @mrajanna During Cluster replacement, I observed this behaviour again. Performed the WA suggested. This time during relocate, PVCs on C2(surviving cluster) are stuck in terminating state. Kindly help me in ensuring that the WA has completely worked, also the reason for PVC stuck. C1 (Recovery cluster) --- $ ceph osd pool ls detail pool 1 'ocs-storagecluster-cephblockpool' replicated size 3 min_size 2 crush_rule 9 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 210 lfor 0/0/26 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_ratio 0.49 application rbd read_balance_score 1.13 pool 2 'ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec' replicated size 3 min_size 2 crush_rule 13 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 123 flags hashpspool stripe_width 0 pg_num_min 8 application rgw read_balance_score 1.88 pool 3 'ocs-storagecluster-cephobjectstore.rgw.meta' replicated size 3 min_size 2 crush_rule 12 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 123 flags hashpspool stripe_width 0 pg_num_min 8 application rgw read_balance_score 1.50 oc get cm -n openshift-storage rook-ceph-csi-mapping-config -o yaml apiVersion: v1 data: csi-mapping-config-json: '[{"ClusterIDMapping":{"openshift-storage":"openshift-storage"},"RBDPoolIDMapping":[{"1":"1"}]}]' kind: ConfigMap metadata: creationTimestamp: "2024-05-22T06:02:15Z" name: rook-ceph-csi-mapping-config namespace: openshift-storage ownerReferences: - apiVersion: apps/v1 blockOwnerDeletion: false controller: true kind: Deployment name: rook-ceph-operator uid: 2dd3eaf4-caaf-4b05-ab07-542e168f1887 resourceVersion: "3906057" uid: d782c3b2-5715-4842-9127-815273c8d8fe C2(surviving cluster) --- $ ceph osd pool ls detail pool 1 'ocs-storagecluster-cephblockpool' replicated size 3 min_size 2 crush_rule 9 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 2408 lfor 0/0/38 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_ratio 0.49 application rbd read_balance_score 1.13 pool 2 'ocs-storagecluster-cephobjectstore.rgw.log' replicated size 3 min_size 2 crush_rule 11 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 2041 flags hashpspool stripe_width 0 pg_num_min 8 application rgw read_balance_score 1.88 pool 3 'ocs-storagecluster-cephobjectstore.rgw.control' replicated size 3 min_size 2 crush_rule 13 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 2041 flags hashpspool stripe_width 0 pg_num_min 8 application rgw read_balance_score 1.50 Edited the config map by adding 1:1 mapping. I believe this is the right way. oc get cm -n openshift-storage rook-ceph-csi-mapping-config -o yaml apiVersion: v1 data: csi-mapping-config-json: '[{"ClusterIDMapping":{"openshift-storage":"openshift-storage"},"RBDPoolIDMapping":[{"3":"1"}]},{"ClusterIDMapping":{"openshift-storage":"openshift-storage"},"RBDPoolIDMapping”:[{“1”:”1”}]}]’ <---- Here I have added the 1:1 pool mapping kind: ConfigMap metadata: creationTimestamp: "2024-05-20T10:19:47Z" name: rook-ceph-csi-mapping-config namespace: openshift-storage ownerReferences: - apiVersion: apps/v1 blockOwnerDeletion: false controller: true kind: Deployment name: rook-ceph-operator uid: d79ca804-3a2f-4f7e-96fb-76780814bd38 resourceVersion: "5653111" uid: f235cd1e-24b7-4062-bd76-ebca186801b9 HUB --- oc get drpc app-sub-busybox1-placement-1-drpc -o yaml apiVersion: ramendr.openshift.io/v1alpha1 kind: DRPlacementControl metadata: annotations: drplacementcontrol.ramendr.openshift.io/app-namespace: app-sub-busybox1 drplacementcontrol.ramendr.openshift.io/last-app-deployment-cluster: kmanohar-c1 creationTimestamp: "2024-05-22T08:20:37Z" finalizers: - drpc.ramendr.openshift.io/finalizer generation: 2 labels: cluster.open-cluster-management.io/backup: ramen name: app-sub-busybox1-placement-1-drpc namespace: app-sub-busybox1 ownerReferences: - apiVersion: cluster.open-cluster-management.io/v1beta1 blockOwnerDeletion: true controller: true kind: Placement name: app-sub-busybox1-placement-1 uid: 353a0fa5-2d42-4b02-b887-a1ecd5ed4e73 resourceVersion: "5409642" uid: 05db7b13-1aac-4b6b-883b-54ea204fe8b5 spec: action: Relocate drPolicyRef: apiVersion: ramendr.openshift.io/v1alpha1 kind: DRPolicy name: dr-policy-10mn failoverCluster: kmanohar-c2 placementRef: apiVersion: cluster.open-cluster-management.io/v1beta1 kind: Placement name: app-sub-busybox1-placement-1 namespace: app-sub-busybox1 preferredCluster: kmanohar-c1 pvcSelector: matchLabels: appname: busybox_app1 status: actionStartTime: "2024-05-22T09:38:21Z" conditions: - lastTransitionTime: "2024-05-22T09:40:22Z" message: Completed observedGeneration: 2 reason: Relocated status: "True" type: Available - lastTransitionTime: "2024-05-22T09:39:20Z" message: Relocation in progress to cluster "kmanohar-c1" observedGeneration: 2 reason: NotStarted status: "False" type: PeerReady - lastTransitionTime: "2024-05-22T12:12:11Z" message: 'VolumeReplicationGroup (app-sub-busybox1/app-sub-busybox1-placement-1-drpc) on cluster kmanohar-c1 is reporting errors (Failed to restore PVs/PVCs: failed to restore PV/PVC for VolRep (failed to restore PVs and PVCs using profile list ([s3profile-kmanohar-c1-ocs-storagecluster s3profile-kmanohar-c2-ocs-storagecluster]): failed to restore all []v1.PersistentVolumeClaim. Total/Restored 20/0)) restoring workload resources, retrying till ClusterDataReady condition is met' observedGeneration: 2 reason: Error status: "False" type: Protected lastGroupSyncBytes: 87576576 lastGroupSyncDuration: 1s lastGroupSyncTime: "2024-05-22T09:30:01Z" lastUpdateTime: "2024-05-22T16:48:09Z" observedGeneration: 2 phase: Relocated preferredDecision: clusterName: kmanohar-c2 clusterNamespace: kmanohar-c2 progression: Cleaning Up resourceConditions: conditions: - lastTransitionTime: "2024-05-22T12:11:40Z" message: Initializing VolumeReplicationGroup observedGeneration: 1 reason: Initializing status: Unknown type: DataReady - lastTransitionTime: "2024-05-22T12:11:40Z" message: Initializing VolumeReplicationGroup observedGeneration: 1 reason: Initializing status: Unknown type: DataProtected - lastTransitionTime: "2024-05-22T12:11:51Z" message: 'Failed to restore PVs/PVCs: failed to restore PV/PVC for VolRep (failed to restore PVs and PVCs using profile list ([s3profile-kmanohar-c1-ocs-storagecluster s3profile-kmanohar-c2-ocs-storagecluster]): failed to restore all []v1.PersistentVolumeClaim. Total/Restored 20/0)' observedGeneration: 1 reason: Error status: "False" type: ClusterDataReady - lastTransitionTime: "2024-05-22T12:11:40Z" message: Initializing VolumeReplicationGroup observedGeneration: 1 reason: Initializing status: Unknown type: ClusterDataProtected resourceMeta: generation: 1 kind: VolumeReplicationGroup name: app-sub-busybox1-placement-1-drpc namespace: app-sub-busybox1 Additional Info:- --------------- --> I did node restart of compute-0 and compute-2 but that didn't help, did rbdplugin-provisioner respin too, that didn't help too. --> CephFS based application relocate was successful --> Submariner connectivity is intact Must gather - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/BZ-2267731/ Cluster details c1 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/37350/ c2 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/37348/ hub - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/37349/ |