Bug 2319334
| Summary: | [RDR] Relocate of ceph fs is stuck in WaitForReadiness | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Pratik Surve <prsurve> |
| Component: | odf-dr | Assignee: | Benamar Mekhissi <bmekhiss> |
| odf-dr sub component: | ramen | QA Contact: | krishnaram Karthick <kramdoss> |
| Status: | ASSIGNED --- | Docs Contact: | |
| Severity: | urgent | ||
| Priority: | unspecified | CC: | bmekhiss, edonnell, kramdoss, kseeger, muagarwa, rtalur, sagrawal |
| Version: | 4.16 | Keywords: | Automation, Regression |
| Target Milestone: | --- | ||
| Target Release: | ODF 4.18.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Known Issue | |
| Doc Text: |
.Relocating of CephFS gets stuck in WaitForReadiness
There is a scenario where the DRPC progression gets stuck in WaitForReadiness. If it remains in this state for an extended period, it's possible that a known issue has occurred, preventing Ramen from updating the PlacementDecision with the new Primary.
As a result, the relocation process will not complete, leaving the workload undeployed on the new primary cluster. This can cause delays in recovery until the user intervenes.
Workaround: Manually update the PlacementDecision to point to the new Primary.
* For workload using PlacementRule:
1. Edit the PlacementRule
oc edit placementrule --subresource=status -n [namespace] [name of the placementrule]
Example:
oc edit placementrule --subresource=status -n busybox-workloads-cephfs-2 busybox-placement
2. Add the following to the placementrule status.
```
status:
decisions:
- clusterName: [primary cluster name]
reason: [primary cluster name]
```
- For workload using Placement:
1. Edit the PlacementRule
oc edit placementdecision --subresource=status -n [namespace] [name of the placementdecision]
Example:
oc get placementdecision --subresource=status -n openshift-gitops busybox-3-placement-cephfs-decision-1
2. Add the following to the placementrule status.
```
status:
decisions:
- clusterName: [primary cluster name]
reason: [primary cluster name]
```
As a result, the PlacementDecision is updated and the workload is deployed on the Primary cluster.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 2281703, 2320289, 2321510 | ||
Description of problem (please be detailed as possible and provide log snippests): [RDR] Relocate of ceph fs is stuck in WaitForReadiness Version of all relevant components (if applicable): OCS operator 4.16.3-2 Cluster Version 4.16.0-0.nightly-2024-10-12-102620 acm_version 2.11.3 gitops_version 1.14.0 submariner_version 0.18.0 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Deploy 4.16.3 RDR cluster 2.Deploy ceph fs workloads 3. Relocate cephfs worklods Actual results: oc describe drpc busybox-3-placement-cephfs-drpc -n openshift-gitops Name: busybox-3-placement-cephfs-drpc Namespace: openshift-gitops Labels: cluster.open-cluster-management.io/backup=ramen Annotations: drplacementcontrol.ramendr.openshift.io/app-namespace: appset-busybox-3-cephfs drplacementcontrol.ramendr.openshift.io/last-app-deployment-cluster: prsurve-5c1 API Version: ramendr.openshift.io/v1alpha1 Kind: DRPlacementControl Metadata: Creation Timestamp: 2024-10-16T13:21:33Z Finalizers: drpc.ramendr.openshift.io/finalizer Generation: 2 Owner References: API Version: cluster.open-cluster-management.io/v1beta1 Block Owner Deletion: true Controller: true Kind: Placement Name: busybox-3-placement-cephfs UID: c08571cd-03c5-46f0-a1c5-4f77bea158fd Resource Version: 2853670 UID: b63d684d-74b4-4c83-83e3-17a9829b5bc9 Spec: Action: Relocate Dr Policy Ref: API Version: ramendr.openshift.io/v1alpha1 Kind: DRPolicy Name: odr-policy-5m Placement Ref: API Version: cluster.open-cluster-management.io/v1beta1 Kind: Placement Name: busybox-3-placement-cephfs Namespace: openshift-gitops Preferred Cluster: prsurve-5c2 Pvc Selector: Match Labels: Appname: busybox_app3_cephfs Status: Action Start Time: 2024-10-16T13:30:33Z Conditions: Last Transition Time: 2024-10-16T13:30:43Z Message: Waiting for App resources to be restored...) Observed Generation: 2 Reason: Relocating Status: False Type: Available Last Transition Time: 2024-10-16T13:34:43Z Message: Relocation in progress to cluster "prsurve-5c2" Observed Generation: 2 Reason: NotStarted Status: False Type: PeerReady Last Transition Time: 2024-10-16T13:34:44Z Message: VolumeReplicationGroup (appset-busybox-3-cephfs/busybox-3-placement-cephfs-drpc) on cluster prsurve-5c2 is progressing on readying workload data (Not all VolSync PVCs are ready), retrying till DataReady condition is met Observed Generation: 2 Reason: Progressing Status: False Type: Protected Last Group Sync Duration: 36.74055203s Last Group Sync Time: 2024-10-16T13:34:34Z Last Update Time: 2024-10-16T14:15:48Z Observed Generation: 2 Phase: Relocating Preferred Decision: Cluster Name: prsurve-5c1 Cluster Namespace: prsurve-5c1 Progression: WaitForReadiness Resource Conditions: Conditions: Last Transition Time: 2024-10-16T13:34:44Z Message: Not all VolSync PVCs are ready Observed Generation: 3 Reason: Progressing Status: False Type: DataReady Last Transition Time: 2024-10-16T13:34:44Z Message: Not all VolSync PVCs are protected Observed Generation: 3 Reason: Progressing Status: False Type: DataProtected Last Transition Time: 2024-10-16T13:34:44Z Message: Not all VolSync PVCs are protected Observed Generation: 3 Reason: Progressing Status: False Type: ClusterDataProtected Last Transition Time: 2024-10-16T13:34:44Z Message: Restored PVs and PVCs Observed Generation: 3 Reason: Restored Status: True Type: ClusterDataReady Resource Meta: Generation: 3 Kind: VolumeReplicationGroup Name: busybox-3-placement-cephfs-drpc Namespace: appset-busybox-3-cephfs Protectedpvcs: busybox-pvc-7 busybox-pvc-6 busybox-pvc-10 busybox-pvc-5 busybox-pvc-4 busybox-pvc-3 busybox-pvc-1 busybox-pvc-8 busybox-pvc-2 busybox-pvc-9 Resource Version: 3633777 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal DRPCDeploying 54m (x8 over 54m) controller_DRPlacementControl Deploying the application and VRG Normal DRPCDeploySuccess 54m (x8 over 54m) controller_DRPlacementControl Successfully deployed the application and VRG Warning unknown state 45m (x14 over 54m) controller_DRPlacementControl next state not known Expected results: Relocation should happen successfully Additional info: