Bug 2263488
Summary: | [RDR] [Hub recovery] [Co-situated] Cleanup remains stuck after failover when older primary cluster is recovered | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Aman Agrawal <amagrawa> |
Component: | odf-dr | Assignee: | Shyamsundar <srangana> |
odf-dr sub component: | ramen | QA Contact: | Aman Agrawal <amagrawa> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | unspecified | CC: | ebenahar, kramdoss, kseeger, muagarwa, srangana |
Version: | 4.14 | ||
Target Milestone: | --- | ||
Target Release: | ODF 4.16.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | verification-blocked | ||
Fixed In Version: | 4.15.0-149 | Doc Type: | No Doc Update |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2024-07-17 13:13:48 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Aman Agrawal
2024-02-09 09:45:50 UTC
Removing blocker flag with reduced severity as co-situated is TP for ODF 4.15. Root cause: - On hub recovery and subsequent failover of workloads to the surviving cluster (c2), the VRG on the currently unreachable cluster (c1) is left as Primary with no ManifestWork on the hub to track this VRG - Post failover, when the failed cluster is recovered, the cleanup phase detects no VRG ManifestWork at the hub and assumes it has been deleted - The hub reconciler then keeps waiting for the ManagedClusterView to report that the VRG has been garbage collected at the failed cluster, which never happens as it is still Primary and its AppliedManifestWork on the managed cluster is still running the eviction timer - When the eviction timer expires, the VRG would be garbage collected, but as Primary and this can lead to various consequences - VR would be deleted as Primary, potentially causing image garbage collection - Future relocate or failover to this cluster will fail as images were never demoted and marked as clean secondaries ------------------------- This call to cleanupSecondaries will keep failing till the AppliedMW is evicted: https://github.com/RamenDR/ramen/blob/34891bf43bad0ff27262a3fbf4db2356e3e189fd/controllers/drplacementcontrol.go#L1729 ------------------------- Log instance: 2024-02-09T12:20:00.063Z INFO controllers.DRPlacementControl controllers/drplacementcontrol.go:1693 ensuring cleanup on secondaries {"DRPC": "busybox-workloads-1/rbd-sub-busybox1-placement-1-drpc", "rid": "a93326a0-bb7e-4cfd-80f9-a812d12f8250"} 2024-02-09T12:20:00.063Z INFO controllers.DRPlacementControl controllers/drplacementcontrol.go:1714 PeerReady Condition &Condition{Type:PeerReady,Status:False,ObservedGeneration:2,LastTransitionTime:2024-02-08 15:12:30 +0000 UTC,Reason:Cleaning,Message:cleaning secondaries,} {"DRPC": "busybox-workloads-1/rbd-sub-busybox1-placement-1-drpc", "rid": "a93326a0-bb7e-4cfd-80f9-a812d12f8250"} 2024-02-09T12:20:00.063Z INFO controllers.DRPlacementControl controllers/drplacementcontrolvolsync.go:193 Checking if there are PVCs for VolSync replication... {"DRPC": "busybox-workloads-1/rbd-sub-busybox1-placement-1-drpc", "rid": "a93326a0-bb7e-4cfd-80f9-a812d12f8250", "cluster": "amagrawa-415-c2"} 2024-02-09T12:20:00.063Z INFO MCV util/mcv_util.go:231 Get managedClusterResource Returned the following MCV Conditions: [{Processing True 0 2024-02-09 07:49:26 +0000 UTC GetResourceProcessing Watching resources successfully}] {"resourceName": "rbd-sub-busybox1-placement-1-drpc", "cluster": "amagrawa-415-c1"} 2024-02-09T12:20:00.063Z INFO controllers.DRPlacementControl controllers/drplacementcontrol.go:1820 Ensuring MW for the VRG is deleted {"DRPC": "busybox-workloads-1/rbd-sub-busybox1-placement-1-drpc", "rid": "a93326a0-bb7e-4cfd-80f9-a812d12f8250", "cluster": "amagrawa-415-c1"} 2024-02-09T12:20:00.063Z INFO controllers.DRPlacementControl controllers/drplacementcontrol.go:1358 MW has been deleted. Check the VRG {"DRPC": "busybox-workloads-1/rbd-sub-busybox1-placement-1-drpc", "rid": "a93326a0-bb7e-4cfd-80f9-a812d12f8250"} 2024-02-09T12:20:00.063Z INFO MCV util/mcv_util.go:231 Get managedClusterResource Returned the following MCV Conditions: [{Processing True 0 2024-02-09 07:49:26 +0000 UTC GetResourceProcessing Watching resources successfully}] {"resourceName": "rbd-sub-busybox1-placement-1-drpc", "cluster": "amagrawa-415-c1"} 2024-02-09T12:20:00.064Z INFO controllers.DRPlacementControl controllers/drplacementcontrol.go:1361 VRG has not been deleted yet {"DRPC": "busybox-workloads-1/rbd-sub-busybox1-placement-1-drpc", "rid": "a93326a0-bb7e-4cfd-80f9-a812d12f8250", "cluster": "amagrawa-415-c1"} 2024-02-09T12:20:00.064Z INFO controllers.DRPlacementControl controllers/drplacementcontrol_controller.go:1856 Found ClusterDecision {"ClsDedicision": [{"clusterName":"amagrawa-415-c2","reason":"amagrawa-415-c2"}]} 2024-02-09T12:20:00.064Z INFO controllers.DRPlacementControl controllers/drplacementcontrol.go:85 Process placement {"DRPC": "busybox-workloads-1/rbd-sub-busybox1-placement-1-drpc", "rid": "a93326a0-bb7e-4cfd-80f9-a812d12f8250", "error": "waiting to clean secondaries"} 2024-02-09T12:20:00.064Z INFO controllers.DRPlacementControl controllers/drplacementcontrol_controller.go:950 Finished processing {"DRPC": "busybox-workloads-1/rbd-sub-busybox1-placement-1-drpc", "rid": "a93326a0-bb7e-4cfd-80f9-a812d12f8250", "Requeue?": true} 2024-02-09T12:20:00.064Z INFO controllers.DRPlacementControl controllers/drplacementcontrol_controller.go:965 Requeing... {"DRPC": "busybox-workloads-1/rbd-sub-busybox1-placement-1-drpc", "rid": "a93326a0-bb7e-4cfd-80f9-a812d12f8250"} 2024-02-09T12:20:00.064Z INFO controllers.DRPlacementControl controllers/drplacementcontrol_controller.go:763 Exiting reconcile loop {"DRPC": "busybox-workloads-1/rbd-sub-busybox1-placement-1-drpc", "rid": "a93326a0-bb7e-4cfd-80f9-a812d12f8250"} Workaround (for now): - edit VRG to Secondary on the recovered cluster (with spec.action failover so that force resync is performed) - Wait till VRG reports status.state as Secondary - Delete VRG on the ManagedCluster once it reports Secondary - Delete the workload AppliedManifestWork on the ManagedCluster - NOTE: This is optional, as post the 24h window the Subscription workloads are automatically garbage collected (this is for hub recovery cases) - DRPC at the hub on a subsequent reconcile will report PeerReady and clean status as required github link posted is an upstream issue and not a PR, setting this back to ASSIGNED Upstream PR under review and test: https://github.com/RamenDR/ramen/pull/1208 Shyam, cleanup happens after 24hrs eviction period after switching to appset pull model starting ACM 2.10. Shall we migrate this bug similar to BZ2268594? (In reply to Aman Agrawal from comment #19) > Shyam, cleanup happens after 24hrs eviction period after switching to appset > pull model starting ACM 2.10. Shall we migrate this bug similar to BZ2268594? The fix here is more than the pull model as stated in the root causing of the issue at comment #4. Here is a test case that would have had unwanted consequences before the fix and should be fine after the fix: - Run workload at C1 - Hub and C1 are down and hub is recovered to Hub' - Failover workload to C2 - Recover C1 - Wait for 24h (or if there are settings to lower this better, as test can happen faster) - DRPC for the workload would show peer ready status, but volumes/images would potentially still be primary on C1 or mirroring would not proceed - Because, DRPC originally assumed that VRG is deleted as its MW on the hub is missing After the fix, the last step would still show DRPC as peer ready, but images would also be in the right state on C1 Tested with following versions: OCP 4.16.0-0.nightly-2024-04-26-145258 ODF 4.16.0-89.stable ceph version 18.2.1-136.el9cp (e7edde2b655d0dd9f860dda675f9d7954f07e6e3) reef (stable) ACM 2.10.2 GA'ed MCE 2.5.2 Gitops 1.12.1 OADP 1.3.0 Submariner 0.17.0 GA'ed Platform- WMware After site failure by bringing active hub and primary managed cluster C1 down, performed hub recovery and moved to new hub (passive). Then failedover all the workloads from C1 to C2 secondary managed cluster. After successful failover, when C1 managed cluster is recovered, it led to 2 observations as below: C1- For RBD ====================================================================================================================================> VRG DESIREDSTATE would be marked as Secondary as per the fix and Comment20 during the eviction period timeout which is currently 24hrs with ACM 2.10.2. oc project busybox-workloads-10; oc get pvc,vr,vrg,pods -o wide Now using project "busybox-workloads-10" on server "https://api.amagrawa-c1-29a.qe.rh-ocs.com:6443". NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-41 Bound pvc-172643b0-9f9b-4e41-b3df-4ae35466ccc4 42Gi RWO ocs-storagecluster-ceph-rbd <unset> 28h Filesystem NAME AGE VOLUMEREPLICATIONCLASS PVCNAME DESIREDSTATE CURRENTSTATE volumereplication.replication.storage.openshift.io/busybox-pvc-41 28h rbd-volumereplicationclass-473128587 busybox-pvc-41 primary Primary NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/rbd-sub-busybox10-placement-1-drpc secondary Primary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/busybox-41-5c55b45d49-5wf4z 1/1 Running 1 28h 10.131.1.33 compute-0 <none> <none> oc project busybox-workloads-2; oc get pvc,vr,vrg,pods -o wide Now using project "busybox-workloads-2" on server "https://api.amagrawa-c1-29a.qe.rh-ocs.com:6443". NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-41 Bound pvc-6eda8199-76eb-4fbe-95f2-c3f81f64a97f 42Gi RWO ocs-storagecluster-ceph-rbd <unset> 16d Filesystem NAME AGE VOLUMEREPLICATIONCLASS PVCNAME DESIREDSTATE CURRENTSTATE volumereplication.replication.storage.openshift.io/busybox-pvc-41 16d rbd-volumereplicationclass-473128587 busybox-pvc-41 primary Primary NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/rbd-appset-busybox2-placement-drpc secondary Primary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/busybox-41-5c55b45d49-g478p 1/1 Running 1 2d 10.129.2.66 compute-2 <none> <none> oc project vm-appset-1; oc get pvc,vr,vrg,pods -o wide Now using project "vm-appset-1" on server "https://api.amagrawa-c1-29a.qe.rh-ocs.com:6443". NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE VOLUMEMODE persistentvolumeclaim/vm-1-pvc Bound pvc-e4be6118-0c4a-4be6-9354-c98fa6f02878 512Mi RWX ocs-storagecluster-ceph-rbd-virtualization <unset> 27h Block NAME AGE VOLUMEREPLICATIONCLASS PVCNAME DESIREDSTATE CURRENTSTATE volumereplication.replication.storage.openshift.io/vm-1-pvc 27h rbd-volumereplicationclass-473128587 vm-1-pvc primary Primary NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/vm-appset-1-placement-drpc secondary Primary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/virt-launcher-vm-workload-1-vlwhh 2/2 Running 0 121m 10.131.0.26 compute-0 <none> 1/1 oc project vm-sub-1; oc get pvc,vr,vrg,pods -o wide Now using project "vm-sub-1" on server "https://api.amagrawa-c1-29a.qe.rh-ocs.com:6443". NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE VOLUMEMODE persistentvolumeclaim/vm-1-pvc Bound pvc-fabc45eb-a5ed-4021-b8f5-74539791b5e0 512Mi RWX ocs-storagecluster-ceph-rbd-virtualization <unset> 28h Block NAME AGE VOLUMEREPLICATIONCLASS PVCNAME DESIREDSTATE CURRENTSTATE volumereplication.replication.storage.openshift.io/vm-1-pvc 27h rbd-volumereplicationclass-473128587 vm-1-pvc primary Primary NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/vm-sub-1-placement-drpc secondary Primary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/virt-launcher-vm-workload-1-wbtlt 2/2 Running 0 122m 10.131.0.25 compute-0 <none> 1/1 oc get vrg -o yaml apiVersion: v1 items: - apiVersion: ramendr.openshift.io/v1alpha1 kind: VolumeReplicationGroup metadata: annotations: drplacementcontrol.ramendr.openshift.io/destination-cluster: amagrawa-c1-29a drplacementcontrol.ramendr.openshift.io/do-not-delete-pvc: "" drplacementcontrol.ramendr.openshift.io/drpc-uid: 3ff037e5-d2c8-41e1-b068-505aee5b4f65 creationTimestamp: "2024-05-15T08:27:27Z" finalizers: - volumereplicationgroups.ramendr.openshift.io/vrg-protection generation: 2 name: vm-sub-1-placement-drpc namespace: vm-sub-1 ownerReferences: - apiVersion: work.open-cluster-management.io/v1 kind: AppliedManifestWork name: c2a75a8d0ed3e51e58c9a6d4bd73c5f7d66b7929d77fc41e8c830d54778f0e7e-vm-sub-1-placement-drpc-vm-sub-1-vrg-mw uid: d7fc2aea-d88e-4ad0-bd5c-d9c8bd1ed3d6 - apiVersion: work.open-cluster-management.io/v1 kind: AppliedManifestWork name: 661184cbe6aabc283e2f4acb234afb291390b8b4b3dd10af342eca0c4e7e3f41-vm-sub-1-placement-drpc-vm-sub-1-vrg-mw uid: 59025d22-0e91-49a7-b02e-94e8acbaa805 resourceVersion: "33086179" uid: 63b94b02-ea99-4a0e-a844-4c97604fd08b spec: action: Failover async: replicationClassSelector: {} schedulingInterval: 10m volumeSnapshotClassSelector: {} pvcSelector: matchLabels: appname: kubevirt replicationState: secondary s3Profiles: - s3profile-amagrawa-c1-29a-ocs-storagecluster - s3profile-amagrawa-c2-29a-ocs-storagecluster volSync: {} status: conditions: - lastTransitionTime: "2024-05-16T10:36:17Z" message: VolumeReplicationGroup is progressing observedGeneration: 2 reason: Progressing status: "False" type: DataReady - lastTransitionTime: "2024-05-16T10:36:17Z" message: VolumeReplicationGroup is replicating observedGeneration: 2 reason: Replicating status: "False" type: DataProtected - lastTransitionTime: "2024-05-15T08:27:27Z" message: Nothing to restore observedGeneration: 1 reason: Restored status: "True" type: ClusterDataReady - lastTransitionTime: "2024-05-16T10:36:17Z" message: Cluster data of all PVs are protected observedGeneration: 2 reason: Uploaded status: "True" type: ClusterDataProtected kubeObjectProtection: {} lastGroupSyncBytes: 1216512 lastGroupSyncDuration: 0s lastGroupSyncTime: "2024-05-16T07:30:00Z" lastUpdateTime: "2024-05-16T10:36:17Z" observedGeneration: 2 protectedPVCs: - accessModes: - ReadWriteMany conditions: - lastTransitionTime: "2024-05-16T10:36:17Z" message: Secondary transition failed as PVC is potentially in use by a pod observedGeneration: 2 reason: Progressing status: "False" type: DataReady - lastTransitionTime: "2024-05-15T08:31:48Z" message: 'Done uploading PV/PVC cluster data to 2 of 2 S3 profile(s): [s3profile-amagrawa-c1-29a-ocs-storagecluster s3profile-amagrawa-c2-29a-ocs-storagecluster]' observedGeneration: 1 reason: Uploaded status: "True" type: ClusterDataProtected - lastTransitionTime: "2024-05-15T08:31:58Z" message: PVC in the VolumeReplicationGroup is ready for use observedGeneration: 1 reason: Replicating status: "False" type: DataProtected csiProvisioner: openshift-storage.rbd.csi.ceph.com labels: app: vm-sub-1 app.kubernetes.io/part-of: vm-sub-1 appname: kubevirt ramendr.openshift.io/owner-name: vm-sub-1-placement-drpc ramendr.openshift.io/owner-namespace-name: vm-sub-1 lastSyncBytes: 1216512 lastSyncDuration: 0s lastSyncTime: "2024-05-16T07:30:00Z" name: vm-1-pvc namespace: vm-sub-1 replicationID: id: 700831d730ddf244ace1823c7f12d67486b8179 modes: - Failover resources: requests: storage: 512Mi storageClassName: ocs-storagecluster-ceph-rbd-virtualization storageID: id: 73451a4c-d40b-431b-a9a4-4cf2c45e3025 state: Primary kind: List metadata: resourceVersion: "" For CephFS ====================================================================================================================================> VRG DESIREDSTATE and CURRENTSTATE would be marked as Secondary during the eviction period timeout which is currently 24hrs with ACM 2.10.2. But it will wait for the replicationdestinations to be created on the recovered cluster C1 until the eviction period times out, hence data sync will not resume until then. oc project busybox-workloads-15; oc get pvc,vr,vrg,pods -o wide Already on project "busybox-workloads-15" on server "https://api.amagrawa-c1-29a.qe.rh-ocs.com:6443". NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-1 Bound pvc-4f5a1fe1-df46-45b7-83bf-65513f7ee9a9 94Gi RWX ocs-storagecluster-cephfs <unset> 15d Filesystem NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/cephfs-sub-busybox15-placement-1-drpc secondary Secondary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/busybox-1-7f9b67dc95-bwgbz 1/1 Running 1 2d 10.129.2.85 compute-2 <none> <none> oc project busybox-workloads-5; oc get pvc,vr,vrg,pods -o wide Now using project "busybox-workloads-5" on server "https://api.amagrawa-c1-29a.qe.rh-ocs.com:6443". NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-1 Bound pvc-b6236a7c-9997-4f06-96fe-adc1dbde808c 94Gi RWX ocs-storagecluster-cephfs <unset> 15d Filesystem NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/cephfs-appset-busybox5-placement-drpc secondary Secondary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/busybox-1-7f9b67dc95-stj6m 1/1 Running 1 28h 10.131.1.49 compute-0 <none> <none> oc get vrg -o yaml apiVersion: v1 items: - apiVersion: ramendr.openshift.io/v1alpha1 kind: VolumeReplicationGroup metadata: annotations: drplacementcontrol.ramendr.openshift.io/destination-cluster: amagrawa-c1-29a drplacementcontrol.ramendr.openshift.io/do-not-delete-pvc: "" drplacementcontrol.ramendr.openshift.io/drpc-uid: 6d40263e-98f3-4160-a38e-de00a9d09cf5 creationTimestamp: "2024-04-30T12:59:13Z" finalizers: - volumereplicationgroups.ramendr.openshift.io/vrg-protection generation: 15 name: cephfs-appset-busybox5-placement-drpc namespace: busybox-workloads-5 ownerReferences: - apiVersion: work.open-cluster-management.io/v1 kind: AppliedManifestWork name: c2a75a8d0ed3e51e58c9a6d4bd73c5f7d66b7929d77fc41e8c830d54778f0e7e-cephfs-appset-busybox5-placement-drpc-busybox-workloads-5-vrg-mw uid: a41842c2-1c03-4a9c-84f1-a09fe24c2fdd - apiVersion: work.open-cluster-management.io/v1 kind: AppliedManifestWork name: 661184cbe6aabc283e2f4acb234afb291390b8b4b3dd10af342eca0c4e7e3f41-cephfs-appset-busybox5-placement-drpc-busybox-workloads-5-vrg-mw uid: 728a2979-3ad8-4995-896d-b37f6444f4d2 resourceVersion: "33084889" uid: 559fdb14-58ef-41a5-bbe7-74388f5a370b spec: action: Failover async: replicationClassSelector: {} schedulingInterval: 5m volumeSnapshotClassSelector: {} pvcSelector: matchLabels: appname: busybox_app3_cephfs replicationState: secondary s3Profiles: - s3profile-amagrawa-c1-29a-ocs-storagecluster - s3profile-amagrawa-c2-29a-ocs-storagecluster volSync: rdSpec: - protectedPVC: accessModes: - ReadWriteMany conditions: - lastTransitionTime: "2024-05-16T08:00:46Z" message: Ready observedGeneration: 18 reason: SourceInitialized status: "True" type: ReplicationSourceSetup - lastTransitionTime: "2024-05-16T08:00:10Z" message: PVC restored observedGeneration: 17 reason: Restored status: "True" type: PVsRestored labels: app.kubernetes.io/instance: cephfs-appset-busybox5-amagrawa-c2-29a appname: busybox_app3_cephfs name: busybox-pvc-1 namespace: busybox-workloads-5 protectedByVolSync: true replicationID: id: "" resources: requests: storage: 94Gi storageClassName: ocs-storagecluster-cephfs storageID: id: "" status: conditions: - lastTransitionTime: "2024-05-15T07:51:19Z" message: All VolSync PVCs are ready observedGeneration: 13 reason: Ready status: "True" type: DataReady - lastTransitionTime: "2024-05-15T07:56:36Z" message: All VolSync PVCs are protected observedGeneration: 13 reason: DataProtected status: "True" type: DataProtected - lastTransitionTime: "2024-05-15T07:51:12Z" message: Nothing to restore observedGeneration: 13 reason: Restored status: "True" type: ClusterDataReady - lastTransitionTime: "2024-05-16T10:31:02Z" message: Kube objects protected observedGeneration: 13 reason: Uploaded status: "True" type: ClusterDataProtected kubeObjectProtection: {} lastUpdateTime: "2024-05-16T10:35:43Z" observedGeneration: 15 state: Secondary kind: List metadata: resourceVersion: "" oc get replicationdestinations.volsync.backube -A No resources found C2- oc get replicationsources.volsync.backube -A NAMESPACE NAME SOURCE LAST SYNC DURATION NEXT SYNC busybox-workloads-13 busybox-pvc-1 busybox-pvc-1 2024-05-16T12:51:07Z 1m7.482592009s 2024-05-16T12:55:00Z busybox-workloads-14 busybox-pvc-1 busybox-pvc-1 2024-05-16T12:51:04Z 1m4.799389728s 2024-05-16T13:00:00Z busybox-workloads-16 busybox-pvc-1 busybox-pvc-1 2024-05-16T12:50:50Z 50.644431785s 2024-05-16T13:00:00Z busybox-workloads-18 busybox-pvc-1 busybox-pvc-1 2024-05-16T12:51:00Z 1m0.918656925s 2024-05-16T13:00:00Z busybox-workloads-18 busybox-pvc-2 busybox-pvc-2 2024-05-16T12:50:53Z 53.833814943s 2024-05-16T13:00:00Z busybox-workloads-18 busybox-pvc-3 busybox-pvc-3 2024-05-16T12:50:39Z 39.346943194s 2024-05-16T13:00:00Z busybox-workloads-18 busybox-pvc-4 busybox-pvc-4 2024-05-16T12:50:39Z 39.470620309s 2024-05-16T13:00:00Z busybox-workloads-20 busybox-pvc-1 busybox-pvc-1 2024-05-16T12:50:58Z 58.125638845s 2024-05-16T12:55:00Z busybox-workloads-20 busybox-pvc-2 busybox-pvc-2 2024-05-16T12:51:01Z 1m1.408814228s 2024-05-16T12:55:00Z busybox-workloads-20 busybox-pvc-3 busybox-pvc-3 2024-05-16T12:51:01Z 1m1.057903452s 2024-05-16T12:55:00Z busybox-workloads-20 busybox-pvc-4 busybox-pvc-4 2024-05-16T12:50:55Z 55.935518256s 2024-05-16T12:55:00Z busybox-workloads-6 busybox-pvc-1 busybox-pvc-1 2024-05-16T12:51:11Z 1m11.958088985s 2024-05-16T13:00:00Z busybox-workloads-7 busybox-pvc-1 busybox-pvc-1 2024-05-16T12:51:09Z 1m9.618121552s 2024-05-16T12:55:00Z busybox-workloads-8 busybox-pvc-1 busybox-pvc-1 2024-05-16T12:50:58Z 58.076276172s 2024-05-16T13:00:00Z Because of which, PEER READY would be marked as True and PROGRESSION as Completed for CephFS workloads in the drpc on hub cluster and hence UI will allow further failover/relocate operations even when lastGroupSyncTime for these workloads is NULL. From Hub- NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY drpc|grep Failover busybox-workloads-10 rbd-sub-busybox10-placement-1-drpc 4h22m amagrawa-c1-29a amagrawa-c2-29a Failover FailedOver Cleaning Up 2024-05-16T07:57:06Z False busybox-workloads-15 cephfs-sub-busybox15-placement-1-drpc 4h22m amagrawa-c1-29a amagrawa-c2-29a Failover FailedOver Completed 2024-05-16T07:57:29Z 2h37m47.703548649s True openshift-gitops cephfs-appset-busybox5-placement-drpc 4h22m amagrawa-c1-29a amagrawa-c2-29a Failover FailedOver Completed 2024-05-16T07:58:26Z 2h36m50.087675242s True openshift-gitops rbd-appset-busybox2-placement-drpc 4h22m amagrawa-c1-29a amagrawa-c2-29a Failover FailedOver Cleaning Up 2024-05-16T07:58:08Z False openshift-gitops vm-appset-1-placement-drpc 4h22m amagrawa-c1-29a amagrawa-c2-29a Failover FailedOver Cleaning Up 2024-05-16T07:58:55Z False vm-sub-1 vm-sub-1-placement-drpc 4h22m amagrawa-c1-29a amagrawa-c2-29a Failover FailedOver Cleaning Up 2024-05-16T07:58:43Z False As discussed with srangana/bmekhiss in the RDR triage call today, this behaviour is expected and it's users responsibility to ensure that lastGroupSyncTime is updated and is closer to the current time in UTC before performing another DR action on these workloads. However, we want to help user re-think about this decision because of it's consequences (relocate being stuck wait for final sync and failover probably restoring only last synced data as no new sync is available for newly written data on C2 by workloads pods) which would need BZ2219460 to be fixed (currently being pushed to be taken in ODF 4.17). As the cleanup will complete after 24hrs eviction period, this BZ is being verified. The eviction issue is being tracked separately https://issues.redhat.com/browse/ACM-11239. Ack! on observations and intended fix as tested and elaborated in comment #21 (Thanks Aman!). Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:4591 |