Description of problem (please be detailed as possible and provide log snippests): Version of all relevant components (if applicable): ceph version 18.2.1-136.el9cp (e7edde2b655d0dd9f860dda675f9d7954f07e6e3) reef (stable) OCP 4.16.0-0.nightly-2024-04-26-145258 ODF 4.16.0-89.stable ACM 2.10.2 MCE 2.5.2 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: *****Active hub co-situated with primary managed cluster***** 1. On a RDR setup with both RBD and CephFS workloads of subscription and appset (pull model) types in distinct states like Deployed, FailedOver and Relocated, perform site-failure by bringing active hub and primary managed cluster down and move to passive hub by performing hub recovery. 2. Then failover all the workloads running on down managed cluster to the surviving managed cluster. 3. After successful failover, recover the down managed cluster. During cleanup, VRG both states would be marked as Secondary for cephfs workloads on the recovered managed cluster which would eventually mark peer ready as True in the drpc resource on hub but the replicationdestination would not be created on the recovered cluster until the eviction period timeout which is 24hrs as of now. 4. Now failover cephfs workloads back to the surviving cluster where peer ready is marked as true but replication destination isn't created. Actual results: Since peer ready is marked as true for cephfs workloads in this case, UI will allow failover even if 1st sync has not completed due to missing replication destination. Marking peer ready is expected when VRG both states are marked as secondary on the recovered cluster (refer comment https://bugzilla.redhat.com/show_bug.cgi?id=2263488#c21), failover never completes when replication destination is missing. The idea is to allow the failover using the last restored PVC state back to the recovered cluster. New Hub- busybox-workloads-15 cephfs-sub-busybox15-placement-1-drpc 7d9h amagrawa-c2-29a amagrawa-c1-29a Failover FailedOver WaitForReadiness 2024-05-17T07:41:59Z False oc get drpc -o yaml -n busybox-workloads-15 apiVersion: v1 items: - apiVersion: ramendr.openshift.io/v1alpha1 kind: DRPlacementControl metadata: annotations: drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-15 drplacementcontrol.ramendr.openshift.io/last-app-deployment-cluster: amagrawa-c2-29a creationTimestamp: "2024-05-16T07:51:31Z" finalizers: - drpc.ramendr.openshift.io/finalizer generation: 3 labels: cluster.open-cluster-management.io/backup: ramen velero.io/backup-name: acm-resources-generic-schedule-20240516070015 velero.io/restore-name: restore-acm-acm-resources-generic-schedule-20240516070015 name: cephfs-sub-busybox15-placement-1-drpc namespace: busybox-workloads-15 ownerReferences: - apiVersion: cluster.open-cluster-management.io/v1beta1 blockOwnerDeletion: true controller: true kind: Placement name: cephfs-sub-busybox15-placement-1 uid: 31b90e55-e8e3-42b4-8f0a-ca8a71daa7ab resourceVersion: "36276430" uid: e0bb1638-5fa6-4b45-8a5e-2bc688c38101 spec: action: Failover drPolicyRef: apiVersion: ramendr.openshift.io/v1alpha1 kind: DRPolicy name: my-drpolicy-5 failoverCluster: amagrawa-c1-29a placementRef: apiVersion: cluster.open-cluster-management.io/v1beta1 kind: Placement name: cephfs-sub-busybox15-placement-1 namespace: busybox-workloads-15 preferredCluster: amagrawa-c2-29a pvcSelector: matchLabels: appname: busybox_app3_cephfs status: actionStartTime: "2024-05-17T07:41:59Z" conditions: - lastTransitionTime: "2024-05-17T07:42:28Z" message: Completed observedGeneration: 3 reason: FailedOver status: "True" type: Available - lastTransitionTime: "2024-05-17T07:41:59Z" message: Started failover to cluster "amagrawa-c1-29a" observedGeneration: 3 reason: NotStarted status: "False" type: PeerReady lastUpdateTime: "2024-05-23T16:40:50Z" phase: FailedOver preferredDecision: clusterName: amagrawa-c1-29a clusterNamespace: amagrawa-c1-29a progression: WaitForReadiness resourceConditions: conditions: - lastTransitionTime: "2024-05-16T08:00:28Z" message: All VolSync PVCs are ready observedGeneration: 6 reason: Ready status: "True" type: DataReady - lastTransitionTime: "2024-05-16T08:00:28Z" message: Not all VolSync PVCs are protected observedGeneration: 6 reason: DataProtected status: "False" type: DataProtected - lastTransitionTime: "2024-05-16T08:00:16Z" message: Nothing to restore observedGeneration: 6 reason: Restored status: "True" type: ClusterDataReady - lastTransitionTime: "2024-05-16T08:00:28Z" message: Not all VolSync PVCs are protected observedGeneration: 6 reason: DataProtected status: "False" type: ClusterDataProtected resourceMeta: generation: 6 kind: VolumeReplicationGroup name: cephfs-sub-busybox15-placement-1-drpc namespace: busybox-workloads-15 protectedpvcs: - busybox-pvc-1 kind: List metadata: resourceVersion: "" Recovered cluster C1- oc project busybox-workloads-15; oc get pvc,vr,vrg,pods -o wide Now using project "busybox-workloads-15" on server "https://api.amagrawa-c2-29a.qe.rh-ocs.com:6443". NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-1 Bound pvc-a333d21e-3ab7-425d-8254-8fa62522dc3f 94Gi RWX ocs-storagecluster-cephfs <unset> 23d Filesystem persistentvolumeclaim/volsync-busybox-pvc-1-src Bound pvc-06823313-ed2d-49df-9773-55ef9a56f114 94Gi ROX ocs-storagecluster-cephfs-vrg <unset> 7d9h Filesystem NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/cephfs-sub-busybox15-placement-1-drpc primary Primary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/busybox-1-7f9b67dc95-6hjn2 1/1 Running 0 7d9h 10.128.3.234 compute-2 <none> <none> pod/volsync-rsync-tls-src-busybox-pvc-1-676pm 0/1 Error 0 26m 10.128.2.70 compute-2 <none> <none> pod/volsync-rsync-tls-src-busybox-pvc-1-zxzl5 1/1 Running 0 4m7s 10.128.2.71 compute-2 <none> <none> oc describe vrg Name: cephfs-sub-busybox15-placement-1-drpc Namespace: busybox-workloads-15 Labels: <none> Annotations: drplacementcontrol.ramendr.openshift.io/destination-cluster: amagrawa-c2-29a drplacementcontrol.ramendr.openshift.io/do-not-delete-pvc: drplacementcontrol.ramendr.openshift.io/drpc-uid: e0bb1638-5fa6-4b45-8a5e-2bc688c38101 API Version: ramendr.openshift.io/v1alpha1 Kind: VolumeReplicationGroup Metadata: Creation Timestamp: 2024-04-30T13:36:03Z Finalizers: volumereplicationgroups.ramendr.openshift.io/vrg-protection Generation: 6 Owner References: API Version: work.open-cluster-management.io/v1 Kind: AppliedManifestWork Name: 661184cbe6aabc283e2f4acb234afb291390b8b4b3dd10af342eca0c4e7e3f41-cephfs-sub-busybox15-placement-1-drpc-busybox-workloads-15-vrg-mw UID: 79905b6c-78f9-414c-abc9-a6506a5cf852 Resource Version: 47644387 UID: 61c8fe31-6d15-4b42-876e-2f5d9f8d55af Spec: Action: Failover Async: Replication Class Selector: Scheduling Interval: 5m Volume Snapshot Class Selector: Pvc Selector: Match Labels: Appname: busybox_app3_cephfs Replication State: primary s3Profiles: s3profile-amagrawa-c1-29a-ocs-storagecluster s3profile-amagrawa-c2-29a-ocs-storagecluster Vol Sync: Status: Conditions: Last Transition Time: 2024-05-16T08:00:28Z Message: All VolSync PVCs are ready Observed Generation: 6 Reason: Ready Status: True Type: DataReady Last Transition Time: 2024-05-16T08:00:28Z Message: Not all VolSync PVCs are protected Observed Generation: 6 Reason: DataProtected Status: False Type: DataProtected Last Transition Time: 2024-05-16T08:00:16Z Message: Nothing to restore Observed Generation: 6 Reason: Restored Status: True Type: ClusterDataReady Last Transition Time: 2024-05-16T08:00:28Z Message: Not all VolSync PVCs are protected Observed Generation: 6 Reason: DataProtected Status: False Type: ClusterDataProtected Kube Object Protection: Last Update Time: 2024-05-23T16:40:25Z Observed Generation: 6 Protected PV Cs: Access Modes: ReadWriteMany Annotations: apps.open-cluster-management.io/hosting-subscription: busybox-workloads-15/cephfs-sub-busybox15-subscription-1 apps.open-cluster-management.io/reconcile-option: merge Conditions: Last Transition Time: 2024-05-16T08:00:16Z Message: Ready Observed Generation: 6 Reason: SourceInitialized Status: True Type: ReplicationSourceSetup Last Transition Time: 2024-05-16T07:59:24Z Message: PVC restored Observed Generation: 5 Reason: Restored Status: True Type: PVsRestored Labels: App: cephfs-sub-busybox15 app.kubernetes.io/part-of: cephfs-sub-busybox15 Appname: busybox_app3_cephfs apps.open-cluster-management.io/reconcile-rate: medium velero.io/backup-name: acm-resources-schedule-20240516070016 velero.io/restore-name: restore-acm-acm-resources-schedule-20240516070016 Name: busybox-pvc-1 Namespace: busybox-workloads-15 Protected By Vol Sync: true Replication ID: Id: Resources: Requests: Storage: 94Gi Storage Class Name: ocs-storagecluster-cephfs Storage ID: Id: State: Primary Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal PrimaryVRGProcessSuccess 62m (x42 over 3h21m) controller_VolumeReplicationGroup Primary Success Normal PrimaryVRGProcessSuccess 20m (x5 over 62m) controller_VolumeReplicationGroup Primary Success C1 still has replication source for the failedover workload but not replication destination oc get replicationsources.volsync.backube -A NAMESPACE NAME SOURCE LAST SYNC DURATION NEXT SYNC busybox-workloads-15 busybox-pvc-1 busybox-pvc-1 Surviving cluster C2- oc project busybox-workloads-15; oc get pvc,vr,vrg,pods -o wide Already on project "busybox-workloads-15" on server "https://api.amagrawa-c2-29a.qe.rh-ocs.com:6443". NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-1 Bound pvc-a333d21e-3ab7-425d-8254-8fa62522dc3f 94Gi RWX ocs-storagecluster-cephfs <unset> 23d Filesystem persistentvolumeclaim/volsync-busybox-pvc-1-src Bound pvc-06823313-ed2d-49df-9773-55ef9a56f114 94Gi ROX ocs-storagecluster-cephfs-vrg <unset> 7d9h Filesystem NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/cephfs-sub-busybox15-placement-1-drpc primary Primary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/busybox-1-7f9b67dc95-6hjn2 1/1 Running 0 7d9h 10.128.3.234 compute-2 <none> <none> pod/volsync-rsync-tls-src-busybox-pvc-1-676pm 0/1 Error 0 28m 10.128.2.70 compute-2 <none> <none> pod/volsync-rsync-tls-src-busybox-pvc-1-zxzl5 1/1 Running 0 5m19s 10.128.2.71 compute-2 <none> <none> oc describe vrg Name: cephfs-sub-busybox15-placement-1-drpc Namespace: busybox-workloads-15 Labels: <none> Annotations: drplacementcontrol.ramendr.openshift.io/destination-cluster: amagrawa-c2-29a drplacementcontrol.ramendr.openshift.io/do-not-delete-pvc: drplacementcontrol.ramendr.openshift.io/drpc-uid: e0bb1638-5fa6-4b45-8a5e-2bc688c38101 API Version: ramendr.openshift.io/v1alpha1 Kind: VolumeReplicationGroup Metadata: Creation Timestamp: 2024-04-30T13:36:03Z Finalizers: volumereplicationgroups.ramendr.openshift.io/vrg-protection Generation: 6 Owner References: API Version: work.open-cluster-management.io/v1 Kind: AppliedManifestWork Name: 661184cbe6aabc283e2f4acb234afb291390b8b4b3dd10af342eca0c4e7e3f41-cephfs-sub-busybox15-placement-1-drpc-busybox-workloads-15-vrg-mw UID: 79905b6c-78f9-414c-abc9-a6506a5cf852 Resource Version: 47644387 UID: 61c8fe31-6d15-4b42-876e-2f5d9f8d55af Spec: Action: Failover Async: Replication Class Selector: Scheduling Interval: 5m Volume Snapshot Class Selector: Pvc Selector: Match Labels: Appname: busybox_app3_cephfs Replication State: primary s3Profiles: s3profile-amagrawa-c1-29a-ocs-storagecluster s3profile-amagrawa-c2-29a-ocs-storagecluster Vol Sync: Status: Conditions: Last Transition Time: 2024-05-16T08:00:28Z Message: All VolSync PVCs are ready Observed Generation: 6 Reason: Ready Status: True Type: DataReady Last Transition Time: 2024-05-16T08:00:28Z Message: Not all VolSync PVCs are protected Observed Generation: 6 Reason: DataProtected Status: False Type: DataProtected Last Transition Time: 2024-05-16T08:00:16Z Message: Nothing to restore Observed Generation: 6 Reason: Restored Status: True Type: ClusterDataReady Last Transition Time: 2024-05-16T08:00:28Z Message: Not all VolSync PVCs are protected Observed Generation: 6 Reason: DataProtected Status: False Type: ClusterDataProtected Kube Object Protection: Last Update Time: 2024-05-23T16:40:25Z Observed Generation: 6 Protected PV Cs: Access Modes: ReadWriteMany Annotations: apps.open-cluster-management.io/hosting-subscription: busybox-workloads-15/cephfs-sub-busybox15-subscription-1 apps.open-cluster-management.io/reconcile-option: merge Conditions: Last Transition Time: 2024-05-16T08:00:16Z Message: Ready Observed Generation: 6 Reason: SourceInitialized Status: True Type: ReplicationSourceSetup Last Transition Time: 2024-05-16T07:59:24Z Message: PVC restored Observed Generation: 5 Reason: Restored Status: True Type: PVsRestored Labels: App: cephfs-sub-busybox15 app.kubernetes.io/part-of: cephfs-sub-busybox15 Appname: busybox_app3_cephfs apps.open-cluster-management.io/reconcile-rate: medium velero.io/backup-name: acm-resources-schedule-20240516070016 velero.io/restore-name: restore-acm-acm-resources-schedule-20240516070016 Name: busybox-pvc-1 Namespace: busybox-workloads-15 Protected By Vol Sync: true Replication ID: Id: Resources: Requests: Storage: 94Gi Storage Class Name: ocs-storagecluster-cephfs Storage ID: Id: State: Primary Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal PrimaryVRGProcessSuccess 63m (x42 over 3h22m) controller_VolumeReplicationGroup Primary Success Normal PrimaryVRGProcessSuccess 20m (x5 over 62m) controller_VolumeReplicationGroup Primary Success C2 has replication source too but that's expected (as failover is successful and workload is running on this cluster) oc get replicationsources.volsync.backube -A NAMESPACE NAME SOURCE LAST SYNC DURATION NEXT SYNC busybox-workloads-15 busybox-pvc-1 busybox-pvc-1 Expected results: Failover should complete with the last restored PVC state when replication is missing. Additional info:
Moving the non-blocker BZs out of ODF-4.17.0 as part of Development Freeze.