Description of problem (please be detailed as possible and provide log snippests): Version of all relevant components (if applicable): OCP 4.14.0-0.nightly-2023-11-30-174049 ACM 2.9.0 GA'ed (from OperatorHub) ODF 4.14.1-15 ceph version 17.2.6-161.el9cp (7037a43d8f7fa86659a0575b566ec10080df0d71) quincy (stable) Submariner 0.16.2 VolSync 0.8.0 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: ** Active hub being at neutral site ** 1. Deployed multiple rbd and cephfs backed workloads of both appset and subscription types. 2. Failedover and relocated them in such a way that they are finally running on the primary managed cluster (which is expected to host all the workloads and can go under disaster). 3. Ensure that we have all the workloads in distict states like deployed, failedover, relocated etc. 4. Let the latest backups be taken at least 1 or 2 (at each 1 hr) for all the different states of the workloads (when progression is completed and no action is going on any of the workloads). Also ensure sync for all the workloads when on active hub is working fine and cluster is healthy. Note drpc -o wide, lastGroupSyncTime etc. 5. Bring active hub completely down, move to passive hub. Restore backps, ensure velero backup reports successful restoration. Make sure both the managed clusters are successfully imported, drpolicy gets validated. 6. Wait for drpc to be restored, check if all the workloads are in their last backedup state or not. They seem to have retained their last state which was backedup. So everything is fine so far. Label cluster-monitoring on the hub cluster so that VolumeSync.DelayAlert is fired if data sync is affected for any workload. 7. Let IOs continue and check lastGroupSyncTime and VolumeSync.DelayAlert alert. sync for rbd based workloads were progressing just fine along with other cephfs backed workloads except appset-cephfs-busybox9-placement-drpc in NS busybox-workloads-9. 8. Upon further validation, it was found that the dst pods and PVCs were lost from the secondary managed cluster for this workload. (Older hub remains down forever and is completely unreachable). Actual results: dst pods and PVCs are lost from the secondary managed cluster for appset-cephfs-busybox9-placement-drpc in NS busybox-workloads-9 From C2- amagrawa:~$ oc get pods,vrg,vr,pvc -o wide -n busybox-workloads-9 NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/appset-cephfs-busybox9-placement-drpc secondary From passive hub- drpc- openshift-gitops appset-cephfs-busybox9-placement-drpc 23h amagrawa-1-1d Deployed Completed True We also found a bunch of NonCompliant policies post hub recovery however all the policies were Compliant on the older hub. amagrawa:~$ oc get policy -A | grep NonCompliant amagrawa-1-1d busybox-workloads-12.vs-secret-9b0006378bc4b3dde4a82c04dd7dd3f6 NonCompliant 23h amagrawa-1-1d openshift-gitops.vs-secret-e00de88ac33afae138ce0bc1dc989ce1 NonCompliant 23h amagrawa-2-1d busybox-workloads-10.vs-secret-674962f86373128a3005a9a581929a8c NonCompliant 23h amagrawa-2-1d busybox-workloads-11.vs-secret-991de86951c29d86a7b4f21afcc222a0 NonCompliant 23h amagrawa-2-1d busybox-workloads-16.vs-secret-875a04e3829a1b8e35315f7c4a6e0c66 NonCompliant 23h amagrawa-2-1d openshift-gitops.vs-secret-1a2537d31f7cda5a558e40664f973bd4 NonCompliant 23h amagrawa-2-1d openshift-gitops.vs-secret-7e65a08e63102201af8f9ada3062686b NonCompliant 23h amagrawa-2-1d openshift-gitops.vs-secret-e00de88ac33afae138ce0bc1dc989ce1 NonCompliant 23h amagrawa-2-1d openshift-gitops.vs-secret-e3b95572f8ad4080e8aec77f9b19d4e4 NonCompliant 23h busybox-workloads-10 vs-secret-674962f86373128a3005a9a581929a8c NonCompliant 23h busybox-workloads-11 vs-secret-991de86951c29d86a7b4f21afcc222a0 NonCompliant 23h busybox-workloads-12 vs-secret-9b0006378bc4b3dde4a82c04dd7dd3f6 NonCompliant 23h busybox-workloads-16 vs-secret-875a04e3829a1b8e35315f7c4a6e0c66 NonCompliant 23h local-cluster open-cluster-management-backup.backup-restore-enabled inform NonCompliant 28h open-cluster-management-backup backup-restore-enabled inform NonCompliant 28h openshift-gitops vs-secret-1a2537d31f7cda5a558e40664f973bd4 NonCompliant 23h openshift-gitops vs-secret-7e65a08e63102201af8f9ada3062686b NonCompliant 23h openshift-gitops vs-secret-e00de88ac33afae138ce0bc1dc989ce1 NonCompliant 23h openshift-gitops vs-secret-e3b95572f8ad4080e8aec77f9b19d4e4 NonCompliant 23h Due to missing volumes on the secondary site, data sync isn't progressing for this workload and if in case primary site goes down or there is a need to perform relocate on this workload which we still can, it can lead to complete loss of data (assuming workload pods won't come up on the secondary cluster). Logs collected some time after moving to passive hub could be downloaded from https://drive.google.com/file/d/16aUyq1tbkKpumnE6Bmzx1PzBbvuBMwbI/view?usp=drive_link Pls note it can not unzipped as it's in the Gdrive and not in QE server (which is currently down). Expected results: volumes shouldn't be lost on the secondary site and data sync should continue just fine for all the DR protected cephfs backed workloads Additional info:
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:1383
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days