+++ This bug was initially created as a clone of Bug #2267907 +++ Description of problem (please be detailed as possible and provide log snippests): On a RDR setup, after performing failover operation and then deleting DR workload (CephFS based), observed that few subvolumes were not deleted from the secondary managed cluster. Version of all relevant components (if applicable): OCP: 4.15.0-0.nightly-2024-02-29-223316 ODF: 4.15.0-150 ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable) ACM: 2.10.0-DOWNSTREAM-2024-02-28-06-06-55 Submariner: 0.17.0 (iib:680159) VolSync: 0.8.0 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Not always. In the same run, test_failover[primary_up_cephfs] failed but the other test test_failover[primary_down_cephfs] passed Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Deploy CephFS based workload consisting of 20 pods, PVCs on C1 (sagrawal-nc1) 2. Wait for around (2 * scheduling_interval) to run IOs 3. Perform failover from C1 (sagrawal-nc1) to C2 (sagrawal-nc2) 4. Verify resource created on secondary cluster and resources cleanup from primary cluster 5. Delete the workload 6. Verify backend subvolumes are deleted Automated test: tests/functional/disaster-recovery/regional-dr/test_failover.py::TestFailover::test_failover[primary_up_cephfs] Console logs from automated test run: https://url.corp.redhat.com/3333dd9 Actual results: Subvolumes left behind in managed cluster (sagrawal-nc2) Actual error message when running this command "ceph fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe csi --format json" : Error ENOENT: subvolume 'csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe' is removed and has only snapshots retained Expected results: Subvolumes removed from both managed cluster. Expected error message when running this command "ceph fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe csi --format json" : Error ENOENT: subvolume 'csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe' does not exist Additional info: > Workload deletion command from logs: 2024-03-04 20:39:21 15:09:21 - MainThread - ocs_ci.utility.utils - INFO - C[sagrawal-acm] - Executing command: oc delete -k ocs-workloads/rdr/busybox/cephfs/app-busybox-1/subscriptions/busybox > Test failed after multiple retries waiting for subvolume to be deleted 2024-03-04 20:52:45 AssertionError: Error occurred while verifying volume is present in backend: Error during execution of command: oc -n openshift-storage rsh rook-ceph-tools-dbddf8896-qhn4j ceph fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe csi --format json. 2024-03-04 20:52:45 Error is Error ENOENT: subvolume 'csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe' is removed and has only snapshots retained 2024-03-04 20:52:45 command terminated with exit code 2 2024-03-04 20:52:45 ImageUUID: ae98923a-fec6-42dd-aca5-52ef54768dfe. Interface type: CephFileSystem 2024-03-04 20:52:45 15:22:44 - MainThread - ocs_ci.helpers.helpers - ERROR - C[sagrawal-nc2] - Volume corresponding to uuid ae98923a-fec6-42dd-aca5-52ef54768dfe is not deleted in backend Latest output after several hours from toolbox pod (cluster - sagrawal-nc2): sh-5.1$ date Tue Mar 5 13:29:02 UTC 2024 sh-5.1$ ceph fs subvolume ls ocs-storagecluster-cephfilesystem csi [ { "name": "csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe" }, { "name": "csi-vol-aeee95f3-b0d6-4e0f-8d11-da07ff482088" } ] sh-5.1$ ceph fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe csi --format json Error ENOENT: subvolume 'csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe' is removed and has only snapshots retained sh-5.1$ ceph fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-aeee95f3-b0d6-4e0f-8d11-da07ff482088 csi --format json Error ENOENT: subvolume 'csi-vol-aeee95f3-b0d6-4e0f-8d11-da07ff482088' is removed and has only snapshots retained --- Additional comment from RHEL Program Management on 2024-03-05 14:07:03 UTC --- This bug having no release flag set previously, is now set with release flag 'odf‑4.15.0' to '?', and so is being proposed to be fixed at the ODF 4.15.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag. --- Additional comment from Sidhant Agrawal on 2024-03-05 14:12:36 UTC --- ACM and ODF must gather logs: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2267907/initial/ --- Additional comment from Rakshith on 2024-03-06 06:18:30 UTC --- >Actual results: >Subvolumes left behind in managed cluster (sagrawal-nc2) >Actual error message when running this command "ceph fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe csi --format json" : >Error ENOENT: subvolume 'csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe' is removed and has only snapshots retained The error indicates that cephfs snapshots/ROX clones are still not deleted. Moving to DR team to check if associated Snapshot and ROX clone is deleted on the primary cluster and for initial analysis. --- Additional comment from Benamar Mekhissi on 2024-03-08 02:20:58 UTC --- Two volumesnapshots were left behind. Unfortunately, the must-gather logs didn't contain the odf-dr logs. The logs in the live system has already been wrapped. The only remaining information that we still have is the two volumesnapshots that were left behind: ``` oc get volumesnapshots -A NAMESPACE NAME READYTOUSE SOURCEPVC SOURCESNAPSHOTCONTENT RESTORESIZE SNAPSHOTCLASS SNAPSHOTCONTENT CREATIONTIME AGE busybox-workloads-cephfs-1 busybox-pvc-20-20240304145737 true busybox-pvc-20 33Gi ocs-storagecluster-cephfsplugin-snapclass snapcontent-9e698308-c7cd-455a-8a55-0d262991923f 2d23h 2d23h busybox-workloads-cephfs-1 busybox-pvc-9-20240304145725 true busybox-pvc-9 111Gi ocs-storagecluster-cephfsplugin-snapclass snapcontent-95abc223-0a03-4757-a449-ada608d21a3a 2d23h 2d23h ``` Looking at the yaml output for one of those ``` oc get volumesnapshots -n busybox-workloads-cephfs-1 busybox-pvc-20-20240304145737 -o yaml apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshot metadata: creationTimestamp: "2024-03-04T14:57:37Z" finalizers: - snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection - snapshot.storage.kubernetes.io/volumesnapshot-bound-protection generation: 1 labels: volsync.backube/do-not-delete: "true" name: busybox-pvc-20-20240304145737 namespace: busybox-workloads-cephfs-1 resourceVersion: "3839787" uid: 9e698308-c7cd-455a-8a55-0d262991923f spec: source: persistentVolumeClaimName: busybox-pvc-20 volumeSnapshotClassName: ocs-storagecluster-cephfsplugin-snapclass status: boundVolumeSnapshotContentName: snapcontent-9e698308-c7cd-455a-8a55-0d262991923f creationTime: "2024-03-04T14:57:40Z" readyToUse: true restoreSize: 33Gi ``` We see a label `do-not-delete` already set. However, we see no owner. Typically, when this label is set, the owner should be the VRG. However, in this case, it seems the owner hasn't been properly assigned. It's uncertain why this occurred; perhaps it's due to a faulty restore where the PVC restore operation terminated prematurely followed by the deletion of the workload (step 4-5 above). At this stage, it's only speculation. I recommend manually deleting those two volumesnapshots at this point. If possible, try to reproduce the issue and ensure that the odf-dr must-gather logs are collected for further investigation. --- Additional comment from Sunil Kumar Acharya on 2024-03-12 12:56:37 UTC --- Moving the BZ out of ODF-4.15.0 as this BZ is not marked as Blocker. If this is a blocker, feel free to propose it back as a blocker with justification note. --- Additional comment from Sidhant Agrawal on 2024-03-12 17:45:40 UTC --- Version details: OCP: 4.15.0-0.nightly-2024-03-09-040926 ODF: 4.15.0-157 Ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable) ACM: 2.10.0-92 (2.10.0-DOWNSTREAM-2024-02-28-06-06-55) Submariner: v0.17.0 (iib:680159) VolSync: 0.8.0 Issue was reproduced on RDR setup using above mentioned versions. ACM and ODF must-gather logs: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2267907/comment-6/ Output from C2 cluster (sagrawal-c2): $ oc get volumesnapshots -A | grep busybox busybox-workloads-cephfs-1 busybox-pvc-6-20240312170040 true busybox-pvc-6 123Gi ocs-storagecluster-cephfsplugin-snapclass snapcontent-af699638-7aa8-4e4e-864d-9151f69d92aa 29m 29m $ oc get volumesnapshots -n busybox-workloads-cephfs-1 busybox-pvc-6-20240312170040 -o yaml apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshot metadata: creationTimestamp: "2024-03-12T17:00:40Z" finalizers: - snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection - snapshot.storage.kubernetes.io/volumesnapshot-bound-protection generation: 1 labels: volsync.backube/do-not-delete: "true" name: busybox-pvc-6-20240312170040 namespace: busybox-workloads-cephfs-1 resourceVersion: "2496546" uid: af699638-7aa8-4e4e-864d-9151f69d92aa spec: source: persistentVolumeClaimName: busybox-pvc-6 volumeSnapshotClassName: ocs-storagecluster-cephfsplugin-snapclass status: boundVolumeSnapshotContentName: snapcontent-af699638-7aa8-4e4e-864d-9151f69d92aa creationTime: "2024-03-12T17:00:42Z" readyToUse: true restoreSize: 123Gi From toolbox pod: sh-5.1$ ceph fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-eeabaca1-d455-46fa-afd4-759100e9cb6f csi --format json Error ENOENT: subvolume 'csi-vol-eeabaca1-d455-46fa-afd4-759100e9cb6f' is removed and has only snapshots retained --- Additional comment from Sidhant Agrawal on 2024-03-12 18:08:04 UTC --- (In reply to Sidhant Agrawal from comment #6) > > Issue was reproduced on RDR setup using above mentioned versions. > ACM and ODF must-gather logs: > http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2267907/comment- > 6/ > Cluster details: sagrawal-jhub - kubeconfig: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sagrawal-jhub/sagrawal-jhub_20240311T062420/openshift-cluster-dir/auth/kubeconfig Web Console: https://console-openshift-console.apps.sagrawal-jhub.qe.rh-ocs.com Login: kubeadmin Password: Zn3Zj-iXAN7-AFFJJ-7nnaf sagrawal-c1 - kubeconfig: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sagrawal-c1/sagrawal-c1_20240311T062447/openshift-cluster-dir/auth/kubeconfig Web Console: https://console-openshift-console.apps.sagrawal-c1.qe.rh-ocs.com Login: kubeadmin Password: SpSqM-iTWCp-E5iZB-M6K5u sagrawal-c2 - kubeconfig: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sagrawal-c2/sagrawal-c2_20240311T062521/openshift-cluster-dir/auth/kubeconfig Web Console: https://console-openshift-console.apps.sagrawal-c2.qe.rh-ocs.com Login: kubeadmin Password: jAwqy-54od9-4h9fb-uHfDc --- Additional comment from Benamar Mekhissi on 2024-03-13 03:08:42 UTC --- Out of the 20 PVCs, the volumesnapshot for busybox-pvc-6 didn't get cleaned up by the garbage collection. Preliminary investigation revealed that during the rollback process, an in-progress sync prevented the previous volumesnapshot from being properly assigned an owner, leading to its retention. I intend to replicate the scenario locally in order to pinpoint the exact issue. In the interim, the recommended workaround is to manually delete the orphaned volumesnapshot. --- Additional comment from Benamar Mekhissi on 2024-03-21 13:38:26 UTC --- PR: https://github.com/RamenDR/ramen/pull/1276 --- Additional comment from Shyamsundar on 2024-04-08 13:33:15 UTC --- @bmekhiss request a backport of the changes to release-4.16 downstream branch. https://github.com/RamenDR/ramen/pull/1276 --- Additional comment from RHEL Program Management on 2024-04-08 13:33:31 UTC --- The 'Target Release' is not to be set manually at the Red Hat OpenShift Data Foundation product. The 'Target Release' will be auto set appropriately, after the 3 Acks (pm,devel,qa) are set to "+" for a specific release flag and that release flag gets auto set to "+".
Moving the bug to 4.15.4. we need to understand why this bug needs to be backported.