Bug 2273997

Summary:	[RDR] [clone 4.15] CephFS subvolume left behind in managed cluster after deleting the application
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Elena Gershkovich <egershko>
Component:	odf-dr	Assignee:	Benamar Mekhissi <bmekhiss>
odf-dr sub component:	ramen	QA Contact:	krishnaram Karthick <kramdoss>
Status:	CLOSED WONTFIX	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	amagrawa, bmekhiss, kramdoss, kseeger, muagarwa, odf-bz-bot, rar, sagrawal, sheggodu, srangana
Version:	4.15	Keywords:	Automation
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	2267907	Environment:
Last Closed:	2024-06-13 12:34:38 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2267907
Bug Blocks:

Description Elena Gershkovich 2024-04-08 14:04:11 UTC

+++ This bug was initially created as a clone of Bug #2267907 +++

Description of problem (please be detailed as possible and provide log snippests):
On a RDR setup, after performing failover operation and then deleting DR workload (CephFS based), observed that few subvolumes were not deleted from the secondary managed cluster.

Version of all relevant components (if applicable):
OCP: 4.15.0-0.nightly-2024-02-29-223316
ODF: 4.15.0-150
ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)
ACM: 2.10.0-DOWNSTREAM-2024-02-28-06-06-55
Submariner: 0.17.0 (iib:680159)
VolSync: 0.8.0

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Not always.
In the same run, test_failover[primary_up_cephfs] failed but the other test test_failover[primary_down_cephfs] passed

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:

1. Deploy CephFS based workload consisting of 20 pods, PVCs on C1 (sagrawal-nc1)
2. Wait for around (2 * scheduling_interval) to run IOs
3. Perform failover from C1 (sagrawal-nc1) to C2 (sagrawal-nc2)
4. Verify resource created on secondary cluster and resources cleanup from primary cluster
5. Delete the workload
6. Verify backend subvolumes are deleted

Automated test: tests/functional/disaster-recovery/regional-dr/test_failover.py::TestFailover::test_failover[primary_up_cephfs]

Console logs from automated test run: https://url.corp.redhat.com/3333dd9


Actual results:
Subvolumes left behind in managed cluster (sagrawal-nc2)
Actual error message when running this command "ceph fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe csi --format json" :
Error ENOENT: subvolume 'csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe' is removed and has only snapshots retained


Expected results:
Subvolumes removed from both managed cluster. 
Expected error message when running this command "ceph fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe csi --format json" :
Error ENOENT: subvolume 'csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe' does not exist

Additional info:

> Workload deletion command from logs:
2024-03-04 20:39:21  15:09:21 - MainThread - ocs_ci.utility.utils - INFO - C[sagrawal-acm] - Executing command: oc delete -k ocs-workloads/rdr/busybox/cephfs/app-busybox-1/subscriptions/busybox

> Test failed after multiple retries waiting for subvolume to be deleted
2024-03-04 20:52:45  AssertionError: Error occurred while verifying volume is present in backend: Error during execution of command: oc -n openshift-storage rsh rook-ceph-tools-dbddf8896-qhn4j ceph fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe csi --format json.
2024-03-04 20:52:45  Error is Error ENOENT: subvolume 'csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe' is removed and has only snapshots retained
2024-03-04 20:52:45  command terminated with exit code 2
2024-03-04 20:52:45   ImageUUID: ae98923a-fec6-42dd-aca5-52ef54768dfe. Interface type: CephFileSystem
2024-03-04 20:52:45  15:22:44 - MainThread - ocs_ci.helpers.helpers - ERROR - C[sagrawal-nc2] - Volume corresponding to uuid ae98923a-fec6-42dd-aca5-52ef54768dfe is not deleted in backend


Latest output after several hours from toolbox pod (cluster - sagrawal-nc2):
sh-5.1$ date
Tue Mar  5 13:29:02 UTC 2024
sh-5.1$ ceph fs subvolume ls ocs-storagecluster-cephfilesystem csi
[
    {
        "name": "csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe"
    },
    {
        "name": "csi-vol-aeee95f3-b0d6-4e0f-8d11-da07ff482088"
    }
]
sh-5.1$ ceph fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe csi --format json

Error ENOENT: subvolume 'csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe' is removed and has only snapshots retained
sh-5.1$ ceph fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-aeee95f3-b0d6-4e0f-8d11-da07ff482088 csi --format json

Error ENOENT: subvolume 'csi-vol-aeee95f3-b0d6-4e0f-8d11-da07ff482088' is removed and has only snapshots retained

--- Additional comment from RHEL Program Management on 2024-03-05 14:07:03 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.15.0' to '?', and so is being proposed to be fixed at the ODF 4.15.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from Sidhant Agrawal on 2024-03-05 14:12:36 UTC ---

ACM and ODF must gather logs: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2267907/initial/

--- Additional comment from Rakshith on 2024-03-06 06:18:30 UTC ---

>Actual results:
>Subvolumes left behind in managed cluster (sagrawal-nc2)
>Actual error message when running this command "ceph fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe csi --format json" :
>Error ENOENT: subvolume 'csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe' is removed and has only snapshots retained

The error indicates that cephfs snapshots/ROX clones are still not deleted.
Moving to DR team to check if associated Snapshot and ROX clone is deleted on the primary cluster and for initial analysis.

--- Additional comment from Benamar Mekhissi on 2024-03-08 02:20:58 UTC ---

Two volumesnapshots were left behind. Unfortunately, the must-gather logs didn't contain the odf-dr logs. The logs in the live system has already been wrapped. The only remaining information that we still have is the two volumesnapshots that were left behind:
```
oc get volumesnapshots -A                                                                   
NAMESPACE                    NAME                            READYTOUSE   SOURCEPVC        SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS                               SNAPSHOTCONTENT                                    CREATIONTIME   AGE
busybox-workloads-cephfs-1   busybox-pvc-20-20240304145737   true         busybox-pvc-20                           33Gi          ocs-storagecluster-cephfsplugin-snapclass   snapcontent-9e698308-c7cd-455a-8a55-0d262991923f   2d23h          2d23h
busybox-workloads-cephfs-1   busybox-pvc-9-20240304145725    true         busybox-pvc-9                            111Gi         ocs-storagecluster-cephfsplugin-snapclass   snapcontent-95abc223-0a03-4757-a449-ada608d21a3a   2d23h          2d23h

```

Looking at the yaml output for one of those
```
oc get volumesnapshots -n busybox-workloads-cephfs-1   busybox-pvc-20-20240304145737 -o yaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  creationTimestamp: "2024-03-04T14:57:37Z"
  finalizers:
  - snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
  - snapshot.storage.kubernetes.io/volumesnapshot-bound-protection
  generation: 1
  labels:
    volsync.backube/do-not-delete: "true"
  name: busybox-pvc-20-20240304145737
  namespace: busybox-workloads-cephfs-1
  resourceVersion: "3839787"
  uid: 9e698308-c7cd-455a-8a55-0d262991923f
spec:
  source:
    persistentVolumeClaimName: busybox-pvc-20
  volumeSnapshotClassName: ocs-storagecluster-cephfsplugin-snapclass
status:
  boundVolumeSnapshotContentName: snapcontent-9e698308-c7cd-455a-8a55-0d262991923f
  creationTime: "2024-03-04T14:57:40Z"
  readyToUse: true
  restoreSize: 33Gi

```

We see a label `do-not-delete` already set. However, we see no owner.  Typically, when this label is set, the owner should be the VRG. However, in this case, it seems the owner hasn't been properly assigned. It's uncertain why this occurred; perhaps it's due to a faulty restore where the PVC restore operation terminated prematurely followed by the deletion of the workload (step 4-5 above). At this stage, it's only speculation.

I recommend manually deleting those two volumesnapshots at this point. If possible, try to reproduce the issue and ensure that the odf-dr must-gather logs are collected for further investigation.

--- Additional comment from Sunil Kumar Acharya on 2024-03-12 12:56:37 UTC ---

Moving the BZ out of ODF-4.15.0 as this BZ is not marked as Blocker. If this is a blocker, feel free to propose it back as a blocker with justification note.

--- Additional comment from Sidhant Agrawal on 2024-03-12 17:45:40 UTC ---

Version details:
OCP: 4.15.0-0.nightly-2024-03-09-040926
ODF: 4.15.0-157
Ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)
ACM: 2.10.0-92 (2.10.0-DOWNSTREAM-2024-02-28-06-06-55)
Submariner: v0.17.0 (iib:680159)
VolSync: 0.8.0

Issue was reproduced on RDR setup using above mentioned versions.
ACM and ODF must-gather logs: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2267907/comment-6/

Output from C2 cluster (sagrawal-c2):
$ oc get volumesnapshots -A | grep busybox
busybox-workloads-cephfs-1           busybox-pvc-6-20240312170040   true         busybox-pvc-6                                         123Gi         ocs-storagecluster-cephfsplugin-snapclass   snapcontent-af699638-7aa8-4e4e-864d-9151f69d92aa   29m            29m

$ oc get volumesnapshots -n busybox-workloads-cephfs-1 busybox-pvc-6-20240312170040 -o yaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  creationTimestamp: "2024-03-12T17:00:40Z"
  finalizers:
  - snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
  - snapshot.storage.kubernetes.io/volumesnapshot-bound-protection
  generation: 1
  labels:
    volsync.backube/do-not-delete: "true"
  name: busybox-pvc-6-20240312170040
  namespace: busybox-workloads-cephfs-1
  resourceVersion: "2496546"
  uid: af699638-7aa8-4e4e-864d-9151f69d92aa
spec:
  source:
    persistentVolumeClaimName: busybox-pvc-6
  volumeSnapshotClassName: ocs-storagecluster-cephfsplugin-snapclass
status:
  boundVolumeSnapshotContentName: snapcontent-af699638-7aa8-4e4e-864d-9151f69d92aa
  creationTime: "2024-03-12T17:00:42Z"
  readyToUse: true
  restoreSize: 123Gi

From toolbox pod:
sh-5.1$ ceph fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-eeabaca1-d455-46fa-afd4-759100e9cb6f csi --format json

Error ENOENT: subvolume 'csi-vol-eeabaca1-d455-46fa-afd4-759100e9cb6f' is removed and has only snapshots retained

--- Additional comment from Sidhant Agrawal on 2024-03-12 18:08:04 UTC ---

(In reply to Sidhant Agrawal from comment #6)
> 
> Issue was reproduced on RDR setup using above mentioned versions.
> ACM and ODF must-gather logs:
> http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2267907/comment-
> 6/
> 

Cluster details:

sagrawal-jhub - 
kubeconfig: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sagrawal-jhub/sagrawal-jhub_20240311T062420/openshift-cluster-dir/auth/kubeconfig
Web Console: https://console-openshift-console.apps.sagrawal-jhub.qe.rh-ocs.com
Login: kubeadmin
Password: Zn3Zj-iXAN7-AFFJJ-7nnaf

sagrawal-c1 - 
kubeconfig: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sagrawal-c1/sagrawal-c1_20240311T062447/openshift-cluster-dir/auth/kubeconfig
Web Console: https://console-openshift-console.apps.sagrawal-c1.qe.rh-ocs.com
Login: kubeadmin
Password: SpSqM-iTWCp-E5iZB-M6K5u

sagrawal-c2 - 
kubeconfig: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sagrawal-c2/sagrawal-c2_20240311T062521/openshift-cluster-dir/auth/kubeconfig
Web Console: https://console-openshift-console.apps.sagrawal-c2.qe.rh-ocs.com
Login: kubeadmin
Password: jAwqy-54od9-4h9fb-uHfDc

--- Additional comment from Benamar Mekhissi on 2024-03-13 03:08:42 UTC ---

Out of the 20 PVCs, the volumesnapshot for busybox-pvc-6 didn't get cleaned up by the garbage collection.  Preliminary investigation revealed that during the rollback process, an in-progress sync prevented the previous volumesnapshot from being properly assigned an owner, leading to its retention. I intend to replicate the scenario locally in order to pinpoint the exact issue.

In the interim, the recommended workaround is to manually delete the orphaned volumesnapshot.

--- Additional comment from Benamar Mekhissi on 2024-03-21 13:38:26 UTC ---

PR: https://github.com/RamenDR/ramen/pull/1276

--- Additional comment from Shyamsundar on 2024-04-08 13:33:15 UTC ---

@bmekhiss request a backport of the changes to release-4.16 downstream branch.

https://github.com/RamenDR/ramen/pull/1276

--- Additional comment from RHEL Program Management on 2024-04-08 13:33:31 UTC ---

The 'Target Release' is not to be set manually at the Red Hat OpenShift Data Foundation product.

The 'Target Release' will be auto set appropriately, after the 3 Acks (pm,devel,qa) are set to "+" for a specific release flag and that release flag gets auto set to "+".

Comment 3 krishnaram Karthick 2024-05-02 11:41:49 UTC

Moving the bug to 4.15.4. we need to understand why this bug needs to be backported.