DescriptionElena Gershkovich
2024-04-08 14:04:11 UTC
+++ This bug was initially created as a clone of Bug #2267907 +++
Description of problem (please be detailed as possible and provide log snippests):
On a RDR setup, after performing failover operation and then deleting DR workload (CephFS based), observed that few subvolumes were not deleted from the secondary managed cluster.
Version of all relevant components (if applicable):
OCP: 4.15.0-0.nightly-2024-02-29-223316
ODF: 4.15.0-150
ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)
ACM: 2.10.0-DOWNSTREAM-2024-02-28-06-06-55
Submariner: 0.17.0 (iib:680159)
VolSync: 0.8.0
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No
Is there any workaround available to the best of your knowledge?
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2
Can this issue reproducible?
Not always.
In the same run, test_failover[primary_up_cephfs] failed but the other test test_failover[primary_down_cephfs] passed
Can this issue reproduce from the UI?
If this is a regression, please provide more details to justify this:
Steps to Reproduce:
1. Deploy CephFS based workload consisting of 20 pods, PVCs on C1 (sagrawal-nc1)
2. Wait for around (2 * scheduling_interval) to run IOs
3. Perform failover from C1 (sagrawal-nc1) to C2 (sagrawal-nc2)
4. Verify resource created on secondary cluster and resources cleanup from primary cluster
5. Delete the workload
6. Verify backend subvolumes are deleted
Automated test: tests/functional/disaster-recovery/regional-dr/test_failover.py::TestFailover::test_failover[primary_up_cephfs]
Console logs from automated test run: https://url.corp.redhat.com/3333dd9
Actual results:
Subvolumes left behind in managed cluster (sagrawal-nc2)
Actual error message when running this command "ceph fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe csi --format json" :
Error ENOENT: subvolume 'csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe' is removed and has only snapshots retained
Expected results:
Subvolumes removed from both managed cluster.
Expected error message when running this command "ceph fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe csi --format json" :
Error ENOENT: subvolume 'csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe' does not exist
Additional info:
> Workload deletion command from logs:
2024-03-04 20:39:21 15:09:21 - MainThread - ocs_ci.utility.utils - INFO - C[sagrawal-acm] - Executing command: oc delete -k ocs-workloads/rdr/busybox/cephfs/app-busybox-1/subscriptions/busybox
> Test failed after multiple retries waiting for subvolume to be deleted
2024-03-04 20:52:45 AssertionError: Error occurred while verifying volume is present in backend: Error during execution of command: oc -n openshift-storage rsh rook-ceph-tools-dbddf8896-qhn4j ceph fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe csi --format json.
2024-03-04 20:52:45 Error is Error ENOENT: subvolume 'csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe' is removed and has only snapshots retained
2024-03-04 20:52:45 command terminated with exit code 2
2024-03-04 20:52:45 ImageUUID: ae98923a-fec6-42dd-aca5-52ef54768dfe. Interface type: CephFileSystem
2024-03-04 20:52:45 15:22:44 - MainThread - ocs_ci.helpers.helpers - ERROR - C[sagrawal-nc2] - Volume corresponding to uuid ae98923a-fec6-42dd-aca5-52ef54768dfe is not deleted in backend
Latest output after several hours from toolbox pod (cluster - sagrawal-nc2):
sh-5.1$ date
Tue Mar 5 13:29:02 UTC 2024
sh-5.1$ ceph fs subvolume ls ocs-storagecluster-cephfilesystem csi
[
{
"name": "csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe"
},
{
"name": "csi-vol-aeee95f3-b0d6-4e0f-8d11-da07ff482088"
}
]
sh-5.1$ ceph fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe csi --format json
Error ENOENT: subvolume 'csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe' is removed and has only snapshots retained
sh-5.1$ ceph fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-aeee95f3-b0d6-4e0f-8d11-da07ff482088 csi --format json
Error ENOENT: subvolume 'csi-vol-aeee95f3-b0d6-4e0f-8d11-da07ff482088' is removed and has only snapshots retained
--- Additional comment from RHEL Program Management on 2024-03-05 14:07:03 UTC ---
This bug having no release flag set previously, is now set with release flag 'odf‑4.15.0' to '?', and so is being proposed to be fixed at the ODF 4.15.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.
--- Additional comment from Sidhant Agrawal on 2024-03-05 14:12:36 UTC ---
ACM and ODF must gather logs: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2267907/initial/
--- Additional comment from Rakshith on 2024-03-06 06:18:30 UTC ---
>Actual results:
>Subvolumes left behind in managed cluster (sagrawal-nc2)
>Actual error message when running this command "ceph fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe csi --format json" :
>Error ENOENT: subvolume 'csi-vol-ae98923a-fec6-42dd-aca5-52ef54768dfe' is removed and has only snapshots retained
The error indicates that cephfs snapshots/ROX clones are still not deleted.
Moving to DR team to check if associated Snapshot and ROX clone is deleted on the primary cluster and for initial analysis.
--- Additional comment from Benamar Mekhissi on 2024-03-08 02:20:58 UTC ---
Two volumesnapshots were left behind. Unfortunately, the must-gather logs didn't contain the odf-dr logs. The logs in the live system has already been wrapped. The only remaining information that we still have is the two volumesnapshots that were left behind:
```
oc get volumesnapshots -A
NAMESPACE NAME READYTOUSE SOURCEPVC SOURCESNAPSHOTCONTENT RESTORESIZE SNAPSHOTCLASS SNAPSHOTCONTENT CREATIONTIME AGE
busybox-workloads-cephfs-1 busybox-pvc-20-20240304145737 true busybox-pvc-20 33Gi ocs-storagecluster-cephfsplugin-snapclass snapcontent-9e698308-c7cd-455a-8a55-0d262991923f 2d23h 2d23h
busybox-workloads-cephfs-1 busybox-pvc-9-20240304145725 true busybox-pvc-9 111Gi ocs-storagecluster-cephfsplugin-snapclass snapcontent-95abc223-0a03-4757-a449-ada608d21a3a 2d23h 2d23h
```
Looking at the yaml output for one of those
```
oc get volumesnapshots -n busybox-workloads-cephfs-1 busybox-pvc-20-20240304145737 -o yaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
creationTimestamp: "2024-03-04T14:57:37Z"
finalizers:
- snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
- snapshot.storage.kubernetes.io/volumesnapshot-bound-protection
generation: 1
labels:
volsync.backube/do-not-delete: "true"
name: busybox-pvc-20-20240304145737
namespace: busybox-workloads-cephfs-1
resourceVersion: "3839787"
uid: 9e698308-c7cd-455a-8a55-0d262991923f
spec:
source:
persistentVolumeClaimName: busybox-pvc-20
volumeSnapshotClassName: ocs-storagecluster-cephfsplugin-snapclass
status:
boundVolumeSnapshotContentName: snapcontent-9e698308-c7cd-455a-8a55-0d262991923f
creationTime: "2024-03-04T14:57:40Z"
readyToUse: true
restoreSize: 33Gi
```
We see a label `do-not-delete` already set. However, we see no owner. Typically, when this label is set, the owner should be the VRG. However, in this case, it seems the owner hasn't been properly assigned. It's uncertain why this occurred; perhaps it's due to a faulty restore where the PVC restore operation terminated prematurely followed by the deletion of the workload (step 4-5 above). At this stage, it's only speculation.
I recommend manually deleting those two volumesnapshots at this point. If possible, try to reproduce the issue and ensure that the odf-dr must-gather logs are collected for further investigation.
--- Additional comment from Sunil Kumar Acharya on 2024-03-12 12:56:37 UTC ---
Moving the BZ out of ODF-4.15.0 as this BZ is not marked as Blocker. If this is a blocker, feel free to propose it back as a blocker with justification note.
--- Additional comment from Sidhant Agrawal on 2024-03-12 17:45:40 UTC ---
Version details:
OCP: 4.15.0-0.nightly-2024-03-09-040926
ODF: 4.15.0-157
Ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)
ACM: 2.10.0-92 (2.10.0-DOWNSTREAM-2024-02-28-06-06-55)
Submariner: v0.17.0 (iib:680159)
VolSync: 0.8.0
Issue was reproduced on RDR setup using above mentioned versions.
ACM and ODF must-gather logs: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2267907/comment-6/
Output from C2 cluster (sagrawal-c2):
$ oc get volumesnapshots -A | grep busybox
busybox-workloads-cephfs-1 busybox-pvc-6-20240312170040 true busybox-pvc-6 123Gi ocs-storagecluster-cephfsplugin-snapclass snapcontent-af699638-7aa8-4e4e-864d-9151f69d92aa 29m 29m
$ oc get volumesnapshots -n busybox-workloads-cephfs-1 busybox-pvc-6-20240312170040 -o yaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
creationTimestamp: "2024-03-12T17:00:40Z"
finalizers:
- snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
- snapshot.storage.kubernetes.io/volumesnapshot-bound-protection
generation: 1
labels:
volsync.backube/do-not-delete: "true"
name: busybox-pvc-6-20240312170040
namespace: busybox-workloads-cephfs-1
resourceVersion: "2496546"
uid: af699638-7aa8-4e4e-864d-9151f69d92aa
spec:
source:
persistentVolumeClaimName: busybox-pvc-6
volumeSnapshotClassName: ocs-storagecluster-cephfsplugin-snapclass
status:
boundVolumeSnapshotContentName: snapcontent-af699638-7aa8-4e4e-864d-9151f69d92aa
creationTime: "2024-03-12T17:00:42Z"
readyToUse: true
restoreSize: 123Gi
From toolbox pod:
sh-5.1$ ceph fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-eeabaca1-d455-46fa-afd4-759100e9cb6f csi --format json
Error ENOENT: subvolume 'csi-vol-eeabaca1-d455-46fa-afd4-759100e9cb6f' is removed and has only snapshots retained
--- Additional comment from Sidhant Agrawal on 2024-03-12 18:08:04 UTC ---
(In reply to Sidhant Agrawal from comment #6)
>
> Issue was reproduced on RDR setup using above mentioned versions.
> ACM and ODF must-gather logs:
> http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2267907/comment-
> 6/
>
Cluster details:
sagrawal-jhub -
kubeconfig: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sagrawal-jhub/sagrawal-jhub_20240311T062420/openshift-cluster-dir/auth/kubeconfig
Web Console: https://console-openshift-console.apps.sagrawal-jhub.qe.rh-ocs.com
Login: kubeadmin
Password: Zn3Zj-iXAN7-AFFJJ-7nnaf
sagrawal-c1 -
kubeconfig: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sagrawal-c1/sagrawal-c1_20240311T062447/openshift-cluster-dir/auth/kubeconfig
Web Console: https://console-openshift-console.apps.sagrawal-c1.qe.rh-ocs.com
Login: kubeadmin
Password: SpSqM-iTWCp-E5iZB-M6K5u
sagrawal-c2 -
kubeconfig: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sagrawal-c2/sagrawal-c2_20240311T062521/openshift-cluster-dir/auth/kubeconfig
Web Console: https://console-openshift-console.apps.sagrawal-c2.qe.rh-ocs.com
Login: kubeadmin
Password: jAwqy-54od9-4h9fb-uHfDc
--- Additional comment from Benamar Mekhissi on 2024-03-13 03:08:42 UTC ---
Out of the 20 PVCs, the volumesnapshot for busybox-pvc-6 didn't get cleaned up by the garbage collection. Preliminary investigation revealed that during the rollback process, an in-progress sync prevented the previous volumesnapshot from being properly assigned an owner, leading to its retention. I intend to replicate the scenario locally in order to pinpoint the exact issue.
In the interim, the recommended workaround is to manually delete the orphaned volumesnapshot.
--- Additional comment from Benamar Mekhissi on 2024-03-21 13:38:26 UTC ---
PR: https://github.com/RamenDR/ramen/pull/1276
--- Additional comment from Shyamsundar on 2024-04-08 13:33:15 UTC ---
@bmekhiss request a backport of the changes to release-4.16 downstream branch.
https://github.com/RamenDR/ramen/pull/1276
--- Additional comment from RHEL Program Management on 2024-04-08 13:33:31 UTC ---
The 'Target Release' is not to be set manually at the Red Hat OpenShift Data Foundation product.
The 'Target Release' will be auto set appropriately, after the 3 Acks (pm,devel,qa) are set to "+" for a specific release flag and that release flag gets auto set to "+".
Comment 3krishnaram Karthick
2024-05-02 11:41:49 UTC
Moving the bug to 4.15.4. we need to understand why this bug needs to be backported.