Bug 1915706 - [Azure][RBD] PV taking longer time ~ 9 minutes to get deleted
Summary: [Azure][RBD] PV taking longer time ~ 9 minutes to get deleted
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: csi-driver
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: OCS 4.7.0
Assignee: Mudit Agarwal
QA Contact: Jilju Joy
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-13 09:55 UTC by Jilju Joy
Modified: 2021-05-19 09:18 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-19 09:18:01 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:2041 0 None None None 2021-05-19 09:18:40 UTC

Description Jilju Joy 2021-01-13 09:55:09 UTC
Description of problem (please be detailed as possible and provide log
snippests):
Test cases are failing with timeout error while waiting for PV to get deleted after deleting the PVC. 30Gi PV is taking 8-9 minutes to get deleted which usually takes few seconds. The PVC was expanded from 10Gi to 30Gi and filled with 14Gi data before deleting.


E               TimeoutError: Timeout when waiting for pvc-636aff3a-ddfa-4ee7-8169-6f253447019e to delete. Describe output: Name:            pvc-636aff3a-ddfa-4ee7-8169-6f253447019e
E               Labels:          <none>
E               Annotations:     pv.kubernetes.io/provisioned-by: openshift-storage.rbd.csi.ceph.com
E               Finalizers:      [kubernetes.io/pv-protection]
E               StorageClass:    ocs-storagecluster-ceph-rbd
E               Status:          Released
E               Claim:           namespace-test-5ada153658b64b388f1712218/pvc-test-1a0bd70eb60a406dbbc82159584c2ba
E               Reclaim Policy:  Delete
E               Access Modes:    RWO
E               VolumeMode:      Filesystem
E               Capacity:        30Gi
E               Node Affinity:   <none>
E               Message:         
E               Source:
E                   Type:              CSI (a Container Storage Interface (CSI) volume source)
E                   Driver:            openshift-storage.rbd.csi.ceph.com
E                   FSType:            ext4
E                   VolumeHandle:      0001-0011-openshift-storage-0000000000000001-fa34e810-4fe7-11eb-a919-0a580a80020c
E                   ReadOnly:          false
E                   VolumeAttributes:      clusterID=openshift-storage
E                                          imageFeatures=layering
E                                          imageFormat=2
E                                          imageName=csi-vol-fa34e810-4fe7-11eb-a919-0a580a80020c
E                                          journalPool=ocs-storagecluster-cephblockpool
E                                          pool=ocs-storagecluster-cephblockpool
E                                          radosNamespace=
E                                          storage.kubernetes.io/csiProvisionerIdentity=1609854627321-8081-openshift-storage.rbd.csi.ceph.com
E               Events:
E                 Type     Reason              Age                 From                                                                                                              Message
E                 ----     ------              ----                ----                                                                                                              -------
E                 Warning  VolumeFailedDelete  61s (x9 over 3m9s)  openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-8c588c46-84xwx_0c301fac-7649-4b06-a682-ea0d0e97ab4f  persistentvolume pvc-636aff3a-ddfa-4ee7-8169-6f253447019e is still attached to node sgatfane-0501-8mstv-worker-eastus2-zhnhr

ocs_ci/ocs/ocp.py:694: TimeoutError



These messages are repeated in csi-rbdplugin-provisioner-8c588c46-84xwx pod csi-provisioner container log.

2021-01-06T06:35:23.530485765Z I0106 06:35:23.530392       1 controller.go:1453] delete "pvc-636aff3a-ddfa-4ee7-8169-6f253447019e": started
2021-01-06T06:35:23.544346666Z E0106 06:35:23.544274       1 controller.go:1463] delete "pvc-636aff3a-ddfa-4ee7-8169-6f253447019e": volume deletion failed: persistentvolume pvc-636aff3a-ddfa-4ee7-8169-6f253447019e is still attached to node sgatfane-0501-8mstv-worker-eastus2-zhnhr
2021-01-06T06:35:23.544422867Z W0106 06:35:23.544357       1 controller.go:998] Retrying syncing volume "pvc-636aff3a-ddfa-4ee7-8169-6f253447019e", failure 8
2021-01-06T06:35:23.544422867Z E0106 06:35:23.544390       1 controller.go:1016] error syncing volume "pvc-636aff3a-ddfa-4ee7-8169-6f253447019e": persistentvolume pvc-636aff3a-ddfa-4ee7-8169-6f253447019e is still attached to node sgatfane-0501-8mstv-worker-eastus2-zhnhr
2021-01-06T06:35:23.544546067Z I0106 06:35:23.544505       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolume", Namespace:"", Name:"pvc-636aff3a-ddfa-4ee7-8169-6f253447019e", UID:"e8c8fb8c-697d-45d8-9e27-02e3cb9c51bc", APIVersion:"v1", ResourceVersion:"511891", FieldPath:""}): type: 'Warning' reason: 'VolumeFailedDelete' persistentvolume pvc-636aff3a-ddfa-4ee7-8169-6f253447019e is still attached to node sgatfane-0501-8mstv-worker-eastus2-zhnhr


must-gather logs - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-0501/sgatfane-0501_20210105T125402/logs/failed_testcase_ocs_logs_1609910642/test_resource_deletion_during_pvc_expansion%5bcephfsplugin_provisioner%5d_ocs_logs/

Name of PVs from different test cases which took longer than usual to get deleted - pvc-398610e1-e2cd-4657-acdf-d9059ea16c4e, pvc-9a616d9d-deb4-4dfd-8e91-f4be85f85a2e, pvc-d8b745f6-d2ad-4ed6-b015-a7eec6d1d700, pvc-455e08e4-ba4e-4172-ae37-836f2a744430, pvc-636aff3a-ddfa-4ee7-8169-6f253447019e

Test run - https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/16050/
===========================================================================
Version of all relevant components (if applicable):
Cluster Version	4.6.0-0.nightly-2021-01-05-062422
OCS operator	v4.6.1-206.ci
Ceph Version	14.2.8-115.el8cp

rook_csi_ceph	cephcsi@sha256:c71e0a6d2acb0d5933b1dfcc8cca55de0fd05f8524b16172e687df69eab2fd1d

rook_csi_provisioner	ose-csi-external-provisioner@sha256:6fa9eb53e58a31ea1a3294c1f7c06c34b5315fb60f2cf066932937667bcc2399


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
PVs are getting deleted eventually

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3


Can this issue reproducible?
4/5


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
Run the ocs-ci test case:
tests/manage/pv_services/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[cephfsplugin_provisioner]

The issue is also seen in few other test cases.


Actual results:
Test case failing in tear down part due to timeout waiting for PV to get deleted. 


Expected results:
PV should not take 8-9 minutes to get deleted.


Additional info:

Comment 2 Yaniv Kaul 2021-01-13 10:18:27 UTC
The logs hint that the PV was still attached. So is it a matter of slow detach? Perhaps something else?
It's not clear from the description if it's RBD or CephFS - can you re-verify?

Comment 3 Jilju Joy 2021-01-13 13:33:45 UTC
(In reply to Yaniv Kaul from comment #2)
> The logs hint that the PV was still attached. So is it a matter of slow
> detach? Perhaps something else?

Checked logs for this PV pvc-398610e1-e2cd-4657-acdf-d9059ea16c4e delete error:
Warning  VolumeFailedDelete  61s (x9 over 3m8s)  openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-8c588c46-84xwx_0c301fac-7649-4b06-a682-ea0d0e97ab4f  persistentvolume pvc-398610e1-e2cd-4657-acdf-d9059ea16c4e is still attached to node sgatfane-0501-8mstv-worker-eastus3-qnfwb


From csi-rbdplugin logs of csi-rbdplugin-hb224 pod, volume unmount completed at 05:36:54.

2021-01-06T05:31:16.122771371Z I0106 05:31:16.122723   28927 nodeserver.go:598] ID: 4740 Req-ID: 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c rbd: successfully unbound volume 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c from /var/lib/kubelet/pods/6750dce8-bbb6-4cfa-a7d4-0ba59a596954/volumes/kubernetes.io~csi/pvc-398610e1-e2cd-4657-acdf-d9059ea16c4e/mount


2021-01-06T05:36:54.269618019Z I0106 05:36:54.269577   28927 nodeserver.go:687] ID: 4742 Req-ID: 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c successfully unmounted volume (0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c) from staging path (/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-398610e1-e2cd-4657-acdf-d9059ea16c4e/globalmount/0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c)


2021-01-06T05:37:26.298452895Z I0106 05:37:26.298428   28927 nodeserver.go:663] ID: 4796 Req-ID: 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c failed to find image metadata: missing stash: open /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-398610e1-e2cd-4657-acdf-d9059ea16c4e/globalmount/image-meta.json: no such file or directory


The below log is from csi-provisioner container of csi-rbdplugin-provisioner-8c588c46-84xwx pod. Volume delete succeeded message was generated at 05:40:50. The sync did not happen between 05:36:54 (the time when volume unmount completed) and 05:40:50.

2021-01-06T05:40:50.304797144Z I0106 05:40:50.304791       1 controller.go:1523] delete "pvc-398610e1-e2cd-4657-acdf-d9059ea16c4e": succeeded

> It's not clear from the description if it's RBD or CephFS - can you
> re-verify?
Yes RBD. The test cases creates both RBD and Cephfs PVCs. The issue is seen only for RBD.

Comment 4 Mudit Agarwal 2021-01-13 13:39:25 UTC
Looks like there was some issue with NodeUnstage because of which pvc was not detached:

>> Unstage request (ID 4742) was issued but didn't get a response back
2021-01-06T05:31:16.221621862Z I0106 05:31:16.221590   28927 utils.go:159] ID: 4742 Req-ID: 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c GRPC call: /csi.v1.Node/NodeUnstageVolume
2021-01-06T05:31:16.223006573Z I0106 05:31:16.222968   28927 utils.go:160] ID: 4742 Req-ID: 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-398610e1-e2cd-4657-acdf-d9059ea16c4e/globalmount","volume_id":"0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c"}


>> when the above response didn't come back it issued another unstage request(ID 4758) for the same volume but that got aborted because earlier request was still hanging:

2021-01-06T05:33:16.822896241Z I0106 05:33:16.822864   28927 utils.go:159] ID: 4758 Req-ID: 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c GRPC call: /csi.v1.Node/NodeUnstageVolume
2021-01-06T05:33:16.824220722Z I0106 05:33:16.824187   28927 utils.go:160] ID: 4758 Req-ID: 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-398610e1-e2cd-4657-acdf-d9059ea16c4e/globalmount","volume_id":"0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c"}
2021-01-06T05:33:16.824244922Z E0106 05:33:16.824225   28927 nodeserver.go:626] ID: 4758 Req-ID: 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c already exists
2021-01-06T05:33:16.824254621Z E0106 05:33:16.824239   28927 utils.go:163] ID: 4758 Req-ID: 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c already exists

>> and this kept on repeating for some time because the request with ID 4742 was still not back.

2021-01-06T05:33:17.930547809Z I0106 05:33:17.930513   28927 utils.go:159] ID: 4762 Req-ID: 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c GRPC call: /csi.v1.Node/NodeUnstageVolume
2021-01-06T05:33:17.930946503Z I0106 05:33:17.930926   28927 utils.go:160] ID: 4762 Req-ID: 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-398610e1-e2cd-4657-acdf-d9059ea16c4e/globalmount","volume_id":"0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c"}
2021-01-06T05:33:17.930972703Z E0106 05:33:17.930955   28927 nodeserver.go:626] ID: 4762 Req-ID: 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c already exists
2021-01-06T05:33:17.930982503Z E0106 05:33:17.930971   28927 utils.go:163] ID: 4762 Req-ID: 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c already exists

>> the moment response came from the original request, it unmounted the volume and provisioner was able to delete the volume

2021-01-06T05:36:54.269578118Z I0106 05:36:54.269520   28927 cephcmds.go:59] ID: 4742 Req-ID: 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c command succeeded: rbd [unmap --device-type krbd ocs-storagecluster-cephblockpool/csi-vol-938ccc42-4fdf-11eb-a919-0a580a80020c]
2021-01-06T05:36:54.269618019Z I0106 05:36:54.269577   28927 nodeserver.go:687] ID: 4742 Req-ID: 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c successfully unmounted volume (0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c) from staging path (/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-398610e1-e2cd-4657-acdf-d9059ea16c4e/globalmount/0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c)
2021-01-06T05:36:54.270450325Z I0106 05:36:54.270421   28927 utils.go:165] ID: 4742 Req-ID: 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c GRPC response: {}

It looks like an intermittent issue to me, is it reproducible every time. GRPC response might be delayed or stuck because of a network glitch also.
Can you please try again?

Comment 5 Mudit Agarwal 2021-01-13 13:45:16 UTC
We also don't have any provisioner logs between T 2021-01-06T05:36:54 (when unstage was successful) and T 2021-01-06T05:40 (when provisioner actually deleted the volume)

2021-01-06T05:36:33.737305160Z W0106 05:36:33.737281       1 controller.go:998] Retrying syncing volume "pvc-398610e1-e2cd-4657-acdf-d9059ea16c4e", failure 9
2021-01-06T05:36:33.737305160Z E0106 05:36:33.737298       1 controller.go:1016] error syncing volume "pvc-398610e1-e2cd-4657-acdf-d9059ea16c4e": persistentvolume pvc-398610e1-e2cd-4657-acdf-d9059ea16c4e is still attached to node sgatfane-0501-8mstv-worker-eastus3-qnfwb
2021-01-06T05:36:33.737357560Z I0106 05:36:33.737324       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolume", Namespace:"", Name:"pvc-398610e1-e2cd-4657-acdf-d9059ea16c4e", UID:"3b9835e5-e0b5-4620-8965-70eeb66abb9a", APIVersion:"v1", ResourceVersion:"479816", FieldPath:""}): type: 'Warning' reason: 'VolumeFailedDelete' persistentvolume pvc-398610e1-e2cd-4657-acdf-d9059ea16c4e is still attached to node sgatfane-0501-8mstv-worker-eastus3-qnfwb
2021-01-06T05:40:22.558773116Z I0106 05:40:22.558707       1 controller.go:1317] provision "namespace-test-844d52b4967d485f84907aebe/pvc-test-46575dae2e644885963f2dbc9cb3c08" class "ocs-storagecluster-ceph-rbd": started
2021-01-06T05:40:22.558921417Z I0106 05:40:22.558863       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-844d52b4967d485f84907aebe", Name:"pvc-test-46575dae2e644885963f2dbc9cb3c08", UID:"9a616d9d-deb4-4dfd-8e91-f4be85f85a2e", APIVersion:"v1", ResourceVersion:"483883", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "namespace-test-844d52b4967d485f84907aebe/pvc-test-46575dae2e644885963f2dbc9cb3c08"

So, it again looks like an issue with some disconnections in between.

Comment 6 Jilju Joy 2021-01-15 08:57:48 UTC
(In reply to Mudit Agarwal from comment #4)
> Looks like there was some issue with NodeUnstage because of which pvc was
> not detached:
> 
> >> Unstage request (ID 4742) was issued but didn't get a response back
> 2021-01-06T05:31:16.221621862Z I0106 05:31:16.221590   28927 utils.go:159]
> ID: 4742 Req-ID:
> 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-
> 0a580a80020c GRPC call: /csi.v1.Node/NodeUnstageVolume
> 2021-01-06T05:31:16.223006573Z I0106 05:31:16.222968   28927 utils.go:160]
> ID: 4742 Req-ID:
> 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-
> 0a580a80020c GRPC request:
> {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-
> 398610e1-e2cd-4657-acdf-d9059ea16c4e/globalmount","volume_id":"0001-0011-
> openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c"}
> 
> 
> >> when the above response didn't come back it issued another unstage request(ID 4758) for the same volume but that got aborted because earlier request was still hanging:
> 
> 2021-01-06T05:33:16.822896241Z I0106 05:33:16.822864   28927 utils.go:159]
> ID: 4758 Req-ID:
> 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-
> 0a580a80020c GRPC call: /csi.v1.Node/NodeUnstageVolume
> 2021-01-06T05:33:16.824220722Z I0106 05:33:16.824187   28927 utils.go:160]
> ID: 4758 Req-ID:
> 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-
> 0a580a80020c GRPC request:
> {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-
> 398610e1-e2cd-4657-acdf-d9059ea16c4e/globalmount","volume_id":"0001-0011-
> openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c"}
> 2021-01-06T05:33:16.824244922Z E0106 05:33:16.824225   28927
> nodeserver.go:626] ID: 4758 Req-ID:
> 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-
> 0a580a80020c an operation with the given Volume ID
> 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-
> 0a580a80020c already exists
> 2021-01-06T05:33:16.824254621Z E0106 05:33:16.824239   28927 utils.go:163]
> ID: 4758 Req-ID:
> 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-
> 0a580a80020c GRPC error: rpc error: code = Aborted desc = an operation with
> the given Volume ID
> 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-
> 0a580a80020c already exists
> 
> >> and this kept on repeating for some time because the request with ID 4742 was still not back.
> 
> 2021-01-06T05:33:17.930547809Z I0106 05:33:17.930513   28927 utils.go:159]
> ID: 4762 Req-ID:
> 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-
> 0a580a80020c GRPC call: /csi.v1.Node/NodeUnstageVolume
> 2021-01-06T05:33:17.930946503Z I0106 05:33:17.930926   28927 utils.go:160]
> ID: 4762 Req-ID:
> 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-
> 0a580a80020c GRPC request:
> {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-
> 398610e1-e2cd-4657-acdf-d9059ea16c4e/globalmount","volume_id":"0001-0011-
> openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-0a580a80020c"}
> 2021-01-06T05:33:17.930972703Z E0106 05:33:17.930955   28927
> nodeserver.go:626] ID: 4762 Req-ID:
> 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-
> 0a580a80020c an operation with the given Volume ID
> 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-
> 0a580a80020c already exists
> 2021-01-06T05:33:17.930982503Z E0106 05:33:17.930971   28927 utils.go:163]
> ID: 4762 Req-ID:
> 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-
> 0a580a80020c GRPC error: rpc error: code = Aborted desc = an operation with
> the given Volume ID
> 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-
> 0a580a80020c already exists
> 
> >> the moment response came from the original request, it unmounted the volume and provisioner was able to delete the volume
> 
> 2021-01-06T05:36:54.269578118Z I0106 05:36:54.269520   28927 cephcmds.go:59]
> ID: 4742 Req-ID:
> 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-
> 0a580a80020c command succeeded: rbd [unmap --device-type krbd
> ocs-storagecluster-cephblockpool/csi-vol-938ccc42-4fdf-11eb-a919-
> 0a580a80020c]
> 2021-01-06T05:36:54.269618019Z I0106 05:36:54.269577   28927
> nodeserver.go:687] ID: 4742 Req-ID:
> 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-
> 0a580a80020c successfully unmounted volume
> (0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-
> 0a580a80020c) from staging path
> (/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-398610e1-e2cd-4657-acdf-
> d9059ea16c4e/globalmount/0001-0011-openshift-storage-0000000000000001-
> 938ccc42-4fdf-11eb-a919-0a580a80020c)
> 2021-01-06T05:36:54.270450325Z I0106 05:36:54.270421   28927 utils.go:165]
> ID: 4742 Req-ID:
> 0001-0011-openshift-storage-0000000000000001-938ccc42-4fdf-11eb-a919-
> 0a580a80020c GRPC response: {}
> 
> It looks like an intermittent issue to me, is it reproducible every time.
> GRPC response might be delayed or stuck because of a network glitch also.
> Can you please try again?

We were also in an assumption that this could be an intermittent issue. But after seeing multiple test case failure due to this error, we thought this issue is worth  debugging. These test cases failed with this error in last 3 runs on azure.

tests/manage/pv_services/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[mgr],				
tests/manage/pv_services/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[osd],				
tests/manage/pv_services/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[rbdplugin],				
tests/manage/pv_services/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[cephfsplugin],				
tests/manage/pv_services/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[rbdplugin_provisioner],				
tests/manage/pv_services/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[cephfsplugin_provisioner]

Comment 7 Mudit Agarwal 2021-01-15 12:28:07 UTC
Niels just pointed out that because we are writing a lot of data before deleting the PVC unstage/unmount will take time because in order to unmount all the data needs to be flushed first. That is the reason unmount is taking a lot of time here.

Jilju, is this a new test case or have we increased the amount of the data written before deletion? Can we make sure in the test case to sync the data before issuing a deletion command?

Comment 8 Jilju Joy 2021-01-15 13:43:54 UTC
(In reply to Mudit Agarwal from comment #7)
> Niels just pointed out that because we are writing a lot of data before
> deleting the PVC unstage/unmount will take time because in order to unmount
> all the data needs to be flushed first. That is the reason unmount is taking
> a lot of time here.
> 
> Jilju, is this a new test case or have we increased the amount of the data
> written before deletion?
These tests were automated during OCS4.5 test cycle. We haven't increased the amount of data since then. I just checked the execution history of these tests. Over the last 10 runs, the issue is seen only on Azure platform. The test case passed in AWS, Bare metal and VMware.

>Can we make sure in the test case to sync the data
> before issuing a deletion command?
If we do this explicitly it will be applicable for all test cases. Will there be an automatic sync happening periodically ?

Comment 9 Mudit Agarwal 2021-01-19 12:16:22 UTC
Automatic sync always happens, we can not unmount or delete a pvc before the data syncs to the disk. So everytime we issue a deletion command sync should finish before the actual deletion.
May be on Azure it is taking more time to sync the data to the disks, but ceph-csi cannot do anything in that case.

To verify the same we can call run_io() with end_fsync=1. I can see that the same is done in other test cases and ideally we should always do it in the test cases where cloning or deletion is performed right after writing the data so that we are sure that the data is synced to the disk.

Can you please rerun after tweaking it and share the results once?

Comment 10 Jilju Joy 2021-01-20 11:11:20 UTC
(In reply to Mudit Agarwal from comment #9)
> Automatic sync always happens, we can not unmount or delete a pvc before the
> data syncs to the disk. So everytime we issue a deletion command sync should
> finish before the actual deletion.
> May be on Azure it is taking more time to sync the data to the disks, but
> ceph-csi cannot do anything in that case.
> 
> To verify the same we can call run_io() with end_fsync=1. I can see that the
> same is done in other test cases and ideally we should always do it in the
> test cases where cloning or deletion is performed right after writing the
> data so that we are sure that the data is synced to the disk.
> 
> Can you please rerun after tweaking it and share the results once?
Hi Mudit,
Test cases passed on Azure after adding end_fsync=1 in fio.
Result - https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/16532/
Logs - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/musoni-12/musoni-12_20210119T174315/logs/ocs-ci-logs-1611136506/

Executed test cases:
tests/manage/pv_services/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[mgr]
tests/manage/pv_services/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[osd]
tests/manage/pv_services/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[rbdplugin]
tests/manage/pv_services/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[cephfsplugin]
tests/manage/pv_services/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[rbdplugin_provisioner]
tests/manage/pv_services/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[cephfsplugin_provisioner]

Comment 11 Mudit Agarwal 2021-01-21 05:25:37 UTC
Thanks Jiliju.
Adding few points based on the observations:

Deletion is a three step process:
1. Sync the newly written data
2. Unpublish/unmount the pvc
3. Do actual deletion

Now, the test case was not waiting to finish the first step and hence deletion is taking time. After enabling "wait for sync" its confirmed that it is a slow sync issue rather than a ceph-csi issue.

I checked the test case, there is a timeout value for deletion but not for sync which means test will fail if deletion is taking more time but won't fail if sync is taking more time.
So to conclude:
1. detach/deletion is not taking extra time
2. sync should be taking more time and we should measure/timeout that

Also, this issue will occur in cases where an user is deleting the pvc just after writing the data.

Comment 12 Michael Adam 2021-01-22 09:50:31 UTC
(In reply to Mudit Agarwal from comment #11)
> Thanks Jiliju.
> Adding few points based on the observations:
> 
> Deletion is a three step process:
> 1. Sync the newly written data
> 2. Unpublish/unmount the pvc
> 3. Do actual deletion
> 
> Now, the test case was not waiting to finish the first step and hence
> deletion is taking time. After enabling "wait for sync" its confirmed that
> it is a slow sync issue rather than a ceph-csi issue.
> 
> I checked the test case, there is a timeout value for deletion but not for
> sync which means test will fail if deletion is taking more time but won't
> fail if sync is taking more time.
> So to conclude:
> 1. detach/deletion is not taking extra time
> 2. sync should be taking more time and we should measure/timeout that
> 
> Also, this issue will occur in cases where an user is deleting the pvc just
> after writing the data.

Thanks for the summary, Mudit!

Conceptually, this seems to be the way it generally works, so it's working as designend, but the sync step seems particulary slow here.
Could it be that slow because we decided not to backport the disk tuning patches, and therefore the disks are not tuned for optimal performance on Azure?

@sabose

Comment 13 Sahina Bose 2021-01-25 14:36:45 UTC
(In reply to Michael Adam from comment #12)
> (In reply to Mudit Agarwal from comment #11)
> > Thanks Jiliju.
> > Adding few points based on the observations:
> > 
> > Deletion is a three step process:
> > 1. Sync the newly written data
> > 2. Unpublish/unmount the pvc
> > 3. Do actual deletion
> > 
> > Now, the test case was not waiting to finish the first step and hence
> > deletion is taking time. After enabling "wait for sync" its confirmed that
> > it is a slow sync issue rather than a ceph-csi issue.
> > 
> > I checked the test case, there is a timeout value for deletion but not for
> > sync which means test will fail if deletion is taking more time but won't
> > fail if sync is taking more time.
> > So to conclude:
> > 1. detach/deletion is not taking extra time
> > 2. sync should be taking more time and we should measure/timeout that
> > 
> > Also, this issue will occur in cases where an user is deleting the pvc just
> > after writing the data.
> 
> Thanks for the summary, Mudit!
> 
> Conceptually, this seems to be the way it generally works, so it's working
> as designend, but the sync step seems particulary slow here.
> Could it be that slow because we decided not to backport the disk tuning
> patches, and therefore the disks are not tuned for optimal performance on
> Azure?
> 
> @sabose

Yes, the sync time is because of the performance issues seen with Azure. On re-running the tests with workaround proposed to recognize the disk as SSD (https://bugzilla.redhat.com/show_bug.cgi?id=1873161#c21), these tests passed. @jijoy to confirm

Comment 14 Jilju Joy 2021-01-25 18:09:09 UTC
(In reply to Sahina Bose from comment #13)
> (In reply to Michael Adam from comment #12)
> > (In reply to Mudit Agarwal from comment #11)
> > > Thanks Jiliju.
> > > Adding few points based on the observations:
> > > 
> > > Deletion is a three step process:
> > > 1. Sync the newly written data
> > > 2. Unpublish/unmount the pvc
> > > 3. Do actual deletion
> > > 
> > > Now, the test case was not waiting to finish the first step and hence
> > > deletion is taking time. After enabling "wait for sync" its confirmed that
> > > it is a slow sync issue rather than a ceph-csi issue.
> > > 
> > > I checked the test case, there is a timeout value for deletion but not for
> > > sync which means test will fail if deletion is taking more time but won't
> > > fail if sync is taking more time.
> > > So to conclude:
> > > 1. detach/deletion is not taking extra time
> > > 2. sync should be taking more time and we should measure/timeout that
> > > 
> > > Also, this issue will occur in cases where an user is deleting the pvc just
> > > after writing the data.
> > 
> > Thanks for the summary, Mudit!
> > 
> > Conceptually, this seems to be the way it generally works, so it's working
> > as designend, but the sync step seems particulary slow here.
> > Could it be that slow because we decided not to backport the disk tuning
> > patches, and therefore the disks are not tuned for optimal performance on
> > Azure?
> > 
> > @sabose
> 
> Yes, the sync time is because of the performance issues seen with Azure. On
> re-running the tests with workaround proposed to recognize the disk as SSD
> (https://bugzilla.redhat.com/show_bug.cgi?id=1873161#c21), these tests
> passed. @jijoy to confirm

Hi Sahina,
Are you referring to the tests executed on the cluster shared by Rajat ?

If yes, the test cases mentioned in comment #12 were not executed as it is. An error occurred while checking the expanded size on pod. The new size did not reflect on the pod. So the test cases failed before the IO stage. So we retried the test after removing the expansion and disruption part from it.
The test steps became -

1. Create PVCs of size 30 Gi 
2. Run IO and verify (Fio without end_fsync)
3. Delete PVCs
4. Verify PVCs and PVs are deleted.

Executed this test 4 times. In the last run one PV took 4 minutes to get deleted. In other 3 runs the PV got deleted within the default wait time of 180 seconds. It is an improvement when compared to the PV deletion time without the given fix.

Comment 15 Mudit Agarwal 2021-01-29 05:19:56 UTC
Elad, can we close this bug. Nothing to do from csi side.

Comment 16 Mudit Agarwal 2021-02-02 10:29:25 UTC
Closing it based on the above comment, feel free to reopen if you think otherwise.

Comment 17 Elad 2021-02-10 07:50:54 UTC
Moving this bug to VERIFIED instead of closed, as the testing of the flows was done by Jilju and reported in comment #14, post the fix for bug 1909793

Comment 22 errata-xmlrpc 2021-05-19 09:18:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041


Note You need to log in before you can comment on or make changes to this bug.