Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2031083

Summary: EFS CSI driver cannot delete volumes under load
Product: OpenShift Container Platform Reporter: Jan Safranek <jsafrane>
Component: StorageAssignee: Tomas Smetana <tsmetana>
Storage sub component: Kubernetes External Components QA Contact: Rohit Patil <ropatil>
Status: CLOSED DEFERRED Docs Contact:
Severity: high    
Priority: high CC: aos-bugs, gcharot, lwan, ropatil, vfarias, wduan
Version: 4.10Keywords: Reopened
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-01-23 12:01:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jan Safranek 2021-12-10 13:26:35 UTC
This bug was initially created as a copy of Bug #202952. We fixed the original BZ with a workaround. This BZ is to fix the CSI driver properly - enable parallel volume deletion and expect that it deletes volumes in a reasonable time (as with the upstream CSI driver).

How reproducible:
always

Steps to Reproduce:
0. Run external-provisioner *without* --worker-threads argument.
1. Create 10-20 PVCs that are dynamically provisioned from a single EFS volume.
2. Not sure if is needed: run Pods with these PVCs + delete the Pods.
3. Delete the PVCs.

Actual results:
PVs stay in the cluster for longer time (10 - 30 minutes)
"oc describe pv" complains:

  Warning  VolumeFailedDelete  22s   efs.csi.aws.com_ip-10-0-174-71_88638ffc-4776-416a-bcb6-c0b50706ae01  rpc error: code = Internal desc = Could not get describe Access Point: fsap-0d5652238d563ab73 , error: Describe Access Point failed: ThrottlingException: 
           status code: 400, request id: 72fe83c8-be0b-40ba-bcb3-5019e9a14110
  Warning  VolumeFailedDelete  8s (x3 over 40s)  efs.csi.aws.com_ip-10-0-174-71_88638ffc-4776-416a-bcb6-c0b50706ae01  rpc error: code = DeadlineExceeded desc = context deadline exceeded

Expected results:
PVs are deleted in few minutes max.

Comment 1 Wei Duan 2021-12-15 12:13:45 UTC
The original BZ is https://bugzilla.redhat.com/show_bug.cgi?id=2029521

Comment 2 Tomas Smetana 2022-01-25 14:17:34 UTC
I believe the aws-efs-utils rebase (https://github.com/openshift/aws-efs-utils/pull/6) should fix this issue. Moving manually to MODIFIED.

Comment 3 Tomas Smetana 2022-01-26 12:51:19 UTC
Sounds like the rebase didn't help in the end: the issue is reproducible. Back to ASSIGNED.

Comment 6 Rohit Patil 2022-06-07 08:02:08 UTC
Same observation on OSD cluster, 


Payload loaded version="4.10.15" 
image="quay.io/openshift-release-dev/ocp-release@sha256:ddcb70ce04a01ce487c0f4ad769e9e36a10c8c832a34307c1b1eb8e03a5b7ddb"
EFS Version: 4.10.0-202205120735


./clean.sh 
Tue Jun  7 13:24:41 IST 2022
pod "mypod-1" deleted
pod "mypod-10" deleted
pod "mypod-100" deleted
pod "mypod-101" deleted
pod "mypod-102" deleted
pod "mypod-103" deleted
pod "mypod-104" deleted
pod "mypod-105" deleted
pod "mypod-106" deleted


oc get pod -n testropatil                  
NAME        READY   STATUS        RESTARTS   AGE
mypod-100   0/1     Terminating   0          40m
mypod-102   0/1     Terminating   0          40m
mypod-103   0/1     Terminating   0          40m
mypod-105   0/1     Terminating   0          40m
mypod-106   0/1     Terminating   0          40m
mypod-107   0/1     Terminating   0          40m
mypod-108   0/1     Terminating   0          40m

Comment 7 Tomas Smetana 2022-08-26 09:40:23 UTC
I believe this is related to https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues/695, which doesn't seem to attract much attention in the upstream.

Comment 9 Tomas Smetana 2023-01-23 12:01:37 UTC
Moved to Jira: https://issues.redhat.com/browse/OCPBUGS-6491