Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2031083

Summary:	EFS CSI driver cannot delete volumes under load
Product:	OpenShift Container Platform	Reporter:	Jan Safranek <jsafrane>
Component:	Storage	Assignee:	Tomas Smetana <tsmetana>
Storage sub component:	Kubernetes External Components	QA Contact:	Rohit Patil <ropatil>
Status:	CLOSED DEFERRED	Docs Contact:
Severity:	high
Priority:	high	CC:	aos-bugs, gcharot, lwan, ropatil, vfarias, wduan
Version:	4.10	Keywords:	Reopened
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-01-23 12:01:37 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jan Safranek 2021-12-10 13:26:35 UTC

This bug was initially created as a copy of Bug #202952. We fixed the original BZ with a workaround. This BZ is to fix the CSI driver properly - enable parallel volume deletion and expect that it deletes volumes in a reasonable time (as with the upstream CSI driver).

How reproducible:
always

Steps to Reproduce:
0. Run external-provisioner *without* --worker-threads argument.
1. Create 10-20 PVCs that are dynamically provisioned from a single EFS volume.
2. Not sure if is needed: run Pods with these PVCs + delete the Pods.
3. Delete the PVCs.

Actual results:
PVs stay in the cluster for longer time (10 - 30 minutes)
"oc describe pv" complains:

  Warning  VolumeFailedDelete  22s   efs.csi.aws.com_ip-10-0-174-71_88638ffc-4776-416a-bcb6-c0b50706ae01  rpc error: code = Internal desc = Could not get describe Access Point: fsap-0d5652238d563ab73 , error: Describe Access Point failed: ThrottlingException: 
           status code: 400, request id: 72fe83c8-be0b-40ba-bcb3-5019e9a14110
  Warning  VolumeFailedDelete  8s (x3 over 40s)  efs.csi.aws.com_ip-10-0-174-71_88638ffc-4776-416a-bcb6-c0b50706ae01  rpc error: code = DeadlineExceeded desc = context deadline exceeded

Expected results:
PVs are deleted in few minutes max.

Comment 1 Wei Duan 2021-12-15 12:13:45 UTC

The original BZ is https://bugzilla.redhat.com/show_bug.cgi?id=2029521

Comment 2 Tomas Smetana 2022-01-25 14:17:34 UTC

I believe the aws-efs-utils rebase (https://github.com/openshift/aws-efs-utils/pull/6) should fix this issue. Moving manually to MODIFIED.

Comment 3 Tomas Smetana 2022-01-26 12:51:19 UTC

Sounds like the rebase didn't help in the end: the issue is reproducible. Back to ASSIGNED.

Comment 6 Rohit Patil 2022-06-07 08:02:08 UTC

Same observation on OSD cluster, 


Payload loaded version="4.10.15" 
image="quay.io/openshift-release-dev/ocp-release@sha256:ddcb70ce04a01ce487c0f4ad769e9e36a10c8c832a34307c1b1eb8e03a5b7ddb"
EFS Version: 4.10.0-202205120735


./clean.sh 
Tue Jun  7 13:24:41 IST 2022
pod "mypod-1" deleted
pod "mypod-10" deleted
pod "mypod-100" deleted
pod "mypod-101" deleted
pod "mypod-102" deleted
pod "mypod-103" deleted
pod "mypod-104" deleted
pod "mypod-105" deleted
pod "mypod-106" deleted


oc get pod -n testropatil                  
NAME        READY   STATUS        RESTARTS   AGE
mypod-100   0/1     Terminating   0          40m
mypod-102   0/1     Terminating   0          40m
mypod-103   0/1     Terminating   0          40m
mypod-105   0/1     Terminating   0          40m
mypod-106   0/1     Terminating   0          40m
mypod-107   0/1     Terminating   0          40m
mypod-108   0/1     Terminating   0          40m

Comment 7 Tomas Smetana 2022-08-26 09:40:23 UTC

I believe this is related to https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues/695, which doesn't seem to attract much attention in the upstream.

Comment 9 Tomas Smetana 2023-01-23 12:01:37 UTC

Moved to Jira: https://issues.redhat.com/browse/OCPBUGS-6491