Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2029521

Summary: EFS CSI driver cannot delete volumes under load
Product: OpenShift Container Platform Reporter: Jan Safranek <jsafrane>
Component: StorageAssignee: Jan Safranek <jsafrane>
Storage sub component: Operators QA Contact: Rohit Patil <ropatil>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aos-bugs, gcharot, ropatil, sahana, vfarias
Version: 4.10Flags: jsafrane: needinfo-
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2056638 (view as bug list) Environment:
Last Closed: 2022-03-10 16:32:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2056638    

Description Jan Safranek 2021-12-06 17:01:55 UTC
Description of problem:
When the AWS EFS CSI driver needs to delete a higher number of volumes from a single node, it sometimes gets stuck and reports "rpc error: code = DeadlineExceeded desc = context deadline exceeded" for ~30 minutes.

Version-Release number of selected component (if applicable):
4.9

How reproducible:
always

Steps to Reproduce:
1. Create 10-20 PVCs that are dynamically provisioned from a single EFS volume.
2. Not sure if is needed: run Pods with these PVCs + delete the Pods.
3. Delete the PVCs.

Actual results:
PVs stay in the cluster for longer time (10 - 30 minutes)
"oc describe pv" complains:

  Warning  VolumeFailedDelete  22s   efs.csi.aws.com_ip-10-0-174-71_88638ffc-4776-416a-bcb6-c0b50706ae01  rpc error: code = Internal desc = Could not get describe Access Point: fsap-0d5652238d563ab73 , error: Describe Access Point failed: ThrottlingException: 
           status code: 400, request id: 72fe83c8-be0b-40ba-bcb3-5019e9a14110
  Warning  VolumeFailedDelete  8s (x3 over 40s)  efs.csi.aws.com_ip-10-0-174-71_88638ffc-4776-416a-bcb6-c0b50706ae01  rpc error: code = DeadlineExceeded desc = context deadline exceeded

Expected results:
PVs are deleted in few minutes max.

Comment 1 Jan Safranek 2021-12-06 17:03:20 UTC
This seems to be related to our CSI driver image. I can't reproduce it with the upstream CSI driver. I checked that we ship the same efs-utils as upstream, it could be stunnel version (4.x upstream, 5.x in OCP).

Comment 2 Jan Safranek 2021-12-07 16:58:52 UTC
My findings so far:

- Upstream CSI driver image works (amazon/aws-efs-csi-driver:v1.3.2)
  - It's Amazon Linux based (=~ CentOS), it uses the same efs-utils as we, it has quite old stunnel (4.56)

- OpenShift CSI driver image does not work
  - RHEL8 based, stunnel 5.56
  - It uses the same /etc/amazon/efs/efs-utils.conf as the upstream image.

- Image based on CentOS:7 works well
  - But I had to disable TLS host checking, our stunnel 4.56 does not support it (https://docs.aws.amazon.com/efs/latest/ug/upgrading-stunnel.html)

Everything was on the same OCP 4.10.nightly cluster - same StorageClass, same PVCs, same CSI driver sources. The real difference is only the driver base image!

And after some time (few hours?), even the OCP image starts working and I need to rebuild the cluster to debug further.

My current hypothesis is that there is something odd in stunnel, but it's hard to debug.  I tried to use stunnel from Centos 7 in an RHEL8 image, and the other way around - they do not compile, stunnel uses too old or too new libs. Stunnel does not log too much, NFS does not log anything at all.

Comment 3 Vagner Farias 2021-12-07 17:46:43 UTC
Hi Jan,

I'm just spitballing here, but with regards to this message:

```
  Warning  VolumeFailedDelete  22s   efs.csi.aws.com_ip-10-0-174-71_88638ffc-4776-416a-bcb6-c0b50706ae01  rpc error: code = Internal desc = Could not get describe Access Point: fsap-0d5652238d563ab73 , error: Describe Access Point failed: ThrottlingException: 
```

"ThrottlingException" grabbed my attention. Isn't this being returned by EFS API? EFS documentation[1] mentions the following:

```
If an API request exceeds the API request rate for its category, the request returns the ThrottlingException error code. To prevent this error, ensure that your application doesn't retry API requests at a high rate. You can do this by using care when polling and by using exponential backoff retries. 
```

This won't explain why it works with AWS' image and not with Red Hat's, unless maybe you're trying first with Red Hat image and the operator (or something else) issues a lot of API calls. And when you test again with  the AWS image the throttling is over. This could also explain why Red Hat image starts working after some time. 

I couldn't quickly find the API request limits for EFS.

[1] https://docs.aws.amazon.com/efs/latest/ug/api-reference.html

Comment 4 Jan Safranek 2021-12-08 11:16:46 UTC
We have exponential backoff after error, so individual retries may start a bit aggressive, but they get further apart pretty quickly. ThrottlingException is expected when ~50 volumes should be deleted at the same time. I don't see it with 20 volumes. And IMO the API quota on AWS side gets empty relatively quickly (in few minutes?).

It does not explain why *mount* gets stuck.

Comment 5 Jan Safranek 2021-12-08 12:11:06 UTC
I can reproduce the issue with vanilla CentOS 8. I created number of access points in AWS for a single EFS volume (just like the driver would do). Then this *fails*:

$ for i in <list of AccessPoint IDS>; do mkdir -p /mnt/$i; mount -t efs -o tls,accesspoint=$i <EFS id> /mnt/$i & done

With 5 access points, few of them gets stuck at /sbin/mount.nfs4 127.0.0.1:/ /mnt/fsap-xyz -o rw,nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport,port=20257
Unmount the volumes & clear caches via "rm -rf /var/run/efs/*" to be able to reproduce the issue again (mount tends to work when called for the second time).

Notice `&` in the for loop, it mounts the volumes in parallel. When run sequentially, it works without any issues.

I also tested that it's stunnel problem. I compiled a version of stunnel 4.56 (from Centos 7) and installed it on CentOS 8, it works well there.

Comment 6 Jan Safranek 2021-12-09 18:46:49 UTC
Rephrasing the issue to people not familiar with AWS.

Amazon Elastic FileSystem (AWS EFS) is NFS tunneled through stunnel. Don't worry about "access points" and other stuff above, they just provide better user experience. In the end it's really NFS through stunnel.
There is a specific mount.efs utility that allocates a random port, say 20257 and creates stunnel that listens on 127.0.0.1:20257 + calls regular "mount -t nfs4 localhost:/ -o port=20257 /mnt/whatever". NFS client tries to mount 127.0.0.1:20257, where the stunnel listens and stunnel connects to the TLS endpoint somewhere in the cloud and proxies all NFS traffic in the tunnel.

The problem is when OpenShift tries to mount multiple such volumes at the same time, say 10. 10 stunnels are created in parallel (each with a different port) and 10 "mount -t nfs4 ..." commands are called in parallel. With stunnel from RHEL7 (4.46), it works without any issue. With stunnel from RHEL8, some of the mounts get stuck and the NFS client needs to retry after 60 seconds. Kernel and NFS client are the same! Only the stunnel version is different.

Example stunnel config:

fips = no
foreground = yes
socket = l:SO_REUSEADDR=yes
socket = a:SO_BINDTODEVICE=lo
[efs]
client = yes
accept = 127.0.0.1:20221
connect = fs-09d5d2a70d200b5bd.efs.us-east-1.amazonaws.com:2049
sslVersion = TLSv1.2
renegotiation = no
TIMEOUTbusy = 20
TIMEOUTclose = 0
TIMEOUTidle = 70
delay = yes
verify = 2
CAfile = /etc/amazon/efs/efs-utils.crt
cert = /var/run/efs/fs-09d5d2a70d200b5bd.var.lib.csi.pv.fsap-01d2f9903db60b3d7.20221+/certificate.pem
key = /etc/amazon/efs/privateKey.pem
checkHost = fs-09d5d2a70d200b5bd.efs.us-east-1.amazonaws.com
 
Do you have any explanation for such behavior? For example, does stunnel have any "global lock file" or something that could make one stunnel process wait for another one?
I can reproduce the issue quite reliably, I can provide you a machine in Amazon for that and a script to reproduce the issue.

I tried to run stunnel with strace, but in that case I can't reproduce the issue. strace changes timing of something :-(. I'll try to provide stunnel logs tomorrow.

Comment 10 Jan Safranek 2021-12-10 12:24:01 UTC
As a workaround, I am removing parallelism from the EFS CSI driver external-provisioner. Only one volume can be provisioner or deleted at a time. This slows down dynamic provisioning to 1 PV every ~1 second and deletion to 1 PV every ~1-3 seconds (when it's almost empty). Actual deletion speed depends heavily on amount of data on the volume, the driver does rm -rf of it during deletion and with millions of files on the volume it can take minutes easily.

Comment 11 Jan Safranek 2021-12-10 16:29:37 UTC
I created https://bugzilla.redhat.com/show_bug.cgi?id=2031083 to track a real fix instead of this workaround.

Comment 16 errata-xmlrpc 2022-03-10 16:32:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056