1537236 – [AWS]PV clean-up failed: Error deleting EBS volume

Bug 1537236 - [AWS]PV clean-up failed: Error deleting EBS volume

Summary: [AWS]PV clean-up failed: Error deleting EBS volume

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.10.0
Assignee:	Tomas Smetana
QA Contact:	Chao Yang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-01-22 17:56 UTC by Hongkai Liu
Modified:	2018-07-30 19:09 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: When the API call quota for in the AWS cloud was reached certain AWS API calls returned errors. These errors were not correctly handled during AWS persistent volumes detaching. Consequence: Some AWS volumes remained attached to the nodes despite there was no pod using them and they had to be detached manually otherwise they got stuck forever. Fix: The AWS API call error handling was fixed in the code for the dynamic volume detaching. Result: Even when the AWS API call quota is reached the attach/detach controller re-tries to detach the volume until it succeeds making sure the volumes that should be detached are really detached.
Clone Of:
Environment:
Last Closed:	2018-07-30 19:09:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:1816	0	None	None	None	2018-07-30 19:09:34 UTC

Description Hongkai Liu 2018-01-22 17:56:58 UTC

Description of problem:
Some of PV deleting failed when deleting test projects. On EC2 web console, the Volumes are still attached to the instance.

Version-Release number of selected component (if applicable):
root@ip-172-31-0-221: ~ # yum list installed | grep openshift
atomic-openshift.x86_64 3.9.0-0.22.0.git.0.d4658fb.el7

How reproducible:
Very likely. Met at least 3 times in 3.9 cluster.

Steps to Reproduce:
1. Enable only one compute node scheduling in the cluster. This step is not necessary. We do it here to show the problem easily.
2. Create 39 projects with a PVC dynamically created and attached to a pod, then label those project as purpose=test (oc label namespace <project_name> purpose=test; Wait until all the pods are running
3. delete all test projects: oc delete project -l purpose=test; watch the results of "oc get pv --no-headers"; Some PVs are going to be there forever (in the test, I waited usually for 1 hour)

Actual results:
We can redo step 2 and 3 and will see more and more PVs accumulated there.

Expected results:
The PVs are deleted and deattached to the instance.

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:
The master log and the node log will be provided.
This is one of the PV description.

root@ip-172-31-0-221: ~ # oc describe pv
Name: pvc-6119afd9-ff91-11e7-ba2c-024c74e10c04
Labels: failure-domain.beta.kubernetes.io/region=us-west-2
failure-domain.beta.kubernetes.io/zone=us-west-2b
Annotations: kubernetes.io/createdby=aws-ebs-dynamic-provisioner
pv.kubernetes.io/bound-by-controller=yes
pv.kubernetes.io/provisioned-by=kubernetes.io/aws-ebs
StorageClass: gp2
Status: Failed
Claim: fiotestia8/pvc4hdjm8m76c
Reclaim Policy: Delete
Access Modes: RWO
Capacity: 1Gi
Message: Error deleting EBS volume "vol-053e38dfab9143549" since volume is currently attached to "i-0f2140c2b4b59222e"
Source:
Type: AWSElasticBlockStore (a Persistent Disk resource in AWS)
VolumeID: aws://us-west-2b/vol-053e38dfab9143549
FSType: ext4
Partition: 0
ReadOnly: false
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning VolumeFailedDelete 39m persistentvolume-controller Error deleting EBS volume "vol-053e38dfab9143549" since volume is currently attached to "i-0f2140c2b4b59222e"

Comment 2 Tomas Smetana 2018-01-23 15:09:14 UTC

The master log contains these warnings:

Got RequestLimitExceeded error on AWS request (ec2::DescribeInstances)

Meaning we hit the AWS API call quota...

Comment 3 Hongkai Liu 2018-01-23 18:38:19 UTC

(In reply to Tomas Smetana from comment #2)
> The master log contains these warnings:
> 
> Got RequestLimitExceeded error on AWS request (ec2::DescribeInstances)
> 
> Meaning we hit the AWS API call quota...

From a user's point of view, what is the impact to the cluster?

Comment 4 Tomas Smetana 2018-01-24 09:31:52 UTC

I'll try to reproduce the problem to see whether this is simply a case of overloaded system and why exactly the volumes were not detached: i.e. if OpenShift tried to detach them but could not because of API quota or if the API call storm triggered some bug to show.

Comment 5 Jan Safranek 2018-01-24 17:14:50 UTC

Did you try to wait "long enough"? Detacher tries periodically to detach the volume and deleter tries to delete it periodically too. It should cure itself in few minutes. Or dozens of minutes...

Comment 6 Jan Safranek 2018-01-24 18:12:22 UTC

Oh, I see, waited for 1 hour. That should be enough. Sorry for the noise.

Comment 7 Hongkai Liu 2018-01-24 18:16:21 UTC

(In reply to Jan Safranek from comment #6)
> Oh, I see, waited for 1 hour. That should be enough. Sorry for the noise.

Yes I did. Actually, even more.

I repeated several times on step 2 and 3.
It lasted over 3 hours.
Even the latest created PV were deleted, the first lefeover was still there.

Comment 8 Tomas Smetana 2018-01-30 11:05:57 UTC

I've tried to reproduce this several times but failed (cleanup of the PVs took quite long -- ~20 minutes -- but eventually they were all detached in the end). I also failed to get the "RequestLimitExceeded" warning so I guess those two are related.

If anybody ever happens to encounter something similar: could you please also capture the controllers log? It might reveal more...

Comment 9 Hongkai Liu 2018-01-31 02:01:33 UTC

I am in a training this week.
Will try to reproduce this again early next week with controllers log.

Comment 12 Hongkai Liu 2018-02-06 14:22:04 UTC

@Tomas, thanks for the reply.

Good to know that we have a potential cause for this bz.

Let me know if anything need to be done from my side.

Comment 14 Tomas Smetana 2018-02-07 13:56:33 UTC

What I think is going on here (I'll try to describe so someone can correct me in case I missed something):

Once the volume is unmounted controller starts detaching the volume:
 -> OperationExecutor.DetachVolume()
    -> OperationGenerator.GenerateDetachVolumeFunc()
       (gets the AWS plugin, calls NewDetacher() to obtain AWS detacher)
       -> detacher.Detach() // this is the AWS plugin detacher
          -> aws.Volumes.DetachDisk() // calling cloudprovider
             -> Cloud.checkIfAttachedToNode()
                -> awsDisk.describeVolume()
                   (fails with RequestLimitExceeded on ec2::DescribeVolumes and
                    thigs start breaking here)

The checkIfAttachedToNode() method returns three values: disk information obtained by the describeVolume, whehter the disk is attached to the specified node and eventual error. In the case the describeVolume call returns error it does this:

if err != nil {
        describeError := fmt.Errorf("Error describing volume %s with %v", diskName, err)
        glog.Warning(describeError)
        awsDiskInfo.volumeState = "unknown"
        return awsDiskInfo, false, describeError
}

So, it sets the diskInfo.volumeState to "unknown" but leaves the rest of diskInfo empty, and tells the caller the volume is not attached to the node... The caller (AWS cloudprovider's DetachDisk) finds out there is no instance information in the diskInfo and the volume is not attached to any instance so it just returns immediately with no error.

This looks like a successful detach in the AWS plugin hence the "DetachVolume.Detach succeeded". The volume is marked as detached in the controller's ActualStateOfWorld (will never try to detach it again) and the PV controler starts deleting it. This fails since the volume is still attached to the instance: DetachDisk silently returned without instructing AWS to really detach the volume.

The PV controller calls AWS cloudprovider's DeleteDisk which again tries to call describeInstance and hits the quota from time to time. Nevertheless it also checks whether the volume can be deleted and of course finds out it can't... The PV controller's logic seems to be safe since it would re-try when hitting the API quota for describeVolume. We need to make sure the volume gets really detached though.

Comment 15 Tomas Smetana 2018-02-07 14:08:17 UTC

I forgot to add: I think this should be enough to fix the problem:

--- a/vendor/k8s.io/kubernetes/pkg/cloudprovider/providers/aws/aws.go
+++ b/vendor/k8s.io/kubernetes/pkg/cloudprovider/providers/aws/aws.go
@@ -2022,7 +2022,7 @@ func (c *Cloud) AttachDisk(diskName KubernetesVolumeID, nodeName types.NodeName,
 // DetachDisk implements Volumes.DetachDisk
 func (c *Cloud) DetachDisk(diskName KubernetesVolumeID, nodeName types.NodeName) (string, error) {
        diskInfo, attached, err := c.checkIfAttachedToNode(diskName, nodeName)
-       if diskInfo == nil {
+       if err != nil {
                return "", err
        }

Comment 16 Tomas Smetana 2018-02-07 16:39:23 UTC

I have the problem reproduced by faking describeVolume failures. The fix will be probably slightly more complicated than the one in comment #15 since as Hemant pointed out: I should check for error 404 in which case it is safe to assume the pod is detached and not try again.

Comment 17 Tomas Smetana 2018-02-08 13:13:18 UTC

Upstream PR:
https://github.com/kubernetes/kubernetes/pull/59569

Note: I have used the one-line patch in the end anyway... The cloudprovider re-formats the awsError so parsing out the code outside of the describeVolume method is not possible and also it is being called from several places so changing its behaviour is much more dangerous. The proposed change is aiming at this very specific issue.

Comment 18 Hongkai Liu 2018-02-08 21:46:55 UTC

@Tomas, Thanks for the PR.

The PR gives more chances to k8s to detach volumes, right?

Comment 19 Tomas Smetana 2018-02-09 08:13:56 UTC

(In reply to Hongkai Liu from comment #18)

> The PR gives more chances to k8s to detach volumes, right?

Yes. In case the describeVolumes fail (hit the API call quota limit in our case)  the DetachDisk method will propagate the error so the attach/detach controller will not mark the volume as detached and will try to detach it again.

I suppose this was actually quite a common issue on more loaded clusters and we may have solved one of the AWS mysterious problem. We'll see.

The next step would be to create a more "sophisticated" (and complex) fix that would differentiate among the errors and be robust even in weird situations, like someone manually managed to delete volume while the detach was in progress. It's very unlikely to happen but can't be ruled out... However the fix would touch more parts, will take more time and might bring regressions.

Comment 20 Tomas Smetana 2018-02-09 08:30:30 UTC

https://github.com/openshift/origin/pull/18544

Comment 21 Hongkai Liu 2018-02-09 13:48:44 UTC

Cool. Thanks for the reply. @Tomas.

Comment 22 Hongkai Liu 2018-03-14 21:26:26 UTC

Tested with atomic-openshift.x86_64         3.9.7-1.git.0.e1a30c3.el7

root@ip-172-31-14-145: ~/go/src/github.com/openshift/ose # git tag --contains 15c1c88b6bef9a437fed960822f203f0620b9074
...
v3.9.7-1
...

root@ip-172-31-14-145: ~/go/src/github.com/openshift/ose # git rev-list -n 1  v3.9.7-1
e1a30c3f321a9c6809f53bd4467d6442e92522c4


I can no longer see any PV left there. Tried about 10 times of creating/deleting. All PVs got cleaned up in my test.

Comment 24 errata-xmlrpc 2018-07-30 19:09:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816

Note You need to log in before you can comment on or make changes to this bug.