Bug 1327384

Summary: AWS ebs volumes remains in "in-use" status after deleting pods which used them
Product: OpenShift Container Platform Reporter: Chao Yang <chaoyang>
Component: StorageAssignee: Jan Safranek <jsafrane>
Status: CLOSED ERRATA QA Contact: Jianwei Hou <jhou>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.2.0CC: aos-bugs, bchilds, chaoyang, eparis, mmcgrath, tdawson, wmeng
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openshift-3.3.0.19 Doc Type: Bug Fix
Doc Text:
Cause: Race condition in OpenShift code could cause an AWS EBS volume not to be detached from a node when a pod that used the volume is terminated. The volume would be attached to the node forever. Consequence: Volume is attached, consuming AWS resources. This volume must be detached manually. Fix: Code that attaches/detaches volume to/from nodes has been rewritten in OpenShift 3.3 Result: AWS EBS volumes are detached from nodes when the last pod that uses the volume is terminated.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-09-27 09:38:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
volume_in-use
none
testlog for node none

Description Chao Yang 2016-04-15 03:30:44 UTC
Created attachment 1147437 [details]
volume_in-use

Description of problem:
volume still in-use after deleting pod which was using it

Version-Release number of selected component (if applicable):
openshift v3.2.0.15
kubernetes v1.2.0-36-g4a3f9c5
etcd 2.2.5

How reproducible:
20%

Steps to Reproduce:
1.create 50 pv, pvc, pod, sleep 300, delete all pods, sleep 30, delete all pvc and pv
2.[root@ip-172-18-5-20 ~]# oc get pods
NAME                      READY     STATUS        RESTARTS   AGE
docker-registry-1-tyl0n   1/1       Running       0          3m
mypod46                   0/1       Terminating   0          8m
[root@ip-172-18-5-20 ~]# oc get events | grep mypod46
8m        8m        1         mypod46   Pod                 Normal    Scheduled     {default-scheduler }                    Successfully assigned mypod46 to ip-172-18-5-21.ec2.internal
7m        4m        4         mypod46   Pod                 Warning   FailedMount   {kubelet ip-172-18-5-21.ec2.internal}   Unable to mount volumes for pod "mypod46_default(8e5f81f9-02ad-11e6-9a1d-0ea1b1d3635d)": Could not attach EBS Disk "aws://us-east-1d/vol-63433dc1": Too many EBS volumes attached to node ip-172-18-5-21.ec2.internal.
7m        4m        4         mypod46   Pod                 Warning   FailedSync    {kubelet ip-172-18-5-21.ec2.internal}   Error syncing pod, skipping: Could not attach EBS Disk "aws://us-east-1d/vol-63433dc1": Too many EBS volumes attached to node ip-172-18-5-21.ec2.internal.
14s       14s       1         mypod46   Pod                 Warning   FailedMount   {kubelet ip-172-18-5-21.ec2.internal}   Unable to mount volumes for pod "mypod46_default(8e5f81f9-02ad-11e6-9a1d-0ea1b1d3635d)": Could not attach EBS Disk "aws://us-east-1d/vol-63433dc1": Timeout waiting for volume state
14s       14s       1         mypod46   Pod                 Warning   FailedSync    {kubelet ip-172-18-5-21.ec2.internal}   Error syncing pod, skipping: Could not attach EBS Disk "aws://us-east-1d/vol-63433dc1": Timeout waiting for volume state
[root@ip-172-18-5-20 ~]# oc get pods
NAME                      READY     STATUS    RESTARTS   AGE
docker-registry-1-tyl0n   1/1       Running   0          4m

3. Check the volume status in AWS web console
this volume vol-63433dc1 is in-use after 10 min deleting the pod which use this volume

[root@ip-172-18-5-20 ~]# ec2dvol vol-63433dc1
VOLUME	vol-63433dc1	1		us-east-1d	in-use	2016-04-15T01:54:05+0000	standard		Not Encrypted	
ATTACHMENT	vol-63433dc1	i-9c0d5107	/dev/xvdbj	attached	2016-04-15T02:04:15+0000	false

Actual results:
volume still in-use after deleting pod using it

Expected results:
volume should become avaiable after deleting pod using it


Additional info:

Comment 1 Jan Safranek 2016-04-19 07:44:19 UTC
`sleep 30` is really too short here. In the logs I can see "Too many EBS volumes attached", depending on the openshift build you have either 39 or 11 EBS volumes attached to the node and detaching them is slow. Be patient, openshift-node should clean them up eventually.

I'd say after 5 minutes it should be fine. It depends on current AWS load and also on their rate limiter, OpenShift tends to hit it quite a lot.

Let me know if some volumes are still attached after 5 minutes and please attach openshift-node log with --v=5 from node start until the end. It will be a long one.

Comment 2 Chao Yang 2016-04-20 05:48:25 UTC
Created attachment 1148866 [details]
testlog for node

Comment 3 Chao Yang 2016-04-20 05:50:33 UTC
Too many EBS volumes is because I used 50 volumes

Today I tried 39 volumes and `sleep 300`, did this 2 times and not found volumes is in used after delete all the pods and pvc, pv

Then I tried 50 volumes and `sleep 300`, some volumes is in-used , please see the attahced log

Comment 4 Jan Safranek 2016-04-21 15:52:12 UTC
I think I can reproduce it reliably by running openshift-master with env. variable KUBE_MAX_PD_VOLS=41 - this will make scheduler to over-commit the node with too many volumes and if I delete all the pods, some volumes will remain attached.

I'll debug it from there.

Comment 5 Jan Safranek 2016-04-22 14:36:37 UTC
Culprit is attachDiskAndVerify in aws_util.go
https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/aws_ebs/aws_util.go#L170

This functions calls AttachDisk (several times, if needed) and waits for its result. If it times out *and* the pod is deleted from kubelet in the meantime, AWS still tries to attach the volume and eventually succeeds while kubelet has no evidence about it and leaves the volume attached.

Question is what we can do about it. When attachDiskAndVerify is about to time out, we could check if the volume is still being attached (single AWS call which can fail or time out) and "fire and forget" DetachDisk (another AWS call that can fail or time out).

When appropriate pod that owns the volume still exists, it will re-try attaching the volume soon. If the pod is already deleted, DetachDisk _should_ eventually succeed in AWS.

Sami, do you have an opinion? I am afraid this can't wait for attach controller to be implemented.

Comment 6 Sami Wagiaalla 2016-04-25 14:27:59 UTC
> AWS still tries to attach the volume and eventually succeeds

Is this happening in kubernetes code or is it behind the AWS API ? If it is in kube code then we can try to stop the retry when the pod is deleted (see comments below) if it is in AWS then we wont know the right time to call DetachDisk

> Question is what we can do about it. When attachDiskAndVerify is about to
> time out, we could check if the volume is still being attached (single AWS
> call which can fail or time out) and "fire and forget" DetachDisk (another
> AWS call that can fail or time out).
>

That might work but the question is: do we end up in that state because we keep retrying even though the pod is deleted or because the attach takes too long in AWS but eventually succeeds ?

> Sami, do you have an opinion?

I have a couple of thoughts:

- Why do we do the retry in volume plugin code ? why not just let the next sync call handle it ? That way if the pod is deleted there are no retries.

- If the call is slow but eventually succeeds don't we go on to create the global mount which can give the kubelet a clue to detach the disk when cleaning up orphaned volumes ?

Comment 7 Jan Safranek 2016-04-25 14:59:22 UTC
(In reply to Sami Wagiaalla from comment #6)
> > AWS still tries to attach the volume and eventually succeeds
> 
> Is this happening in kubernetes code or is it behind the AWS API ? If it is
> in kube code then we can try to stop the retry when the pod is deleted (see
> comments below) if it is in AWS then we wont know the right time to call
> DetachDisk

It's how AWS (and GCE and Cinder) cloud provider work. It starts REST call to AWS API and then periodically check for expected result (such as volume is attached).

> 
> > Question is what we can do about it. When attachDiskAndVerify is about to
> > time out, we could check if the volume is still being attached (single AWS
> > call which can fail or time out) and "fire and forget" DetachDisk (another
> > AWS call that can fail or time out).
> >
> 
> That might work but the question is: do we end up in that state because we
> keep retrying even though the pod is deleted or because the attach takes too
> long in AWS but eventually succeeds ?

We ended up in this state because we gave up waiting for the volume to get attached. We may have tried to retry the operation when waiting for the result, but IMO that's irrelevant here. AWS eventually succeeds *after* Kubernetes gave up retrying and waiting.

> 
> > Sami, do you have an opinion?
> 
> I have a couple of thoughts:
> 
> - Why do we do the retry in volume plugin code ? why not just let the next
> sync call handle it ? That way if the pod is deleted there are no retries.

I do not follow here. The volume gets attached if we simply call cloudprovider.AttachDisk and do not wait for result and someone deletes the pod. Someone should detach it.


> - If the call is slow but eventually succeeds don't we go on to create the
> global mount which can give the kubelet a clue to detach the disk when
> cleaning up orphaned volumes ?

Interesting, this could actually work...

Comment 8 Jan Safranek 2016-04-26 11:29:58 UTC
(In reply to Jan Safranek from comment #7)
> (In reply to Sami Wagiaalla from comment #6)
> > - If the call is slow but eventually succeeds don't we go on to create the
> > global mount which can give the kubelet a clue to detach the disk when
> > cleaning up orphaned volumes ?
> 
> Interesting, this could actually work...

This would involve quite big changes in AWS volume plugin. As it works now, SetUp() does:

1. attach the volume
2. mount it to /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/<volume id>
3. bind-mounts /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/<volume id> to /var/lib/kubelet/pods/<pod UID>/volumes/kubernetes.io~aws-ebs/<volume number>

When kubelet cleans up volumes, it looks only at /var/lib/kubelet/pods/<pod UID>/volumes/. Directory here does not contain AWS volume-ID, therefore TearDown() looks what is bind-mounted here and finds  /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/<volume id>.

So, if we simply create /var/lib/kubelet/pods/<pod UID>/volumes/<something>, kubelet will try to TearDown the directory, but AWS volume plugin is not able to find AWS volume id from that. Whole SetUp/TearDown logic needs to be changed and that seems to me like overkill here.

Back to the drawing board...

Comment 9 Jan Safranek 2016-04-27 13:51:57 UTC
Experimental PR: https://github.com/kubernetes/kubernetes/pull/24861

Comment 10 Bradley Childs 2016-05-24 03:39:06 UTC
*** Bug 1337103 has been marked as a duplicate of this bug. ***

Comment 11 Jan Safranek 2016-06-14 13:34:49 UTC
So, if we want this fixed in 3.3, new attach/detach controller should handle that (please retest carefully!).

For fix in 3.2.x, we can use my pr from comment #9, however it won't be upstream. 

I'm marking this bug as 'POST' and waiting for Origin rebase for 3.3

Comment 12 Chao Yang 2016-08-16 03:36:25 UTC
This is passed on OCP
openshift v3.3.0.19
kubernetes v1.3.0+507d3a7
etcd 2.3.0+git

create 50 pv, pvc, pod, sleep 300, delete all pods, sleep 30, delete all pvc and pv, then all ebs volume detached from node and become available

Comment 14 errata-xmlrpc 2016-09-27 09:38:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1933