Bug 1327384
Summary: | AWS ebs volumes remains in "in-use" status after deleting pods which used them | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Chao Yang <chaoyang> | ||||||
Component: | Storage | Assignee: | Jan Safranek <jsafrane> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Jianwei Hou <jhou> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 3.2.0 | CC: | aos-bugs, bchilds, chaoyang, eparis, mmcgrath, tdawson, wmeng | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | openshift-3.3.0.19 | Doc Type: | Bug Fix | ||||||
Doc Text: |
Cause:
Race condition in OpenShift code could cause an AWS EBS volume not to be detached from a node when a pod that used the volume is terminated. The volume would be attached to the node forever.
Consequence:
Volume is attached, consuming AWS resources. This volume must be detached manually.
Fix:
Code that attaches/detaches volume to/from nodes has been rewritten in OpenShift 3.3
Result:
AWS EBS volumes are detached from nodes when the last pod that uses the volume is terminated.
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2016-09-27 09:38:05 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
`sleep 30` is really too short here. In the logs I can see "Too many EBS volumes attached", depending on the openshift build you have either 39 or 11 EBS volumes attached to the node and detaching them is slow. Be patient, openshift-node should clean them up eventually. I'd say after 5 minutes it should be fine. It depends on current AWS load and also on their rate limiter, OpenShift tends to hit it quite a lot. Let me know if some volumes are still attached after 5 minutes and please attach openshift-node log with --v=5 from node start until the end. It will be a long one. Created attachment 1148866 [details]
testlog for node
Too many EBS volumes is because I used 50 volumes Today I tried 39 volumes and `sleep 300`, did this 2 times and not found volumes is in used after delete all the pods and pvc, pv Then I tried 50 volumes and `sleep 300`, some volumes is in-used , please see the attahced log I think I can reproduce it reliably by running openshift-master with env. variable KUBE_MAX_PD_VOLS=41 - this will make scheduler to over-commit the node with too many volumes and if I delete all the pods, some volumes will remain attached. I'll debug it from there. Culprit is attachDiskAndVerify in aws_util.go https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/aws_ebs/aws_util.go#L170 This functions calls AttachDisk (several times, if needed) and waits for its result. If it times out *and* the pod is deleted from kubelet in the meantime, AWS still tries to attach the volume and eventually succeeds while kubelet has no evidence about it and leaves the volume attached. Question is what we can do about it. When attachDiskAndVerify is about to time out, we could check if the volume is still being attached (single AWS call which can fail or time out) and "fire and forget" DetachDisk (another AWS call that can fail or time out). When appropriate pod that owns the volume still exists, it will re-try attaching the volume soon. If the pod is already deleted, DetachDisk _should_ eventually succeed in AWS. Sami, do you have an opinion? I am afraid this can't wait for attach controller to be implemented. > AWS still tries to attach the volume and eventually succeeds Is this happening in kubernetes code or is it behind the AWS API ? If it is in kube code then we can try to stop the retry when the pod is deleted (see comments below) if it is in AWS then we wont know the right time to call DetachDisk > Question is what we can do about it. When attachDiskAndVerify is about to > time out, we could check if the volume is still being attached (single AWS > call which can fail or time out) and "fire and forget" DetachDisk (another > AWS call that can fail or time out). > That might work but the question is: do we end up in that state because we keep retrying even though the pod is deleted or because the attach takes too long in AWS but eventually succeeds ? > Sami, do you have an opinion? I have a couple of thoughts: - Why do we do the retry in volume plugin code ? why not just let the next sync call handle it ? That way if the pod is deleted there are no retries. - If the call is slow but eventually succeeds don't we go on to create the global mount which can give the kubelet a clue to detach the disk when cleaning up orphaned volumes ? (In reply to Sami Wagiaalla from comment #6) > > AWS still tries to attach the volume and eventually succeeds > > Is this happening in kubernetes code or is it behind the AWS API ? If it is > in kube code then we can try to stop the retry when the pod is deleted (see > comments below) if it is in AWS then we wont know the right time to call > DetachDisk It's how AWS (and GCE and Cinder) cloud provider work. It starts REST call to AWS API and then periodically check for expected result (such as volume is attached). > > > Question is what we can do about it. When attachDiskAndVerify is about to > > time out, we could check if the volume is still being attached (single AWS > > call which can fail or time out) and "fire and forget" DetachDisk (another > > AWS call that can fail or time out). > > > > That might work but the question is: do we end up in that state because we > keep retrying even though the pod is deleted or because the attach takes too > long in AWS but eventually succeeds ? We ended up in this state because we gave up waiting for the volume to get attached. We may have tried to retry the operation when waiting for the result, but IMO that's irrelevant here. AWS eventually succeeds *after* Kubernetes gave up retrying and waiting. > > > Sami, do you have an opinion? > > I have a couple of thoughts: > > - Why do we do the retry in volume plugin code ? why not just let the next > sync call handle it ? That way if the pod is deleted there are no retries. I do not follow here. The volume gets attached if we simply call cloudprovider.AttachDisk and do not wait for result and someone deletes the pod. Someone should detach it. > - If the call is slow but eventually succeeds don't we go on to create the > global mount which can give the kubelet a clue to detach the disk when > cleaning up orphaned volumes ? Interesting, this could actually work... (In reply to Jan Safranek from comment #7) > (In reply to Sami Wagiaalla from comment #6) > > - If the call is slow but eventually succeeds don't we go on to create the > > global mount which can give the kubelet a clue to detach the disk when > > cleaning up orphaned volumes ? > > Interesting, this could actually work... This would involve quite big changes in AWS volume plugin. As it works now, SetUp() does: 1. attach the volume 2. mount it to /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/<volume id> 3. bind-mounts /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/<volume id> to /var/lib/kubelet/pods/<pod UID>/volumes/kubernetes.io~aws-ebs/<volume number> When kubelet cleans up volumes, it looks only at /var/lib/kubelet/pods/<pod UID>/volumes/. Directory here does not contain AWS volume-ID, therefore TearDown() looks what is bind-mounted here and finds /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/<volume id>. So, if we simply create /var/lib/kubelet/pods/<pod UID>/volumes/<something>, kubelet will try to TearDown the directory, but AWS volume plugin is not able to find AWS volume id from that. Whole SetUp/TearDown logic needs to be changed and that seems to me like overkill here. Back to the drawing board... Experimental PR: https://github.com/kubernetes/kubernetes/pull/24861 *** Bug 1337103 has been marked as a duplicate of this bug. *** So, if we want this fixed in 3.3, new attach/detach controller should handle that (please retest carefully!). For fix in 3.2.x, we can use my pr from comment #9, however it won't be upstream. I'm marking this bug as 'POST' and waiting for Origin rebase for 3.3 This is passed on OCP openshift v3.3.0.19 kubernetes v1.3.0+507d3a7 etcd 2.3.0+git create 50 pv, pvc, pod, sleep 300, delete all pods, sleep 30, delete all pvc and pv, then all ebs volume detached from node and become available Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1933 |
Created attachment 1147437 [details] volume_in-use Description of problem: volume still in-use after deleting pod which was using it Version-Release number of selected component (if applicable): openshift v3.2.0.15 kubernetes v1.2.0-36-g4a3f9c5 etcd 2.2.5 How reproducible: 20% Steps to Reproduce: 1.create 50 pv, pvc, pod, sleep 300, delete all pods, sleep 30, delete all pvc and pv 2.[root@ip-172-18-5-20 ~]# oc get pods NAME READY STATUS RESTARTS AGE docker-registry-1-tyl0n 1/1 Running 0 3m mypod46 0/1 Terminating 0 8m [root@ip-172-18-5-20 ~]# oc get events | grep mypod46 8m 8m 1 mypod46 Pod Normal Scheduled {default-scheduler } Successfully assigned mypod46 to ip-172-18-5-21.ec2.internal 7m 4m 4 mypod46 Pod Warning FailedMount {kubelet ip-172-18-5-21.ec2.internal} Unable to mount volumes for pod "mypod46_default(8e5f81f9-02ad-11e6-9a1d-0ea1b1d3635d)": Could not attach EBS Disk "aws://us-east-1d/vol-63433dc1": Too many EBS volumes attached to node ip-172-18-5-21.ec2.internal. 7m 4m 4 mypod46 Pod Warning FailedSync {kubelet ip-172-18-5-21.ec2.internal} Error syncing pod, skipping: Could not attach EBS Disk "aws://us-east-1d/vol-63433dc1": Too many EBS volumes attached to node ip-172-18-5-21.ec2.internal. 14s 14s 1 mypod46 Pod Warning FailedMount {kubelet ip-172-18-5-21.ec2.internal} Unable to mount volumes for pod "mypod46_default(8e5f81f9-02ad-11e6-9a1d-0ea1b1d3635d)": Could not attach EBS Disk "aws://us-east-1d/vol-63433dc1": Timeout waiting for volume state 14s 14s 1 mypod46 Pod Warning FailedSync {kubelet ip-172-18-5-21.ec2.internal} Error syncing pod, skipping: Could not attach EBS Disk "aws://us-east-1d/vol-63433dc1": Timeout waiting for volume state [root@ip-172-18-5-20 ~]# oc get pods NAME READY STATUS RESTARTS AGE docker-registry-1-tyl0n 1/1 Running 0 4m 3. Check the volume status in AWS web console this volume vol-63433dc1 is in-use after 10 min deleting the pod which use this volume [root@ip-172-18-5-20 ~]# ec2dvol vol-63433dc1 VOLUME vol-63433dc1 1 us-east-1d in-use 2016-04-15T01:54:05+0000 standard Not Encrypted ATTACHMENT vol-63433dc1 i-9c0d5107 /dev/xvdbj attached 2016-04-15T02:04:15+0000 false Actual results: volume still in-use after deleting pod using it Expected results: volume should become avaiable after deleting pod using it Additional info: