Bug 1316095 - AWS volumes remains in "in-use" status after deleting OSE pods which used them - OSE v3.1.1.911
Summary: AWS volumes remains in "in-use" status after deleting OSE pods which used the...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OKD
Classification: Red Hat
Component: Storage
Version: 3.x
Hardware: Unspecified
OS: Unspecified
urgent
low
Target Milestone: ---
: ---
Assignee: Sami Wagiaalla
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On:
Blocks: 1318975
TreeView+ depends on / blocked
 
Reported: 2016-03-09 12:11 UTC by Elvir Kuric
Modified: 2016-06-07 22:47 UTC (History)
10 users (show)

Fixed In Version: atomic-openshift-3.2.0.5-1.git.0.f1bac72.el7.x86_64
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1318975 (view as bug list)
Environment:
Last Closed: 2016-05-12 17:14:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
volume status (36.22 KB, image/png)
2016-03-09 12:11 UTC, Elvir Kuric
no flags Details

Description Elvir Kuric 2016-03-09 12:11:45 UTC
Created attachment 1134490 [details]
volume status

Description of problem:
EBS volume stays in "in use" status after removing pods associated with them

Version-Release number of selected component (if applicable):

OSE v3.1.1.911 with awsElasticBlockStore storage plugin

How reproducible:
Create 50+ pods where each pod uses one EBS volume and once pods are created start to delete pods, pv,pvc and EBS volumes.Create and delete operation were timely close, eg. create_pods; sleep 30; delete_pods 

Actual results:

EBS volumes remains in "in use" status, even pods which used them were deleted 

Expected results:

After deleting pods, it is expected that EBS volumes move to "available" state from where they can be removed/deleted. 

Additional info:

In AWS web interface, after deleting pods volumes remain in state as showed in attached photo ( in-use). This status does not allow to remove volumes, and necessary to detach them  in order to remove volumes.

While devices in "in use" ( after deleting pods) , then are not visible on amazon instance ( = OSE node ) in 

fdisk 
/proc/partitions 

outputs

Comment 1 Jan Safranek 2016-03-09 12:29:12 UTC
I'll look at it.

Comment 2 Jan Safranek 2016-03-16 09:05:22 UTC
The first part, raising the limit to 39 has been merged to Kubernetes 1.2. Admins can adjust the limit by setting env. variable "KUBE_MAX_PD_VOLS" in scheduler process (openshift-master), however kubelet will refuse to attach more that 39 volumes anyway. 'oc describe pod' will show clear message that too many volumes are attached and a pod can't be started.

https://github.com/kubernetes/kubernetes/pull/22942


The second part, allowing kubelet to attach more than 39 volumes, is still open and I'm working on it. Tracked here: https://github.com/kubernetes/kubernetes/issues/22994

Comment 3 Jeremy Eder 2016-03-16 11:31:55 UTC
I assume that updated belonged in https://bugzilla.redhat.com/show_bug.cgi?id=1315995

Comment 4 Jan Safranek 2016-03-16 12:43:32 UTC
Oops, sorry, too many open windows... scratch comment #2.

Comment 6 Sami Wagiaalla 2016-03-18 20:21:07 UTC
> Create 50+ pods where each pod uses one EBS volume and once pods are created
> start to delete pods, pv,pvc and EBS volumes.Create and delete operation
> were timely close, eg. create_pods; sleep 30; delete_pods 
> 

OSE 3.1 currently supports only 11 EBS volumes. This may get raised to 39. Can you reproduce this issue with < 11 volumes ? If not then this should be marked as a duplicate of #1315995

I am surprised you are not getting "Value (/dev/XYZ) for parameter device is invalid." errors. Are all 50 volumes getting attached and the pods are running ? Or are you just cycling through the pods n at a time where n < 11 ?

Comment 7 Elvir Kuric 2016-03-18 20:48:59 UTC
(In reply to Sami Wagiaalla from comment #6)
> > Create 50+ pods where each pod uses one EBS volume and once pods are created
> > start to delete pods, pv,pvc and EBS volumes.Create and delete operation
> > were timely close, eg. create_pods; sleep 30; delete_pods 
> > 
> 
> OSE 3.1 currently supports only 11 EBS volumes. This may get raised to 39.
> Can you reproduce this issue with < 11 volumes ? If not then this should be
> marked as a duplicate of #1315995
I did not tried < 11 volumes, in my tests I tried to reach max values. 
> 
> I am surprised you are not getting "Value (/dev/XYZ) for parameter device is
> invalid." errors.
I am getting this error when exceed limit of 11 EBSes - but during create phase.This issue is visible after pods/pv/pvc are removed. 

 Are all 50 volumes getting attached and the pods are
> running ? Or are you just cycling through the pods n at a time where n < 11 ?
if started more pods asking for number of EBSs > 11 , then not all EBEs are attached - only up to limit per ose node ( 11 EBSes ). Only pods which starts get EBS device ( one ebs per pod ). Other EBSes are not visible at OSE side, in amazon console they are marked as "available" 

This bz is for case when pods,pv,pvc are deleted ( oc get pv/pods,pvc  is not showing anything ), but even pods/pvc/pv are gone, EBSes used previously by pods are still ""in use" and attached to ec2 instance used for ose node.Attachment to BZ is illustrating it, EBS showed "in use" even no pods using them.

Comment 8 Jan Safranek 2016-03-20 13:26:31 UTC
Elvir, I tried hard to reproduce this bug, but OpenShift always worked as expected, detaching all volumes. Sometimes I ended with very confused AWS - it refused to attach any volumes to a node, but that's not what you see.

Can you try with the newest OpenShift build? atomic-openshift-3.2.0.5 should does not have buggy AWS attachment cache, it should help.

If you can still reproduce it, can you please share with us how did you install the systems (i.e. what AWS AMI do you use, your ansible inventory + any additional steps you did).

Comment 9 Elvir Kuric 2016-03-21 12:19:41 UTC
I do not have atomic-openshift-3.2.0.5 installed, will do it today/tomorrow and test with these bits and will follow up. Thank you

Comment 10 Elvir Kuric 2016-03-21 20:36:25 UTC
I tested with below packages ( 38 pods per ose node )
-- 
atomic-openshift-master-3.2.0.5-1.git.0.f1bac72.el7.x86_64
tuned-profiles-atomic-openshift-node-3.2.0.5-1.git.0.f1bac72.el7.x86_64
atomic-openshift-3.2.0.5-1.git.0.f1bac72.el7.x86_64
atomic-openshift-sdn-ovs-3.2.0.5-1.git.0.f1bac72.el7.x86_64
atomic-openshift-clients-3.2.0.5-1.git.0.f1bac72.el7.x86_64
atomic-openshift-node-3.2.0.5-1.git.0.f1bac72.el7.x86_64
-- 
and cannot reproduce issue again with these packages.
I tried same approach as before ( which led to issue ) 
 
create pods, sleep 300s, delete pods,sleep 30s; delete pv; delete pvc, sleep 120s, delete  EBS volumes... worked fine as expected

Previously last step when tried to remove EBS volumes was problematic and failing, now it does not. 

Thanks 
Kind regards, 
Elvir

Comment 11 Jan Safranek 2016-03-22 11:12:13 UTC
Thanks a lot for the test! I'm marking it as fixed.

Comment 12 Wenqi He 2016-03-24 13:05:31 UTC
I have verified of this on origin, it works well with the max pods numbers(39), all the volumes can be deleted after releasing. Change the status to verified.


Note You need to log in before you can comment on or make changes to this bug.