Bug 1481729

Summary:	EBS issues on us-starter-east-2
Product:	OpenShift Online	Reporter:	Paul Bergene <pbergene>
Component:	Storage	Assignee:	Hemant Kumar <hekumar>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Chao Yang <chaoyang>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	3.x	CC:	abhgupta, aos-bugs, aos-storage-staff, bchilds, chaoyang, erich, hchen, hekumar, hongkliu, jfiala, jupierce, lxia, mifiedle, pbergene, rhowe, sspeiche, sten, tsmetana
Target Milestone:	---	Keywords:	OnlinePro, OnlineStarter
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-03-05 18:15:57 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Paul Bergene 2017-08-15 14:20:06 UTC

Description of problem:

with the upgrade to 3.6, it looks like some of the EBS problems are
back, we are now seeing issues like this in the event logs again:

ref: project : aslak-che on starter-us-east-2 cluster )

21m        26m       3         che-2-jr9lf   Pod                 Warning
  FailedMount   kubelet, ip-172-31-79-193.us-east-2.compute.internal
Unable to mount volumes for pod
"che-2-jr9lf_aslak-che(8834b53f-81a6-11e7-a1ae-0233cba325d9)": timeout
expired waiting for volumes to attach/mount for pod
"aslak-che"/"che-2-jr9lf". list of unattached/unmounted
volumes=[che-data-volume]
21m        26m       3         che-2-jr9lf   Pod                 Warning
  FailedSync    kubelet, ip-172-31-79-193.us-east-2.compute.internal
Error syncing pod
25m        25m       1         che-2-jr9lf   Pod                 Warning
  FailedMount   attachdetach
Failed to attach volume "pvc-950f4b94-814e-11e7-ac45-0233cba325d9" on
node "ip-172-31-79-193.us-east-2.compute.internal" with: Error attaching
EBS volume "vol-0f06f75a93ad3a6a0" to instance "i-0ca452e5adc5d3e40":
VolumeInUse: vol-0f06f75a93ad3a6a0 is already attached to an instance
           status code: 400, request id:
cf794c0d-2580-4395-81a7-987f3766dce9. The volume is currently attached
to instance "i-0e12b6108c6915c15"
23m        23m       1         che-2-jr9lf   Pod                 Warning
  FailedMount   attachdetach   (combined from similar events): Failed to
attach volume "pvc-950f4b94-814e-11e7-ac45-0233cba325d9" on node
"ip-172-31-79-193.us-east-2.compute.internal" with: Error attaching EBS
volume "vol-0f06f75a93ad3a6a0" to instance "i-0ca452e5adc5d3e40":
VolumeInUse: vol-0f06f75a93ad3a6a0 is already attached to an instance
           status code: 400, request id:
cf87853e-6ddf-40fa-acbe-8d44e57e91e4. The volume is currently attached
to instance "i-0e12b6108c6915c15


Version-Release number of selected component (if applicable):

OpenShift Master:
    v3.6.173.0.5 (online version 3.5.0.20)
Kubernetes Master:
    v1.6.1+5115d708d7 


How reproducible:

Try attaching an EBS volume


Steps to Reproduce:
1.
2.
3.

Actual results:

Volume not attaching

Expected results:

Volume attaching

Additional info:

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info: 
Race condition, locks don't expire? Workaround was in place on 3.5.

Comment 3 Matthew Wong 2017-08-15 16:44:07 UTC

Please provide kubelet logs for ip-172-31-77-48.us-east-2.compute.internal & ip-172-31-79-193.us-east-2.compute.internal that capture 10:40~10:50

Comment 20 Hemant Kumar 2017-09-12 19:46:35 UTC

Opened smaller PR for fixing it - https://github.com/kubernetes/kubernetes/pull/52221 

We need to convince that it is correct.

Comment 21 Bradley Childs 2017-09-13 20:56:23 UTC

*** Bug 1472530 has been marked as a duplicate of this bug. ***

Comment 22 Bradley Childs 2017-09-15 19:24:47 UTC

Upstream doesn't want to merge this fix so late in the 1.8 release, but has review&approved for merge into 1.9 when its open.

Per eparis, we will carry this patch: https://github.com/openshift/ose/pull/864

And Origin: 
https://github.com/openshift/origin/pull/16384

Comment 23 Abhishek Gupta 2018-01-18 17:33:11 UTC

Starter is already on OCP 3.7 - can this bug be tested/verified on Starter?

Comment 24 Hemant Kumar 2018-01-18 17:36:35 UTC

The fix was merged in upstream as well and is part of 3.7, 3.8 and 3.9. 

But just a reminder - this bug is about a narrow case of:

1. Create a pod with EBS volume.
2. Before pod can start running on new node but volume gets attached to the node, delete the pod.
3. Before this fix - detaching will take 6-8 minutes itself.
4. After the fix, the volume should be detached sooner.

Comment 25 Abhishek Gupta 2018-01-26 00:40:31 UTC

The fix for this issue should be in INT/STG for the Pro tier at this point.

Comment 26 Chao Yang 2018-01-26 08:13:11 UTC

Verify this bug as below script.

---------------------
#!/bin/bash

oc create -f https://raw.githubusercontent.com/chao007/v3-testfiles/master/persistent-volumes/ebs/dynamic-provisioning-pvc.json
#make sure pv and pvc is bound
sleep 5
oc create -f pod.yaml
sleep 6
oc describe pods mypod
oc get pod
oc delete pods mypod
------------------------

During the step `sleep 6`, I can see the ebs volume is attached status from aws web console and the pod is in "ContainerCreating" status
After `oc delete pods mypod`, the ebs volume will in `detaching` status immediately and become available soon.

QE could not verify the ebs volume status from aws web console, so I test it on the ocp product and version is
oc v3.9.0-0.24.0
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-2-46.ec2.internal:443
openshift v3.9.0-0.24.0
kubernetes v1.9.1+a0ce1bc657