Bug 1371375

Summary: Race condition during aws ebs/cinder volume detach and delete
Product: OpenShift Container Platform Reporter: Chao Yang <chaoyang>
Component: StorageAssignee: Jan Safranek <jsafrane>
Status: CLOSED ERRATA QA Contact: Jianwei Hou <jhou>
Severity: low Docs Contact:
Priority: medium    
Version: 3.3.0CC: aos-bugs, bchilds, jsafrane, xtian
Target Milestone: ---   
Target Release: 3.5.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openshift v3.5.0.18+9a5d1aa Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-04-26 05:35:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Chao Yang 2016-08-30 05:30:52 UTC
Description of problem:
Using a dynamic provisioned ebs volume, when the PVC is deleted, PV entered FAILED status, and after some time, pv will be deleted

Version-Release number of selected component (if applicable):
openshift v3.3.0.27
kubernetes v1.3.0+507d3a7
etcd 2.3.0+git

How reproducible:
Always

Steps to Reproduce:
1.oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/persistent-volumes/ebs/dynamic-provisioning-pvc.json
2.Create a pod using above pvc
3.After Pod is running, delete the pod and pvc
4.Check pv status
[root@ip-172-18-4-224 ~]# oc get pv
NAME                                       CAPACITY   ACCESSMODES   STATUS    CLAIM                 REASON    AGE
pvc-b3da0e4c-6e60-11e6-b1d8-0e1d1ce05611   1Gi        RWO           Failed    default/ebsc                    4m
5. [root@ip-172-18-4-224 ~]# oc describe pv pvc-b3da0e4c-6e60-11e6-b1d8-0e1d1ce05611
Name:		pvc-b3da0e4c-6e60-11e6-b1d8-0e1d1ce05611
Labels:		failure-domain.beta.kubernetes.io/region=us-east-1
		failure-domain.beta.kubernetes.io/zone=us-east-1d
Status:		Failed
Claim:		default/ebsc
Reclaim Policy:	Delete
Access Modes:	RWO
Capacity:	1Gi
Message:	Delete of volume "pvc-b3da0e4c-6e60-11e6-b1d8-0e1d1ce05611" failed: error deleting EBS volumes: VolumeInUse: Volume vol-2658d981 is currently attached to i-09ba4911
		status code: 400, request id: 
Source:
    Type:	AWSElasticBlockStore (a Persistent Disk resource in AWS)
    VolumeID:	aws://us-east-1d/vol-2658d981
    FSType:	ext4
    Partition:	0
    ReadOnly:	false
Events:
  FirstSeen	LastSeen	Count	From				SubobjectPath	Type		Reason			Message
  ---------	--------	-----	----				-------------	--------	------			-------
  10s		10s		1	{persistentvolume-controller }			Warning		VolumeFailedDelete	Delete of volume "pvc-b3da0e4c-6e60-11e6-b1d8-0e1d1ce05611" failed: error deleting EBS volumes: VolumeInUse: Volume vol-2658d981 is currently attached to i-09ba4911
		status code: 400, request id:

Actual results:
PV entered "Failed" status

Expected results:
PV should be "Released" and deleted

Additional info:
Aug 29 23:25:04 ip-172-18-4-224 atomic-openshift-master: I0829 23:25:04.592697   11569 controller.go:398] volume "pvc-b3da0e4c-6e60-11e6-b1d8-0e1d1ce05611" is released and reclaim policy "Delete" will be executed
Aug 29 23:25:04 ip-172-18-4-224 atomic-openshift-master: I0829 23:25:04.608665   11569 controller.go:618] volume "pvc-b3da0e4c-6e60-11e6-b1d8-0e1d1ce05611" entered phase "Released"
Aug 29 23:25:04 ip-172-18-4-224 atomic-openshift-master: I0829 23:25:04.619587   11569 controller.go:1079] isVolumeReleased[pvc-b3da0e4c-6e60-11e6-b1d8-0e1d1ce05611]: volume is released
Aug 29 23:25:04 ip-172-18-4-224 atomic-openshift-master: I0829 23:25:04.772467   11569 aws_util.go:51] Error deleting EBS Disk volume aws://us-east-1d/vol-2658d981: error deleting EBS volumes: VolumeInUse: Volume vol-2658d981 is currently attached to i-09ba4911
Aug 29 23:25:04 ip-172-18-4-224 atomic-openshift-master: status code: 400, request id:
Aug 29 23:25:04 ip-172-18-4-224 atomic-openshift-master: I0829 23:25:04.777807   11569 controller.go:618] volume "pvc-b3da0e4c-6e60-11e6-b1d8-0e1d1ce05611" entered phase "Failed"
Aug 29 23:25:04 ip-172-18-4-224 atomic-openshift-master: I0829 23:25:04.790657   11569 controller.go:1079] isVolumeReleased[pvc-b3da0e4c-6e60-11e6-b1d8-0e1d1ce05611]: volume is released
Aug 29 23:25:05 ip-172-18-4-224 atomic-openshift-master: I0829 23:25:05.154037   11569 aws_util.go:51] Error deleting EBS Disk volume aws://us-east-1d/vol-2658d981: error deleting EBS volumes: VolumeInUse: Volume vol-2658d981 is currently attached to i-09ba4911

Comment 1 Jianwei Hou 2016-08-30 05:42:44 UTC
Cinder has same issue. https://github.com/kubernetes/kubernetes/issues/31511

Comment 2 Jan Safranek 2016-08-30 12:43:35 UTC
I admit the PV enters Failed state, but it should "self-heal" ~30 seconds after the volume is detached from all nodes.

Lowering the priority.

#xtian/jhou, please confirm and raise priority if a volume is stuck in Failed state while it's detached from all nodes longer than 1 minute.

Comment 3 Jianwei Hou 2016-11-03 08:39:10 UTC
The issue is reproduced  in
openshift v3.4.0.19+346a31d
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

Comment 4 Jan Safranek 2016-11-03 10:33:41 UTC
sorry, I forgot to create Origin PR, now it's tracked as https://github.com/openshift/origin/pull/11746.

Comment 5 Chao Yang 2017-02-09 07:28:38 UTC
This is passed on 
openshift v3.5.0.18+9a5d1aa
kubernetes v1.5.2+43a9be4
etcd 3.1.0

Comment 6 Jan Safranek 2017-03-22 09:14:12 UTC
@chaoyang, I forgot to mark it as MODIFIED, I am doing it right now. Please move it to VERIFIED if you think it's fixed.

Comment 7 Chao Yang 2017-03-24 03:03:44 UTC
Already test this bug on v3.5.0.18 and it is passed

Comment 9 errata-xmlrpc 2017-04-26 05:35:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1129