Bug 1422531

Summary:	Openshift generates invalid device names on AWS
Product:	OKD	Reporter:	Hemant Kumar <hekumar>
Component:	Storage	Assignee:	Hemant Kumar <hekumar>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Chao Yang <chaoyang>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	3.x	CC:	aos-bugs, aos-storage-staff, bingli, chaoyang, dakini, eparis, erich
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-05-30 12:49:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Hemant Kumar 2017-02-15 13:22:12 UTC

On AWS Openshift will generate device names that are invalid and rejected by AWS API.

Version-Release number of selected component (if applicable): 3.4.1.x

Following is original bug report copied from Stefanie's message in - https://bugzilla.redhat.com/show_bug.cgi?id=1404811#c39

We have installed 3.4.1.2 in Preview prod, but we're still seeing the same behavior as before:

To reproduce:

1. Create a pv-backed app. In this case, we were installing metrics and the cassandra pod is backed by a PV.

2. Scale the app down to 0 replicas.

3. Scale back up to 1 replicas.

4. Watch the EBS volume get stuck in 'attaching' state in the AWS web console. At this point, we get an error like this:
timeout expired waiting for volumes to attach/mount for pod "hawkular-cassandra-2-dcq1r"/"openshift-infra"

5. Force-detach the volume in the web console and repeat steps 2,3,4. The app never succeeds in scaling up.

Level4 node logs during the scale up.
http://paste-platops.itos.redhat.com/p03x0rmms/wbv2qn/raw

Container is creating. Node says the PV is "in use" but not "attached".
http://paste-platops.itos.redhat.com/pcuvksblp/kxhdbk/raw

Events log showing the timeout. And 'oc describe pv' for the affected pv.
http://paste-platops.itos.redhat.com/pmpbmpon6/uqauex/raw

AWS shows the volume in 'available' state after repeatedly scaling up and down. Sometimes it get stuck in 'attaching' state, but mostly only after deleting the PV and app and recreating it all from scratch.

In the controller logs, I'm now seeing messages like this for various other volumes:

Feb 14 22:58:01 ip-172-31-10-24.ec2.internal atomic-openshift-master-controllers[64870]: E0214 22:58:01.864781 64870 attacher.go:72] Error attaching volume "aws://us-east-1c/vol-0fbcf15804f98f8e9": Error attaching EBS volume: InvalidParameterValue: Value (/dev/xvdfh) for parameter device is invalid. /dev/xvdfh is not a valid EBS device name.

Feb 14 20:36:49 ip-172-31-10-24.ec2.internal atomic-openshift-master-controllers[64870]: E0214 20:36:49.972282 64870 attacher.go:72] Error attaching volume "aws://us-east-1c/vol-0ffea06983cd98900": Error attaching EBS volume: InvalidParameterValue: Value (/dev/xvddz) for parameter device is invalid. /dev/xvddz is not a valid EBS device name.

More controller logs are here:
http://paste-platops.itos.redhat.com/pjqz8puvn/fdnxt4/raw

Comment 1 Hemant Kumar 2017-02-15 13:28:05 UTC

The upstream PR - https://github.com/kubernetes/kubernetes/pull/41455

I am waiting for it to be merged before I start cherry picking this.

Comment 5 Hemant Kumar 2017-02-15 21:10:33 UTC

*** Bug 1422457 has been marked as a duplicate of this bug. ***

Comment 6 Hemant Kumar 2017-02-21 15:44:39 UTC

The fix for 3.4 has been included in v3.4.1.8 . I will hold off on putting this BZ in QA since fix for 3.3 isn't merged yet.