1422531 – Openshift generates invalid device names on AWS

Bug 1422531 - Openshift generates invalid device names on AWS

Summary: Openshift generates invalid device names on AWS

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OKD
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	3.x
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Hemant Kumar
QA Contact:	Chao Yang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-02-15 13:22 UTC by Hemant Kumar
Modified:	2020-06-11 13:17 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-05-30 12:49:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Hemant Kumar 2017-02-15 13:22:12 UTC

On AWS Openshift will generate device names that are invalid and rejected by AWS API.

Version-Release number of selected component (if applicable): 3.4.1.x

Following is original bug report copied from Stefanie's message in - https://bugzilla.redhat.com/show_bug.cgi?id=1404811#c39

We have installed 3.4.1.2 in Preview prod, but we're still seeing the same behavior as before:

To reproduce:

1. Create a pv-backed app. In this case, we were installing metrics and the cassandra pod is backed by a PV.

2. Scale the app down to 0 replicas.

3. Scale back up to 1 replicas.

4. Watch the EBS volume get stuck in 'attaching' state in the AWS web console. At this point, we get an error like this:
timeout expired waiting for volumes to attach/mount for pod "hawkular-cassandra-2-dcq1r"/"openshift-infra"

5. Force-detach the volume in the web console and repeat steps 2,3,4. The app never succeeds in scaling up.

Level4 node logs during the scale up.
http://paste-platops.itos.redhat.com/p03x0rmms/wbv2qn/raw

Container is creating. Node says the PV is "in use" but not "attached".
http://paste-platops.itos.redhat.com/pcuvksblp/kxhdbk/raw

Events log showing the timeout. And 'oc describe pv' for the affected pv.
http://paste-platops.itos.redhat.com/pmpbmpon6/uqauex/raw

AWS shows the volume in 'available' state after repeatedly scaling up and down. Sometimes it get stuck in 'attaching' state, but mostly only after deleting the PV and app and recreating it all from scratch.

In the controller logs, I'm now seeing messages like this for various other volumes:

Feb 14 22:58:01 ip-172-31-10-24.ec2.internal atomic-openshift-master-controllers[64870]: E0214 22:58:01.864781 64870 attacher.go:72] Error attaching volume "aws://us-east-1c/vol-0fbcf15804f98f8e9": Error attaching EBS volume: InvalidParameterValue: Value (/dev/xvdfh) for parameter device is invalid. /dev/xvdfh is not a valid EBS device name.

Feb 14 20:36:49 ip-172-31-10-24.ec2.internal atomic-openshift-master-controllers[64870]: E0214 20:36:49.972282 64870 attacher.go:72] Error attaching volume "aws://us-east-1c/vol-0ffea06983cd98900": Error attaching EBS volume: InvalidParameterValue: Value (/dev/xvddz) for parameter device is invalid. /dev/xvddz is not a valid EBS device name.

More controller logs are here:
http://paste-platops.itos.redhat.com/pjqz8puvn/fdnxt4/raw

Comment 1 Hemant Kumar 2017-02-15 13:28:05 UTC

The upstream PR - https://github.com/kubernetes/kubernetes/pull/41455

I am waiting for it to be merged before I start cherry picking this.

Comment 5 Hemant Kumar 2017-02-15 21:10:33 UTC

*** Bug 1422457 has been marked as a duplicate of this bug. ***

Comment 6 Hemant Kumar 2017-02-21 15:44:39 UTC

The fix for 3.4 has been included in v3.4.1.8 . I will hold off on putting this BZ in QA since fix for 3.3 isn't merged yet.

Note You need to log in before you can comment on or make changes to this bug.