Bug 1261886

Summary: [OpenStack Director] Deployment fails due to virtual media attached to host
Product: Red Hat OpenStack Reporter: Joe Talerico <jtaleric>
Component: rhosp-directorAssignee: Lucas Alvares Gomes <lmartins>
Status: CLOSED WONTFIX QA Contact: Shai Revivo <srevivo>
Severity: medium Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: bengland, dtantsur, jcoufal, jslagle, jtaleric, mburns, rhel-osp-director-maint
Target Milestone: ---Keywords: Triaged
Target Release: 10.0 (Newton)   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-14 16:28:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Joe Talerico 2015-09-10 11:52:04 UTC
Description of problem:
When attempting to deploy we had 4 hosts drop into dracut for what looked to be storage issues. It seems that the deployment defaults to whatever sda/vda is to install on, in this case, it was virtual media that was attached to the host at one point. 

How reproducible:
100%


Steps to Reproduce:
1. Hosts have to have a ironic node-update <node-uuid> add properties/root_device='{"key": "value"}'virtual drive attached (we saw this with Dell)
2. Run through a OSPD Deployment

Actual results:
Host drops into dracut.

Expected results:
1) Present the user information about the failure (dump the vendor/model of the storage device -- focusing on Storage, but there might be other useful information to present.)

2) Preferably determine a way to build some intelligence into which disk the installer chooses to install on. 

Additional info:
We attempted to provide hints on where to install the media to using `ironic node-update <node-uuid> add properties/root_device='{"model": "ata"}' however the deployment would only stall at that point.

Comment 4 Mike Burns 2016-04-07 20:50:54 UTC
This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.

Comment 6 Lucas Alvares Gomes 2016-05-19 10:20:32 UTC
Hi Joe,

(In reply to Joe Talerico from comment #0)
> Description of problem:
> When attempting to deploy we had 4 hosts drop into dracut for what looked to
> be storage issues. It seems that the deployment defaults to whatever sda/vda
> is to install on, in this case, it was virtual media that was attached to
> the host at one point. 
> 
> How reproducible:
> 100%
> 
> 
> Steps to Reproduce:
> 1. Hosts have to have a ironic node-update <node-uuid> add
> properties/root_device='{"key": "value"}'virtual drive attached (we saw this
> with Dell)
> 2. Run through a OSPD Deployment
> 

Not sure if I follow, so the root_device was pointing to the virtual media device? 

> Actual results:
> Host drops into dracut.
> 
> Expected results:
> 1) Present the user information about the failure (dump the vendor/model of
> the storage device -- focusing on Storage, but there might be other useful
> information to present.)
> 
> 2) Preferably determine a way to build some intelligence into which disk the
> installer chooses to install on. 
> 
> Additional info:
> We attempted to provide hints on where to install the media to using `ironic
> node-update <node-uuid> add properties/root_device='{"model": "ata"}'
> however the deployment would only stall at that point.

Comment 7 Joe Talerico 2016-05-19 23:25:50 UTC
Lucas from what we saw, yes. It was pointing to the Dell Virtual media. This was in a team members lab, I will add him as the NEEDINFO so he can provide more information if needed.

Comment 8 Ben England 2016-05-30 22:58:29 UTC
So we were using an older Dell DRAC which has a virtual CDROM and a virtual flash (not kidding), and if you don't disable these, they get discovered before the real SCSI devices that you want it to use.    Someone had left those enabled from a previous test run.   So the system disk in this case that we wanted to use was /dev/sdc instead of /dev/sda, for example.   I think virtually all x86_64 servers have similar functionality in the BIOS.  

Linux has been very clear about this for decades -- you CANNOT DEPEND ON DEVICE NAMES TO BE STABLE or MEANINGFUL, full stop, device names are just determined by order of discovery.    And the OpenStack deployment in the yaml files specifies device names only, with no other way to identify correct devices, if I recall correctly.

To identify the correct device target for OSP install, it would be more useful to search for a candidate by its stable attributes, such as size, whether or not it is removable or non-rotational, and have some sane defaults for this.     You could have a choice rule such as "choose the smallest device > 30 GB that is non-removable and rotational" and a fallback choice such as "choose the smallest device > 30 GB that is non-removable".  Or the yaml file could let the user specify that introspection should filter out devices with certain strings like "DRAC" or "CDROM" in the vendor or model name.  This would allow you to choose the right device more often with less intervention (not having to go into the BIOS on all the systems and change their configuration).  For examples of fields that would be stable regardless of order of discovery see: 

/sys/block/sd[a-z]*/{removable,size} 
/sys/block/sd[a-z]*/device/{model,vendor,rotational}

BTW the rotational field is not always accurate - a MegaRAID controller may pass a non-rotational SSD device as a "Logical Drive" that the controller indicates is rotational (uggh).   But NVM SSDs always show up as non-rotational.

Also information about the discovered values of these attributes and the device selected by above rules should be displayed/logged in summary form, so that a OpenStack sysadmin can see what's going wrong without debugging the install like we did.  The user could then retry introspection with improved filters.

Comment 9 Dmitry Tantsur 2016-10-14 16:28:19 UTC
Hi! Since move to IPA, we've changed the logic to detect the default device. However, one should not rely on it. You have to use root device hints, if you have more than one disk device, no matter of which nature.

As to better diagnostic, it would be awesome, but to my best knowledge, IPMI does not expose virtual media information, and the Drac driver does not support it either. So we can only know it when we fail.

Now, it might be interesting to have something like "deployment summary" before the actual deployment, but it's going to be a separate and pretty big RFE.