Bug 1470449

Summary: Cinder volumes not always attached to instance in order presented
Product: Red Hat OpenStack Reporter: Kevin Lambright <kevinl>
Component: openstack-novaAssignee: Eoghan Glynn <eglynn>
Status: CLOSED WONTFIX QA Contact: Joe H. Rahme <jhakimra>
Severity: high Docs Contact:
Priority: unspecified    
Version: 8.0 (Liberty)CC: akarlsso, awaugama, berrange, dasmith, eglynn, hariram, kchamart, kevinl, mbooth, rchincho, sbauza, sferdjao, sgordon, srevivo, vromanso
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-07-17 10:02:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1335596    

Description Kevin Lambright 2017-07-13 01:29:42 UTC
Description of problem:

Our application require a number of Cinder volumes to be attached to the Nova instance. They always need to be attached in the same order, the order matter to the software defined storage application. How they are presented in our application determines what "disk" the Cinder volume becomes in the application (our software defined storage VM has boot, root, coredump, data disks, etc.)

We use the OpenStack API to create the resources (Cinder, Neutron, Nova, etc.) and attach them with a Nova server create call.

Most of the time the volumes are attached in the correct order, but about 1 out of 10 times, the order of the volumes as they are presented in the nova API call (Python list) is not preserved. This causes our SDS VM to fail booting because it does not get the disks it expects in the correct order.

Most VMs do not care about the order in which the cinder volumes are presented in the VM, in our case it is significant.



Version-Release number of selected component (if applicable):

This is OpenStack RDO Liberty

openstack-nova-compute-12.0.4-1.el7.noarch


How reproducible:


Steps to Reproduce:

This has been done using the OpenStack API, which is the best way to programmatically reproduce the problem, but could likely be done with OS CLI as well.

1. Create a number of Cinder volumes in a way which they can be uniquely identified in the VM instance (different sizes, etc.
2. Attach volumes to Nova instance and boot.
3. repeat steps 1 & 2 enough times, and the cinder volumes will be attached to the nova instance in a different order than what was specified. This can be verified by checking the libvirt XML file that is generated by nova (virsh dumpxml <domain name>).


Actual results:

Most times the expected result is true, about 1 out of 10 times, the order the volumes are attached to the nova instance is not what is expected.


Expected results:

Given an ordered list of Cinder volumes to be attached to a nova instance, the expected result is that they are attached in the specified order every time.


Additional info:

There is a Canonical launchpad bug filed for this as well:

https://bugs.launchpad.net/nova/+bug/1697580

Additional info from this bug is:

All volumes are attached using block device mapping.

I am not passing in any device name, just letting it do auto-assignment. In the software defined storage application, it's not expecting specific device names, rather it relies on the order in which they show up on first boot, which is one of the reasons that we can't boot and then attach the volumes after the fact.

>> Are you waiting for each volume to show up as "in-use" before attaching another one?

The volumes are not explicitly attached to the instance, they are created, we wait until they are in the active state, and they are put in order into a Python list.

The call to boot the server looks like this:

 server = self.nova.servers.create(name=self.name,
                                          image=self.boot_image_base,
                                          flavor=self.flavor,
                                          config_drive=self.config_drive,
       userdata=self.userdata_fp,
                                          block_device_mapping_v2=self.block_device_list,
                                   nics=self.nic_device_list)

self.block_device_list is the list of volumes that we want mapped to the instance, in a very specific order. As stated in the original message, something on the order of 9 times out of 10 that order is preserved.

BTW, we do have a HEAT template - we don't use it for our internal deployments, but it is something that we could give out to customers. I'm doing a bunch of deployments with the HEAT templates to see if I can get it to fail in the same way.

Let me know what other information should be provided.

Comment 2 Matthew Booth 2017-07-14 10:34:59 UTC
A few things to note here. Most important is that Nova explicitly does not guarantee the order that block devices are presented to an instance. We also explicitly don't guarantee that the order of block devices will remain consistent even within the lifetime of the instance, as they can change with a reboot for a variety of reasons including remapping after dynamically adding or removing disks, or just the non-deterministic nature of the Linux kernel's device probing at boot time. In fact, this last could be causing your problem even if Nova isn't changing the order they're presented.

There are some deterministic things which would cause the order presented to the instance to differ from the order you passed to create. If your BDMs contain mappings which are not volumes, for example snapshots, ephemeral disks, swap disks, or a custom root disk definition, they will be re-ordered. Within each type, though, the order may be preserved. If your flavor specifies ephemeral or swap disks, these will be automatically added before any volumes.

Because we do not guarantee that drive ordering will be preserved, there may also be non-deterministic reasons that they are re-ordered. I've just scanned the code for any obvious ones and didn't see any, but the BDM data structure is transformed several times before it hits the driver and I may have missed something. Non-deterministic reasons the order may change include iterating over a dict, or async processing of remote cinder calls. As drive ordering is explicitly not guaranteed, it's not clear to me that these would be considered bugs upstream, which may make them hard to change in practise.

The good news is that the device role tagging feature added in Newton (OSP 10) was designed to do exactly this. The spec is here:

  https://specs.openstack.org/openstack/nova-specs/specs/mitaka/approved/virt-device-role-tagging.html

This is the only fully supported way for a guest OS to unambiguously map its devices to those passed through the API. I strongly recommend that you explore using this API for your application, although it would not be possible to backport to Liberty (OSP 8).

I can still try to see what is causing the re-ordering in your case, although for the reasons described above I can't guarantee that it would be fixable. I would need the following logs from both a case where the order was preserved, and one where it wasn't:

* DEBUG logs from the nova-api and nova-conductor that serviced the request.
    If you've got multiple controllers, providing all logs from all of them would be fine.
* DEBUG logs from the destination nova-compute
* Guest OS kernel boot logs

Comment 3 Matthew Booth 2017-07-17 10:02:10 UTC
We discussed this in the compute bug triage meeting, and unfortunately there really isn't anything we can fix in OSP8. As I mentioned above, there are many reasons this order may not be consistent, and not all of them are in Nova.

The recommended solution is device role tagging. You can find some better documentation about this than the spec I posted above here:

  https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html/networking_guide/use-tagging

This gives the guest OS an API it can use to determine which disk is intended for which purpose by tag.

If you really can't upgrade to OSP10, you can at least get stable device names for volumes. Nova exposes the volume id of a cinder volume to the guest OS by setting it as the disk's serial number. You can find more information about this here:

  https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Storage_Administration_Guide/persistent_naming.html

This isn't ideal, as you still need to communicate the volume id to the guest OS, but if you have a way to do that then this is completely reliable.

I'm going to close this as I don't think there's anything we can do in OSP8, but I will send a note to the docs team that we could add an explicit disclaimer about BDM ordering on server create with pointers to the reliable naming docs.

Incidentally, thanks for the thorough bug report.