Bug 1020501

Summary: openstack-nova: No proper error is returned immediately if several instances attempt to configure the same bootable volume
Product: Red Hat OpenStack Reporter: Yogev Rabl <yrabl>
Component: openstack-novaAssignee: Nikola Dipanov <ndipanov>
Status: CLOSED UPSTREAM QA Contact: Tzach Shefi <tshefi>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.0CC: ajeain, dron, eglynn, ndipanov, sgordon, tshefi, yeylon, yrabl
Target Milestone: z2Keywords: Triaged, ZStream
Target Release: 6.0 (Juno)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: openstack-nova-2013.2-6.el6ost Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1085005 (view as bug list) Environment:
Last Closed: 2015-03-03 18:49:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1085005    
Bug Blocks:    
Attachments:
Description Flags
the cinder & nova logs none

Description Yogev Rabl 2013-10-17 19:19:20 UTC
Created attachment 813518 [details]
the cinder & nova logs

Description of problem:
When creating a large number of instances in the CLI with a loop, with the mistake of configuring the same bootable volume to all of the instances, more than one instances show the volume as 'active'. 

Version-Release number of selected component (if applicable):

The openstack components: 
python-cinderclient-1.0.6-1.el6ost.noarch
python-nova-2013.2-0.25.rc1.el6ost.noarch
openstack-nova-cert-2013.2-0.25.rc1.el6ost.noarch
python-cinder-2013.2-0.11.rc1.el6ost.noarch
openstack-nova-common-2013.2-0.25.rc1.el6ost.noarch
openstack-nova-console-2013.2-0.25.rc1.el6ost.noarch
openstack-nova-compute-2013.2-0.25.rc1.el6ost.noarch
openstack-nova-conductor-2013.2-0.25.rc1.el6ost.noarch
openstack-nova-novncproxy-2013.2-0.25.rc1.el6ost.noarch
openstack-nova-scheduler-2013.2-0.25.rc1.el6ost.noarch
python-novaclient-2.15.0-1.el6ost.noarch
openstack-cinder-2013.2-0.11.rc1.el6ost.noarch
openstack-nova-api-2013.2-0.25.rc1.el6ost.noarch
openstack-nova-network-2013.2-0.25.rc1.el6ost.noarch

The OS: 
Red Hat Enterprise Linux Server release 6.5 Beta (Santiago)

How reproducible:
each time a different number of instances are able to configure the volume as active, but it is always more than one.

Steps to Reproduce:
1. Run the launch instance command in the CLI with a loop: 
# for i in 1 2 3 4 5 6 7 8 9 10  ; do nova boot --flavor 2 --boot-volume <volume id> <name>$i ; done
2. Check the instances status: 
# nova list
3.

Actual results:
More than one instances were able to configure the bootable volume as active.

Expected results:
The first instance is able to configure the bootable volume, the rest show an error and are not created.

Additional info:
Attached: the Nova and Cinder logs.

Comment 2 Nikola Dipanov 2013-10-29 14:19:10 UTC
I have linked the two upstream bugs that have already been raised for this issue. One can find more detailed explanations of what the race condition is there. It would be good to confirm that this in fact is the same bug - as it is not clear from the original description. It would be much better to provide command line history as it is really difficult to reconstruct things from the logs alone.

The results you should be seeing is:

* Only one instance boots successfully.
* All others fail (they get rescheduled, so if you are running in a more than one compute node setup they may succeed).
* The volume is shown as available - even though it should be shown as attached to the first instance. The first instance is still shown as active.

Comment 3 Yogev Rabl 2013-10-31 15:00:31 UTC
The bug should be fixed upstream as well as downstream

Comment 4 Nikola Dipanov 2013-11-01 12:55:24 UTC
What Yogev forgot to mention in comment #3 is that we managed to confirm that this is indeed the bug I linked. I have proposed an upstream stable fix (linked above).

Once accepted upstream - we will apply it to RHOS tree.

Comment 7 Haim 2013-12-12 12:34:12 UTC
moving back to assigned, managed to reproduce the problem with the small script mentioned above, I get the following error:


[root@puma31 ~(keystone_admin)]# for i in 1 2 3 4 5 6 7 8 9 10  ; do nova boot --flavor 2 --boot-volume 7424b243-438b-4a42-b350-0feb0646818a  vm$i ; done


[root@puma31 ~(keystone_admin)]# for i in 1 2 3 4 5 6 7 8 9 10  ; do nova boot --flavor 2 --boot-volume 7424b243-438b-4a42-b350-0feb0646818a  vm$i ; done
+--------------------------------------+----------------------------------------------------+
| Property                             | Value                                              |
+--------------------------------------+----------------------------------------------------+
| OS-EXT-STS:task_state                | scheduling                                         |
| image                                | Attempt to boot from volume - no image supplied    |
| OS-EXT-STS:vm_state                  | building                                           |
| OS-EXT-SRV-ATTR:instance_name        | instance-00000012                                  |
| OS-SRV-USG:launched_at               | None                                               |
| flavor                               | m1.small                                           |
| id                                   | b668226a-fbcc-4d20-94ff-91997fc0dc30               |
| security_groups                      | [{u'name': u'default'}]                            |
| user_id                              | bec60d0e35c44aac8c47a27069fbcb3a                   |
| OS-DCF:diskConfig                    | MANUAL                                             |
| accessIPv4                           |                                                    |
| accessIPv6                           |                                                    |
| progress                             | 0                                                  |
| OS-EXT-STS:power_state               | 0                                                  |
| OS-EXT-AZ:availability_zone          | nova                                               |
| config_drive                         |                                                    |
| status                               | BUILD                                              |
| updated                              | 2013-12-12T12:31:32Z                               |
| hostId                               |                                                    |
| OS-EXT-SRV-ATTR:host                 | None                                               |
| OS-SRV-USG:terminated_at             | None                                               |
| key_name                             | None                                               |
| OS-EXT-SRV-ATTR:hypervisor_hostname  | None                                               |
| name                                 | vm1                                                |
| adminPass                            | qkEBa5eKPLsU                                       |
| tenant_id                            | b1434b1c9e024196929bc1214040fffe                   |
| created                              | 2013-12-12T12:31:32Z                               |
| os-extended-volumes:volumes_attached | [{u'id': u'7424b243-438b-4a42-b350-0feb0646818a'}] |
| metadata                             | {}                                                 |
+--------------------------------------+----------------------------------------------------+
+--------------------------------------+----------------------------------------------------+
| Property                             | Value                                              |
+--------------------------------------+----------------------------------------------------+
| OS-EXT-STS:task_state                | scheduling                                         |
| image                                | Attempt to boot from volume - no image supplied    |
| OS-EXT-STS:vm_state                  | building                                           |
| OS-EXT-SRV-ATTR:instance_name        | instance-00000013                                  |
| OS-SRV-USG:launched_at               | None                                               |
| flavor                               | m1.small                                           |
| id                                   | 1d543670-dd0b-4321-9334-72f55d896aa4               |
| security_groups                      | [{u'name': u'default'}]                            |
| user_id                              | bec60d0e35c44aac8c47a27069fbcb3a                   |
| OS-DCF:diskConfig                    | MANUAL                                             |
| accessIPv4                           |                                                    |
| accessIPv6                           |                                                    |
| progress                             | 0                                                  |
| OS-EXT-STS:power_state               | 0                                                  |
| OS-EXT-AZ:availability_zone          | nova                                               |
| config_drive                         |                                                    |
| status                               | BUILD                                              |
| updated                              | 2013-12-12T12:31:34Z                               |
| hostId                               |                                                    |
| OS-EXT-SRV-ATTR:host                 | None                                               |
| OS-SRV-USG:terminated_at             | None                                               |
| key_name                             | None                                               |
| OS-EXT-SRV-ATTR:hypervisor_hostname  | None                                               |
| name                                 | vm2                                                |
| adminPass                            | cvtH88KKrSMP                                       |
| tenant_id                            | b1434b1c9e024196929bc1214040fffe                   |
| created                              | 2013-12-12T12:31:34Z                               |
| os-extended-volumes:volumes_attached | [{u'id': u'7424b243-438b-4a42-b350-0feb0646818a'}] |
| metadata                             | {}                                                 |
+--------------------------------------+----------------------------------------------------+
ERROR: Block Device Mapping is Invalid: failed to get volume 7424b243-438b-4a42-b350-0feb0646818a. (HTTP 400) (Request-ID: req-24cb440e-4476-446a-ba63-bae9099ebd5c)
ERROR: Block Device Mapping is Invalid: failed to get volume 7424b243-438b-4a42-b350-0feb0646818a. (HTTP 400) (Request-ID: req-685124e5-fd54-479d-8d2b-1d86fa248ab1)
ERROR: Block Device Mapping is Invalid: failed to get volume 7424b243-438b-4a42-b350-0feb0646818a. (HTTP 400) (Request-ID: req-a6017878-0902-40bb-9659-4c58272e78af)
ERROR: Block Device Mapping is Invalid: failed to get volume 7424b243-438b-4a42-b350-0feb0646818a. (HTTP 400) (Request-ID: req-5cd5141c-0847-4038-ad33-106d72eac19f)
ERROR: Block Device Mapping is Invalid: failed to get volume 7424b243-438b-4a42-b350-0feb0646818a. (HTTP 400) (Request-ID: req-b6468af6-be98-4df2-998d-12689f2ed8a3)
ERROR: Block Device Mapping is Invalid: failed to get volume 7424b243-438b-4a42-b350-0feb0646818a. (HTTP 400) (Request-ID: req-31103cf3-27ae-44ed-b68b-6f26df81b061)
ERROR: Block Device Mapping is Invalid: failed to get volume 7424b243-438b-4a42-b350-0feb0646818a. (HTTP 400) (Request-ID: req-d74ddc93-78e3-48ad-9ecf-e6b55d666070)
ERROR: Block Device Mapping is Invalid: failed to get volume 7424b243-438b-4a42-b350-0feb0646818a. (HTTP 400) (Request-ID: req-9801da1e-3bf5-4f3b-a6e3-a3fe5f02d045)

Comment 9 Nikola Dipanov 2013-12-17 12:18:45 UTC
There are two issues at hand here. The first one is what the upstream bug was about, and what the bug fix addresses. The second issue is what comment #7 is referring to. We might want to raise a different bug for this second issue as it is not that critical IMHO and fix it in the next release.

1) Due to a race condition it was possible for an instance to 'steal' a volume from another instance. This would manifest itself by one instance booting successfully, but the volume would be marked as available in cinder.

2) The bug we are seeing in comment #7 is that the API will still accept the request, however once the requests hit compute nodes - only one instance should be able to actually attach the volume while the other goes into ERROR and in the end - the volume is attached properly to only a single instance.

Haim - could you please confirm that this is in fact the case, and if so - I will mark this one as resolved and open another one to handle the work that needs to be done upstream to prevent the API from even accepting race requests.

Comment 10 Xavier Queralt 2013-12-17 13:13:24 UTC
I thought I saw the two instances going to ACTIVE but, now that I've rechecked it, I can confirm that even thought two instances are scheduled only the first one gets the volume attached and moves to the ACTIVE state. The second one fails when it can't attach the volume and moves to the ERROR state.

$ for i in 1 2 3 4 5 6 7 8 9 10  ; do nova boot --flavor m1.nano --boot-volume 1c2363d1-4f1b-4331-ae94-8c2e8ecd3e89 vm$i ; done

$ nova list
+--------------------------------------+------+--------+------------+-------------+--------------------------+
| ID                                   | Name | Status | Task State | Power State | Networks                 |
+--------------------------------------+------+--------+------------+-------------+--------------------------+
| 327efced-ca4f-4344-9768-57527f22ac1a | vm1  | ACTIVE | None       | Running     | novanetwork=192.168.32.2 |
| 9571eeb1-7d13-4722-9323-c1a9cf887ff2 | vm2  | ERROR  | None       | NOSTATE     | novanetwork=192.168.32.3 |
+--------------------------------------+------+--------+------------+-------------+--------------------------+

$ cinder list
+--------------------------------------+--------+--------------+------+-------------+----------+--------------------------------------+
|                  ID                  | Status | Display Name | Size | Volume Type | Bootable |             Attached to              |
+--------------------------------------+--------+--------------+------+-------------+----------+--------------------------------------+
| 1c2363d1-4f1b-4331-ae94-8c2e8ecd3e89 | in-use |    cirros    |  1   |     None    |   true   | 327efced-ca4f-4344-9768-57527f22ac1a |
+--------------------------------------+--------+--------------+------+-------------+----------+--------------------------------------+

Comment 17 Nikola Dipanov 2014-04-10 11:55:38 UTC
We have created bug #1085005 so that we can verify that the "bad" part of this bug (see 1) in comment 9) is actually resolved, and leaving this one to track the issues described in Comment 7 as well as the upstream bug linked (https://bugs.launchpad.net/nova/+bug/1302545)

Comment 19 Eoghan Glynn 2015-03-03 18:49:30 UTC
Since the main bulk of this issue was cleaved off into BZ 1085005, the remaining portion trracked in: 

  https://bugs.launchpad.net/nova/+bug/1302545

can be CLOSED->UPSTREAM, as discussed on triage call.