Description of problem: OSP10 nova is affected by https://bugs.launchpad.net/nova/+bug/1648324: "When cross_az_attach option is set to False and you try to boot an instance with bootable volume created before, InvalidBDM error raised". unfortunately the fix has never landed upstream. It would be great to have this address this in OSP.
https://review.openstack.org/#/c/485257/ has been abandoned. What exactly should be backported?
(In reply to Sven Anderson from comment #1) > https://review.openstack.org/#/c/485257/ has been abandoned. What exactly > should be backported? Hi Sven, sorry maybe backport is not the right wording. It would be great if we could carry that patch in osp10, since it prevents users from pre-provisioning boot volumes. Do you think that would be feasible? thanks Luca
As the nova core developer describes in the abandoned change, there is much doubt that this is the right fix for Newton/OSP10. They suggest to first research the actual problem. This change could help debugging the problem in Newton: https://review.openstack.org/#/c/493206/. So if you can reproduce this problem, I would suggest to cherry-pick this change and see what the real source of the problem is. If we just blindly cherry-pick an abandoned fix from upstream to downstream, we ignore the reasons why it has been abandoned. It might let the error disappear on the surface, but it could also have other negative effects at the same time. I will close this for now. Please feel free to reopen this bug, if you find further evidence, that this - against the assessment of the upstream developers - is still the right patch.
(In reply to Sven Anderson from comment #3) > As the nova core developer describes in the abandoned change, there is much > doubt that this is the right fix for Newton/OSP10. They suggest to first > research the actual problem. This change could help debugging the problem in > Newton: https://review.openstack.org/#/c/493206/. So if you can reproduce > this problem, I would suggest to cherry-pick this change and see what the > real source of the problem is. > > If we just blindly cherry-pick an abandoned fix from upstream to downstream, > we ignore the reasons why it has been abandoned. It might let the error > disappear on the surface, but it could also have other negative effects at > the same time. > > I will close this for now. Please feel free to reopen this bug, if you find > further evidence, that this - against the assessment of the upstream > developers - is still the right patch. Hi Sven, I understand and share your concern. Reason why I raised this BZ and pointed to that specific patch is that the customer experienced the same symptoms as in https://bugs.launchpad.net/nova/+bug/1648324 and https://review.openstack.org/#/c/485257/ fixed the issue for them. I'll follow your suggestion and try to cherry pick 493206, hopefully this will provide better visibility w/e this is indeed the right fix. Cheers Luca
Looks like https://review.openstack.org/#/c/493206/ is already there in openstack-nova-api-14.0.8-5.el7ost.noarch: 1597 except Exception as ex: 1598 LOG.exception(_LE('Instance failed block device setup'), 1599 instance=instance) 1600 # InvalidBDM will eventually result in a BuildAbortException when 1601 # booting from volume, and will be recorded as an instance fault. 1602 # Maintain the original exception message which most likely has 1603 # useful details which the standard InvalidBDM error message lacks. 1604 raise exception.InvalidBDM(six.text_type(ex)) [stack@undercloud-10 ~]$ openstack volume create --availability-zone dc1 --size 1 --image 3b17a727-623e-462d-aa37-2c1ab2e5faf3 cirvol_dc1 +---------------------+--------------------------------------+ | Field | Value | +---------------------+--------------------------------------+ | attachments | [] | | availability_zone | dc1 | | bootable | false | | consistencygroup_id | None | | created_at | 2018-01-29T18:33:05.947236 | | description | None | | encrypted | False | | id | c65bc60a-08f0-472e-9878-065dc77a61df | | migration_status | None | | multiattach | False | | name | cirvol_dc1 | | properties | | | replication_status | disabled | | size | 1 | | snapshot_id | None | | source_volid | None | | status | creating | | type | None | | updated_at | None | | user_id | ad21e604433140e7a7ec38cb2e6c3d2c | +---------------------+--------------------------------------+ [stack@undercloud-10 ~]$ cinder list +--------------------------------------+-----------+------------+------+-------------+----------+-------------+ | ID | Status | Name | Size | Volume Type | Bootable | Attached to | +--------------------------------------+-----------+------------+------+-------------+----------+-------------+ | c65bc60a-08f0-472e-9878-065dc77a61df | available | cirvol_dc1 | 1 | - | true | | +--------------------------------------+-----------+------------+------+-------------+----------+-------------+ with: def get_instance_availability_zone(context, instance): """Return availability zone of specified instance.""" host = instance.get('host') #host = instance.host if 'host' in instance else None if not host: # Likely hasn't reached a viable compute node yet so give back the # desired availability_zone in the instance record if the boot request # specified one. az = instance.get('availability_zone') return az [stack@undercloud-10 ~]$ openstack server create --flavor c2aac3b3-b20e-4c1f-9599-d4a598a2e9db --availability-zone dc1 --volume c65bc60a-08f0-472e-9878-065dc77a61df --block-device source=volume,id=c65bc60a-08f0-472e-9878-065dc77a61df,dest=volume,size=1,shutdown=preserve,bootindex=0 instance_dc1 Block Device Mapping is Invalid: failed to get volume c65bc60a-08f0-472e-9878-065dc77a61df. (HTTP 400) (Request-ID: req-1fb4aa99-02df-4089-bc11-f3ba5567a78b) with: def get_instance_availability_zone(context, instance): """Return availability zone of specified instance.""" #host = instance.get('host') host = instance.host if 'host' in instance else None if not host: # Likely hasn't reached a viable compute node yet so give back the # desired availability_zone in the instance record if the boot request # specified one. az = instance.get('availability_zone') return az $ openstack server create --flavor c2aac3b3-b20e-4c1f-9599-d4a598a2e9db --availability-zone dc1 --volume c65bc60a-08f0-472e-9878-065dc77a61df --block-device source=volume,id=c65bc60a-08f0-472e-9878-065dc77a61df,dest=volume,size=1,shutdown=preserve,bootindex=0 instance_dc1 +--------------------------------------+----------------------------------------------------+ | Field | Value | +--------------------------------------+----------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | dc1 | | OS-EXT-SRV-ATTR:host | None | | OS-EXT-SRV-ATTR:hypervisor_hostname | None | | OS-EXT-SRV-ATTR:instance_name | | | OS-EXT-STS:power_state | NOSTATE | | OS-EXT-STS:task_state | scheduling | | OS-EXT-STS:vm_state | building | | OS-SRV-USG:launched_at | None | | OS-SRV-USG:terminated_at | None | | accessIPv4 | | | accessIPv6 | | | addresses | | | adminPass | mhLvGGrmMMx4 | | config_drive | | | created | 2018-01-29T18:38:25Z | | flavor | m1.tiny (c2aac3b3-b20e-4c1f-9599-d4a598a2e9db) | | hostId | | | id | baa485f1-2de7-4451-91d4-ea20bcc47c3c | | image | | | key_name | None | | name | instance_dc1 | | os-extended-volumes:volumes_attached | [{u'id': u'c65bc60a-08f0-472e-9878-065dc77a61df'}] | | progress | 0 | | project_id | 3b8884bbb6424123869a17dc7293fde9 | | properties | | | security_groups | [{u'name': u'default'}] | | status | BUILD | | updated | 2018-01-29T18:38:25Z | | user_id | ad21e604433140e7a7ec38cb2e6c3d2c | +--------------------------------------+----------------------------------------------------+ [stack@undercloud-10 ~]$ nova list +--------------------------------------+--------------+--------+----------------------+-------------+----------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------------+--------+----------------------+-------------+----------------------+ | baa485f1-2de7-4451-91d4-ea20bcc47c3c | instance_dc1 | BUILD | block_device_mapping | NOSTATE | internal=192.168.0.7 | +--------------------------------------+--------------+--------+----------------------+-------------+----------------------+ [stack@undercloud-10 ~]$ nova list +--------------------------------------+--------------+--------+------------+-------------+----------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------------+--------+------------+-------------+----------------------+ | baa485f1-2de7-4451-91d4-ea20bcc47c3c | instance_dc1 | ACTIVE | - | Running | internal=192.168.0.7 | +--------------------------------------+--------------+--------+------------+-------------+----------------------+ it looks like https://review.openstack.org/#/c/485257/ allows the volume to be retrieved and the instance to spawn. Do you want me to test anything else? there is not much in the logs but I can attach them if you like.
Do the logs show a backtrace of that exception? We can try to compare that with what Matt Riedemann explained how the code logic should work. Actually, his explanations are indeed not consistent, because he claims this line would check that instance.host is set: https://github.com/openstack/nova/blob/stable/newton/nova/compute/resource_tracker.py#L125 But actually that code makes sure it is _not_ set.
(In reply to Sven Anderson from comment #6) > Do the logs show a backtrace of that exception? We can try to compare that > with what Matt Riedemann explained how the code logic should work. > > Actually, his explanations are indeed not consistent, because he claims this > line would check that instance.host is set: > https://github.com/openstack/nova/blob/stable/newton/nova/compute/ > resource_tracker.py#L125 > > But actually that code makes sure it is _not_ set. I've attached the log in debug mode. Unfortunately there is nothing except a 404 as far as I can see (2018-01-29 18:42:13.788).
Created attachment 1388418 [details] nova api logs in debug
If I'm not mistaken, this is compute code. Can you provide debug compute logs?
(In reply to Sven Anderson from comment #11) > If I'm not mistaken, this is compute code. Can you provide debug compute > logs? [stack@undercloud-10 ~]$ cinder list +--------------------------------------+-----------+------------+------+-------------+----------+-------------+ | ID | Status | Name | Size | Volume Type | Bootable | Attached to | +--------------------------------------+-----------+------------+------+-------------+----------+-------------+ | c65bc60a-08f0-472e-9878-065dc77a61df | available | cirvol_dc1 | 1 | - | true | | +--------------------------------------+-----------+------------+------+-------------+----------+-------------+ [stack@undercloud-10 ~]$ openstack server create --flavor c2aac3b3-b20e-4c1f-9599-d4a598a2e9db --availability-zone dc1 --volume c65bc60a-08f0-472e-9878-065dc77a61df --block-device source=vol ume,id=c65bc60a-08f0-472e-9878-065dc77a61df,dest=volume,size=1,shutdown=preserve,bootindex=0 instance_dc1 Block Device Mapping is Invalid: failed to get volume c65bc60a-08f0-472e-9878-065dc77a61df. (HTTP 400) (Request-ID: req-8bcc1d00-dd7c-44aa-a349-69324def0677) I am attaching logs from both controller and compute.
Created attachment 1388472 [details] new logs - controller
Created attachment 1388473 [details] new logs - compute
So that obviously never hits compute code, and therefor we don't see an exception. Could you please add this assert line at the beginning of the get_instance_availability_zone function (before the host = ... line), so we get an actual traceback where this is coming from? def get_instance_availability_zone(context, instance): """Return availability zone of specified instance.""" assert 'host' in instance #host = instance.get('host') host = instance.host if 'host' in instance else None ... Thanks!
(In reply to Sven Anderson from comment #15) > So that obviously never hits compute code, and therefor we don't see an > exception. Could you please add this assert line at the beginning of the > get_instance_availability_zone function (before the host = ... line), so we > get an actual traceback where this is coming from? > > def get_instance_availability_zone(context, instance): > """Return availability zone of specified instance.""" > assert 'host' in instance > #host = instance.get('host') > host = instance.host if 'host' in instance else None > ... > > Thanks! I've added the assert but it didn't generate any trace, so I ran nova-api under pdb: (Pdb) (Pdb) l 156 157 def get_instance_availability_zone(context, instance): 158 """Return availability zone of specified instance.""" 159 160 import pdb; pdb.set_trace() 161 -> assert 'host' in instance 162 host = instance.get('host') 163 #host = instance.host if 'host' in instance else None 164 if not host: 165 # Likely hasn't reached a viable compute node yet so give back the 166 # desired availability_zone in the instance record if the boot request (Pdb) n AssertionError: AssertionError() > /usr/lib/python2.7/site-packages/nova/availability_zones.py(161)get_instance_availability_zone() -> assert 'host' in instance (Pdb) host *** NameError: name 'host' is not defined (Pdb) n AssertionError: AssertionError() > /usr/lib/python2.7/site-packages/nova/volume/cinder.py(296)check_availability_zone() -> instance_az = az.get_instance_availability_zone(context, instance) (Pdb) l 291 """Ensure that the availability zone is the same.""" 292 293 # TODO(walter-boring): move this check to Cinder as part of 294 # the reserve call. 295 if instance and not CONF.cinder.cross_az_attach: 296 -> instance_az = az.get_instance_availability_zone(context, instance) 297 if instance_az != volume['availability_zone']: 298 msg = _("Instance %(instance)s and volume %(vol)s are not in " 299 "the same availability_zone. Instance is in " 300 "%(ins_zone)s. Volume is in %(vol_zone)s") % { 301 "instance": instance['id'], (Pdb) host *** NameError: name 'host' is not defined (Pdb) instance.host *** ObjectActionError: Object action obj_load_attr failed because: attribute host not lazy-loadable (Pdb) instance.get('host') *** ObjectActionError: Object action obj_load_attr failed because: attribute host not lazy-loadable (Pdb) n --Return-- > /usr/lib/python2.7/site-packages/nova/volume/cinder.py(296)check_availability_zone()->None -> instance_az = az.get_instance_availability_zone(context, instance) (Pdb) l 291 """Ensure that the availability zone is the same.""" 292 293 # TODO(walter-boring): move this check to Cinder as part of 294 # the reserve call. 295 if instance and not CONF.cinder.cross_az_attach: 296 -> instance_az = az.get_instance_availability_zone(context, instance) 297 if instance_az != volume['availability_zone']: 298 msg = _("Instance %(instance)s and volume %(vol)s are not in " 299 "the same availability_zone. Instance is in " 300 "%(ins_zone)s. Volume is in %(vol_zone)s") % { 301 "instance": instance['id'], (Pdb) n AssertionError: AssertionError() > /usr/lib/python2.7/site-packages/nova/volume/cinder.py(288)check_attach() -> self.check_availability_zone(context, volume, instance) (Pdb) l 283 raise exception.InvalidVolume(reason=msg) 284 if volume['attach_status'] == "attached": 285 msg = _("volume %s already attached") % volume['id'] 286 raise exception.InvalidVolume(reason=msg) 287 288 -> self.check_availability_zone(context, volume, instance) 289 290 def check_availability_zone(self, context, volume, instance=None): 291 """Ensure that the availability zone is the same.""" 292 293 # TODO(walter-boring): move this check to Cinder as part of (Pdb) instance_az *** NameError: name 'instance_az' is not defined hope this helps, looks to me that there might be more than one issue here?
(Pdb) instance Instance(access_ip_v4=None,access_ip_v6=None,architecture=None,auto_disk_config=False,availability_zone='dc1',cell_name=<?>,cleaned=<?>,config_drive='',created_at=<?>,default_ephemeral_device=<?>,default_swap_device=<?>,deleted=<?>,deleted_at=<?>,device_metadata=<?>,disable_terminate=<?>,display_description='instance_dc1',display_name='instance_dc1',ec2_ids=<?>,ephemeral_gb=0,ephemeral_key_uuid=None,fault=<?>,flavor=Flavor(1),host=<?>,hostname='instance-dc1',id=<?>,image_ref='',info_cache=InstanceInfoCache,instance_type_id=1,kernel_id='',key_data=None,key_name=None,keypairs=KeyPairList,launch_index=0,launched_at=<?>,launched_on=<?>,locked=False,locked_by=<?>,memory_mb=512,metadata={},migration_context=<?>,new_flavor=None,node=<?>,numa_topology=None,old_flavor=None,os_type=None,pci_devices=<?>,pci_requests=InstancePCIRequests,power_state=0,progress=0,project_id='3b8884bbb6424123869a17dc7293fde9',ramdisk_id='',reservation_id='r-gvzmbotj',root_device_name=None,root_gb=1,security_groups=SecurityGroupList,services=<?>,shutdown_terminate=False,system_metadata={image_base_image_ref='',image_min_disk='1',image_min_ram='0'},tags=<?>,task_state='scheduling',terminated_at=<?>,updated_at=<?>,user_data=None,user_id='ad21e604433140e7a7ec38cb2e6c3d2c',uuid=bebc1084-d258-40c4-a325-67ca3513c550,vcpu_model=<?>,vcpus=1,vm_mode=None,vm_state='building') (Pdb) volume {'status': u'available', 'volume_type_id': None, 'volume_image_metadata': {u'container_format': u'bare', u'min_ram': u'0', u'disk_format': u'qcow2', u'image_name': u'cirros', u'image_id': u'3b17a727-623e-462d-aa37-2c1ab2e5faf3', u'checksum': u'f8ab98ff5e73ebab884d80c9dc9c7290', u'min_disk': u'0', u'size': u'13267968'}, 'display_name': u'cirvol_dc1', 'attach_time': '', 'availability_zone': u'dc1', 'bootable': True, 'created_at': u'2018-01-29T18:33:05.000000', 'multiattach': False, 'attach_status': 'detached', 'display_description': None, 'volume_metadata': {u'readonly': u'False'}, 'snapshot_id': None, 'mountpoint': '', 'id': u'c65bc60a-08f0-472e-9878-065dc77a61df', 'size': 1}
We need the backtrace. Can you enter w (where) in the Pdb, so we see how the code ended up there without host being set?
Created attachment 1393789 [details] pdb output
(In reply to Luca Miccini from comment #19) > Created attachment 1393789 [details] > pdb output Thank you, that helped. So the outcome is, that this test is indeed necessary, but for another reason than in Ocata and later. I will backport it downstream.
(In reply to Sven Anderson from comment #20) > (In reply to Luca Miccini from comment #19) > > Created attachment 1393789 [details] > > pdb output > > Thank you, that helped. So the outcome is, that this test is indeed > necessary, but for another reason than in Ocata and later. I will backport > it downstream. thanks! Luca
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1595