Bug 1538092

Summary: Backport Avoid lazy-load error when getting instance AZ - https://review.openstack.org/#/c/485257/
Product: Red Hat OpenStack Reporter: Luca Miccini <lmiccini>
Component: openstack-novaAssignee: Lee Yarwood <lyarwood>
Status: CLOSED ERRATA QA Contact: Joe H. Rahme <jhakimra>
Severity: medium Docs Contact:
Priority: medium    
Version: 10.0 (Newton)CC: awaugama, berrange, dasmith, dcadzow, eglynn, kchamart, lmiccini, lyarwood, sbauza, sferdjao, sgordon, srevivo, svanders, vromanso
Target Milestone: z8Keywords: FeatureBackport, Reopened, Triaged, ZStream
Target Release: 10.0 (Newton)   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-nova-14.1.0-14.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-17 15:33:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
nova api logs in debug
none
new logs - controller
none
new logs - compute
none
pdb output none

Description Luca Miccini 2018-01-24 13:03:14 UTC
Description of problem:

OSP10 nova is affected by https://bugs.launchpad.net/nova/+bug/1648324:

"When cross_az_attach option is set to False and you try to boot an instance with bootable volume created before, InvalidBDM error raised".

unfortunately the fix has never landed upstream.

It would be great to have this address this in OSP.

Comment 1 Sven Anderson 2018-01-26 14:53:19 UTC
https://review.openstack.org/#/c/485257/ has been abandoned. What exactly should be backported?

Comment 2 Luca Miccini 2018-01-26 15:49:14 UTC
(In reply to Sven Anderson from comment #1)
> https://review.openstack.org/#/c/485257/ has been abandoned. What exactly
> should be backported?


Hi Sven,

sorry maybe backport is not the right wording. It would be great if we could carry that patch in osp10, since it prevents users from pre-provisioning boot volumes.

Do you think that would be feasible?

thanks
Luca

Comment 3 Sven Anderson 2018-01-29 15:55:38 UTC
As the nova core developer describes in the abandoned change, there is much doubt that this is the right fix for Newton/OSP10. They suggest to first research the actual problem. This change could help debugging the problem in Newton: https://review.openstack.org/#/c/493206/. So if you can reproduce this problem, I would suggest to cherry-pick this change and see what the real source of the problem is.

If we just blindly cherry-pick an abandoned fix from upstream to downstream, we ignore the reasons why it has been abandoned. It might let the error disappear on the surface, but it could also have other negative effects at the same time.

I will close this for now. Please feel free to reopen this bug, if you find further evidence, that this - against the assessment of the upstream developers - is still the right patch.

Comment 4 Luca Miccini 2018-01-29 16:10:19 UTC
(In reply to Sven Anderson from comment #3)
> As the nova core developer describes in the abandoned change, there is much
> doubt that this is the right fix for Newton/OSP10. They suggest to first
> research the actual problem. This change could help debugging the problem in
> Newton: https://review.openstack.org/#/c/493206/. So if you can reproduce
> this problem, I would suggest to cherry-pick this change and see what the
> real source of the problem is.
> 
> If we just blindly cherry-pick an abandoned fix from upstream to downstream,
> we ignore the reasons why it has been abandoned. It might let the error
> disappear on the surface, but it could also have other negative effects at
> the same time.
> 
> I will close this for now. Please feel free to reopen this bug, if you find
> further evidence, that this - against the assessment of the upstream
> developers - is still the right patch.

Hi Sven,

I understand and share your concern. Reason why I raised this BZ and pointed to that specific patch is that the customer experienced the same symptoms as in https://bugs.launchpad.net/nova/+bug/1648324 and https://review.openstack.org/#/c/485257/ fixed the issue for them.

I'll follow your suggestion and try to cherry pick 493206, hopefully this will provide better visibility w/e this is indeed the right fix.

Cheers
Luca

Comment 5 Luca Miccini 2018-01-29 18:44:38 UTC
Looks like https://review.openstack.org/#/c/493206/ is already there in openstack-nova-api-14.0.8-5.el7ost.noarch:


   1597         except Exception as ex:
   1598             LOG.exception(_LE('Instance failed block device setup'),
   1599                           instance=instance)
   1600             # InvalidBDM will eventually result in a BuildAbortException when
   1601             # booting from volume, and will be recorded as an instance fault.
   1602             # Maintain the original exception message which most likely has
   1603             # useful details which the standard InvalidBDM error message lacks.
   1604             raise exception.InvalidBDM(six.text_type(ex))



[stack@undercloud-10 ~]$ openstack volume create --availability-zone dc1 --size 1 --image 3b17a727-623e-462d-aa37-2c1ab2e5faf3 cirvol_dc1
+---------------------+--------------------------------------+
| Field               | Value                                |
+---------------------+--------------------------------------+
| attachments         | []                                   |
| availability_zone   | dc1                                  |
| bootable            | false                                |
| consistencygroup_id | None                                 |
| created_at          | 2018-01-29T18:33:05.947236           |
| description         | None                                 |
| encrypted           | False                                |
| id                  | c65bc60a-08f0-472e-9878-065dc77a61df |
| migration_status    | None                                 |
| multiattach         | False                                |
| name                | cirvol_dc1                           |
| properties          |                                      |
| replication_status  | disabled                             |
| size                | 1                                    |
| snapshot_id         | None                                 |
| source_volid        | None                                 |
| status              | creating                             |
| type                | None                                 |
| updated_at          | None                                 |
| user_id             | ad21e604433140e7a7ec38cb2e6c3d2c     |
+---------------------+--------------------------------------+

[stack@undercloud-10 ~]$ cinder list
+--------------------------------------+-----------+------------+------+-------------+----------+-------------+
| ID                                   | Status    | Name       | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+-----------+------------+------+-------------+----------+-------------+
| c65bc60a-08f0-472e-9878-065dc77a61df | available | cirvol_dc1 | 1    | -           | true     |             |
+--------------------------------------+-----------+------------+------+-------------+----------+-------------+

with:

def get_instance_availability_zone(context, instance):
    """Return availability zone of specified instance."""
    host = instance.get('host')
    #host = instance.host if 'host' in instance else None
    if not host:
        # Likely hasn't reached a viable compute node yet so give back the
        # desired availability_zone in the instance record if the boot request
        # specified one.
        az = instance.get('availability_zone')
        return az

[stack@undercloud-10 ~]$ openstack server create --flavor c2aac3b3-b20e-4c1f-9599-d4a598a2e9db --availability-zone dc1 --volume c65bc60a-08f0-472e-9878-065dc77a61df --block-device source=volume,id=c65bc60a-08f0-472e-9878-065dc77a61df,dest=volume,size=1,shutdown=preserve,bootindex=0 instance_dc1
Block Device Mapping is Invalid: failed to get volume c65bc60a-08f0-472e-9878-065dc77a61df. (HTTP 400) (Request-ID: req-1fb4aa99-02df-4089-bc11-f3ba5567a78b)


with:

def get_instance_availability_zone(context, instance):
    """Return availability zone of specified instance."""
    #host = instance.get('host')
    host = instance.host if 'host' in instance else None
    if not host:
        # Likely hasn't reached a viable compute node yet so give back the
        # desired availability_zone in the instance record if the boot request
        # specified one.
        az = instance.get('availability_zone')
        return az

$ openstack server create --flavor c2aac3b3-b20e-4c1f-9599-d4a598a2e9db --availability-zone dc1 --volume c65bc60a-08f0-472e-9878-065dc77a61df --block-device source=volume,id=c65bc60a-08f0-472e-9878-065dc77a61df,dest=volume,size=1,shutdown=preserve,bootindex=0 instance_dc1


+--------------------------------------+----------------------------------------------------+
| Field                                | Value                                              |
+--------------------------------------+----------------------------------------------------+
| OS-DCF:diskConfig                    | MANUAL                                             |
| OS-EXT-AZ:availability_zone          | dc1                                                |
| OS-EXT-SRV-ATTR:host                 | None                                               |
| OS-EXT-SRV-ATTR:hypervisor_hostname  | None                                               |
| OS-EXT-SRV-ATTR:instance_name        |                                                    |
| OS-EXT-STS:power_state               | NOSTATE                                            |
| OS-EXT-STS:task_state                | scheduling                                         |
| OS-EXT-STS:vm_state                  | building                                           |
| OS-SRV-USG:launched_at               | None                                               |
| OS-SRV-USG:terminated_at             | None                                               |
| accessIPv4                           |                                                    |
| accessIPv6                           |                                                    |
| addresses                            |                                                    |
| adminPass                            | mhLvGGrmMMx4                                       |
| config_drive                         |                                                    |
| created                              | 2018-01-29T18:38:25Z                               |
| flavor                               | m1.tiny (c2aac3b3-b20e-4c1f-9599-d4a598a2e9db)     |
| hostId                               |                                                    |
| id                                   | baa485f1-2de7-4451-91d4-ea20bcc47c3c               |
| image                                |                                                    |
| key_name                             | None                                               |
| name                                 | instance_dc1                                       |
| os-extended-volumes:volumes_attached | [{u'id': u'c65bc60a-08f0-472e-9878-065dc77a61df'}] |
| progress                             | 0                                                  |
| project_id                           | 3b8884bbb6424123869a17dc7293fde9                   |
| properties                           |                                                    |
| security_groups                      | [{u'name': u'default'}]                            |
| status                               | BUILD                                              |
| updated                              | 2018-01-29T18:38:25Z                               |
| user_id                              | ad21e604433140e7a7ec38cb2e6c3d2c                   |
+--------------------------------------+----------------------------------------------------+

[stack@undercloud-10 ~]$ nova list
+--------------------------------------+--------------+--------+----------------------+-------------+----------------------+
| ID                                   | Name         | Status | Task State           | Power State | Networks             |
+--------------------------------------+--------------+--------+----------------------+-------------+----------------------+
| baa485f1-2de7-4451-91d4-ea20bcc47c3c | instance_dc1 | BUILD  | block_device_mapping | NOSTATE     | internal=192.168.0.7 |
+--------------------------------------+--------------+--------+----------------------+-------------+----------------------+

[stack@undercloud-10 ~]$ nova list
+--------------------------------------+--------------+--------+------------+-------------+----------------------+
| ID                                   | Name         | Status | Task State | Power State | Networks             |
+--------------------------------------+--------------+--------+------------+-------------+----------------------+
| baa485f1-2de7-4451-91d4-ea20bcc47c3c | instance_dc1 | ACTIVE | -          | Running     | internal=192.168.0.7 |
+--------------------------------------+--------------+--------+------------+-------------+----------------------+


it looks like https://review.openstack.org/#/c/485257/ allows the volume to be retrieved and the instance to spawn.

Do you want me to test anything else? there is not much in the logs but I can attach them if you like.

Comment 6 Sven Anderson 2018-01-30 13:52:52 UTC
Do the logs show a backtrace of that exception? We can try to compare that with what Matt Riedemann explained how the code logic should work.

Actually, his explanations are indeed not consistent, because he claims this line would check that instance.host is set:
https://github.com/openstack/nova/blob/stable/newton/nova/compute/resource_tracker.py#L125

But actually that code makes sure it is _not_ set.

Comment 8 Luca Miccini 2018-01-30 14:15:30 UTC
(In reply to Sven Anderson from comment #6)
> Do the logs show a backtrace of that exception? We can try to compare that
> with what Matt Riedemann explained how the code logic should work.
> 
> Actually, his explanations are indeed not consistent, because he claims this
> line would check that instance.host is set:
> https://github.com/openstack/nova/blob/stable/newton/nova/compute/
> resource_tracker.py#L125
> 
> But actually that code makes sure it is _not_ set.

I've attached the log in debug mode. Unfortunately there is nothing except a 404 as far as I can see (2018-01-29 18:42:13.788).

Comment 9 Luca Miccini 2018-01-30 14:16:09 UTC
Created attachment 1388418 [details]
nova api logs in debug

Comment 11 Sven Anderson 2018-01-30 14:56:19 UTC
If I'm not mistaken, this is compute code. Can you provide debug compute logs?

Comment 12 Luca Miccini 2018-01-30 15:04:21 UTC
(In reply to Sven Anderson from comment #11)
> If I'm not mistaken, this is compute code. Can you provide debug compute
> logs?

[stack@undercloud-10 ~]$ cinder list
+--------------------------------------+-----------+------------+------+-------------+----------+-------------+
| ID                                   | Status    | Name       | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+-----------+------------+------+-------------+----------+-------------+
| c65bc60a-08f0-472e-9878-065dc77a61df | available | cirvol_dc1 | 1    | -           | true     |             |
+--------------------------------------+-----------+------------+------+-------------+----------+-------------+
                                                                                                                                     
[stack@undercloud-10 ~]$ openstack server create --flavor c2aac3b3-b20e-4c1f-9599-d4a598a2e9db --availability-zone dc1 --volume c65bc60a-08f0-472e-9878-065dc77a61df --block-device source=vol
ume,id=c65bc60a-08f0-472e-9878-065dc77a61df,dest=volume,size=1,shutdown=preserve,bootindex=0 instance_dc1                                                                                     
Block Device Mapping is Invalid: failed to get volume c65bc60a-08f0-472e-9878-065dc77a61df. (HTTP 400) (Request-ID: req-8bcc1d00-dd7c-44aa-a349-69324def0677)

I am attaching logs from both controller and compute.

Comment 13 Luca Miccini 2018-01-30 15:05:33 UTC
Created attachment 1388472 [details]
new logs - controller

Comment 14 Luca Miccini 2018-01-30 15:06:01 UTC
Created attachment 1388473 [details]
new logs  - compute

Comment 15 Sven Anderson 2018-01-30 18:41:03 UTC
So that obviously never hits compute code, and therefor we don't see an exception. Could you please add this assert line at the beginning of the get_instance_availability_zone function (before the host = ... line), so we get an actual traceback where this is coming from?

def get_instance_availability_zone(context, instance):
    """Return availability zone of specified instance."""
    assert 'host' in instance
    #host = instance.get('host')
    host = instance.host if 'host' in instance else None
    ...

Thanks!

Comment 16 Luca Miccini 2018-01-31 07:10:37 UTC
(In reply to Sven Anderson from comment #15)
> So that obviously never hits compute code, and therefor we don't see an
> exception. Could you please add this assert line at the beginning of the
> get_instance_availability_zone function (before the host = ... line), so we
> get an actual traceback where this is coming from?
> 
> def get_instance_availability_zone(context, instance):
>     """Return availability zone of specified instance."""
>     assert 'host' in instance
>     #host = instance.get('host')
>     host = instance.host if 'host' in instance else None
>     ...
> 
> Thanks!

I've added the assert but it didn't generate any trace, so I ran nova-api under pdb:

(Pdb) 
(Pdb) l
156  
157     def get_instance_availability_zone(context, instance):
158         """Return availability zone of specified instance."""
159  
160         import pdb; pdb.set_trace()
161  ->     assert 'host' in instance
162         host = instance.get('host')
163         #host = instance.host if 'host' in instance else None
164         if not host:
165             # Likely hasn't reached a viable compute node yet so give back the
166             # desired availability_zone in the instance record if the boot request
(Pdb) n
AssertionError: AssertionError()
> /usr/lib/python2.7/site-packages/nova/availability_zones.py(161)get_instance_availability_zone()
-> assert 'host' in instance
(Pdb) host
*** NameError: name 'host' is not defined

(Pdb) n
AssertionError: AssertionError()
> /usr/lib/python2.7/site-packages/nova/volume/cinder.py(296)check_availability_zone()
-> instance_az = az.get_instance_availability_zone(context, instance)
(Pdb) l
291             """Ensure that the availability zone is the same."""
292  
293             # TODO(walter-boring): move this check to Cinder as part of
294             # the reserve call.
295             if instance and not CONF.cinder.cross_az_attach:
296  ->             instance_az = az.get_instance_availability_zone(context, instance)
297                 if instance_az != volume['availability_zone']:
298                     msg = _("Instance %(instance)s and volume %(vol)s are not in "
299                             "the same availability_zone. Instance is in "
300                             "%(ins_zone)s. Volume is in %(vol_zone)s") % {
301                                 "instance": instance['id'],
(Pdb) host
*** NameError: name 'host' is not defined
(Pdb) instance.host
*** ObjectActionError: Object action obj_load_attr failed because: attribute host not lazy-loadable
(Pdb) instance.get('host')
*** ObjectActionError: Object action obj_load_attr failed because: attribute host not lazy-loadable
(Pdb) n
--Return--
> /usr/lib/python2.7/site-packages/nova/volume/cinder.py(296)check_availability_zone()->None
-> instance_az = az.get_instance_availability_zone(context, instance)
(Pdb) l
291             """Ensure that the availability zone is the same."""
292  
293             # TODO(walter-boring): move this check to Cinder as part of
294             # the reserve call.
295             if instance and not CONF.cinder.cross_az_attach:
296  ->             instance_az = az.get_instance_availability_zone(context, instance)
297                 if instance_az != volume['availability_zone']:
298                     msg = _("Instance %(instance)s and volume %(vol)s are not in "
299                             "the same availability_zone. Instance is in "
300                             "%(ins_zone)s. Volume is in %(vol_zone)s") % {
301                                 "instance": instance['id'],
(Pdb) n
AssertionError: AssertionError()
> /usr/lib/python2.7/site-packages/nova/volume/cinder.py(288)check_attach()
-> self.check_availability_zone(context, volume, instance)
(Pdb) l
283                 raise exception.InvalidVolume(reason=msg)
284             if volume['attach_status'] == "attached":
285                 msg = _("volume %s already attached") % volume['id']
286                 raise exception.InvalidVolume(reason=msg)
287  
288  ->         self.check_availability_zone(context, volume, instance)
289  
290         def check_availability_zone(self, context, volume, instance=None):
291             """Ensure that the availability zone is the same."""
292  
293             # TODO(walter-boring): move this check to Cinder as part of
(Pdb) instance_az
*** NameError: name 'instance_az' is not defined

hope this helps, looks to me that there might be more than one issue here?

Comment 17 Luca Miccini 2018-01-31 09:53:29 UTC
(Pdb) instance
Instance(access_ip_v4=None,access_ip_v6=None,architecture=None,auto_disk_config=False,availability_zone='dc1',cell_name=<?>,cleaned=<?>,config_drive='',created_at=<?>,default_ephemeral_device=<?>,default_swap_device=<?>,deleted=<?>,deleted_at=<?>,device_metadata=<?>,disable_terminate=<?>,display_description='instance_dc1',display_name='instance_dc1',ec2_ids=<?>,ephemeral_gb=0,ephemeral_key_uuid=None,fault=<?>,flavor=Flavor(1),host=<?>,hostname='instance-dc1',id=<?>,image_ref='',info_cache=InstanceInfoCache,instance_type_id=1,kernel_id='',key_data=None,key_name=None,keypairs=KeyPairList,launch_index=0,launched_at=<?>,launched_on=<?>,locked=False,locked_by=<?>,memory_mb=512,metadata={},migration_context=<?>,new_flavor=None,node=<?>,numa_topology=None,old_flavor=None,os_type=None,pci_devices=<?>,pci_requests=InstancePCIRequests,power_state=0,progress=0,project_id='3b8884bbb6424123869a17dc7293fde9',ramdisk_id='',reservation_id='r-gvzmbotj',root_device_name=None,root_gb=1,security_groups=SecurityGroupList,services=<?>,shutdown_terminate=False,system_metadata={image_base_image_ref='',image_min_disk='1',image_min_ram='0'},tags=<?>,task_state='scheduling',terminated_at=<?>,updated_at=<?>,user_data=None,user_id='ad21e604433140e7a7ec38cb2e6c3d2c',uuid=bebc1084-d258-40c4-a325-67ca3513c550,vcpu_model=<?>,vcpus=1,vm_mode=None,vm_state='building')

(Pdb) volume
{'status': u'available', 'volume_type_id': None, 'volume_image_metadata': {u'container_format': u'bare', u'min_ram': u'0', u'disk_format': u'qcow2', u'image_name': u'cirros', u'image_id': u'3b17a727-623e-462d-aa37-2c1ab2e5faf3', u'checksum': u'f8ab98ff5e73ebab884d80c9dc9c7290', u'min_disk': u'0', u'size': u'13267968'}, 'display_name': u'cirvol_dc1', 'attach_time': '', 'availability_zone': u'dc1', 'bootable': True, 'created_at': u'2018-01-29T18:33:05.000000', 'multiattach': False, 'attach_status': 'detached', 'display_description': None, 'volume_metadata': {u'readonly': u'False'}, 'snapshot_id': None, 'mountpoint': '', 'id': u'c65bc60a-08f0-472e-9878-065dc77a61df', 'size': 1}

Comment 18 Sven Anderson 2018-02-09 14:55:38 UTC
We need the backtrace. Can you enter w (where) in the Pdb, so we see how the code ended up there without host being set?

Comment 19 Luca Miccini 2018-02-09 15:36:25 UTC
Created attachment 1393789 [details]
pdb output

Comment 20 Sven Anderson 2018-02-09 16:45:04 UTC
(In reply to Luca Miccini from comment #19)
> Created attachment 1393789 [details]
> pdb output

Thank you, that helped. So the outcome is, that this test is indeed necessary, but for another reason than in Ocata and later. I will backport it downstream.

Comment 21 Luca Miccini 2018-02-09 16:56:18 UTC
(In reply to Sven Anderson from comment #20)
> (In reply to Luca Miccini from comment #19)
> > Created attachment 1393789 [details]
> > pdb output
> 
> Thank you, that helped. So the outcome is, that this test is indeed
> necessary, but for another reason than in Ocata and later. I will backport
> it downstream.

thanks!
Luca

Comment 30 errata-xmlrpc 2018-05-17 15:33:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1595