Bug 1516952

Summary: Cannot boot vm with sriov port after upgrade OSP11 to OSP12
Product: Red Hat OpenStack Reporter: Eran Kuris <ekuris>
Component: openstack-novaAssignee: Stephen Finucane <stephenfin>
Status: CLOSED ERRATA QA Contact: Eran Kuris <ekuris>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 12.0 (Pike)CC: berrange, dasmith, eglynn, ekuris, jlibosva, jschluet, kchamart, lyarwood, mcornea, mriedem, oblaut, sbauza, sferdjao, sgordon, srevivo, stephenfin, vromanso
Target Milestone: rcKeywords: Triaged
Target Release: 12.0 (Pike)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-nova-16.0.2-3.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1518879 (view as bug list) Environment:
Last Closed: 2017-12-13 22:23:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1507225, 1516634    
Bug Blocks: 1518879    
Attachments:
Description Flags
comput_sos none

Description Eran Kuris 2017-11-23 16:29:22 UTC
Created attachment 1358311 [details]
comput_sos

Description of problem:
After upgrade OSP11 to OSP12 (with sriov & Composable roles), getting an error when trying to boot VM with sriov port.
In nova logs I see this trace: 
2017-11-23 12:09:36.028 1 INFO nova.service [req-af2ce51c-73fc-4ea4-9b67-0c71c80f031a - - - - -] Updating service version for nova-compute on compute-0.localdomain from 16 to 22
2017-11-23 12:09:36.284 1 WARNING nova.compute.monitors [req-af2ce51c-73fc-4ea4-9b67-0c71c80f031a - - - - -] Excluding nova.compute.monitors.cpu monitor virt_driver. Not in the list of enabl
ed monitors (CONF.compute_monitors).
2017-11-23 12:09:36.942 1 WARNING nova.pci.utils [req-af2ce51c-73fc-4ea4-9b67-0c71c80f031a - - - - -] No net device was found for VF 0000:05:11.0: PciDeviceNotFoundById: PCI device 0000:05:1
1.0 not found
2017-11-23 12:09:37.479 1 ERROR nova.compute.manager [req-af2ce51c-73fc-4ea4-9b67-0c71c80f031a - - - - -] Error updating resources for node compute-0.localdomain.: ValueError: Field `uuid' c
annot be None
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/nova/conductor/manager.py", line 123, in _object_dispatch
    return getattr(target, method)(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 184, in wrapper
    result = fn(cls, context, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/nova/objects/pci_device.py", line 458, in get_by_compute_node
    db_dev_list)
  File "/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 1121, in obj_make_list
    **extra_args)
  File "/usr/lib/python2.7/site-packages/nova/objects/pci_device.py", line 194, in _from_db_object
    setattr(pci_device, key, db_dev[key])
  File "/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 72, in setter
    field_value = field.coerce(self, name, value)
  File "/usr/lib/python2.7/site-packages/oslo_versionedobjects/fields.py", line 193, in coerce
    return self._null(obj, attr)

  File "/usr/lib/python2.7/site-packages/oslo_versionedobjects/fields.py", line 171, in _null
    raise ValueError(_("Field `%s' cannot be None") % attr)

ValueError: Field `uuid' cannot be None


Version-Release number of selected component (if applicable):
OSP12
rpm -qa |grep nova 
python-nova-16.0.2-2.el7ost.noarch
python-novaclient-9.1.1-1.el7ost.noarch
openstack-nova-placement-api-16.0.2-2.el7ost.noarch
openstack-nova-console-16.0.2-2.el7ost.noarch
openstack-nova-scheduler-16.0.2-2.el7ost.noarch
puppet-nova-11.4.0-2.el7ost.noarch
openstack-nova-novncproxy-16.0.2-2.el7ost.noarch
openstack-nova-common-16.0.2-2.el7ost.noarch
openstack-nova-api-16.0.2-2.el7ost.noarch
openstack-nova-conductor-16.0.2-2.el7ost.noarch
[root@compute-0 ~]# rpm -qa |grep sriov
openstack-neutron-sriov-nic-agent-11.0.1-5.el7ost.noarch
[root@compute-0 ~]# rpm -qa |grep openvs
openstack-neutron-openvswitch-11.0.1-5.el7ost.noarch
openvswitch-ovn-host-2.7.2-4.git20170719.el7fdp.x86_64
openvswitch-2.7.2-4.git20170719.el7fdp.x86_64
openvswitch-ovn-common-2.7.2-4.git20170719.el7fdp.x86_64
openvswitch-ovn-central-2.7.2-4.git20170719.el7fdp.x86_64
python-openvswitch-2.7.2-4.git20170719.el7fdp.noarch

How reproducible:
100%

Steps to Reproduce:
1.Deploy OSP-11 sriov with Composable role 
2.Run upgrade to osp12 use this guide: https://gitlab.cee.redhat.com/mcornea/OSP11-OSP12-Upgrade/blob/master/README.md
3. after upgrade process completed try to boot VM with SRIOV port.

Actual results:
Getting error 

Expected results:


Additional info:
vm with normal port can be booted and it works well.
The old instances from OSP11 still working with full connectivity

Comment 1 Eran Kuris 2017-11-23 16:38:14 UTC
According to log and debugging with Dev there is some communication between the nova-compute manager and the Nova conductor that there is some kind of constraint being violated "Field 'uuid' cannot be None".
Now it may turn out that neutron isn't returning some kind of payload on an existing port that is supposed to match up with something in the database and it is not but...
The stack trace is specific to nova's handling of PCI resource management

Thanks to Brent Eagles & Marius Cornea for help

Comment 3 Stephen Finucane 2017-11-29 13:51:38 UTC
This looks like an issue with commit 15ac5b688bf6d91ac42ca33860d187d80289d82d in upstream nova, which added the UUID field to the PciDevice model (pci_devices table). This change contained an online migration to populate the field with a UUID but that clearly isn't being applied here. This could be an issue with upgrades or with the change itself. My money's on the latter.

Comment 4 Lee Yarwood 2017-11-29 14:07:42 UTC
(In reply to Stephen Finucane from comment #3)
> This looks like an issue with commit
> 15ac5b688bf6d91ac42ca33860d187d80289d82d in upstream nova, which added the
> UUID field to the PciDevice model (pci_devices table). This change contained
> an online migration to populate the field with a UUID but that clearly isn't
> being applied here. This could be an issue with upgrades or with the change
> itself. My money's on the latter.

Well, either way we need controller logs ASAP from the upgraded node to confirm if the migrations were run for n-api.

In addition I'd like more details on the roles used here, we've seen issues with the use of roles shipped within infrared so I wouldn't be surprised if that's causing an issue here.

Comment 5 Eran Kuris 2017-11-29 14:32:14 UTC
(In reply to Lee Yarwood from comment #4)
> (In reply to Stephen Finucane from comment #3)
> > This looks like an issue with commit
> > 15ac5b688bf6d91ac42ca33860d187d80289d82d in upstream nova, which added the
> > UUID field to the PciDevice model (pci_devices table). This change contained
> > an online migration to populate the field with a UUID but that clearly isn't
> > being applied here. This could be an issue with upgrades or with the change
> > itself. My money's on the latter.
> 
> Well, either way, we need controller logs ASAP from the upgraded node to
> confirm if the migrations were run for n-API.

I am working on deploy new setup and reproduce the issue.
 
>, In addition, I'd like more details on the roles used here, we've seen issues
> with the use of roles shipped within infrared so I wouldn't be surprised if
> that's causing an issue here.

This is the templates file that I am using, the roles that I am using are "Compute" & "Contoler": 

https://code.engineering.redhat.com/gerrit/gitweb?p=Neutron-QE.git;a=tree;f=BM_heat_template/ospd-11-multiple-nic-vlans-sriov-hybrid-ha;h=085c2382ab582545c193d3829b07dbcb207f196a;hb=refs/heads/master

I will let you know when I have setup with reproduction.

Comment 6 Matt Riedemann 2017-11-29 14:46:39 UTC
(8:45:18 AM) mriedem: i see the problem
(8:45:26 AM) mriedem: _from_db_object isn't handling the uuid column properly
(8:45:40 AM) mriedem: https://review.openstack.org/#/c/469147/2/nova/objects/pci_device.py@194
(8:45:45 AM) mriedem: there should be a skip in there
(8:46:13 AM) mriedem: if key not in ('extra_info', 'uuid'):
(8:46:21 AM) mriedem: stephenfin: do you have a launchpad bug yet?

Comment 7 Matt Riedemann 2017-11-29 15:53:58 UTC
https://review.openstack.org/#/c/523914/

Comment 10 Eran Kuris 2017-11-30 16:27:45 UTC
Fixed verified during upgrade from OSP11 to OSP12 puddle 2017-11-29.2 pass. 
Old instances worked well as expected.
I success to boot new instance with Normal port & SRIOV port {PF & VF } 

rpm -qa | grep nova 
python-novaclient-9.1.1-1.el7ost.noarch
openstack-nova-compute-16.0.2-3.el7ost.noarch
openstack-nova-scheduler-16.0.2-3.el7ost.noarch
openstack-nova-conductor-16.0.2-3.el7ost.noarch
openstack-nova-common-16.0.2-3.el7ost.noarch
python-nova-16.0.2-3.el7ost.noarch
openstack-nova-placement-api-16.0.2-3.el7ost.noarch
openstack-nova-novncproxy-16.0.2-3.el7ost.noarch
openstack-nova-migration-16.0.2-3.el7ost.noarch
openstack-nova-console-16.0.2-3.el7ost.noarch
puppet-nova-11.4.0-2.el7ost.noarch
openstack-nova-api-16.0.2-3.el7ost.noarch

Comment 13 errata-xmlrpc 2017-12-13 22:23:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462