Bug 1828834 - Unable to start SRIOV VM because "Net device not found"
Summary: Unable to start SRIOV VM because "Net device not found"
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: beta
: 16.1 (Train on RHEL 8.2)
Assignee: Rodolfo Alonso
QA Contact: Eduardo Olivares
URL:
Whiteboard:
: 1834592 (view as bug list)
Depends On:
Blocks: 1666684
TreeView+ depends on / blocked
 
Reported: 2020-04-28 13:14 UTC by Eduardo Olivares
Modified: 2020-07-29 07:52 UTC (History)
23 users (show)

Fixed In Version: openstack-neutron-15.0.3-0.20200518110604.5c7c832.el8ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1983792 (view as bug list)
Environment:
Last Closed: 2020-07-29 07:52:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
nova-compute logs (27.36 KB, text/plain)
2020-04-28 13:14 UTC, Eduardo Olivares
no flags Details
nova.conf (217.64 KB, text/plain)
2020-04-29 07:03 UTC, Eduardo Olivares
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1878042 0 None None None 2020-05-11 16:49:29 UTC
OpenStack gerrit 726918 0 None MERGED Use pyroute2 for SRIOV VF commands 2021-02-12 21:54:34 UTC
Red Hat Product Errata RHBA-2020:3148 0 None None None 2020-07-29 07:52:43 UTC

Description Eduardo Olivares 2020-04-28 13:14:51 UTC
Created attachment 1682459 [details]
nova-compute logs

Description of problem:
An SRIOV port is created:
openstack port create --vnic-type direct --network nova sriov-port-prov-3

A VM is created using that port:
openstack server create --image rhel8-pass --flavor rhel_flavor_1ram_1vpu_10disk --port sriov-port-prov-3 --security-group sec_group vm-sriov-prov-3

vm-sriov-prov-3 remains for 5 minutes in status BUILD until it changes to status ERROR. During those 5 minutes, virsh list showed its status as "paused".

Find logs attached.

Comments on some logs:
2020-04-28 09:13:55.838 7 WARNING nova.pci.utils [req-8ee4cc33-142d-4d99-879e-24953135d3ed - - - - -] No net device was found for VF 0000:07:0e.4: nova.exception.PciDeviceNotFoundById: PCI device 0000:07:0e.4 not found
2020-04-28 09:13:55.855 7 ERROR nova.compute.manager [req-8ee4cc33-142d-4d99-879e-24953135d3ed - - - - -] Error updating resources for node computesriov-1.localdomain.: libvirt.libvirtError: Node device not found: no node device with matching name 'net_enp7s0f3v2_06_88_8f_32_26_bf'

What looks incoherent in these logs is that 0000:07:0e.4 correspnds with enp7s0f3v4, not with enp7s0f3v2. Command 'ip link' showed that enp7s0f3v4 was in use.

Command 'virsh dumpxml' also showed that actual interface was '04'
 <address type='pci' domain='0x0000' bus='0x07' slot='0x0e' function='0x4


Version-Release number of selected component (if applicable):
RHOS-16.1-RHEL-8-20200424.n.0
python3-nova-20.2.1-0.20200424133447.118ee68.el8ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. OSP hybrid setup installed with OVN SRIOV configuration
2. Create SRIOV VF port
3. Create VM using that port

Actual results:
VM not created successfully

Expected results:
Successful VM creation

Additional info:

Comment 2 Eduardo Olivares 2020-04-29 07:03:58 UTC
Created attachment 1682732 [details]
nova.conf

Comment 6 smooney 2020-05-08 13:47:04 UTC
can you provide the full nova compute logs or better yet a set of sosreports
there is not enough info to triage this in what you have provided.
i need to see the output of the periodic task that reports the availabel devices to the pci manager to determin what nova considers to be the 
avaiable set of VFs. adding the output of 

virsh nodedev-list would also help.

the fact that you have set trusted in the pci whitelist is not relevent to this.
passthrough_whitelist={"devname":"enp7s0f3","physical_network":"datacentre", "trusted":"true"}
marking an interface as trusted in the whitelist just means it can be used for trusted vfs not that you must request a trusted vf.
its only really important for using pci passthough via the flavor alias which you are not doing so we can ignore it.

Comment 7 Eduardo Olivares 2020-05-08 15:42:42 UTC
Please find the commands I am using during this tes here: http://pastebin.test.redhat.com/863296

Please find here the sosreports, the output of virsh nodedev-list and nova logs from both controller and compute nodes: http://file.mad.redhat.com/eolivare/BZ1828834/
You can search for the server ID "d2235b5d-0d00-4652-8b63-463366d41293" within the logs.

I executed this test on RHOS-16.1-RHEL-8-20200505.n.0

Comment 8 smooney 2020-05-08 18:37:55 UTC
[instance: d2235b5d-0d00-4652-8b63-463366d41293] Failed to allocate network(s): nova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed
2020-05-08 15:31:35.831 7 ERROR nova.compute.manager [instance: d2235b5d-0d00-4652-8b63-463366d41293] Traceback (most recent call last):
2020-05-08 15:31:35.831 7 ERROR nova.compute.manager [instance: d2235b5d-0d00-4652-8b63-463366d41293]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 6351, in _create_domain_and_network
2020-05-08 15:31:35.831 7 ERROR nova.compute.manager [instance: d2235b5d-0d00-4652-8b63-463366d41293]     network_info)
2020-05-08 15:31:35.831 7 ERROR nova.compute.manager [instance: d2235b5d-0d00-4652-8b63-463366d41293]   File "/usr/lib64/python3.6/contextlib.py", line 88, in __exit__
2020-05-08 15:31:35.831 7 ERROR nova.compute.manager [instance: d2235b5d-0d00-4652-8b63-463366d41293]     next(self.gen)
2020-05-08 15:31:35.831 7 ERROR nova.compute.manager [instance: d2235b5d-0d00-4652-8b63-463366d41293]   File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 478, in wait_for_instance_event
2020-05-08 15:31:35.831 7 ERROR nova.compute.manager [instance: d2235b5d-0d00-4652-8b63-463366d41293]     actual_event = event.wait()
2020-05-08 15:31:35.831 7 ERROR nova.compute.manager [instance: d2235b5d-0d00-4652-8b63-463366d41293]   File "/usr/lib/python3.6/site-packages/eventlet/event.py", line 125, in wait
2020-05-08 15:31:35.831 7 ERROR nova.compute.manager [instance: d2235b5d-0d00-4652-8b63-463366d41293]     result = hub.switch()
2020-05-08 15:31:35.831 7 ERROR nova.compute.manager [instance: d2235b5d-0d00-4652-8b63-463366d41293]   File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch
2020-05-08 15:31:35.831 7 ERROR nova.compute.manager [instance: d2235b5d-0d00-4652-8b63-463366d41293]     return self.greenlet.switch()
2020-05-08 15:31:35.831 7 ERROR nova.compute.manager [instance: d2235b5d-0d00-4652-8b63-463366d41293] eventlet.timeout.Timeout: 300 seconds

so this looks like the issue is in neutron not in nova

the wait_for_instance_event function is waiting for the network-vif-plugged event from the neutron sriov nic agent when it finishes configuring the port


i see the libvirt error "libvirt.libvirtError: Node device not found: no node device with matching name 'net_enp6s0f3v1_4a_cc_00_c6_c2_ed'"
earlier in the log but that seams to be resolved when the compute agent is restarted and i can see its correctly reported the inventory 


[{
	count = 1,
	numa_node = 0,
	product_id = '1572',
	tags = {
		dev_type = 'type-PF',
		physical_network = 'datacentre',
		trusted = 'true'
	},
	vendor_id = '8086'
}, {
	count = 5,
	numa_node = 0,
	product_id = '154c',
	tags = {
		dev_type = 'type-VF',
		parent_ifname = 'enp6s0f3',
		physical_network = 'datacentre',
		trusted = 'true'
	},
	vendor_id = '8086'
}]

for some reason i cant access the sosreports the permission seam to be wrong so i can only open the nova compute log
can you update teh permission on the sosreport or provide the neutron server log and the neutron sriov nic agent log?


it look like the issue that prevented d2235b5d-0d00-4652-8b63-463366d41293 from spawning is caused by neutron not nova.

Comment 9 Eduardo Olivares 2020-05-11 07:15:22 UTC
Sean, please try again to download sosreport files from http://file.mad.redhat.com/eolivare/BZ1828834/
I have fixed the permissions. Sorry about that.

Lucas, please take a look at Sean's previous comment. You can download logs the link above. If any logs are missing, I will have to reproduce it again. The environment is not available at this moment, but will be later today or during this week.

Comment 10 Lucas Alvares Gomes 2020-05-11 09:57:41 UTC
(In reply to eolivare from comment #9)
> Sean, please try again to download sosreport files from
> http://file.mad.redhat.com/eolivare/BZ1828834/
> I have fixed the permissions. Sorry about that.
> 
> Lucas, please take a look at Sean's previous comment. You can download logs
> the link above. If any logs are missing, I will have to reproduce it again.
> The environment is not available at this moment, but will be later today or
> during this week.

Thanks Eduardo,

I will probably need to look into the environment to try to understand what's going on, from the networking-ovn perspective, I don't think we are creating any provisioning block to the SRIOV port (VNIC_DIRECT) [0]. Maybe it's something that the SRIOV agent does ? (Do you happen to know Rodolfo?)

I'll need to experiment with the environment and see what I can find. The weird part is that apparently VIFs were working before when we tested it out, I don't recall any changes on these parts of the code that could have affected it.

[0] https://github.com/openstack/networking-ovn/blob/fd1c0c3cffc3c827028b04e4288240355150a987/networking_ovn/ml2/mech_driver.py#L443-L445
[1] https://github.com/openstack/networking-ovn/blob/fd1c0c3cffc3c827028b04e4288240355150a987/networking_ovn/ml2/mech_driver.py#L459-L464

Comment 14 Rodolfo Alonso 2020-05-12 10:26:10 UTC
*** Bug 1834592 has been marked as a duplicate of this bug. ***

Comment 23 Eduardo Olivares 2020-06-04 09:19:43 UTC
Verified on:
RHOS-16.1-RHEL-8-20200603.n.0
openstack-neutron-15.1.1-0.20200528113513.4157550.el8ost.noarch


VM with VF ports are created successfully.

All test_sriov_provider_network tests executed on OVN SRIOV CI job passed:
https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv4-vlan-sriov/26/artifact/tempest-results/tempest-results-neutron_qe_sriov.1.html

Comment 24 Alex McLeod 2020-06-16 12:30:03 UTC
If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text.

If this bug does not require doc text, please set the 'requires_doc_text' flag to '-'.

Comment 26 errata-xmlrpc 2020-07-29 07:52:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3148


Note You need to log in before you can comment on or make changes to this bug.