Created attachment 1682459 [details] nova-compute logs Description of problem: An SRIOV port is created: openstack port create --vnic-type direct --network nova sriov-port-prov-3 A VM is created using that port: openstack server create --image rhel8-pass --flavor rhel_flavor_1ram_1vpu_10disk --port sriov-port-prov-3 --security-group sec_group vm-sriov-prov-3 vm-sriov-prov-3 remains for 5 minutes in status BUILD until it changes to status ERROR. During those 5 minutes, virsh list showed its status as "paused". Find logs attached. Comments on some logs: 2020-04-28 09:13:55.838 7 WARNING nova.pci.utils [req-8ee4cc33-142d-4d99-879e-24953135d3ed - - - - -] No net device was found for VF 0000:07:0e.4: nova.exception.PciDeviceNotFoundById: PCI device 0000:07:0e.4 not found 2020-04-28 09:13:55.855 7 ERROR nova.compute.manager [req-8ee4cc33-142d-4d99-879e-24953135d3ed - - - - -] Error updating resources for node computesriov-1.localdomain.: libvirt.libvirtError: Node device not found: no node device with matching name 'net_enp7s0f3v2_06_88_8f_32_26_bf' What looks incoherent in these logs is that 0000:07:0e.4 correspnds with enp7s0f3v4, not with enp7s0f3v2. Command 'ip link' showed that enp7s0f3v4 was in use. Command 'virsh dumpxml' also showed that actual interface was '04' <address type='pci' domain='0x0000' bus='0x07' slot='0x0e' function='0x4 Version-Release number of selected component (if applicable): RHOS-16.1-RHEL-8-20200424.n.0 python3-nova-20.2.1-0.20200424133447.118ee68.el8ost.noarch How reproducible: 100% Steps to Reproduce: 1. OSP hybrid setup installed with OVN SRIOV configuration 2. Create SRIOV VF port 3. Create VM using that port Actual results: VM not created successfully Expected results: Successful VM creation Additional info:
Created attachment 1682732 [details] nova.conf
can you provide the full nova compute logs or better yet a set of sosreports there is not enough info to triage this in what you have provided. i need to see the output of the periodic task that reports the availabel devices to the pci manager to determin what nova considers to be the avaiable set of VFs. adding the output of virsh nodedev-list would also help. the fact that you have set trusted in the pci whitelist is not relevent to this. passthrough_whitelist={"devname":"enp7s0f3","physical_network":"datacentre", "trusted":"true"} marking an interface as trusted in the whitelist just means it can be used for trusted vfs not that you must request a trusted vf. its only really important for using pci passthough via the flavor alias which you are not doing so we can ignore it.
Please find the commands I am using during this tes here: http://pastebin.test.redhat.com/863296 Please find here the sosreports, the output of virsh nodedev-list and nova logs from both controller and compute nodes: http://file.mad.redhat.com/eolivare/BZ1828834/ You can search for the server ID "d2235b5d-0d00-4652-8b63-463366d41293" within the logs. I executed this test on RHOS-16.1-RHEL-8-20200505.n.0
[instance: d2235b5d-0d00-4652-8b63-463366d41293] Failed to allocate network(s): nova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed 2020-05-08 15:31:35.831 7 ERROR nova.compute.manager [instance: d2235b5d-0d00-4652-8b63-463366d41293] Traceback (most recent call last): 2020-05-08 15:31:35.831 7 ERROR nova.compute.manager [instance: d2235b5d-0d00-4652-8b63-463366d41293] File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 6351, in _create_domain_and_network 2020-05-08 15:31:35.831 7 ERROR nova.compute.manager [instance: d2235b5d-0d00-4652-8b63-463366d41293] network_info) 2020-05-08 15:31:35.831 7 ERROR nova.compute.manager [instance: d2235b5d-0d00-4652-8b63-463366d41293] File "/usr/lib64/python3.6/contextlib.py", line 88, in __exit__ 2020-05-08 15:31:35.831 7 ERROR nova.compute.manager [instance: d2235b5d-0d00-4652-8b63-463366d41293] next(self.gen) 2020-05-08 15:31:35.831 7 ERROR nova.compute.manager [instance: d2235b5d-0d00-4652-8b63-463366d41293] File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 478, in wait_for_instance_event 2020-05-08 15:31:35.831 7 ERROR nova.compute.manager [instance: d2235b5d-0d00-4652-8b63-463366d41293] actual_event = event.wait() 2020-05-08 15:31:35.831 7 ERROR nova.compute.manager [instance: d2235b5d-0d00-4652-8b63-463366d41293] File "/usr/lib/python3.6/site-packages/eventlet/event.py", line 125, in wait 2020-05-08 15:31:35.831 7 ERROR nova.compute.manager [instance: d2235b5d-0d00-4652-8b63-463366d41293] result = hub.switch() 2020-05-08 15:31:35.831 7 ERROR nova.compute.manager [instance: d2235b5d-0d00-4652-8b63-463366d41293] File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch 2020-05-08 15:31:35.831 7 ERROR nova.compute.manager [instance: d2235b5d-0d00-4652-8b63-463366d41293] return self.greenlet.switch() 2020-05-08 15:31:35.831 7 ERROR nova.compute.manager [instance: d2235b5d-0d00-4652-8b63-463366d41293] eventlet.timeout.Timeout: 300 seconds so this looks like the issue is in neutron not in nova the wait_for_instance_event function is waiting for the network-vif-plugged event from the neutron sriov nic agent when it finishes configuring the port i see the libvirt error "libvirt.libvirtError: Node device not found: no node device with matching name 'net_enp6s0f3v1_4a_cc_00_c6_c2_ed'" earlier in the log but that seams to be resolved when the compute agent is restarted and i can see its correctly reported the inventory [{ count = 1, numa_node = 0, product_id = '1572', tags = { dev_type = 'type-PF', physical_network = 'datacentre', trusted = 'true' }, vendor_id = '8086' }, { count = 5, numa_node = 0, product_id = '154c', tags = { dev_type = 'type-VF', parent_ifname = 'enp6s0f3', physical_network = 'datacentre', trusted = 'true' }, vendor_id = '8086' }] for some reason i cant access the sosreports the permission seam to be wrong so i can only open the nova compute log can you update teh permission on the sosreport or provide the neutron server log and the neutron sriov nic agent log? it look like the issue that prevented d2235b5d-0d00-4652-8b63-463366d41293 from spawning is caused by neutron not nova.
Sean, please try again to download sosreport files from http://file.mad.redhat.com/eolivare/BZ1828834/ I have fixed the permissions. Sorry about that. Lucas, please take a look at Sean's previous comment. You can download logs the link above. If any logs are missing, I will have to reproduce it again. The environment is not available at this moment, but will be later today or during this week.
(In reply to eolivare from comment #9) > Sean, please try again to download sosreport files from > http://file.mad.redhat.com/eolivare/BZ1828834/ > I have fixed the permissions. Sorry about that. > > Lucas, please take a look at Sean's previous comment. You can download logs > the link above. If any logs are missing, I will have to reproduce it again. > The environment is not available at this moment, but will be later today or > during this week. Thanks Eduardo, I will probably need to look into the environment to try to understand what's going on, from the networking-ovn perspective, I don't think we are creating any provisioning block to the SRIOV port (VNIC_DIRECT) [0]. Maybe it's something that the SRIOV agent does ? (Do you happen to know Rodolfo?) I'll need to experiment with the environment and see what I can find. The weird part is that apparently VIFs were working before when we tested it out, I don't recall any changes on these parts of the code that could have affected it. [0] https://github.com/openstack/networking-ovn/blob/fd1c0c3cffc3c827028b04e4288240355150a987/networking_ovn/ml2/mech_driver.py#L443-L445 [1] https://github.com/openstack/networking-ovn/blob/fd1c0c3cffc3c827028b04e4288240355150a987/networking_ovn/ml2/mech_driver.py#L459-L464
*** Bug 1834592 has been marked as a duplicate of this bug. ***
Verified on: RHOS-16.1-RHEL-8-20200603.n.0 openstack-neutron-15.1.1-0.20200528113513.4157550.el8ost.noarch VM with VF ports are created successfully. All test_sriov_provider_network tests executed on OVN SRIOV CI job passed: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv4-vlan-sriov/26/artifact/tempest-results/tempest-results-neutron_qe_sriov.1.html
If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text. If this bug does not require doc text, please set the 'requires_doc_text' flag to '-'.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3148