Bug 1390576

Summary: Couldn't boot several PFs from separate physical interfaces
Product: Red Hat OpenStack Reporter: Eyal Dannon <edannon>
Component: openstack-novaAssignee: Vladik Romanovsky <vromanso>
Status: CLOSED ERRATA QA Contact: Yariv <yrachman>
Severity: high Docs Contact:
Priority: high    
Version: 10.0 (Newton)CC: atelang, atragler, beagles, berrange, dasmith, edannon, eglynn, fbaudin, jjung, jschluet, kchamart, mbabushk, oblaut, panbalag, sbauza, sferdjao, sgordon, srevivo, vromanso, yrachman, zgreenbe
Target Milestone: rcKeywords: Triaged
Target Release: 10.0 (Newton)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-nova-14.0.1-6.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-14 16:27:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Description Eyal Dannon 2016-11-01 12:42:26 UTC
Description of problem:

Part of OSPd10 + SR-IOV verification I've tried to boot 2 VMs with PF as interfaces, each of them from separate physical network.
I couldn't boot an instance because of nova scheduler PciPassthroughFilter issue, here's the "pci_passthrough_whitelist" combination I've tried:

#Only-vf-works#
pci_passthrough_whitelist=[{"devname": "p6p1", "physical_network": "tenant"}, {"devname": "p6p2", "physical_network": "tenant2"}]

#only-vf-works#
pci_passthrough_whitelist=[{"vendor_id": "8086", "product_id": "154d", "physical_network": "tenant", "address": "0000:06:00.0"},{"vendor_id": "8086", "product_id":"10ed", "physical_network": "tenant", "address": "0000:06:00.0"}]

#PF+VF works while using 1 physical network#
pci_passthrough_whitelist=[{"vendor_id": "8086", "product_id": "154d", "physical_network": "tenant"},{"vendor_id": "8086", "product_id":"10ed", "physical_network": "tenant"}]

#Only 1 PF boots(from the first physical_network)#
pci_passthrough_whitelist=[{"vendor_id": "8086", "product_id": "154d", "physical_network": "tenant"},{"vendor_id": "8086", "product_id":"10ed", "physical_network": "tenant"},{"vendor_id": "8086", "product_id": "154d", "physical_network": "tenant2"},{"vendor_id": "8086", "product_id":"10ed", "physical_network": "tenant2"}]

#Only 1 PF boots(from tenant2)#
pci_passthrough_whitelist=[{"vendor_id": "8086", "product_id": "154d", "physical_network": "tenant2"},{"vendor_id": "8086", "product_id":"10ed", "physical_network": "tenant2"},{"vendor_id": "8086", "product_id": "154d", "physical_network": "tenant"},{"vendor_id": "8086", "product_id":"10ed", "physical_network": "tenant"}]

#Couldn't boot any VM with PF#
pci_passthrough_whitelist=[{"vendor_id": "8086", "product_id": "154d"},{"vendor_id": "8086", "product_id":"10ed"}]

#PF-Wont-boot#
pci_passthrough_whitelist=[{"vendor_id": "8086", "product_id": "154d", "physical_network": "tenant"},{"vendor_id": "8086", "product_id":"154d", "physical_network": "tenant2"}]

Version-Release number of selected component (if applicable):

openstack-nova-novncproxy-14.0.0-0.20160929203854.59653c6.el7ost.noarch
openstack-nova-api-14.0.0-0.20160929203854.59653c6.el7ost.noarch
openstack-nova-console-14.0.0-0.20160929203854.59653c6.el7ost.noarch
openstack-nova-common-14.0.0-0.20160929203854.59653c6.el7ost.noarch
openstack-nova-scheduler-14.0.0-0.20160929203854.59653c6.el7ost.noarch
openstack-nova-conductor-14.0.0-0.20160929203854.59653c6.el7ost.noarch
openstack-nova-cert-14.0.0-0.20160929203854.59653c6.el7ost.noarch
openstack-nova-compute-14.0.0-0.20160929203854.59653c6.el7ost.noarch


How reproducible:
Always

Steps to Reproduce:
1.Setup OSPd10 + SR-IOV environment
2.Set the "pci_passthrough_whitelist" as mentioned above
3.Boot an instance with direct-physical port as interface

Actual results:
VM boots to ERROR state

Expected results:
The VM should boot to ACTIVE, RUNNING state

Additional info:
Error example from /var/log/nova/nova-scheduler.log:

2016-11-01 09:50:23.161 2804 INFO nova.filters [req-893f7cfb-5c79-413b-b70d-a20442fe411f 7cdae8e58bd240868fea4cf8f10f4ed7 16d7d8002a954229bee64f24e9ec25bd - - -] Filtering removed all hosts for the request with instance ID 'a736782b-cb40-4ecd-bad5-93cf8f80fbab'. Filter results: ['AvailabilityZoneFilter: (start: 1, end: 1)', 'RamFilter: (start: 1, end: 1)', 'ComputeFilter: (start: 1, end: 1)', 'ComputeCapabilitiesFilter: (start: 1, end: 1)', 'ImagePropertiesFilter: (start: 1, end: 1)', 'ServerGroupAntiAffinityFilter: (start: 1, end: 1)', 'ServerGroupAffinityFilter: (start: 1, end: 1)', 'PciPassthroughFilter: (start: 1, end: 0)']

Comment 1 Vladik Romanovsky 2016-11-01 14:33:12 UTC
Hi Eyal,

Yes, currently it's not possible to distinguish between Physical Functions that share the same vendor/product id on a host.

We will need to wait until the following patch is merged:

https://review.openstack.org/#/c/363884

then we will be able to whitelist these according their pci addresses.

For an even more fine grained control, we will need to wait for the following spec to be implemented in the Ocata release:

https://review.openstack.org/#/c/350211

Thanks,
Vladik

Comment 3 Brent Eagles 2016-11-02 13:47:11 UTC
Another reason we'll need the PCI address control is that if a system has only SR-IOV cards of the same make/model but only a few are being used for guests and the others are used for control plane/management/storage/etc., it is very possible that openstack will allocate the device for openstack use - potentially disconnecting the compute node from the cloud. I think we need a some kind of release note strongly warning users to NOT use PFs if they are using similar NIC hardware for multiple purposes on their hypervisor nodes.

Comment 4 Sahid Ferdjaoui 2016-11-04 12:41:16 UTC
An attempt to make it backported on Newton here:

   https://review.openstack.org/#/c/393752/

Comment 10 Stephen Gordon 2016-11-09 00:35:30 UTC
*** Bug 1391680 has been marked as a duplicate of this bug. ***

Comment 13 Ziv Greenberg 2016-11-16 15:47:20 UTC
Hi,

I have verified the following bug.
I was able to boot up two VM's which is sharing the same vendor/product id on the host.


Thank you,
Ziv

Comment 15 errata-xmlrpc 2016-12-14 16:27:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html