Bug 1366208

Summary:	[RFE] Reserve NUMA nodes with PCI devices attached
Product:	Red Hat OpenStack	Reporter:	Pratik Pravin Bandarkar <pbandark>
Component:	openstack-nova	Assignee:	Stephen Finucane <stephenfin>
Status:	CLOSED EOL	QA Contact:	awaugama
Severity:	high	Docs Contact:
Priority:	high
Version:	12.0 (Pike)	CC:	acanan, atelang, awaugama, berrange, broose, dasmith, dhill, djuran, eglynn, fbaudin, jamsmith, jjoyce, jniu, jraju, jschluet, kchamart, lmarsh, lruzicka, lyarwood, markmc, mburns, mdeng, mschuppe, oblaut, pchavva, rhos-maint, sbauza, sclewis, sgordon, slinaber, sputhenp, srevivo, stephenfin, tvignaud, vromanso, yohmura, yrachman
Target Milestone:	z4	Keywords:	FutureFeature, TestOnly, Triaged, ZStream
Target Release:	12.0 (Pike)
Hardware:	All
OS:	Linux
URL:	https://blueprints.launchpad.net/nova/+spec/reserve-numa-with-pci
Whiteboard:	upstream_milestone_none upstream_definition_approved upstream_status_needs-code-review
Fixed In Version:	openstack-nova-16.0.0-0.20170624031428.3863eca.el7ost	Doc Type:	Enhancement
Doc Text:	This release adds the ‘PCIWeigher’ weigher. The weigher is enabled by default, but has no effect until it is configured. Configure ‘PCIWeigher’ using the '[filter_scheduler] pci_weight_multiplier' option to prevent non-PCI instances from occupying resources on hosts with PCI devices. When the weigher is enabled, the nova-scheduler service prefers hosts without whitelisted PCI devices when booting instances that do not request PCI passthrough or SR-IOV networking.	Story Points:	---
Clone Of:
Clones:	1446311 (view as bug list)		Environment:
Last Closed:	2019-01-11 16:26:38 UTC	Type:	Feature Request
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1188000, 1389435, 1389441, 1399897, 1419231, 1442136, 1446311, 1469476, 1561961, 1647536, 1650606, 1732816, 1757886, 1775575, 1775576, 1783354, 1791991

Description Pratik Pravin Bandarkar 2016-08-11 10:02:24 UTC

1. Proposed title of this feature request  
   Enable to share SR-IOV PCI devices between NUMA nodes 
  
3. What is the nature and description of the request?  
On two-sockets server (NUMA + SR-IOV environment) each NUMA cell has its own PCI devices. If some NUMA cell has no SR-IOV PCI device user can't boot a VM with SR-IOV support in this NUMA cell. Nova forbids to share PCI devices between NUMA cells for better performance. But this behavior leaves 2-sockets machines half-populated. User should have a choice and Nova behavior in this case should be configurable.

Blueprint link: https://blueprints.launchpad.net/nova/+spec/share-pci-device-between-numa-nodes

Comment 3 Stephen Gordon 2016-08-18 16:42:16 UTC

Clearing rhos-flags, it's too early to ask for acks as there is not yet a design.

Comment 4 Stephen Gordon 2016-08-19 14:50:19 UTC

(In reply to Pratik Pravin Bandarkar from comment #0)
> On two-sockets server (NUMA + SR-IOV environment) each NUMA cell has its own
> PCI devices. If some NUMA cell has no SR-IOV PCI device user can't boot a VM
> with SR-IOV support in this NUMA cell. Nova forbids to share PCI devices
> between NUMA cells for better performance. But this behavior leaves
> 2-sockets machines half-populated. User should have a choice and Nova
> behavior in this case should be configurable.

First thing I will note is that you can work around this using hw:numa_nodes=2 (or more) to allow the guest to split over more than one node but this is not ideal as the splitting of the guest in this fashion impacts not just device allocation but also vCPU and RAM allocation. 

There are really I think a couple of aspects to resolving this ask which comes up in a couple of different ways:

1) Weighting scheduling requests such that if a given request does *not* require and SR-IOV/PCI device it is weighted towards landing on NUMA nodes that do not have them exposed.

2) Allowing users requesting a VM with a SR-IOV/PCI device a mechanism for specifying whether they have a "hard" requirement for it to be co-located on the same NUMA node(s) as the guest or a "soft" requirement. In the latter case the scheduler would still attempt to place the guest on the same NUMA node(s) as the available devices but if this is not possible, just place them on the same host and still attach the device.

We need to be mindful of the fact that for some workloads the current behaviour is the expectation, and that for others the behaviour requested here is the expectation and that both may even manifest in the same cloud.

Comment 5 Sadique Puthen 2016-08-22 03:59:29 UTC

If I use hw:numa_nodes=2, will it strictly try to align memory, cpu and pci devices to the same numa node when possible/available and split it only if it's not possible to get resource from same numa node? They don't want it to be split if resource are available in the same numa node.

Comment 7 Stephen Gordon 2016-08-22 13:44:13 UTC

(In reply to Sadique Puthen from comment #5)
> If I use hw:numa_nodes=2, will it strictly try to align memory, cpu and pci
> devices to the same numa node when possible/available and split it only if
> it's not possible to get resource from same numa node? They don't want it to
> be split if resource are available in the same numa node.

No, the guest topology will be split evenly (or as evenly as possible) across the two nodes unless other parameters implemented by the same spec are used to create an asymmetric guest CPU/RAM layout.

Comment 11 Stephen Gordon 2016-09-01 19:11:57 UTC

(In reply to Stephen Gordon from comment #4)
> (In reply to Pratik Pravin Bandarkar from comment #0)
> > On two-sockets server (NUMA + SR-IOV environment) each NUMA cell has its own
> > PCI devices. If some NUMA cell has no SR-IOV PCI device user can't boot a VM
> > with SR-IOV support in this NUMA cell. Nova forbids to share PCI devices
> > between NUMA cells for better performance. But this behavior leaves
> > 2-sockets machines half-populated. User should have a choice and Nova
> > behavior in this case should be configurable.
> 
> First thing I will note is that you can work around this using
> hw:numa_nodes=2 (or more) to allow the guest to split over more than one
> node but this is not ideal as the splitting of the guest in this fashion
> impacts not just device allocation but also vCPU and RAM allocation. 
> 
> There are really I think a couple of aspects to resolving this ask which
> comes up in a couple of different ways:
> 
> 1) Weighting scheduling requests such that if a given request does *not*
> require and SR-IOV/PCI device it is weighted towards landing on NUMA nodes
> that do not have them exposed.

This proposal endeavours to address this part:

https://blueprints.launchpad.net/nova/+spec/reserve-numa-with-pci
https://review.openstack.org/#/c/364468/

> 2) Allowing users requesting a VM with a SR-IOV/PCI device a mechanism for
> specifying whether they have a "hard" requirement for it to be co-located on
> the same NUMA node(s) as the guest or a "soft" requirement. In the latter
> case the scheduler would still attempt to place the guest on the same NUMA
> node(s) as the available devices but if this is not possible, just place
> them on the same host and still attach the device.

This proposal endeavours to address this part:

https://blueprints.launchpad.net/nova/+spec/share-pci-between-numa-nodes
https://review.openstack.org/#/c/361140/

> We need to be mindful of the fact that for some workloads the current
> behaviour is the expectation, and that for others the behaviour requested
> here is the expectation and that both may even manifest in the same cloud.

Comment 13 Stephen Gordon 2016-11-23 16:13:14 UTC

(In reply to Stephen Gordon from comment #11)
> (In reply to Stephen Gordon from comment #4)
> > (In reply to Pratik Pravin Bandarkar from comment #0)
> > > On two-sockets server (NUMA + SR-IOV environment) each NUMA cell has its own
> > > PCI devices. If some NUMA cell has no SR-IOV PCI device user can't boot a VM
> > > with SR-IOV support in this NUMA cell. Nova forbids to share PCI devices
> > > between NUMA cells for better performance. But this behavior leaves
> > > 2-sockets machines half-populated. User should have a choice and Nova
> > > behavior in this case should be configurable.
> > 
> > First thing I will note is that you can work around this using
> > hw:numa_nodes=2 (or more) to allow the guest to split over more than one
> > node but this is not ideal as the splitting of the guest in this fashion
> > impacts not just device allocation but also vCPU and RAM allocation. 
> > 
> > There are really I think a couple of aspects to resolving this ask which
> > comes up in a couple of different ways:
> > 
> > 1) Weighting scheduling requests such that if a given request does *not*
> > require and SR-IOV/PCI device it is weighted towards landing on NUMA nodes
> > that do not have them exposed.
> 
> This proposal endeavours to address this part:
> 
> https://blueprints.launchpad.net/nova/+spec/reserve-numa-with-pci
> https://review.openstack.org/#/c/364468/
> 
> > 2) Allowing users requesting a VM with a SR-IOV/PCI device a mechanism for
> > specifying whether they have a "hard" requirement for it to be co-located on
> > the same NUMA node(s) as the guest or a "soft" requirement. In the latter
> > case the scheduler would still attempt to place the guest on the same NUMA
> > node(s) as the available devices but if this is not possible, just place
> > them on the same host and still attach the device.
> 
> This proposal endeavours to address this part:
> 
> https://blueprints.launchpad.net/nova/+spec/share-pci-between-numa-nodes
> https://review.openstack.org/#/c/361140/
> 
> > We need to be mindful of the fact that for some workloads the current
> > behaviour is the expectation, and that for others the behaviour requested
> > here is the expectation and that both may even manifest in the same cloud.

Neither specification was approved for Ocata/11, as a result I am moving this RFE to Pike/12.

Comment 16 Stephen Gordon 2017-04-12 12:19:21 UTC

Moving to Upstream M2, progress is good but pending further upstream reviews.

Comment 35 Lon Hohberger 2018-03-07 13:51:10 UTC

According to our records, this should be resolved by openstack-nova-16.0.2-9.el7ost.  This build is available now.

Comment 38 David Juran 2018-03-14 09:22:06 UTC

I'm not sure if I'm misreading things here, but does this Bz cover _both_ the sharing of pci devices between NUMA nodes[1] mentioned in #0 and the reservation of NUMA nodes with PCI devices[2] which the title implies?
Which one those (or both?) are in the coming OSP12 errata?

[1]
https://blueprints.launchpad.net/nova/+spec/share-pci-device-between-numa-nodes

[2]
https://blueprints.launchpad.net/nova/+spec/reserve-numa-with-pci