Bug 1366208 - [RFE] Reserve NUMA nodes with PCI devices attached
Summary: [RFE] Reserve NUMA nodes with PCI devices attached
Keywords:
Status: CLOSED EOL
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 12.0 (Pike)
Hardware: All
OS: Linux
high
high
Target Milestone: z4
: 12.0 (Pike)
Assignee: Stephen Finucane
QA Contact: awaugama
URL: https://blueprints.launchpad.net/nova...
Whiteboard: upstream_milestone_none upstream_defi...
Depends On:
Blocks: 1188000 1389435 1399897 1442136 1647536 1732816 1791991 1389441 1419231 1446311 1469476 1561961 1650606 1757886 1775575 1775576 1783354
TreeView+ depends on / blocked
 
Reported: 2016-08-11 10:02 UTC by Pratik Pravin Bandarkar
Modified: 2020-02-14 17:52 UTC (History)
37 users (show)

Fixed In Version: openstack-nova-16.0.0-0.20170624031428.3863eca.el7ost
Doc Type: Enhancement
Doc Text:
This release adds the ‘PCIWeigher’ weigher. The weigher is enabled by default, but has no effect until it is configured. Configure ‘PCIWeigher’ using the '[filter_scheduler] pci_weight_multiplier' option to prevent non-PCI instances from occupying resources on hosts with PCI devices. When the weigher is enabled, the nova-scheduler service prefers hosts without whitelisted PCI devices when booting instances that do not request PCI passthrough or SR-IOV networking.
Clone Of:
: 1446311 (view as bug list)
Environment:
Last Closed: 2019-01-11 16:26:38 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Launchpad 1614882 None None None 2016-08-19 15:07:19 UTC
OpenStack gerrit 364468 'None' MERGED Reserve NUMA nodes with PCI devices attached 2020-04-06 19:47:50 UTC
OpenStack gerrit 379524 'None' MERGED Add PCIWeigher 2020-04-06 19:47:50 UTC
OpenStack gerrit 379625 'None' MERGED Prefer non-PCI host nodes for non-PCI instances 2020-04-06 19:47:50 UTC
OpenStack gerrit 454117 'None' MERGED Update "Reserve NUMA nodes with PCI devices attached" 2020-04-06 19:47:50 UTC
Red Hat Knowledge Base (Solution) 4723611 None None None 2020-01-08 15:04:08 UTC

Description Pratik Pravin Bandarkar 2016-08-11 10:02:24 UTC
1. Proposed title of this feature request  
   Enable to share SR-IOV PCI devices between NUMA nodes 
  
3. What is the nature and description of the request?  
On two-sockets server (NUMA + SR-IOV environment) each NUMA cell has its own PCI devices. If some NUMA cell has no SR-IOV PCI device user can't boot a VM with SR-IOV support in this NUMA cell. Nova forbids to share PCI devices between NUMA cells for better performance. But this behavior leaves 2-sockets machines half-populated. User should have a choice and Nova behavior in this case should be configurable.

Blueprint link: https://blueprints.launchpad.net/nova/+spec/share-pci-device-between-numa-nodes

Comment 3 Stephen Gordon 2016-08-18 16:42:16 UTC
Clearing rhos-flags, it's too early to ask for acks as there is not yet a design.

Comment 4 Stephen Gordon 2016-08-19 14:50:19 UTC
(In reply to Pratik Pravin Bandarkar from comment #0)
> On two-sockets server (NUMA + SR-IOV environment) each NUMA cell has its own
> PCI devices. If some NUMA cell has no SR-IOV PCI device user can't boot a VM
> with SR-IOV support in this NUMA cell. Nova forbids to share PCI devices
> between NUMA cells for better performance. But this behavior leaves
> 2-sockets machines half-populated. User should have a choice and Nova
> behavior in this case should be configurable.

First thing I will note is that you can work around this using hw:numa_nodes=2 (or more) to allow the guest to split over more than one node but this is not ideal as the splitting of the guest in this fashion impacts not just device allocation but also vCPU and RAM allocation. 

There are really I think a couple of aspects to resolving this ask which comes up in a couple of different ways:

1) Weighting scheduling requests such that if a given request does *not* require and SR-IOV/PCI device it is weighted towards landing on NUMA nodes that do not have them exposed.

2) Allowing users requesting a VM with a SR-IOV/PCI device a mechanism for specifying whether they have a "hard" requirement for it to be co-located on the same NUMA node(s) as the guest or a "soft" requirement. In the latter case the scheduler would still attempt to place the guest on the same NUMA node(s) as the available devices but if this is not possible, just place them on the same host and still attach the device.

We need to be mindful of the fact that for some workloads the current behaviour is the expectation, and that for others the behaviour requested here is the expectation and that both may even manifest in the same cloud.

Comment 5 Sadique Puthen 2016-08-22 03:59:29 UTC
If I use hw:numa_nodes=2, will it strictly try to align memory, cpu and pci devices to the same numa node when possible/available and split it only if it's not possible to get resource from same numa node? They don't want it to be split if resource are available in the same numa node.

Comment 7 Stephen Gordon 2016-08-22 13:44:13 UTC
(In reply to Sadique Puthen from comment #5)
> If I use hw:numa_nodes=2, will it strictly try to align memory, cpu and pci
> devices to the same numa node when possible/available and split it only if
> it's not possible to get resource from same numa node? They don't want it to
> be split if resource are available in the same numa node.

No, the guest topology will be split evenly (or as evenly as possible) across the two nodes unless other parameters implemented by the same spec are used to create an asymmetric guest CPU/RAM layout.

Comment 11 Stephen Gordon 2016-09-01 19:11:57 UTC
(In reply to Stephen Gordon from comment #4)
> (In reply to Pratik Pravin Bandarkar from comment #0)
> > On two-sockets server (NUMA + SR-IOV environment) each NUMA cell has its own
> > PCI devices. If some NUMA cell has no SR-IOV PCI device user can't boot a VM
> > with SR-IOV support in this NUMA cell. Nova forbids to share PCI devices
> > between NUMA cells for better performance. But this behavior leaves
> > 2-sockets machines half-populated. User should have a choice and Nova
> > behavior in this case should be configurable.
> 
> First thing I will note is that you can work around this using
> hw:numa_nodes=2 (or more) to allow the guest to split over more than one
> node but this is not ideal as the splitting of the guest in this fashion
> impacts not just device allocation but also vCPU and RAM allocation. 
> 
> There are really I think a couple of aspects to resolving this ask which
> comes up in a couple of different ways:
> 
> 1) Weighting scheduling requests such that if a given request does *not*
> require and SR-IOV/PCI device it is weighted towards landing on NUMA nodes
> that do not have them exposed.

This proposal endeavours to address this part:

https://blueprints.launchpad.net/nova/+spec/reserve-numa-with-pci
https://review.openstack.org/#/c/364468/

> 2) Allowing users requesting a VM with a SR-IOV/PCI device a mechanism for
> specifying whether they have a "hard" requirement for it to be co-located on
> the same NUMA node(s) as the guest or a "soft" requirement. In the latter
> case the scheduler would still attempt to place the guest on the same NUMA
> node(s) as the available devices but if this is not possible, just place
> them on the same host and still attach the device.

This proposal endeavours to address this part:

https://blueprints.launchpad.net/nova/+spec/share-pci-between-numa-nodes
https://review.openstack.org/#/c/361140/

> We need to be mindful of the fact that for some workloads the current
> behaviour is the expectation, and that for others the behaviour requested
> here is the expectation and that both may even manifest in the same cloud.

Comment 13 Stephen Gordon 2016-11-23 16:13:14 UTC
(In reply to Stephen Gordon from comment #11)
> (In reply to Stephen Gordon from comment #4)
> > (In reply to Pratik Pravin Bandarkar from comment #0)
> > > On two-sockets server (NUMA + SR-IOV environment) each NUMA cell has its own
> > > PCI devices. If some NUMA cell has no SR-IOV PCI device user can't boot a VM
> > > with SR-IOV support in this NUMA cell. Nova forbids to share PCI devices
> > > between NUMA cells for better performance. But this behavior leaves
> > > 2-sockets machines half-populated. User should have a choice and Nova
> > > behavior in this case should be configurable.
> > 
> > First thing I will note is that you can work around this using
> > hw:numa_nodes=2 (or more) to allow the guest to split over more than one
> > node but this is not ideal as the splitting of the guest in this fashion
> > impacts not just device allocation but also vCPU and RAM allocation. 
> > 
> > There are really I think a couple of aspects to resolving this ask which
> > comes up in a couple of different ways:
> > 
> > 1) Weighting scheduling requests such that if a given request does *not*
> > require and SR-IOV/PCI device it is weighted towards landing on NUMA nodes
> > that do not have them exposed.
> 
> This proposal endeavours to address this part:
> 
> https://blueprints.launchpad.net/nova/+spec/reserve-numa-with-pci
> https://review.openstack.org/#/c/364468/
> 
> > 2) Allowing users requesting a VM with a SR-IOV/PCI device a mechanism for
> > specifying whether they have a "hard" requirement for it to be co-located on
> > the same NUMA node(s) as the guest or a "soft" requirement. In the latter
> > case the scheduler would still attempt to place the guest on the same NUMA
> > node(s) as the available devices but if this is not possible, just place
> > them on the same host and still attach the device.
> 
> This proposal endeavours to address this part:
> 
> https://blueprints.launchpad.net/nova/+spec/share-pci-between-numa-nodes
> https://review.openstack.org/#/c/361140/
> 
> > We need to be mindful of the fact that for some workloads the current
> > behaviour is the expectation, and that for others the behaviour requested
> > here is the expectation and that both may even manifest in the same cloud.

Neither specification was approved for Ocata/11, as a result I am moving this RFE to Pike/12.

Comment 16 Stephen Gordon 2017-04-12 12:19:21 UTC
Moving to Upstream M2, progress is good but pending further upstream reviews.

Comment 35 Lon Hohberger 2018-03-07 13:51:10 UTC
According to our records, this should be resolved by openstack-nova-16.0.2-9.el7ost.  This build is available now.

Comment 38 David Juran 2018-03-14 09:22:06 UTC
I'm not sure if I'm misreading things here, but does this Bz cover _both_ the sharing of pci devices between NUMA nodes[1] mentioned in #0 and the reservation of NUMA nodes with PCI devices[2] which the title implies?
Which one those (or both?) are in the coming OSP12 errata?

[1]
https://blueprints.launchpad.net/nova/+spec/share-pci-device-between-numa-nodes

[2]
https://blueprints.launchpad.net/nova/+spec/reserve-numa-with-pci


Note You need to log in before you can comment on or make changes to this bug.