Bug 1366208
Summary: | [RFE] Reserve NUMA nodes with PCI devices attached | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Pratik Pravin Bandarkar <pbandark> | |
Component: | openstack-nova | Assignee: | Stephen Finucane <stephenfin> | |
Status: | CLOSED EOL | QA Contact: | awaugama | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 12.0 (Pike) | CC: | acanan, atelang, awaugama, berrange, broose, dasmith, dhill, djuran, eglynn, fbaudin, jamsmith, jjoyce, jniu, jraju, jschluet, kchamart, lmarsh, lruzicka, lyarwood, markmc, mburns, mdeng, mschuppe, oblaut, pchavva, rhos-maint, sbauza, sclewis, sgordon, slinaber, sputhenp, srevivo, stephenfin, tvignaud, vromanso, yohmura, yrachman | |
Target Milestone: | z4 | Keywords: | FutureFeature, TestOnly, Triaged, ZStream | |
Target Release: | 12.0 (Pike) | |||
Hardware: | All | |||
OS: | Linux | |||
URL: | https://blueprints.launchpad.net/nova/+spec/reserve-numa-with-pci | |||
Whiteboard: | upstream_milestone_none upstream_definition_approved upstream_status_needs-code-review | |||
Fixed In Version: | openstack-nova-16.0.0-0.20170624031428.3863eca.el7ost | Doc Type: | Enhancement | |
Doc Text: |
This release adds the ‘PCIWeigher’ weigher. The weigher is enabled by default, but has no effect until it is configured. Configure ‘PCIWeigher’ using the '[filter_scheduler] pci_weight_multiplier' option to prevent non-PCI instances from occupying resources on hosts with PCI devices. When the weigher is enabled, the nova-scheduler service prefers hosts without whitelisted PCI devices when booting instances that do not request PCI passthrough or SR-IOV networking.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1446311 (view as bug list) | Environment: | ||
Last Closed: | 2019-01-11 16:26:38 UTC | Type: | Feature Request | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1188000, 1389435, 1389441, 1399897, 1419231, 1442136, 1446311, 1469476, 1561961, 1647536, 1650606, 1732816, 1757886, 1775575, 1775576, 1783354, 1791991 |
Description
Pratik Pravin Bandarkar
2016-08-11 10:02:24 UTC
Clearing rhos-flags, it's too early to ask for acks as there is not yet a design. (In reply to Pratik Pravin Bandarkar from comment #0) > On two-sockets server (NUMA + SR-IOV environment) each NUMA cell has its own > PCI devices. If some NUMA cell has no SR-IOV PCI device user can't boot a VM > with SR-IOV support in this NUMA cell. Nova forbids to share PCI devices > between NUMA cells for better performance. But this behavior leaves > 2-sockets machines half-populated. User should have a choice and Nova > behavior in this case should be configurable. First thing I will note is that you can work around this using hw:numa_nodes=2 (or more) to allow the guest to split over more than one node but this is not ideal as the splitting of the guest in this fashion impacts not just device allocation but also vCPU and RAM allocation. There are really I think a couple of aspects to resolving this ask which comes up in a couple of different ways: 1) Weighting scheduling requests such that if a given request does *not* require and SR-IOV/PCI device it is weighted towards landing on NUMA nodes that do not have them exposed. 2) Allowing users requesting a VM with a SR-IOV/PCI device a mechanism for specifying whether they have a "hard" requirement for it to be co-located on the same NUMA node(s) as the guest or a "soft" requirement. In the latter case the scheduler would still attempt to place the guest on the same NUMA node(s) as the available devices but if this is not possible, just place them on the same host and still attach the device. We need to be mindful of the fact that for some workloads the current behaviour is the expectation, and that for others the behaviour requested here is the expectation and that both may even manifest in the same cloud. If I use hw:numa_nodes=2, will it strictly try to align memory, cpu and pci devices to the same numa node when possible/available and split it only if it's not possible to get resource from same numa node? They don't want it to be split if resource are available in the same numa node. (In reply to Sadique Puthen from comment #5) > If I use hw:numa_nodes=2, will it strictly try to align memory, cpu and pci > devices to the same numa node when possible/available and split it only if > it's not possible to get resource from same numa node? They don't want it to > be split if resource are available in the same numa node. No, the guest topology will be split evenly (or as evenly as possible) across the two nodes unless other parameters implemented by the same spec are used to create an asymmetric guest CPU/RAM layout. (In reply to Stephen Gordon from comment #4) > (In reply to Pratik Pravin Bandarkar from comment #0) > > On two-sockets server (NUMA + SR-IOV environment) each NUMA cell has its own > > PCI devices. If some NUMA cell has no SR-IOV PCI device user can't boot a VM > > with SR-IOV support in this NUMA cell. Nova forbids to share PCI devices > > between NUMA cells for better performance. But this behavior leaves > > 2-sockets machines half-populated. User should have a choice and Nova > > behavior in this case should be configurable. > > First thing I will note is that you can work around this using > hw:numa_nodes=2 (or more) to allow the guest to split over more than one > node but this is not ideal as the splitting of the guest in this fashion > impacts not just device allocation but also vCPU and RAM allocation. > > There are really I think a couple of aspects to resolving this ask which > comes up in a couple of different ways: > > 1) Weighting scheduling requests such that if a given request does *not* > require and SR-IOV/PCI device it is weighted towards landing on NUMA nodes > that do not have them exposed. This proposal endeavours to address this part: https://blueprints.launchpad.net/nova/+spec/reserve-numa-with-pci https://review.openstack.org/#/c/364468/ > 2) Allowing users requesting a VM with a SR-IOV/PCI device a mechanism for > specifying whether they have a "hard" requirement for it to be co-located on > the same NUMA node(s) as the guest or a "soft" requirement. In the latter > case the scheduler would still attempt to place the guest on the same NUMA > node(s) as the available devices but if this is not possible, just place > them on the same host and still attach the device. This proposal endeavours to address this part: https://blueprints.launchpad.net/nova/+spec/share-pci-between-numa-nodes https://review.openstack.org/#/c/361140/ > We need to be mindful of the fact that for some workloads the current > behaviour is the expectation, and that for others the behaviour requested > here is the expectation and that both may even manifest in the same cloud. (In reply to Stephen Gordon from comment #11) > (In reply to Stephen Gordon from comment #4) > > (In reply to Pratik Pravin Bandarkar from comment #0) > > > On two-sockets server (NUMA + SR-IOV environment) each NUMA cell has its own > > > PCI devices. If some NUMA cell has no SR-IOV PCI device user can't boot a VM > > > with SR-IOV support in this NUMA cell. Nova forbids to share PCI devices > > > between NUMA cells for better performance. But this behavior leaves > > > 2-sockets machines half-populated. User should have a choice and Nova > > > behavior in this case should be configurable. > > > > First thing I will note is that you can work around this using > > hw:numa_nodes=2 (or more) to allow the guest to split over more than one > > node but this is not ideal as the splitting of the guest in this fashion > > impacts not just device allocation but also vCPU and RAM allocation. > > > > There are really I think a couple of aspects to resolving this ask which > > comes up in a couple of different ways: > > > > 1) Weighting scheduling requests such that if a given request does *not* > > require and SR-IOV/PCI device it is weighted towards landing on NUMA nodes > > that do not have them exposed. > > This proposal endeavours to address this part: > > https://blueprints.launchpad.net/nova/+spec/reserve-numa-with-pci > https://review.openstack.org/#/c/364468/ > > > 2) Allowing users requesting a VM with a SR-IOV/PCI device a mechanism for > > specifying whether they have a "hard" requirement for it to be co-located on > > the same NUMA node(s) as the guest or a "soft" requirement. In the latter > > case the scheduler would still attempt to place the guest on the same NUMA > > node(s) as the available devices but if this is not possible, just place > > them on the same host and still attach the device. > > This proposal endeavours to address this part: > > https://blueprints.launchpad.net/nova/+spec/share-pci-between-numa-nodes > https://review.openstack.org/#/c/361140/ > > > We need to be mindful of the fact that for some workloads the current > > behaviour is the expectation, and that for others the behaviour requested > > here is the expectation and that both may even manifest in the same cloud. Neither specification was approved for Ocata/11, as a result I am moving this RFE to Pike/12. Moving to Upstream M2, progress is good but pending further upstream reviews. According to our records, this should be resolved by openstack-nova-16.0.2-9.el7ost. This build is available now. I'm not sure if I'm misreading things here, but does this Bz cover _both_ the sharing of pci devices between NUMA nodes[1] mentioned in #0 and the reservation of NUMA nodes with PCI devices[2] which the title implies? Which one those (or both?) are in the coming OSP12 errata? [1] https://blueprints.launchpad.net/nova/+spec/share-pci-device-between-numa-nodes [2] https://blueprints.launchpad.net/nova/+spec/reserve-numa-with-pci |