Hide Forgot
Description of problem: In a guest which has multiple NUMA nodes which are bound to specific host nodes, and is also given host PCI devices, it is desirable to specify which guest NUMA nodes are associated with the PCI device in the guest. This is important to ensure that the guest does not use the PCI device in a way which causes it to do I/O across host NUMA nodes. This appears to involve ACPI SLIT setup in some manner, either against the PCI bus or the device. cf http://www.osronline.com/showthread.cfm?link=241954 This feature is important for OpenStack's NFV use cases.
We finally have a POC for i440fx which models multiple Host Bridge extenders, each one in different NUMA node that works both under Windows and Linux. QEMU 2.2 is in soft freeze, so it will be part of QEMU 2.3
The feature was accepted upstream, see https://lists.gnu.org/archive/html/qemu-devel/2015-05/msg04927.html Thanks, Marcel
Fix included in qemu-kvm-rhev-2.3.0-9.el7
Tested this bug with qemu-kvm-rhev-2.3.0-14.el7.x86_64. steps: 1.bind VF to vfio driver. # ls /sys/bus/pci/drivers/vfio-pci/ 0000:01:10.4 bind module new_id remove_id uevent unbind 2.boot RHEL7.2 guest with a VF key command line: usr/libexec/qemu-kvm -m 4096M -smp 2 \ -object memory-backend-ram,host-nodes=0,policy=bind,id=mem-0,size=2048M,prealloc=yes -numa node,memdev=mem-0 \ -object memory-backend-ram,host-nodes=1,policy=bind,id=mem-1,size=2048M,prealloc=yes -numa node,memdev=mem-1 \ -device pxb,id=pxb,bus_nr=4,numa_node=1 \ -device vfio-pci,host=01:10.4,id=vf1,bus=pxb,addr=0x1 -vnc :1 3. check pci device related numa node info inside guest. # lspci|grep Eth 00:04.0 Ethernet controller: Red Hat, Inc Virtio network device 05:01.0 Ethernet controller: Intel Corporation X540 Ethernet Controller Virtual Function (rev 01) # lspci -t -+-[0000:04]---00.0-[05]----01.0 \-[0000:00]-+-00.0 +-01.0 +-01.1 +-01.3 +-02.0 +-03.0 +-04.0 \-05.0 # cat /sys/devices/pci0000\:04/0000\:04\:00.0/numa_node 1 Alex, after step3. result indicate pci device(05:01.0) is binding in special numa node. according to this result. Is this bug fixed? Another: Should QE how to understand and verify this scenario in commnet0? "This is important to ensure that the guest does not use the PCI device in a way which causes it to do I/O across host NUMA nodes."
It appears correct to me, Marcel can answer whether there's more to expect.
(In reply to Alex Williamson from comment #26) > It appears correct to me, Marcel can answer whether there's more to expect. Hi, The VM configuration shows that, indeed, the device is attached to the intended NUMA node, however the "real" test will require more. Let me explain: 1. You need a host machine with 2 NUMA nodes and a PCI device attached to one of them. (Be sure that the phys device is connected to a Host NUMA node). 2. Create the VM with 2 NUMA nodes and map the memory and vcpus exactly as in host. Guest NUMA node 0 will have memory from host NUMA 0 and the vcpus will be taken from host NUMA 0 threads (play with affinity). 3. Now create a PXB and attach the device to NUMA 0 and run a performance test. Do the same with the PXB attached to NUMA 1. 4. The real test is showing that performance is better when the pxb is connected to the same NUMA node as the phys device. Please let me know if you need further info. Thanks, Marcel
(In reply to Marcel Apfelbaum from comment #27) > (In reply to Alex Williamson from comment #26) > > It appears correct to me, Marcel can answer whether there's more to expect. > > Hi, > > The VM configuration shows that, indeed, the device is attached to the > intended NUMA node, however the "real" test will require more. > > Let me explain: > 1. You need a host machine with 2 NUMA nodes and a PCI device attached to > one of them. (Be sure that the phys device is connected to a Host NUMA node). > > 2. Create the VM with 2 NUMA nodes and map the memory and vcpus exactly as > in host. Guest NUMA node 0 will have memory from host NUMA 0 and the vcpus > will be taken from host NUMA 0 threads (play with affinity). > > 3. Now create a PXB and attach the device to NUMA 0 and run a performance > test. > Do the same with the PXB attached to NUMA 1. > > 4. The real test is showing that performance is better when the pxb is > connected to the same NUMA node as the phys device. > > Please let me know if you need further info. > Thanks, > Marcel HI Marcel, Thanks for the feedback first. Could you please be more specific performance mean? Pass-though nic speed in guest or? Best Regards, Junyi
(In reply to juzhang from comment #28) > (In reply to Marcel Apfelbaum from comment #27) > > (In reply to Alex Williamson from comment #26) > > > It appears correct to me, Marcel can answer whether there's more to expect. > > > > Hi, > > > > The VM configuration shows that, indeed, the device is attached to the > > intended NUMA node, however the "real" test will require more. > > > > Let me explain: > > 1. You need a host machine with 2 NUMA nodes and a PCI device attached to > > one of them. (Be sure that the phys device is connected to a Host NUMA node). > > > > 2. Create the VM with 2 NUMA nodes and map the memory and vcpus exactly as > > in host. Guest NUMA node 0 will have memory from host NUMA 0 and the vcpus > > will be taken from host NUMA 0 threads (play with affinity). > > > > 3. Now create a PXB and attach the device to NUMA 0 and run a performance > > test. > > Do the same with the PXB attached to NUMA 1. > > > > 4. The real test is showing that performance is better when the pxb is > > connected to the same NUMA node as the phys device. > > > > Please let me know if you need further info. > > Thanks, > > Marcel > > HI Marcel, > > Thanks for the feedback first. Could you please be more specific performance > mean? Pass-though nic speed in guest or? Exactly. Sorry for misunderstanding. The pass-through device in guest should perform better if is attach to its NUMA node. Thanks, Marcel > > Best Regards, > Junyi
But how do you define performance? Is there a noticeable difference in throughput of an assigned 1G NIC? Definitely not. 10G? Maybe. 40G? Maybe++. What about latency, do I get more pps with NUMA locality? Is the benefit more distinct on 4-node systems? 8-node systems? How important is other traffic between the nodes during testing? This performance analysis is a really complex issue to hand-wave as "guest should perform better" and AFAIK it hasn't been conclusively proven to this extent upstream. I don't think QE is signed up for performance analysis to this extent, so if we want QE to do a performance test we need to provide a concrete and specific example to evaluate. Do we have one?
(In reply to Alex Williamson from comment #30) > But how do you define performance? Is there a noticeable difference in > throughput of an assigned 1G NIC? Definitely not. 10G? Maybe. 40G? > Maybe++. What about latency, do I get more pps with NUMA locality? Is the > benefit more distinct on 4-node systems? 8-node systems? How important is > other traffic between the nodes during testing? This performance analysis > is a really complex issue to hand-wave as "guest should perform better" and > AFAIK it hasn't been conclusively proven to this extent upstream. For what I understood, the OpenStack community was complaining about this, so *it must* be a way to figure out if we solved the issue. I deliberately kept the loose "guest should perform better" that says - we don't really want a full performance analysis, but we only want to be sure that the phys device is accessed only with its native NUMA node MEM/threads. > > I don't think QE is signed up for performance analysis to this extent, so if > we want QE to do a performance test we need to provide a concrete and > specific example to evaluate. Do we have one? I was hoping the QE is already handling this kind of performance tests for physical machines. I was thinking that we could get *one of them* and transform them for our needs. If they are not handling it, how can we check we really solved the issue? Maybe ask the ones that opened the BZ? Daniel, can you please comment? Thanks, Marcel
(In reply to Marcel Apfelbaum from comment #31) > (In reply to Alex Williamson from comment #30) > > But how do you define performance? Is there a noticeable difference in > > throughput of an assigned 1G NIC? Definitely not. 10G? Maybe. 40G? > > Maybe++. What about latency, do I get more pps with NUMA locality? Is the > > benefit more distinct on 4-node systems? 8-node systems? How important is > > other traffic between the nodes during testing? This performance analysis > > is a really complex issue to hand-wave as "guest should perform better" and > > AFAIK it hasn't been conclusively proven to this extent upstream. > For what I understood, the OpenStack community was complaining about this, > so *it must* be a way to figure out if we solved the issue. I deliberately > kept the loose "guest should perform better" that says - we don't really > want a full performance analysis, but we only want to be sure that the phys > device is accessed only with its native NUMA node MEM/threads. > > > > > > I don't think QE is signed up for performance analysis to this extent, so if > > we want QE to do a performance test we need to provide a concrete and > > specific example to evaluate. Do we have one? > I was hoping the QE is already handling this kind of performance tests for > physical machines. I was thinking that we could get *one of them* and > transform them for our needs. If they are not handling it, how can we check > we really solved the issue? Maybe ask the ones that opened the BZ? > > Daniel, can you please comment? For real bare metal deployments, having NUMA locality for PCI devices results in improved performance. For this nested-virt scenario though, performance is really not a big deal - this feature is merely intended to provide a functional method for setting up NUMA locality of PCI devices. The actual performance of nested-virt in general will mask any perf benefit of NUMA, so I don't think you need to test performance for this BZ. For testing it suffices to validate that the guest OS kernel sysfs and libvirt "virsh nodedev-dumpxml" reports the NUMA locality for devices
I think we (perf dept) could probably conduct a test to show the impact of not getting this "right". We have a pair of test systems with 12 x 40Gb, 6 per NUMA node (one system directly connected to another). A streaming network test in bare-metal, with all NUMA locality preserved, results in ~420Gbps. We could reproduce this with a 2-node VM, with and without the proper NUMA locality information (assuming we can do device assignment of 12 physical functions). Let me know if you want a result like this.
(In reply to Andrew Theurer from comment #33) > I think we (perf dept) could probably conduct a test to show the impact of > not getting this "right". We have a pair of test systems with 12 x 40Gb, 6 > per NUMA node (one system directly connected to another). A streaming > network test in bare-metal, with all NUMA locality preserved, results in > ~420Gbps. We could reproduce this with a 2-node VM, with and without the > proper NUMA locality information (assuming we can do device assignment of 12 > physical functions). Let me know if you want a result like this. Hi Andrew, Thank you for stepping in. Yes, we would really want to reproduce that with 2-node VM locality preserved versus not preserved. I will help with the command line for the new PXB device. In the meantime QA can classify this as 'verified'. Thanks, Marcel
With Marcel's help, a VM has been configured on a 2-node system with 6 network interfaces per node. I have tested the two configurations: 1) VM with mem/cpu NUMA topology, but no pci NUMA topology 2) VM with mem/cpu/pci NUMA topology A streaming network test was run with all network interfaces (to another system with similar interfaces). When the VM has a PCI NUMA topology, the network throughput was 45% higher than without PCI NUMA topology.
Thanks Andrew and Marcel, great to know. Best Regards, Junyi
(In reply to Andrew Theurer from comment #39) > With Marcel's help, a VM has been configured on a 2-node system with 6 > network interfaces per node. I have tested the two configurations: > > 1) VM with mem/cpu NUMA topology, but no pci NUMA topology > 2) VM with mem/cpu/pci NUMA topology > > A streaming network test was run with all network interfaces (to another > system with similar interfaces). When the VM has a PCI NUMA topology, the > network throughput was 45% higher than without PCI NUMA topology. WOW Thanks a lot Andrew! Marcel
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-2546.html