Bug 1103313 (guestNUMALocalityPCIdev)

Summary:	RFE: configure guest NUMA node locality for guest PCI devices
Product:	Red Hat Enterprise Linux 7	Reporter:	Daniel Berrangé <berrange>
Component:	qemu-kvm-rhev	Assignee:	Marcel Apfelbaum <marcel>
Status:	CLOSED ERRATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	unspecified	Docs Contact:	Robert Krátký <rkratky>
Priority:	unspecified
Version:	7.0	CC:	ailan, alex.williamson, atheurer, berrange, ehabkost, hhuang, huding, juzhang, knoel, marcel, michen, mst, mtosatti, rbalakri, rkratky, virt-maint, xfu, yanghliu, ypu
Target Milestone:	rc	Keywords:	FutureFeature
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	qemu-kvm-rhev-2.3.0-9.el7	Doc Type:	Enhancement
Doc Text:	Extra PCI root buses now supported using PCI expander bridge devices Unlike PCI-PCI bridges, a bus on a PCI expander bridge can be associated with a NUMA node, allowing the guest operating system to recognize the proximity of a device to RAM and CPUs. With this update, assigned devices can be associated with the corresponding NUMA node, resulting in optimal performance.	Story Points:	---
Clone Of:
Clones:	1103314 1235381 (view as bug list)		Environment:
Last Closed:	2015-12-04 16:15:36 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1076724
Bug Blocks:	1082754, 1093069, 1103314, 1154205, 1205796, 1235381, 1242479, 1249580

Description Daniel Berrangé 2014-05-30 17:38:44 UTC

Description of problem:
In a guest which has multiple NUMA nodes which are bound to specific host nodes, and is also given host PCI devices, it is desirable to specify which guest NUMA nodes are associated with the PCI device in the guest. This is important to ensure that the guest does not use the PCI device in a way which causes it to do I/O across host NUMA nodes.

This appears to involve ACPI SLIT setup in some manner, either against the PCI bus or the device.

cf  http://www.osronline.com/showthread.cfm?link=241954

This feature is important for OpenStack's NFV use cases.

Comment 19 Marcel Apfelbaum 2014-11-02 12:41:14 UTC

We finally have a POC for i440fx which models multiple Host Bridge extenders, each one in different NUMA node that works both under Windows and Linux.
QEMU 2.2 is in soft freeze, so it will be part of QEMU 2.3

Comment 20 Marcel Apfelbaum 2015-06-24 11:08:24 UTC

The feature was accepted upstream, see
https://lists.gnu.org/archive/html/qemu-devel/2015-05/msg04927.html

Thanks,
Marcel

Comment 23 Miroslav Rezanina 2015-07-09 07:15:23 UTC

Fix included in qemu-kvm-rhev-2.3.0-9.el7

Comment 25 FuXiangChun 2015-08-05 02:35:11 UTC

Tested this bug with qemu-kvm-rhev-2.3.0-14.el7.x86_64.

steps:

1.bind VF to vfio driver.
# ls /sys/bus/pci/drivers/vfio-pci/
0000:01:10.4  bind  module  new_id  remove_id  uevent  unbind

2.boot RHEL7.2 guest with a VF

key command line:
usr/libexec/qemu-kvm -m 4096M -smp 2 \

-object memory-backend-ram,host-nodes=0,policy=bind,id=mem-0,size=2048M,prealloc=yes -numa node,memdev=mem-0 \

-object memory-backend-ram,host-nodes=1,policy=bind,id=mem-1,size=2048M,prealloc=yes -numa node,memdev=mem-1 \

-device pxb,id=pxb,bus_nr=4,numa_node=1 \

-device vfio-pci,host=01:10.4,id=vf1,bus=pxb,addr=0x1 -vnc :1 

3. check pci device related numa node info inside guest.

# lspci|grep Eth
00:04.0 Ethernet controller: Red Hat, Inc Virtio network device
05:01.0 Ethernet controller: Intel Corporation X540 Ethernet Controller Virtual Function (rev 01)

# lspci -t
-+-[0000:04]---00.0-[05]----01.0
 \-[0000:00]-+-00.0
             +-01.0
             +-01.1
             +-01.3
             +-02.0
             +-03.0
             +-04.0
             \-05.0

# cat /sys/devices/pci0000\:04/0000\:04\:00.0/numa_node 
1

Alex,
after step3. 
result indicate pci device(05:01.0) is binding in special numa node. according to this result. Is this bug fixed?

Another:
Should QE how to understand and verify this scenario in commnet0? 

"This is important to ensure that the guest does not use the PCI device in a way which causes it to do I/O across host NUMA nodes."

Comment 26 Alex Williamson 2015-08-05 02:58:05 UTC

It appears correct to me, Marcel can answer whether there's more to expect.

Comment 27 Marcel Apfelbaum 2015-08-09 14:33:05 UTC

(In reply to Alex Williamson from comment #26)
> It appears correct to me, Marcel can answer whether there's more to expect.

Hi,

The VM configuration shows that, indeed, the device is attached to the intended NUMA node, however the "real" test will require more.

Let me explain:
1. You need a host machine with 2 NUMA nodes and a PCI device attached to one of them. (Be sure that the phys device is connected to a Host NUMA node).

2. Create the VM with 2 NUMA nodes and map the memory and vcpus exactly as in host. Guest NUMA node 0 will have memory from host NUMA 0 and the vcpus will be taken from host NUMA 0 threads (play with affinity).

3. Now create a PXB and attach the device to NUMA 0 and run a performance test.
Do the same with the PXB attached to NUMA 1.

4. The real test is showing that performance is better when the pxb is connected to the same NUMA node as the phys device.

Please let me know if you need further info.
Thanks,
Marcel

Comment 28 juzhang 2015-08-10 03:56:33 UTC

(In reply to Marcel Apfelbaum from comment #27)
> (In reply to Alex Williamson from comment #26)
> > It appears correct to me, Marcel can answer whether there's more to expect.
> 
> Hi,
> 
> The VM configuration shows that, indeed, the device is attached to the
> intended NUMA node, however the "real" test will require more.
> 
> Let me explain:
> 1. You need a host machine with 2 NUMA nodes and a PCI device attached to
> one of them. (Be sure that the phys device is connected to a Host NUMA node).
> 
> 2. Create the VM with 2 NUMA nodes and map the memory and vcpus exactly as
> in host. Guest NUMA node 0 will have memory from host NUMA 0 and the vcpus
> will be taken from host NUMA 0 threads (play with affinity).
> 
> 3. Now create a PXB and attach the device to NUMA 0 and run a performance
> test.
> Do the same with the PXB attached to NUMA 1.
> 
> 4. The real test is showing that performance is better when the pxb is
> connected to the same NUMA node as the phys device.
> 
> Please let me know if you need further info.
> Thanks,
> Marcel

HI Marcel,

Thanks for the feedback first. Could you please be more specific performance mean? Pass-though nic speed in guest or?

Best Regards,
Junyi

Comment 29 Marcel Apfelbaum 2015-08-10 14:50:19 UTC

(In reply to juzhang from comment #28)
> (In reply to Marcel Apfelbaum from comment #27)
> > (In reply to Alex Williamson from comment #26)
> > > It appears correct to me, Marcel can answer whether there's more to expect.
> > 
> > Hi,
> > 
> > The VM configuration shows that, indeed, the device is attached to the
> > intended NUMA node, however the "real" test will require more.
> > 
> > Let me explain:
> > 1. You need a host machine with 2 NUMA nodes and a PCI device attached to
> > one of them. (Be sure that the phys device is connected to a Host NUMA node).
> > 
> > 2. Create the VM with 2 NUMA nodes and map the memory and vcpus exactly as
> > in host. Guest NUMA node 0 will have memory from host NUMA 0 and the vcpus
> > will be taken from host NUMA 0 threads (play with affinity).
> > 
> > 3. Now create a PXB and attach the device to NUMA 0 and run a performance
> > test.
> > Do the same with the PXB attached to NUMA 1.
> > 
> > 4. The real test is showing that performance is better when the pxb is
> > connected to the same NUMA node as the phys device.
> > 
> > Please let me know if you need further info.
> > Thanks,
> > Marcel
> 
> HI Marcel,
> 
> Thanks for the feedback first. Could you please be more specific performance
> mean? Pass-though nic speed in guest or?
Exactly. Sorry for misunderstanding.
The pass-through device in guest should perform better if is attach to its NUMA node.

Thanks,
Marcel

> 
> Best Regards,
> Junyi

Comment 30 Alex Williamson 2015-08-10 15:13:00 UTC

But how do you define performance?  Is there a noticeable difference in throughput of an assigned 1G NIC?  Definitely not.  10G?  Maybe.  40G?  Maybe++.  What about latency, do I get more pps with NUMA locality?  Is the benefit more distinct on 4-node systems?  8-node systems?  How important is other traffic between the nodes during testing?  This performance analysis is a really complex issue to hand-wave as "guest should perform better" and AFAIK it hasn't been conclusively proven to this extent upstream.

I don't think QE is signed up for performance analysis to this extent, so if we want QE to do a performance test we need to provide a concrete and specific example to evaluate.  Do we have one?

Comment 31 Marcel Apfelbaum 2015-08-10 16:15:45 UTC

(In reply to Alex Williamson from comment #30)
> But how do you define performance?  Is there a noticeable difference in
> throughput of an assigned 1G NIC?  Definitely not.  10G?  Maybe.  40G? 
> Maybe++.  What about latency, do I get more pps with NUMA locality?  Is the
> benefit more distinct on 4-node systems?  8-node systems?  How important is
> other traffic between the nodes during testing?  This performance analysis
> is a really complex issue to hand-wave as "guest should perform better" and
> AFAIK it hasn't been conclusively proven to this extent upstream.
For what I understood, the OpenStack community was complaining about this,
so *it must* be a way to figure out if we solved the issue. I deliberately kept the loose "guest should perform better" that says - we don't really want a full performance analysis, but we only want to be sure that the phys device is accessed only with its native NUMA node MEM/threads.


> 
> I don't think QE is signed up for performance analysis to this extent, so if
> we want QE to do a performance test we need to provide a concrete and
> specific example to evaluate.  Do we have one?
I was hoping the QE is already handling this kind of performance tests for physical machines. I was thinking that we could get *one of them* and transform them for our needs. If they are not handling it, how can we check we really solved the issue? Maybe ask the ones that opened the BZ? 

Daniel, can you please comment?

Thanks,
Marcel

Comment 32 Daniel Berrangé 2015-08-10 16:38:21 UTC

(In reply to Marcel Apfelbaum from comment #31)
> (In reply to Alex Williamson from comment #30)
> > But how do you define performance?  Is there a noticeable difference in
> > throughput of an assigned 1G NIC?  Definitely not.  10G?  Maybe.  40G? 
> > Maybe++.  What about latency, do I get more pps with NUMA locality?  Is the
> > benefit more distinct on 4-node systems?  8-node systems?  How important is
> > other traffic between the nodes during testing?  This performance analysis
> > is a really complex issue to hand-wave as "guest should perform better" and
> > AFAIK it hasn't been conclusively proven to this extent upstream.
> For what I understood, the OpenStack community was complaining about this,
> so *it must* be a way to figure out if we solved the issue. I deliberately
> kept the loose "guest should perform better" that says - we don't really
> want a full performance analysis, but we only want to be sure that the phys
> device is accessed only with its native NUMA node MEM/threads.
> 
> 
> > 
> > I don't think QE is signed up for performance analysis to this extent, so if
> > we want QE to do a performance test we need to provide a concrete and
> > specific example to evaluate.  Do we have one?
> I was hoping the QE is already handling this kind of performance tests for
> physical machines. I was thinking that we could get *one of them* and
> transform them for our needs. If they are not handling it, how can we check
> we really solved the issue? Maybe ask the ones that opened the BZ? 
> 
> Daniel, can you please comment?

For real bare metal deployments, having NUMA locality for PCI devices results in improved performance.  For this nested-virt scenario though, performance is really not a big deal - this feature is merely intended to provide a functional method for setting up NUMA locality of PCI devices. The actual performance of nested-virt in general will mask any perf benefit of NUMA, so I don't think you need to test performance for this BZ.

For testing it suffices to validate that the guest OS kernel sysfs and libvirt "virsh nodedev-dumpxml" reports the NUMA locality for devices

Comment 33 Andrew Theurer 2015-08-10 17:16:46 UTC

I think we (perf dept) could probably conduct a test to show the impact of not getting this "right".  We have a pair of test systems with 12 x 40Gb, 6 per NUMA node (one system directly connected to another).  A streaming network test in bare-metal, with all NUMA locality preserved, results in ~420Gbps.  We could reproduce this with a 2-node VM, with and without the proper NUMA locality information (assuming we can do device assignment of 12 physical functions).  Let me know if you want a result like this.

Comment 34 Marcel Apfelbaum 2015-08-11 04:59:42 UTC

(In reply to Andrew Theurer from comment #33)
> I think we (perf dept) could probably conduct a test to show the impact of
> not getting this "right".  We have a pair of test systems with 12 x 40Gb, 6
> per NUMA node (one system directly connected to another).  A streaming
> network test in bare-metal, with all NUMA locality preserved, results in
> ~420Gbps.  We could reproduce this with a 2-node VM, with and without the
> proper NUMA locality information (assuming we can do device assignment of 12
> physical functions).  Let me know if you want a result like this.

Hi Andrew,
Thank you for stepping in.
Yes, we would really want to reproduce that with 2-node VM locality preserved versus not preserved. I will help with the command line for the new PXB device.

In the meantime QA can classify this as 'verified'.

Thanks,
Marcel

Comment 39 Andrew Theurer 2015-08-31 19:43:04 UTC

With Marcel's help, a VM has been configured on a 2-node system with 6 network interfaces per node.  I have tested the two configurations:

1) VM with mem/cpu NUMA topology, but no pci NUMA topology
2) VM with mem/cpu/pci NUMA topology

A streaming network test was run with all network interfaces (to another system with similar interfaces).  When the VM has a PCI NUMA topology, the network throughput was 45% higher than without PCI NUMA topology.

Comment 40 juzhang 2015-09-01 02:25:07 UTC

Thanks Andrew and Marcel, great to know.

Best Regards,
Junyi

Comment 41 Marcel Apfelbaum 2015-09-01 11:55:25 UTC

(In reply to Andrew Theurer from comment #39)
> With Marcel's help, a VM has been configured on a 2-node system with 6
> network interfaces per node.  I have tested the two configurations:
> 
> 1) VM with mem/cpu NUMA topology, but no pci NUMA topology
> 2) VM with mem/cpu/pci NUMA topology
> 
> A streaming network test was run with all network interfaces (to another
> system with similar interfaces).  When the VM has a PCI NUMA topology, the
> network throughput was 45% higher than without PCI NUMA topology.
WOW

Thanks a lot Andrew!
Marcel

Comment 43 errata-xmlrpc 2015-12-04 16:15:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2546.html