Red Hat Bugzilla – Bug 1093127
RFE: report NUMA node locality for PCI devices
Last modified: 2016-04-26 11:20:43 EDT
Description of problem: PCI devices are associated with specific NUMA nodes, so if a guest is pinned to one NUMA node and a device it is assigned is on another NUMA node, then DMA operations are going to travell across nodes. This is sub-optimal for overall performance and utilization. Libvirt's node device APIs need to report the NUMA node locality for all PCI devices. This is visible in sysfs via the files: /sys/devices/pci*/*/numa_node This feature is needed by OpenStack in order todo intelligent guest placement with SRIOV devices for NFV use cases.
I've just proposed patches upstream: https://www.redhat.com/archives/libvir-list/2014-May/msg00991.html Requested info is to be seen in device xml (virsh nodedev-dumpxml): <device> <name>pci_0000_00_1c_1</name> <path>/sys/devices/pci0000:00/0000:00:1c.1</path> <parent>computer</parent> ... <capability type='pci'> ... <numa node='2'/> ... </capabilitiy> </device>
This time as a standalone patch: https://www.redhat.com/archives/libvir-list/2014-June/msg00352.html
I've just pushed patches upstream: commit 1c7027788678c3ce0e41eb937d71ede33418b6b9 Author: Michal Privoznik <mprivozn@redhat.com> AuthorDate: Wed May 7 18:07:12 2014 +0200 Commit: Michal Privoznik <mprivozn@redhat.com> CommitDate: Fri Jun 6 15:10:57 2014 +0200 nodedev: Export NUMA node locality for PCI devices A PCI device can be associated with a specific NUMA node. Later, when a guest is pinned to one NUMA node the PCI device can be assigned on different NUMA node. This makes DMA transfers travel across nodes and thus results in suboptimal performance. We should expose the NUMA node locality for PCI devices so management applications can make better decisions. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> v1.2.5-69-g1c70277 The info can be seen in nodedev-dumpxml output: <device> <name>pci_1002_71c4</name> <parent>pci_8086_27a1</parent> <capability type='pci'> <domain>0</domain> <bus>1</bus> <slot>0</slot> <function>0</function> <product id='0x71c4'>M56GL [Mobility FireGL V5200]</product> <vendor id='0x1002'>ATI Technologies Inc</vendor> + <numa node='1'/> </capability> </device> If there's no numa node associated with the device, no <numa/> is reported.
Hi Michal Why the sysfs file 'numa_node' is always '-1', even for a passthrough-ed NIC? # uname -r 3.10.0-160.el7.x86_64 # ll /sys/devices/pci0000:40/0000:40:0b.0/0000:44:00.1/driver lrwxrwxrwx. 1 root root 0 Sep 17 06:36 /sys/devices/pci0000:40/0000:40:0b.0/0000:44:00.1/driver -> ../../../../bus/pci/drivers/vfio-pci # cat /sys/devices/pci0000:40/0000:40:0b.0/0000:44:00.1/numa_node -1 Is that a kernel problem ?
(In reply to Jincheng Miao from comment #7) > Hi Michal > > Why the sysfs file 'numa_node' is always '-1', even for a passthrough-ed NIC? The data that this file reports comes from the BIOS. The majority of hardware that exists today has a BIOS that does not report the data. If you are looking at this inside a QEMU/KVM guest, it will definitely be missing since QEMU/KVM don't report it (see bug 1103313). So you need to test this using real hardware, and find a machine which actually supports it. Unfortunately I don't know which specific hardware to recommend using.
Do some testing for the bug both on NUMA and UMA. <1> On NUMA machine: [root@ibm-x3850x5-06 ~]# rpm -q libvirt kernel libvirt-1.2.8-5.el7.x86_64 kernel-3.10.0-187.el7.x86_64 [root@ibm-x3850x5-06 ~]# numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 node 0 size: 16362 MB node 0 free: 15517 MB node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 node 1 size: 16384 MB node 1 free: 15696 MB node distances: node 0 1 0: 10 11 1: 11 10 [root@ibm-x3850x5-06 ~]# lspci -s 04:00.1 -v 04:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) Subsystem: IBM Device 03b5 Flags: bus master, fast devsel, latency 0, IRQ 40 Memory at 92000000 (64-bit, non-prefetchable) [size=32M] Capabilities: [48] Power Management version 3 Capabilities: [50] Vital Product Data Capabilities: [58] MSI: Enable- Count=1/16 Maskable- 64bit+ Capabilities: [a0] MSI-X: Enable+ Count=9 Masked- Capabilities: [ac] Express Endpoint, MSI 00 Capabilities: [100] Device Serial Number 5c-f3-fc-ff-fe-dc-10-be Capabilities: [110] Advanced Error Reporting Capabilities: [150] Power Budgeting <?> Capabilities: [160] Virtual Channel Kernel driver in use: bnx2 [root@ibm-x3850x5-06 ~]# virsh nodedev-dumpxml pci_0000_04_00_1 <device> <name>pci_0000_04_00_1</name> <path>/sys/devices/pci0000:00/0000:00:01.0/0000:04:00.1</path> <parent>pci_0000_00_01_0</parent> <driver> <name>bnx2</name> </driver> <capability type='pci'> <domain>0</domain> <bus>4</bus> <slot>0</slot> <function>1</function> <product id='0x1639'>NetXtreme II BCM5709 Gigabit Ethernet</product> <vendor id='0x14e4'>Broadcom Corporation</vendor> <iommuGroup number='15'> <address domain='0x0000' bus='0x04' slot='0x00' function='0x0'/> <address domain='0x0000' bus='0x04' slot='0x00' function='0x1'/> </iommuGroup> <numa node='0'/> <pci-express> <link validity='cap' port='0' speed='5' width='4'/> <link validity='sta' speed='5' width='4'/> </pci-express> </capability> </device> [root@ibm-x3850x5-06 ~]# cat /sys/devices/pci0000\:00/0000\:00\:01.0/0000\:04\:00.1/numa_node 0 [root@ibm-x3850x5-06 ~]# ll /sys/devices/pci0000\:00/0000\:00\:01.0/0000\:04\:00.1/driver lrwxrwxrwx. 1 root root 0 Oct 13 15:50 /sys/devices/pci0000:00/0000:00:01.0/0000:04:00.1/driver -> ../../../../bus/pci/drivers/bnx2 [root@ibm-x3850x5-06 ~]# virsh nodedev-detach pci_0000_04_00_1 Device pci_0000_04_00_1 detached [root@ibm-x3850x5-06 ~]# virsh nodedev-dumpxml pci_0000_04_00_1 <device> <name>pci_0000_04_00_1</name> <path>/sys/devices/pci0000:00/0000:00:01.0/0000:04:00.1</path> <parent>pci_0000_00_01_0</parent> <driver> <name>vfio-pci</name> </driver> <capability type='pci'> <domain>0</domain> <bus>4</bus> <slot>0</slot> <function>1</function> <product id='0x1639'>NetXtreme II BCM5709 Gigabit Ethernet</product> <vendor id='0x14e4'>Broadcom Corporation</vendor> <iommuGroup number='15'> <address domain='0x0000' bus='0x04' slot='0x00' function='0x0'/> <address domain='0x0000' bus='0x04' slot='0x00' function='0x1'/> </iommuGroup> <numa node='0'/> <pci-express> <link validity='cap' port='0' speed='5' width='4'/> <link validity='sta' speed='5' width='4'/> </pci-express> </capability> </device> [root@ibm-x3850x5-06 ~]# ll /sys/devices/pci0000\:00/0000\:00\:01.0/0000\:04\:00.1/driver lrwxrwxrwx. 1 root root 0 Oct 14 10:37 /sys/devices/pci0000:00/0000:00:01.0/0000:04:00.1/driver -> ../../../../bus/pci/drivers/vfio-pci [root@ibm-x3850x5-06 ~]# cat /sys/devices/pci0000\:00/0000\:00\:01.0/0000\:04\:00.1/numa_node 0 Checking another device on node 1 [root@ibm-x3850x5-06 ~]# virsh nodedev-dumpxml pci_0000_80_16_7 <device> <name>pci_0000_80_16_7</name> <path>/sys/devices/pci0000:80/0000:80:16.7</path> <parent>computer</parent> <driver> <name>ioatdma</name> </driver> <capability type='pci'> <domain>0</domain> <bus>128</bus> <slot>22</slot> <function>7</function> <product id='0x342c'>5520/5500/X58 Chipset QuickData Technology Device</product> <vendor id='0x8086'>Intel Corporation</vendor> <iommuGroup number='25'> <address domain='0x0000' bus='0x80' slot='0x16' function='0x0'/> <address domain='0x0000' bus='0x80' slot='0x16' function='0x1'/> <address domain='0x0000' bus='0x80' slot='0x16' function='0x2'/> <address domain='0x0000' bus='0x80' slot='0x16' function='0x3'/> <address domain='0x0000' bus='0x80' slot='0x16' function='0x4'/> <address domain='0x0000' bus='0x80' slot='0x16' function='0x5'/> <address domain='0x0000' bus='0x80' slot='0x16' function='0x6'/> <address domain='0x0000' bus='0x80' slot='0x16' function='0x7'/> </iommuGroup> <numa node='1'/> <pci-express/> </capability> </device> <2> On UMA machine: [root@localhost ~]# rpm -q libvirt kernel libvirt-1.2.8-5.el7.x86_64 kernel-3.10.0-138.el7.x86_64 kernel-3.10.0-121.el7.x86_64 [root@localhost ~]# numactl --hardware available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 8066 MB node 0 free: 7209 MB node distances: node 0 0: 10 [root@localhost ~]# virsh nodedev-dumpxml pci_0000_02_00_0 <device> <name>pci_0000_02_00_0</name> <path>/sys/devices/pci0000:00/0000:00:1e.0/0000:02:00.0</path> <parent>pci_0000_00_1e_0</parent> <driver> <name>e1000</name> </driver> <capability type='pci'> <domain>0</domain> <bus>2</bus> <slot>0</slot> <function>0</function> <product id='0x107c'>82541PI Gigabit Ethernet Controller</product> <vendor id='0x8086'>Intel Corporation</vendor> <iommuGroup number='9'> <address domain='0x0000' bus='0x00' slot='0x1e' function='0x0'/> <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> </iommuGroup> </capability> </device> [root@localhost ~]# cat /sys/devices/pci0000\:00/0000\:00\:1e.0/0000\:02\:00.0/numa_node -1 [root@localhost ~]# virsh nodedev-detach pci_0000_02_00_0 Device pci_0000_02_00_0 detached [root@localhost ~]# virsh nodedev-dumpxml pci_0000_02_00_0 <device> <name>pci_0000_02_00_0</name> <path>/sys/devices/pci0000:00/0000:00:1e.0/0000:02:00.0</path> <parent>pci_0000_00_1e_0</parent> <driver> <name>vfio-pci</name> </driver> <capability type='pci'> <domain>0</domain> <bus>2</bus> <slot>0</slot> <function>0</function> <product id='0x107c'>82541PI Gigabit Ethernet Controller</product> <vendor id='0x8086'>Intel Corporation</vendor> <iommuGroup number='9'> <address domain='0x0000' bus='0x00' slot='0x1e' function='0x0'/> <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> </iommuGroup> </capability> </device> [root@localhost ~]# cat /sys/devices/pci0000\:00/0000\:00\:1e.0/0000\:02\:00.0/numa_node -1 Questions: 1. For my UMA machine, the numa_node is "-1", is it unsupported BIOS? If yes, maybe I hit the issue mentioned in comment 7 and comment 8. 2. According to above testing results, is it enough for verifying the bug? Thanks.
(In reply to Hu Jianwei from comment #9) > Do some testing for the bug both on NUMA and UMA. > > <1> On NUMA machine: > Checking another device on node 1 > [root@ibm-x3850x5-06 ~]# virsh nodedev-dumpxml pci_0000_80_16_7 > <device> > <name>pci_0000_80_16_7</name> > <path>/sys/devices/pci0000:80/0000:80:16.7</path> > <parent>computer</parent> > <driver> > <name>ioatdma</name> > </driver> > <capability type='pci'> > <domain>0</domain> > <bus>128</bus> > <slot>22</slot> > <function>7</function> > <product id='0x342c'>5520/5500/X58 Chipset QuickData Technology > Device</product> > <vendor id='0x8086'>Intel Corporation</vendor> > <iommuGroup number='25'> > <address domain='0x0000' bus='0x80' slot='0x16' function='0x0'/> > <address domain='0x0000' bus='0x80' slot='0x16' function='0x1'/> > <address domain='0x0000' bus='0x80' slot='0x16' function='0x2'/> > <address domain='0x0000' bus='0x80' slot='0x16' function='0x3'/> > <address domain='0x0000' bus='0x80' slot='0x16' function='0x4'/> > <address domain='0x0000' bus='0x80' slot='0x16' function='0x5'/> > <address domain='0x0000' bus='0x80' slot='0x16' function='0x6'/> > <address domain='0x0000' bus='0x80' slot='0x16' function='0x7'/> > </iommuGroup> > <numa node='1'/> > <pci-express/> > </capability> > </device> Hey, that's awesome, you've got a machine that is true NUMA. And shows that the code works. > > <2> On UMA machine: > [root@localhost ~]# rpm -q libvirt kernel > libvirt-1.2.8-5.el7.x86_64 > kernel-3.10.0-138.el7.x86_64 > kernel-3.10.0-121.el7.x86_64 > > [root@localhost ~]# numactl --hardware > available: 1 nodes (0) > node 0 cpus: 0 1 2 3 4 5 6 7 > node 0 size: 8066 MB > node 0 free: 7209 MB > node distances: > node 0 > 0: 10 > [root@localhost ~]# virsh nodedev-dumpxml pci_0000_02_00_0 > <device> > <name>pci_0000_02_00_0</name> > <path>/sys/devices/pci0000:00/0000:00:1e.0/0000:02:00.0</path> > <parent>pci_0000_00_1e_0</parent> > <driver> > <name>e1000</name> > </driver> > <capability type='pci'> > <domain>0</domain> > <bus>2</bus> > <slot>0</slot> > <function>0</function> > <product id='0x107c'>82541PI Gigabit Ethernet Controller</product> > <vendor id='0x8086'>Intel Corporation</vendor> > <iommuGroup number='9'> > <address domain='0x0000' bus='0x00' slot='0x1e' function='0x0'/> > <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> > </iommuGroup> > </capability> > </device> The <numa node=/> is missing here. Correct. > > > [root@localhost ~]# cat > /sys/devices/pci0000\:00/0000\:00\:1e.0/0000\:02\:00.0/numa_node > -1 > Questions: > 1. For my UMA machine, the numa_node is "-1", is it unsupported BIOS? If > yes, maybe I hit the issue mentioned in comment 7 and comment 8. Yeah. That's exactly the case. > 2. According to above testing results, is it enough for verifying the bug? Yes it's exactly, what we need.
According to comment 9 and 10, move to Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0323.html