Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1093127 - (libvirt-numa-locality-for-pci) RFE: report NUMA node locality for PCI devices
RFE: report NUMA node locality for PCI devices
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: libvirt (Show other bugs)
7.0
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Michal Privoznik
Virtualization Bugs
: FutureFeature, Upstream
Depends On:
Blocks: 1113520 1078542 1134746
  Show dependency treegraph
 
Reported: 2014-04-30 12:47 EDT by Daniel Berrange
Modified: 2016-04-26 11:20 EDT (History)
10 users (show)

See Also:
Fixed In Version: libvirt-1.2.7-1.el7
Doc Type: Enhancement
Doc Text:
Feature: Report NUMA node locality for PCI devices Reason: When starting new domain, it's crucial to know the host NUMA topology and PCI device affiliation to NUMA nodes, s that when PCI passthrough is requested, guest is pinned onto correct NUMA nodes. It's suboptimal if guest is pinned onto say nodes 0-1, but the PCI device is affiliated with node 2. Data transfer between nodes will take some time. Result: The device XML was enhanced to export PCI device affiliation with NUMA node.
Story Points: ---
Clone Of:
: 1134746 (view as bug list)
Environment:
Last Closed: 2015-03-05 02:35:00 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2015:0323 normal SHIPPED_LIVE Low: libvirt security, bug fix, and enhancement update 2015-03-05 07:10:54 EST

  None (edit)
Description Daniel Berrange 2014-04-30 12:47:40 EDT
Description of problem:
PCI devices are associated with specific NUMA nodes, so if a guest is pinned to one NUMA node and a device it is assigned is on another NUMA node, then DMA operations are going to travell across nodes. This is sub-optimal for overall performance and utilization.

Libvirt's node device APIs need to report the NUMA node locality for all PCI devices. This is visible in sysfs via the files:

  /sys/devices/pci*/*/numa_node


This feature is needed by OpenStack in order todo intelligent guest placement with SRIOV devices for NFV use cases.
Comment 3 Michal Privoznik 2014-05-29 05:13:13 EDT
I've just proposed patches upstream:

https://www.redhat.com/archives/libvir-list/2014-May/msg00991.html

Requested info is to be seen in device xml (virsh nodedev-dumpxml):

<device>
  <name>pci_0000_00_1c_1</name>
  <path>/sys/devices/pci0000:00/0000:00:1c.1</path>
  <parent>computer</parent>
  ...
  <capability type='pci'>
    ...
    <numa node='2'/>
    ...
  </capabilitiy>
</device>
Comment 4 Michal Privoznik 2014-06-06 07:10:24 EDT
This time as a standalone patch:

https://www.redhat.com/archives/libvir-list/2014-June/msg00352.html
Comment 5 Michal Privoznik 2014-06-06 09:31:46 EDT
I've just pushed patches upstream:

commit 1c7027788678c3ce0e41eb937d71ede33418b6b9
Author:     Michal Privoznik <mprivozn@redhat.com>
AuthorDate: Wed May 7 18:07:12 2014 +0200
Commit:     Michal Privoznik <mprivozn@redhat.com>
CommitDate: Fri Jun 6 15:10:57 2014 +0200

    nodedev: Export NUMA node locality for PCI devices
    
    A PCI device can be associated with a specific NUMA node. Later, when
    a guest is pinned to one NUMA node the PCI device can be assigned on
    different NUMA node. This makes DMA transfers travel across nodes and
    thus results in suboptimal performance. We should expose the NUMA node
    locality for PCI devices so management applications can make better
    decisions.
    
    Signed-off-by: Michal Privoznik <mprivozn@redhat.com>

v1.2.5-69-g1c70277


The info can be seen in nodedev-dumpxml output:
 <device>
   <name>pci_1002_71c4</name>
   <parent>pci_8086_27a1</parent>
   <capability type='pci'>
     <domain>0</domain>
     <bus>1</bus>
     <slot>0</slot>
     <function>0</function>
     <product id='0x71c4'>M56GL [Mobility FireGL V5200]</product>
     <vendor id='0x1002'>ATI Technologies Inc</vendor>
+    <numa node='1'/>
   </capability>
 </device>

If there's no numa node associated with the device, no <numa/> is reported.
Comment 7 Jincheng Miao 2014-09-17 06:38:17 EDT
Hi Michal

Why the sysfs file 'numa_node' is always '-1', even for a passthrough-ed NIC?

# uname -r
3.10.0-160.el7.x86_64

# ll /sys/devices/pci0000:40/0000:40:0b.0/0000:44:00.1/driver
lrwxrwxrwx. 1 root root 0 Sep 17 06:36 /sys/devices/pci0000:40/0000:40:0b.0/0000:44:00.1/driver -> ../../../../bus/pci/drivers/vfio-pci

# cat /sys/devices/pci0000:40/0000:40:0b.0/0000:44:00.1/numa_node
-1

Is that a kernel problem ?
Comment 8 Daniel Berrange 2014-09-17 06:46:01 EDT
(In reply to Jincheng Miao from comment #7)
> Hi Michal
> 
> Why the sysfs file 'numa_node' is always '-1', even for a passthrough-ed NIC?

The data that this file reports comes from the BIOS. The majority of hardware that exists today has a BIOS that does not report the data.  If you are looking at this inside a QEMU/KVM guest, it will definitely be missing since QEMU/KVM don't report it (see bug 1103313).

So you need to test this using real hardware, and find a machine which actually supports it. Unfortunately I don't know which specific hardware to recommend using.
Comment 9 Hu Jianwei 2014-10-14 00:17:00 EDT
Do some testing for the bug both on NUMA and UMA.

<1> On NUMA machine:
[root@ibm-x3850x5-06 ~]# rpm -q libvirt kernel
libvirt-1.2.8-5.el7.x86_64
kernel-3.10.0-187.el7.x86_64

[root@ibm-x3850x5-06 ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 16362 MB
node 0 free: 15517 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size: 16384 MB
node 1 free: 15696 MB
node distances:
node   0   1 
  0:  10  11 
  1:  11  10

[root@ibm-x3850x5-06 ~]# lspci -s 04:00.1 -v
04:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
	Subsystem: IBM Device 03b5
	Flags: bus master, fast devsel, latency 0, IRQ 40
	Memory at 92000000 (64-bit, non-prefetchable) [size=32M]
	Capabilities: [48] Power Management version 3
	Capabilities: [50] Vital Product Data
	Capabilities: [58] MSI: Enable- Count=1/16 Maskable- 64bit+
	Capabilities: [a0] MSI-X: Enable+ Count=9 Masked-
	Capabilities: [ac] Express Endpoint, MSI 00
	Capabilities: [100] Device Serial Number 5c-f3-fc-ff-fe-dc-10-be
	Capabilities: [110] Advanced Error Reporting
	Capabilities: [150] Power Budgeting <?>
	Capabilities: [160] Virtual Channel
	Kernel driver in use: bnx2

[root@ibm-x3850x5-06 ~]# virsh nodedev-dumpxml pci_0000_04_00_1
<device>
  <name>pci_0000_04_00_1</name>
  <path>/sys/devices/pci0000:00/0000:00:01.0/0000:04:00.1</path>
  <parent>pci_0000_00_01_0</parent>
  <driver>
    <name>bnx2</name>
  </driver>
  <capability type='pci'>
    <domain>0</domain>
    <bus>4</bus>
    <slot>0</slot>
    <function>1</function>
    <product id='0x1639'>NetXtreme II BCM5709 Gigabit Ethernet</product>
    <vendor id='0x14e4'>Broadcom Corporation</vendor>
    <iommuGroup number='15'>
      <address domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
      <address domain='0x0000' bus='0x04' slot='0x00' function='0x1'/>
    </iommuGroup>
    <numa node='0'/>
    <pci-express>
      <link validity='cap' port='0' speed='5' width='4'/>
      <link validity='sta' speed='5' width='4'/>
    </pci-express>
  </capability>
</device>


[root@ibm-x3850x5-06 ~]# cat /sys/devices/pci0000\:00/0000\:00\:01.0/0000\:04\:00.1/numa_node 
0
[root@ibm-x3850x5-06 ~]# ll /sys/devices/pci0000\:00/0000\:00\:01.0/0000\:04\:00.1/driver
lrwxrwxrwx. 1 root root 0 Oct 13 15:50 /sys/devices/pci0000:00/0000:00:01.0/0000:04:00.1/driver -> ../../../../bus/pci/drivers/bnx2

[root@ibm-x3850x5-06 ~]# virsh nodedev-detach pci_0000_04_00_1
Device pci_0000_04_00_1 detached

[root@ibm-x3850x5-06 ~]# virsh nodedev-dumpxml pci_0000_04_00_1
<device>
  <name>pci_0000_04_00_1</name>
  <path>/sys/devices/pci0000:00/0000:00:01.0/0000:04:00.1</path>
  <parent>pci_0000_00_01_0</parent>
  <driver>
    <name>vfio-pci</name>
  </driver>
  <capability type='pci'>
    <domain>0</domain>
    <bus>4</bus>
    <slot>0</slot>
    <function>1</function>
    <product id='0x1639'>NetXtreme II BCM5709 Gigabit Ethernet</product>
    <vendor id='0x14e4'>Broadcom Corporation</vendor>
    <iommuGroup number='15'>
      <address domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
      <address domain='0x0000' bus='0x04' slot='0x00' function='0x1'/>
    </iommuGroup>
    <numa node='0'/>
    <pci-express>
      <link validity='cap' port='0' speed='5' width='4'/>
      <link validity='sta' speed='5' width='4'/>
    </pci-express>
  </capability>
</device>

[root@ibm-x3850x5-06 ~]# ll /sys/devices/pci0000\:00/0000\:00\:01.0/0000\:04\:00.1/driver
lrwxrwxrwx. 1 root root 0 Oct 14 10:37 /sys/devices/pci0000:00/0000:00:01.0/0000:04:00.1/driver -> ../../../../bus/pci/drivers/vfio-pci
[root@ibm-x3850x5-06 ~]# cat /sys/devices/pci0000\:00/0000\:00\:01.0/0000\:04\:00.1/numa_node 
0

Checking another device on node 1
[root@ibm-x3850x5-06 ~]# virsh nodedev-dumpxml pci_0000_80_16_7
<device>
  <name>pci_0000_80_16_7</name>
  <path>/sys/devices/pci0000:80/0000:80:16.7</path>
  <parent>computer</parent>
  <driver>
    <name>ioatdma</name>
  </driver>
  <capability type='pci'>
    <domain>0</domain>
    <bus>128</bus>
    <slot>22</slot>
    <function>7</function>
    <product id='0x342c'>5520/5500/X58 Chipset QuickData Technology Device</product>
    <vendor id='0x8086'>Intel Corporation</vendor>
    <iommuGroup number='25'>
      <address domain='0x0000' bus='0x80' slot='0x16' function='0x0'/>
      <address domain='0x0000' bus='0x80' slot='0x16' function='0x1'/>
      <address domain='0x0000' bus='0x80' slot='0x16' function='0x2'/>
      <address domain='0x0000' bus='0x80' slot='0x16' function='0x3'/>
      <address domain='0x0000' bus='0x80' slot='0x16' function='0x4'/>
      <address domain='0x0000' bus='0x80' slot='0x16' function='0x5'/>
      <address domain='0x0000' bus='0x80' slot='0x16' function='0x6'/>
      <address domain='0x0000' bus='0x80' slot='0x16' function='0x7'/>
    </iommuGroup>
    <numa node='1'/>
    <pci-express/>
  </capability>
</device>

<2> On UMA machine:
[root@localhost ~]# rpm -q libvirt kernel
libvirt-1.2.8-5.el7.x86_64
kernel-3.10.0-138.el7.x86_64
kernel-3.10.0-121.el7.x86_64

[root@localhost ~]# numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 8066 MB
node 0 free: 7209 MB
node distances:
node   0 
  0:  10 
[root@localhost ~]# virsh nodedev-dumpxml pci_0000_02_00_0
<device>
  <name>pci_0000_02_00_0</name>
  <path>/sys/devices/pci0000:00/0000:00:1e.0/0000:02:00.0</path>
  <parent>pci_0000_00_1e_0</parent>
  <driver>
    <name>e1000</name>
  </driver>
  <capability type='pci'>
    <domain>0</domain>
    <bus>2</bus>
    <slot>0</slot>
    <function>0</function>
    <product id='0x107c'>82541PI Gigabit Ethernet Controller</product>
    <vendor id='0x8086'>Intel Corporation</vendor>
    <iommuGroup number='9'>
      <address domain='0x0000' bus='0x00' slot='0x1e' function='0x0'/>
      <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </iommuGroup>
  </capability>
</device>


[root@localhost ~]# cat /sys/devices/pci0000\:00/0000\:00\:1e.0/0000\:02\:00.0/numa_node
-1
[root@localhost ~]# virsh nodedev-detach pci_0000_02_00_0
Device pci_0000_02_00_0 detached

[root@localhost ~]# virsh nodedev-dumpxml pci_0000_02_00_0
<device>
  <name>pci_0000_02_00_0</name>
  <path>/sys/devices/pci0000:00/0000:00:1e.0/0000:02:00.0</path>
  <parent>pci_0000_00_1e_0</parent>
  <driver>
    <name>vfio-pci</name>
  </driver>
  <capability type='pci'>
    <domain>0</domain>
    <bus>2</bus>
    <slot>0</slot>
    <function>0</function>
    <product id='0x107c'>82541PI Gigabit Ethernet Controller</product>
    <vendor id='0x8086'>Intel Corporation</vendor>
    <iommuGroup number='9'>
      <address domain='0x0000' bus='0x00' slot='0x1e' function='0x0'/>
      <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </iommuGroup>
  </capability>
</device>


[root@localhost ~]# cat /sys/devices/pci0000\:00/0000\:00\:1e.0/0000\:02\:00.0/numa_node
-1

Questions:
1. For my UMA machine, the numa_node is "-1", is it unsupported BIOS? If yes, maybe I hit the issue mentioned in comment 7 and comment 8.
2. According to above testing results, is it enough for verifying the bug?

Thanks.
Comment 10 Michal Privoznik 2014-10-15 07:01:21 EDT
(In reply to Hu Jianwei from comment #9)
> Do some testing for the bug both on NUMA and UMA.
> 
> <1> On NUMA machine:

> Checking another device on node 1
> [root@ibm-x3850x5-06 ~]# virsh nodedev-dumpxml pci_0000_80_16_7
> <device>
>   <name>pci_0000_80_16_7</name>
>   <path>/sys/devices/pci0000:80/0000:80:16.7</path>
>   <parent>computer</parent>
>   <driver>
>     <name>ioatdma</name>
>   </driver>
>   <capability type='pci'>
>     <domain>0</domain>
>     <bus>128</bus>
>     <slot>22</slot>
>     <function>7</function>
>     <product id='0x342c'>5520/5500/X58 Chipset QuickData Technology
> Device</product>
>     <vendor id='0x8086'>Intel Corporation</vendor>
>     <iommuGroup number='25'>
>       <address domain='0x0000' bus='0x80' slot='0x16' function='0x0'/>
>       <address domain='0x0000' bus='0x80' slot='0x16' function='0x1'/>
>       <address domain='0x0000' bus='0x80' slot='0x16' function='0x2'/>
>       <address domain='0x0000' bus='0x80' slot='0x16' function='0x3'/>
>       <address domain='0x0000' bus='0x80' slot='0x16' function='0x4'/>
>       <address domain='0x0000' bus='0x80' slot='0x16' function='0x5'/>
>       <address domain='0x0000' bus='0x80' slot='0x16' function='0x6'/>
>       <address domain='0x0000' bus='0x80' slot='0x16' function='0x7'/>
>     </iommuGroup>
>     <numa node='1'/>
>     <pci-express/>
>   </capability>
> </device>

Hey, that's awesome, you've got a machine that is true NUMA. And shows that the code works.

> 
> <2> On UMA machine:
> [root@localhost ~]# rpm -q libvirt kernel
> libvirt-1.2.8-5.el7.x86_64
> kernel-3.10.0-138.el7.x86_64
> kernel-3.10.0-121.el7.x86_64
> 
> [root@localhost ~]# numactl --hardware
> available: 1 nodes (0)
> node 0 cpus: 0 1 2 3 4 5 6 7
> node 0 size: 8066 MB
> node 0 free: 7209 MB
> node distances:
> node   0 
>   0:  10 
> [root@localhost ~]# virsh nodedev-dumpxml pci_0000_02_00_0
> <device>
>   <name>pci_0000_02_00_0</name>
>   <path>/sys/devices/pci0000:00/0000:00:1e.0/0000:02:00.0</path>
>   <parent>pci_0000_00_1e_0</parent>
>   <driver>
>     <name>e1000</name>
>   </driver>
>   <capability type='pci'>
>     <domain>0</domain>
>     <bus>2</bus>
>     <slot>0</slot>
>     <function>0</function>
>     <product id='0x107c'>82541PI Gigabit Ethernet Controller</product>
>     <vendor id='0x8086'>Intel Corporation</vendor>
>     <iommuGroup number='9'>
>       <address domain='0x0000' bus='0x00' slot='0x1e' function='0x0'/>
>       <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
>     </iommuGroup>
>   </capability>
> </device>

The <numa node=/> is missing here. Correct.

> 
> 
> [root@localhost ~]# cat
> /sys/devices/pci0000\:00/0000\:00\:1e.0/0000\:02\:00.0/numa_node
> -1

> Questions:
> 1. For my UMA machine, the numa_node is "-1", is it unsupported BIOS? If
> yes, maybe I hit the issue mentioned in comment 7 and comment 8.

Yeah. That's exactly the case.

> 2. According to above testing results, is it enough for verifying the bug?

Yes it's exactly, what we need.
Comment 11 Hu Jianwei 2014-12-09 03:36:35 EST
According to comment 9 and 10, move to Verified.
Comment 13 errata-xmlrpc 2015-03-05 02:35:00 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0323.html

Note You need to log in before you can comment on or make changes to this bug.