Bug 1093127 (libvirt-numa-locality-for-pci)
| Summary: | RFE: report NUMA node locality for PCI devices | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Daniel Berrangé <berrange> | |
| Component: | libvirt | Assignee: | Michal Privoznik <mprivozn> | |
| Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | |
| Severity: | unspecified | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 7.0 | CC: | dyuan, honzhang, jdenemar, jiahu, jmiao, mprivozn, mzhan, rbalakri, sgordon, xuzhang | |
| Target Milestone: | rc | Keywords: | FutureFeature, Upstream | |
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | libvirt-1.2.7-1.el7 | Doc Type: | Enhancement | |
| Doc Text: |
Feature:
Report NUMA node locality for PCI devices
Reason:
When starting new domain, it's crucial to know the host NUMA topology and PCI device affiliation to NUMA nodes, s that when PCI passthrough is requested, guest is pinned onto correct NUMA nodes. It's suboptimal if guest is pinned onto say nodes 0-1, but the PCI device is affiliated with node 2. Data transfer between nodes will take some time.
Result:
The device XML was enhanced to export PCI device affiliation with NUMA node.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1134746 (view as bug list) | Environment: | ||
| Last Closed: | 2015-03-05 07:35:00 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1078542, 1113520, 1134746 | |||
|
Description
Daniel Berrangé
2014-04-30 16:47:40 UTC
I've just proposed patches upstream: https://www.redhat.com/archives/libvir-list/2014-May/msg00991.html Requested info is to be seen in device xml (virsh nodedev-dumpxml): <device> <name>pci_0000_00_1c_1</name> <path>/sys/devices/pci0000:00/0000:00:1c.1</path> <parent>computer</parent> ... <capability type='pci'> ... <numa node='2'/> ... </capabilitiy> </device> This time as a standalone patch: https://www.redhat.com/archives/libvir-list/2014-June/msg00352.html I've just pushed patches upstream:
commit 1c7027788678c3ce0e41eb937d71ede33418b6b9
Author: Michal Privoznik <mprivozn>
AuthorDate: Wed May 7 18:07:12 2014 +0200
Commit: Michal Privoznik <mprivozn>
CommitDate: Fri Jun 6 15:10:57 2014 +0200
nodedev: Export NUMA node locality for PCI devices
A PCI device can be associated with a specific NUMA node. Later, when
a guest is pinned to one NUMA node the PCI device can be assigned on
different NUMA node. This makes DMA transfers travel across nodes and
thus results in suboptimal performance. We should expose the NUMA node
locality for PCI devices so management applications can make better
decisions.
Signed-off-by: Michal Privoznik <mprivozn>
v1.2.5-69-g1c70277
The info can be seen in nodedev-dumpxml output:
<device>
<name>pci_1002_71c4</name>
<parent>pci_8086_27a1</parent>
<capability type='pci'>
<domain>0</domain>
<bus>1</bus>
<slot>0</slot>
<function>0</function>
<product id='0x71c4'>M56GL [Mobility FireGL V5200]</product>
<vendor id='0x1002'>ATI Technologies Inc</vendor>
+ <numa node='1'/>
</capability>
</device>
If there's no numa node associated with the device, no <numa/> is reported.
Hi Michal Why the sysfs file 'numa_node' is always '-1', even for a passthrough-ed NIC? # uname -r 3.10.0-160.el7.x86_64 # ll /sys/devices/pci0000:40/0000:40:0b.0/0000:44:00.1/driver lrwxrwxrwx. 1 root root 0 Sep 17 06:36 /sys/devices/pci0000:40/0000:40:0b.0/0000:44:00.1/driver -> ../../../../bus/pci/drivers/vfio-pci # cat /sys/devices/pci0000:40/0000:40:0b.0/0000:44:00.1/numa_node -1 Is that a kernel problem ? (In reply to Jincheng Miao from comment #7) > Hi Michal > > Why the sysfs file 'numa_node' is always '-1', even for a passthrough-ed NIC? The data that this file reports comes from the BIOS. The majority of hardware that exists today has a BIOS that does not report the data. If you are looking at this inside a QEMU/KVM guest, it will definitely be missing since QEMU/KVM don't report it (see bug 1103313). So you need to test this using real hardware, and find a machine which actually supports it. Unfortunately I don't know which specific hardware to recommend using. Do some testing for the bug both on NUMA and UMA.
<1> On NUMA machine:
[root@ibm-x3850x5-06 ~]# rpm -q libvirt kernel
libvirt-1.2.8-5.el7.x86_64
kernel-3.10.0-187.el7.x86_64
[root@ibm-x3850x5-06 ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 16362 MB
node 0 free: 15517 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size: 16384 MB
node 1 free: 15696 MB
node distances:
node 0 1
0: 10 11
1: 11 10
[root@ibm-x3850x5-06 ~]# lspci -s 04:00.1 -v
04:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
Subsystem: IBM Device 03b5
Flags: bus master, fast devsel, latency 0, IRQ 40
Memory at 92000000 (64-bit, non-prefetchable) [size=32M]
Capabilities: [48] Power Management version 3
Capabilities: [50] Vital Product Data
Capabilities: [58] MSI: Enable- Count=1/16 Maskable- 64bit+
Capabilities: [a0] MSI-X: Enable+ Count=9 Masked-
Capabilities: [ac] Express Endpoint, MSI 00
Capabilities: [100] Device Serial Number 5c-f3-fc-ff-fe-dc-10-be
Capabilities: [110] Advanced Error Reporting
Capabilities: [150] Power Budgeting <?>
Capabilities: [160] Virtual Channel
Kernel driver in use: bnx2
[root@ibm-x3850x5-06 ~]# virsh nodedev-dumpxml pci_0000_04_00_1
<device>
<name>pci_0000_04_00_1</name>
<path>/sys/devices/pci0000:00/0000:00:01.0/0000:04:00.1</path>
<parent>pci_0000_00_01_0</parent>
<driver>
<name>bnx2</name>
</driver>
<capability type='pci'>
<domain>0</domain>
<bus>4</bus>
<slot>0</slot>
<function>1</function>
<product id='0x1639'>NetXtreme II BCM5709 Gigabit Ethernet</product>
<vendor id='0x14e4'>Broadcom Corporation</vendor>
<iommuGroup number='15'>
<address domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
<address domain='0x0000' bus='0x04' slot='0x00' function='0x1'/>
</iommuGroup>
<numa node='0'/>
<pci-express>
<link validity='cap' port='0' speed='5' width='4'/>
<link validity='sta' speed='5' width='4'/>
</pci-express>
</capability>
</device>
[root@ibm-x3850x5-06 ~]# cat /sys/devices/pci0000\:00/0000\:00\:01.0/0000\:04\:00.1/numa_node
0
[root@ibm-x3850x5-06 ~]# ll /sys/devices/pci0000\:00/0000\:00\:01.0/0000\:04\:00.1/driver
lrwxrwxrwx. 1 root root 0 Oct 13 15:50 /sys/devices/pci0000:00/0000:00:01.0/0000:04:00.1/driver -> ../../../../bus/pci/drivers/bnx2
[root@ibm-x3850x5-06 ~]# virsh nodedev-detach pci_0000_04_00_1
Device pci_0000_04_00_1 detached
[root@ibm-x3850x5-06 ~]# virsh nodedev-dumpxml pci_0000_04_00_1
<device>
<name>pci_0000_04_00_1</name>
<path>/sys/devices/pci0000:00/0000:00:01.0/0000:04:00.1</path>
<parent>pci_0000_00_01_0</parent>
<driver>
<name>vfio-pci</name>
</driver>
<capability type='pci'>
<domain>0</domain>
<bus>4</bus>
<slot>0</slot>
<function>1</function>
<product id='0x1639'>NetXtreme II BCM5709 Gigabit Ethernet</product>
<vendor id='0x14e4'>Broadcom Corporation</vendor>
<iommuGroup number='15'>
<address domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
<address domain='0x0000' bus='0x04' slot='0x00' function='0x1'/>
</iommuGroup>
<numa node='0'/>
<pci-express>
<link validity='cap' port='0' speed='5' width='4'/>
<link validity='sta' speed='5' width='4'/>
</pci-express>
</capability>
</device>
[root@ibm-x3850x5-06 ~]# ll /sys/devices/pci0000\:00/0000\:00\:01.0/0000\:04\:00.1/driver
lrwxrwxrwx. 1 root root 0 Oct 14 10:37 /sys/devices/pci0000:00/0000:00:01.0/0000:04:00.1/driver -> ../../../../bus/pci/drivers/vfio-pci
[root@ibm-x3850x5-06 ~]# cat /sys/devices/pci0000\:00/0000\:00\:01.0/0000\:04\:00.1/numa_node
0
Checking another device on node 1
[root@ibm-x3850x5-06 ~]# virsh nodedev-dumpxml pci_0000_80_16_7
<device>
<name>pci_0000_80_16_7</name>
<path>/sys/devices/pci0000:80/0000:80:16.7</path>
<parent>computer</parent>
<driver>
<name>ioatdma</name>
</driver>
<capability type='pci'>
<domain>0</domain>
<bus>128</bus>
<slot>22</slot>
<function>7</function>
<product id='0x342c'>5520/5500/X58 Chipset QuickData Technology Device</product>
<vendor id='0x8086'>Intel Corporation</vendor>
<iommuGroup number='25'>
<address domain='0x0000' bus='0x80' slot='0x16' function='0x0'/>
<address domain='0x0000' bus='0x80' slot='0x16' function='0x1'/>
<address domain='0x0000' bus='0x80' slot='0x16' function='0x2'/>
<address domain='0x0000' bus='0x80' slot='0x16' function='0x3'/>
<address domain='0x0000' bus='0x80' slot='0x16' function='0x4'/>
<address domain='0x0000' bus='0x80' slot='0x16' function='0x5'/>
<address domain='0x0000' bus='0x80' slot='0x16' function='0x6'/>
<address domain='0x0000' bus='0x80' slot='0x16' function='0x7'/>
</iommuGroup>
<numa node='1'/>
<pci-express/>
</capability>
</device>
<2> On UMA machine:
[root@localhost ~]# rpm -q libvirt kernel
libvirt-1.2.8-5.el7.x86_64
kernel-3.10.0-138.el7.x86_64
kernel-3.10.0-121.el7.x86_64
[root@localhost ~]# numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 8066 MB
node 0 free: 7209 MB
node distances:
node 0
0: 10
[root@localhost ~]# virsh nodedev-dumpxml pci_0000_02_00_0
<device>
<name>pci_0000_02_00_0</name>
<path>/sys/devices/pci0000:00/0000:00:1e.0/0000:02:00.0</path>
<parent>pci_0000_00_1e_0</parent>
<driver>
<name>e1000</name>
</driver>
<capability type='pci'>
<domain>0</domain>
<bus>2</bus>
<slot>0</slot>
<function>0</function>
<product id='0x107c'>82541PI Gigabit Ethernet Controller</product>
<vendor id='0x8086'>Intel Corporation</vendor>
<iommuGroup number='9'>
<address domain='0x0000' bus='0x00' slot='0x1e' function='0x0'/>
<address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
</iommuGroup>
</capability>
</device>
[root@localhost ~]# cat /sys/devices/pci0000\:00/0000\:00\:1e.0/0000\:02\:00.0/numa_node
-1
[root@localhost ~]# virsh nodedev-detach pci_0000_02_00_0
Device pci_0000_02_00_0 detached
[root@localhost ~]# virsh nodedev-dumpxml pci_0000_02_00_0
<device>
<name>pci_0000_02_00_0</name>
<path>/sys/devices/pci0000:00/0000:00:1e.0/0000:02:00.0</path>
<parent>pci_0000_00_1e_0</parent>
<driver>
<name>vfio-pci</name>
</driver>
<capability type='pci'>
<domain>0</domain>
<bus>2</bus>
<slot>0</slot>
<function>0</function>
<product id='0x107c'>82541PI Gigabit Ethernet Controller</product>
<vendor id='0x8086'>Intel Corporation</vendor>
<iommuGroup number='9'>
<address domain='0x0000' bus='0x00' slot='0x1e' function='0x0'/>
<address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
</iommuGroup>
</capability>
</device>
[root@localhost ~]# cat /sys/devices/pci0000\:00/0000\:00\:1e.0/0000\:02\:00.0/numa_node
-1
Questions:
1. For my UMA machine, the numa_node is "-1", is it unsupported BIOS? If yes, maybe I hit the issue mentioned in comment 7 and comment 8.
2. According to above testing results, is it enough for verifying the bug?
Thanks.
(In reply to Hu Jianwei from comment #9) > Do some testing for the bug both on NUMA and UMA. > > <1> On NUMA machine: > Checking another device on node 1 > [root@ibm-x3850x5-06 ~]# virsh nodedev-dumpxml pci_0000_80_16_7 > <device> > <name>pci_0000_80_16_7</name> > <path>/sys/devices/pci0000:80/0000:80:16.7</path> > <parent>computer</parent> > <driver> > <name>ioatdma</name> > </driver> > <capability type='pci'> > <domain>0</domain> > <bus>128</bus> > <slot>22</slot> > <function>7</function> > <product id='0x342c'>5520/5500/X58 Chipset QuickData Technology > Device</product> > <vendor id='0x8086'>Intel Corporation</vendor> > <iommuGroup number='25'> > <address domain='0x0000' bus='0x80' slot='0x16' function='0x0'/> > <address domain='0x0000' bus='0x80' slot='0x16' function='0x1'/> > <address domain='0x0000' bus='0x80' slot='0x16' function='0x2'/> > <address domain='0x0000' bus='0x80' slot='0x16' function='0x3'/> > <address domain='0x0000' bus='0x80' slot='0x16' function='0x4'/> > <address domain='0x0000' bus='0x80' slot='0x16' function='0x5'/> > <address domain='0x0000' bus='0x80' slot='0x16' function='0x6'/> > <address domain='0x0000' bus='0x80' slot='0x16' function='0x7'/> > </iommuGroup> > <numa node='1'/> > <pci-express/> > </capability> > </device> Hey, that's awesome, you've got a machine that is true NUMA. And shows that the code works. > > <2> On UMA machine: > [root@localhost ~]# rpm -q libvirt kernel > libvirt-1.2.8-5.el7.x86_64 > kernel-3.10.0-138.el7.x86_64 > kernel-3.10.0-121.el7.x86_64 > > [root@localhost ~]# numactl --hardware > available: 1 nodes (0) > node 0 cpus: 0 1 2 3 4 5 6 7 > node 0 size: 8066 MB > node 0 free: 7209 MB > node distances: > node 0 > 0: 10 > [root@localhost ~]# virsh nodedev-dumpxml pci_0000_02_00_0 > <device> > <name>pci_0000_02_00_0</name> > <path>/sys/devices/pci0000:00/0000:00:1e.0/0000:02:00.0</path> > <parent>pci_0000_00_1e_0</parent> > <driver> > <name>e1000</name> > </driver> > <capability type='pci'> > <domain>0</domain> > <bus>2</bus> > <slot>0</slot> > <function>0</function> > <product id='0x107c'>82541PI Gigabit Ethernet Controller</product> > <vendor id='0x8086'>Intel Corporation</vendor> > <iommuGroup number='9'> > <address domain='0x0000' bus='0x00' slot='0x1e' function='0x0'/> > <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> > </iommuGroup> > </capability> > </device> The <numa node=/> is missing here. Correct. > > > [root@localhost ~]# cat > /sys/devices/pci0000\:00/0000\:00\:1e.0/0000\:02\:00.0/numa_node > -1 > Questions: > 1. For my UMA machine, the numa_node is "-1", is it unsupported BIOS? If > yes, maybe I hit the issue mentioned in comment 7 and comment 8. Yeah. That's exactly the case. > 2. According to above testing results, is it enough for verifying the bug? Yes it's exactly, what we need. According to comment 9 and 10, move to Verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0323.html |