Bug 1076962 (libvirt-api-pcie-info)

Summary: Expose PCIe BW and lane information through API
Product: Red Hat Enterprise Linux 7 Reporter: Stephen Gordon <sgordon>
Component: libvirtAssignee: Michal Privoznik <mprivozn>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: high    
Version: 7.0CC: dyuan, honzhang, jdenemar, jiahu, jmiao, mprivozn, mzhan, rbalakri, weizhan
Target Milestone: rcKeywords: FutureFeature, Upstream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libvirt-1.2.7-1.el7 Doc Type: Enhancement
Doc Text:
Feature: Expose PCIe information through API Reason: If management application is deciding on which host a guest should run it has to make sure the chosen host meets requirements laid out by guest configuration. On of the requirements can be, that guest demands certain PCIe device. The device, moreover, has to be able to communicate at certain speed (for instance due to specialized SW running within the guest). For mgmt app to make that sort of decision we must expose the PCIe information somehow. Result: Libvirt already expose some info about host devices in so called node XML (which is per device). The XML was enhanced so that if the underlying device represented in the XML is a PCIe device, it exposes PCIe info too.
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-03-05 07:37:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1078542    

Description Stephen Gordon 2014-03-16 19:25:53 UTC
Description of problem:

To allow a management application to proactively recognize potential I/O bottlenecks when placing virtual machines it needs insight into the PCIe bandwidth and lanes being used by a given device. Ideally it should also be possible to streamline the process for confirming, given two devices, whether they share the same PCIe lanes or an ethernet port.

Additional info:

It is not possible to get through libvirt the PCIe BW and lanes being used by a PCI device. This is necessary for an orchestrator to foresee potential I/O bottlenecks in a network interface due to PCIe BW limitation.
Command for solution in Linux:
$ sudo lspci –vvv
…
04:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01)
[…]
        Capabilities: [a0] Express (v2) Endpoint, MSI 00
                LnkCap: Port #2, Speed 5GT/s, Width x8, ASPM L0s, Latency L0 <1us, L1 <8us
                LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
[…]

04:00.1 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01)
[…]
        Capabilities: [a0] Express (v2) Endpoint, MSI 00
                LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s, Latency L0 <1us, L1 <8us
                LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
[…]

It is not easy to obtain from the PCI device tree which PCIe devices (including SR-IOV Virtual Functions) share a physical Ethernet port and which PCIe devices share PCIe lanes. This is necessary for an orchestrator to identify potential I/O bottlenecks and how a PCI device might affect another one.
It must be noticed that libvirt offers utilities to get the hierarchy of PCIe devices from the PCI device tree and specific PCI information for each device. However, additional libvirt routines might help to get that information in a much simpler way since it currently involves multiple interactions.

Comment 4 Michal Privoznik 2014-05-29 09:13:32 UTC
I've just proposed patches upstream:

https://www.redhat.com/archives/libvir-list/2014-May/msg00991.html

Yet another bit that is exposed in device XML description:
# virsh nodedev-dumpxml pci_0000_00_1c_1
<device>
  <name>pci_0000_00_1c_1</name>
  <path>/sys/devices/pci0000:00/0000:00:1c.1</path>
  <capability type='pci'>
    <domain>0</domain>
    <bus>0</bus>
    <slot>28</slot>
    <function>1</function>
    <product id='0x3b44'>5 Series/3400 Series Chipset PCI Express Root Port 2</product>
    <vendor id='0x8086'>Intel Corporation</vendor>
    <iommuGroup number='8'>
      <address domain='0x0000' bus='0x00' slot='0x1c' function='0x0'/>
      <address domain='0x0000' bus='0x00' slot='0x1c' function='0x1'/>
      <address domain='0x0000' bus='0x00' slot='0x1c' function='0x3'/>
      <address domain='0x0000' bus='0x00' slot='0x1c' function='0x4'/>
      <address domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
      <address domain='0x0000' bus='0x0d' slot='0x00' function='0x0'/>
      <address domain='0x0000' bus='0x0d' slot='0x00' function='0x1'/>
      <address domain='0x0000' bus='0x0d' slot='0x00' function='0x3'/>
    </iommuGroup>
    <numa_node>-1</numa_node>
    <pci-express>
      <link validity='cap' port='2' speed='2.5' width='1'/>
      <link validity='sta' speed='2.5' width='1'/>
    </pci-express>
  </capability>
</device>

The @validity attribute tells if <link/> refers ti device capabilities (='cap'), or the one negotiated on the device initialization (='sta'). Since the PCI port can't be negotiated, it's never presented in ./link/[@validity='sta'].

Comment 5 Michal Privoznik 2014-06-06 10:55:26 UTC
I'm proposing the standalone series for better review:

https://www.redhat.com/archives/libvir-list/2014-June/msg00346.html

Comment 6 Michal Privoznik 2014-06-12 15:30:24 UTC
Another attempt:

https://www.redhat.com/archives/libvir-list/2014-June/msg00615.html

Comment 7 Michal Privoznik 2014-06-16 15:43:50 UTC
So I've just pushed patches upstream:

commit 16ebf10f34ad21f0626903db3307490365543dc9
Author:     Michal Privoznik <mprivozn>
AuthorDate: Thu May 15 10:13:45 2014 +0200
Commit:     Michal Privoznik <mprivozn>
CommitDate: Mon Jun 16 17:40:49 2014 +0200

    nodedev: Introduce <pci-express/> to PCI devices
    
    This new element is there to represent PCI-Express capabilities
    of a PCI devices, like link speed, number of lanes, etc.
    
    Signed-off-by: Michal Privoznik <mprivozn>

commit a22a7a5ef3b4a375015016ac833e9992be0babd7
Author:     Michal Privoznik <mprivozn>
AuthorDate: Thu May 15 10:04:28 2014 +0200
Commit:     Michal Privoznik <mprivozn>
CommitDate: Mon Jun 16 17:40:49 2014 +0200

    virpci: Introduce virPCIDeviceIsPCIExpress and friends
    
    These functions will handle PCIe devices and their link capabilities
    to query some info about it.
    
    Signed-off-by: Michal Privoznik <mprivozn>


v1.2.5-135-g16ebf10

Comment 9 Hu Jianwei 2014-10-13 03:21:25 UTC
I do some testing on the bug, the bandwidth and lane info of NIC can be printed by nodedev-dumpxml.

[root@localhost ~]# rpm -q libvirt
libvirt-1.2.8-5.el7.x86_64

<1> For pcie device
[root@ibm-x3850x5-06 ~]# lspci -s 4:00.0 -v
04:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
Subsystem: IBM Device 03b5
Flags: bus master, fast devsel, latency 0, IRQ 28
Memory at 94000000 (64-bit, non-prefetchable) [size=32M]
Capabilities: [48] Power Management version 3
Capabilities: [50] Vital Product Data
Capabilities: [58] MSI: Enable- Count=1/16 Maskable- 64bit+
Capabilities: [a0] MSI-X: Enable+ Count=9 Masked-
Capabilities: [ac] Express Endpoint, MSI 00
Capabilities: [100] Device Serial Number 5c-f3-fc-ff-fe-dc-10-bc
Capabilities: [110] Advanced Error Reporting
Capabilities: [150] Power Budgeting
Capabilities: [160] Virtual Channel
Kernel driver in use: bnx2

[root@ibm-x3850x5-06 ~]# virsh nodedev-dumpxml pci_0000_04_00_0
<device>
<name>pci_0000_04_00_0</name>
<path>/sys/devices/pci0000:00/0000:00:01.0/0000:04:00.0</path>
<parent>pci_0000_00_01_0</parent>
<driver>
<name>bnx2</name>
</driver>
<capability type='pci'>
<domain>0</domain>
<bus>4</bus>
<slot>0</slot>
<function>0</function>
<product id='0x1639'>NetXtreme II BCM5709 Gigabit Ethernet</product>
<vendor id='0x14e4'>Broadcom Corporation</vendor>
<numa node='0'/>
<pci-express>
<link validity='cap' port='0' speed='5' width='4'/>
<link validity='sta' speed='5' width='4'/>
</pci-express>
</capability>
</device>

<2> For none-pcie device
[root@localhost ~]# lspci -s 2:00.0 -v
02:00.0 Ethernet controller: Intel Corporation 82541PI Gigabit Ethernet Controller (rev 05)
Subsystem: Intel Corporation PRO/1000 GT Desktop Adapter
Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 20
Memory at f7c40000 (32-bit, non-prefetchable) [size=128K]
Memory at f7c20000 (32-bit, non-prefetchable) [size=128K]
I/O ports at d000 [size=64]
Expansion ROM at f7c00000 [disabled] [size=128K]
Capabilities: [dc] Power Management version 2
Capabilities: [e4] PCI-X non-bridge device
Kernel driver in use: e1000

[root@localhost ~]# virsh nodedev-dumpxml pci_0000_02_00_0
<device>
<name>pci_0000_02_00_0</name>
<path>/sys/devices/pci0000:00/0000:00:1e.0/0000:02:00.0</path>
<parent>pci_0000_00_1e_0</parent>
<driver>
<name>e1000</name>
</driver>
<capability type='pci'>
<domain>0</domain>
<bus>2</bus>
<slot>0</slot>
<function>0</function>
<product id='0x107c'>82541PI Gigabit Ethernet Controller</product>
<vendor id='0x8086'>Intel Corporation</vendor>
</capability>
</device>

[root@localhost ~]# lspci -s 19.0 -v
00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (rev 04)
Subsystem: Hewlett-Packard Company Device 3397
Flags: bus master, fast devsel, latency 0, IRQ 50
Memory at f7e00000 (32-bit, non-prefetchable) [size=128K]
Memory at f7e39000 (32-bit, non-prefetchable) [size=4K]
I/O ports at f080 [size=32]
Capabilities: [c8] Power Management version 2
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [e0] PCI Advanced Features
Kernel driver in use: e1000e

[root@localhost ~]# virsh nodedev-dumpxml pci_0000_00_19_0
<device>
<name>pci_0000_00_19_0</name>
<path>/sys/devices/pci0000:00/0000:00:19.0</path>
<parent>computer</parent>
<driver>
<name>e1000e</name>
</driver>
<capability type='pci'>
<domain>0</domain>
<bus>0</bus>
<slot>25</slot>
<function>0</function>
<product id='0x1502'>82579LM Gigabit Network Connection</product>
<vendor id='0x8086'>Intel Corporation</vendor>
</capability>
</device>

Question:
Not find a new hardware about PCI-Express devices doesn't have to necessarily export link info, could you help me pick it out? or need we purchase a new such type device?

Thanks.

Comment 10 Hu Jianwei 2014-10-13 04:09:57 UTC
> Question:
> Not find a new hardware about PCI-Express devices doesn't have to
> necessarily export link info, could you help me pick it out? or need we
> purchase a new such type device?
> 
Oh, I found a such device, is it right?
[root@ibm-x3850x5-06 ~]# virsh nodedev-dumpxml pci_0000_80_16_6
<device>
  <name>pci_0000_80_16_6</name>
  <path>/sys/devices/pci0000:80/0000:80:16.6</path>
  <parent>computer</parent>
  <driver>
    <name>ioatdma</name>
  </driver>
  <capability type='pci'>
    <domain>0</domain>
    <bus>128</bus>
    <slot>22</slot>
    <function>6</function>
    <product id='0x342b'>5520/5500/X58 Chipset QuickData Technology Device</product>
    <vendor id='0x8086'>Intel Corporation</vendor>
    <numa node='1'/>
    <pci-express/>
  </capability>
</device>

Comment 11 Michal Privoznik 2014-10-15 10:52:47 UTC
(In reply to Hu Jianwei from comment #10)
> > Question:
> > Not find a new hardware about PCI-Express devices doesn't have to
> > necessarily export link info, could you help me pick it out? or need we
> > purchase a new such type device?
> > 
> Oh, I found a such device, is it right?
> [root@ibm-x3850x5-06 ~]# virsh nodedev-dumpxml pci_0000_80_16_6
> <device>
>   <name>pci_0000_80_16_6</name>
>   <path>/sys/devices/pci0000:80/0000:80:16.6</path>
>   <parent>computer</parent>
>   <driver>
>     <name>ioatdma</name>
>   </driver>
>   <capability type='pci'>
>     <domain>0</domain>
>     <bus>128</bus>
>     <slot>22</slot>
>     <function>6</function>
>     <product id='0x342b'>5520/5500/X58 Chipset QuickData Technology
> Device</product>
>     <vendor id='0x8086'>Intel Corporation</vendor>
>     <numa node='1'/>
>     <pci-express/>
>   </capability>
> </device>

Hey, that's brilliant! So even though I hadn't had an access to such device, my code works :-) So I guess that means VERIFIED. Just to give you a more insight, there are some devices in PCIe world that don't have any link dedicated. For instance, I've found such device on one of my systems:

00:03.0 Audio device: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor HD Audio Controller (rev 06)
        Subsystem: Lenovo Device 2210
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 47
        Region 0: Memory at f1630000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: [50] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [60] MSI: Enable+ Count=1/1 Maskable- 64bit-
                Address: fee003d8  Data: 0000
        Capabilities: [70] Express (v1) Root Complex Integrated Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag- RBE- FLReset+
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed unknown, Width x0, ASPM unknown, Latency L0 <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; Disabled- Retrain- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel

(this is lspci -vvv output btw)

You can see 'Speed unknown' and Width x0. This is because 'Root Complex Integrated endpoint' is one of the class of devices that don't have any link. The other would be 'Root Complex Integrated End Controller'.

Comment 12 Jincheng Miao 2014-12-08 07:48:23 UTC
According to comment 9 10 11, this bug should be VERIFIED.

Comment 14 errata-xmlrpc 2015-03-05 07:37:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0323.html