Bugzilla (bugzilla.redhat.com) will be under maintenance for infrastructure upgrades and will not be available on July 31st between 12:30 AM - 05:30 AM UTC. We appreciate your understanding and patience. You can follow status.redhat.com for details.
Bug 1410287 - [RFE] Support for PCIe devices on PAPR (POWER) guests
Summary: [RFE] Support for PCIe devices on PAPR (POWER) guests
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: libvirt
Version: 7.4
Hardware: ppc64le
OS: Linux
unspecified
medium
Target Milestone: rc
: 7.4
Assignee: Andrea Bolognani
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On: 1410284 1429760 1568219
Blocks: 1264935 RHV4.1PPC
TreeView+ depends on / blocked
 
Reported: 2017-01-05 01:34 UTC by David Gibson
Modified: 2018-04-17 01:36 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-02 07:44:59 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
IBM Linux Technology Center 152939 0 None None None 2017-03-28 10:10:17 UTC

Description David Gibson 2017-01-05 01:34:40 UTC
Description of problem:

Currently qemu, and therefore libvirt as well, doesn't support PCIe devices on POWER (PAPR) guests.  This is a limitation within qemu - the paravirtualized PCI interface means that guests don't see a great distinction between vanilla PCI and PCIe.

Adding support for this is non-trivial, because libvirt's normal PCIe device placement strategy is unsuitable for POWER, due to that paravirtualized guest interface.  PAPR has its own hotplug protocol different from the PCIe standard one, so root ports / downstream ports are generally unnnecessary (or even troublesome) for PAPR guests.

How to solve this without breaking compatibility is an open question.  Fixing bug 1410284 is certainly necessary first, and it will probably be easier if bug 1280542 is also addressed first.

Comment 1 Andrea Bolognani 2017-03-10 14:58:33 UTC
It was agreed that resolving Bug 1280542 is not necessary
for moving forward with this, hence dropping the dependency.

Comment 2 Andrea Bolognani 2017-03-15 15:48:38 UTC
I've tested this both with an emulated PCIe Ethernet adapter
(e1000e) device and with an assigned host PCIe Ethernet
adapter, and in both cases the guest was able to access the
extended config space (capabilities >100).

Note that this requires QEMU 2.9 and the use of the
pseries-2.9 machine type. In fact, I've also verified that,
as expected, guests running on older machine types can't
access the extended config space.

Comment 3 Dan Zheng 2017-05-10 07:50:18 UTC

Case 1: Cold plug multiple host pcie devices to the guest.

# lspci -vv|grep PCIe
...
0003:09:00.0 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
		Product Name: PCIe2 4-port 1GbE Adapter
0003:09:00.1 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
		Product Name: PCIe2 4-port 1GbE Adapter
0003:09:00.2 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
		Product Name: PCIe2 4-port 1GbE Adapter
0003:09:00.3 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
		Product Name: PCIe2 4-port 1GbE Adapter

Configure the guest with 4 <hostdev>
 <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0003' bus='0x09' slot='0x00' function='0x0'/>
      </source>
    </hostdev>
...
Start guest successfully.
Dumpxml guest
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0003' bus='0x09' slot='0x00' function='0x0'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
    </hostdev>
Also see 
<address domain='0x0003' bus='0x09' slot='0x00' function='0x08'/>
<address domain='0x0003' bus='0x09' slot='0x00' function='0x09'/>
<address domain='0x0003' bus='0x09' slot='0x00' function='0x0a'/>

Check within the guest

[root@localhost ~]# lspci
00:01.0 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
...
00:08.0 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
00:09.0 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
00:0a.0 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)

Reboot the VM and devices can be seen in lspci.
Destroy the VM, the host devices are back to the host. Other three are same as below.
# virsh nodedev-dumpxml pci_0003_09_00_0
<device>
  <name>pci_0003_09_00_0</name>
  <path>/sys/devices/pci0003:00/0003:00:00.0/0003:01:00.0/0003:02:09.0/0003:09:00.0</path>
  <parent>pci_0003_02_09_0</parent>
  <driver>
    <name>tg3</name>
  </driver>
Case 2: Hotplug a host PCIe device to the guest.
1. Unbind other devices in same iommu group from the host.
# virsh nodedev-detach  pci_0003_09_00_1
Device pci_0003_09_00_1 detached

# virsh nodedev-reset  pci_0003_09_00_1
Device pci_0003_09_00_1 reset

Same to pci_0003_09_00_0, pci_0003_09_00_2


2. Hot plug pci_0003_09_00_3 to the guest and lspci can list the device in guest.
# virsh attach-device vm1  device_hostdev.xml
Device attached successfully

3. Can not detach the device as bug 1272300.

Comment 4 Dan Zheng 2017-05-10 08:04:11 UTC
Test packages:

libvirt-3.2.0-4.el7.ppc64le
qemu-kvm-rhev-2.9.0-2.el7.ppc64le
kernel-3.10.0-657.el7.ppc64le

Above two cases are passed.

Comment 5 Andrea Bolognani 2017-05-10 08:24:37 UTC
(In reply to Dan Zheng from comment #3)
[...]
> Check within the guest
> 
> [root@localhost ~]# lspci
> 00:01.0 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit
> Ethernet PCIe (rev 01)
> ...
> 00:08.0 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit
> Ethernet PCIe (rev 01)
> 00:09.0 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit
> Ethernet PCIe (rev 01)
> 00:0a.0 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit
> Ethernet PCIe (rev 01)

It's not enough to check whether devices are visible in the
guest: for the purpose of this bug, it's critical that the
PCIe config space is also exposed. You can make sure it is
by looking for PCI capabilities >=100, eg.

  $ sudo lspci -vvs 0003:09:00.0
  0003:09:00.0 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
          Subsystem: IBM Device 0420
          Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
          Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
          Latency: 0
          Interrupt: pin A routed to IRQ 508
          NUMA node: 1
          Region 0: Memory at 250100000000 (64-bit, prefetchable) [size=64K]
          Region 2: Memory at 250100010000 (64-bit, prefetchable) [size=64K]
          Region 4: Memory at 250100020000 (64-bit, prefetchable) [size=64K]
          [virtual] Expansion ROM at 3fe281000000 [disabled] [size=512K]
          Capabilities: [48] Power Management version 3
                  Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                  Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
          [...]
          Capabilities: [100 v1] Advanced Error Reporting
                  UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                  UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                  UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                  CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                  CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                  AERCap: First Error Pointer: 00, GenCap+ CGenEn+ ChkCap+ ChkEn+
          Capabilities: [13c v1] Device Serial Number 00-00-98-be-94-04-14-04
          Capabilities: [150 v1] Power Budgeting <?>
          Capabilities: [160 v1] Virtual Channel
                  Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                  Arb:    Fixed- WRR32- WRR64- WRR128-
                  Ctrl:   ArbSelect=Fixed
                  Status: InProgress-
                  VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                          Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                          Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                          Status: NegoPending- InProgress-
          Capabilities: [230 v1] Transaction Processing Hints
                  Interrupt vector mode supported
                  Steering table in MSI-X table
          Kernel driver in use: tg3
          Kernel modules: tg3

> Case 2: Hotplug a host PCIe device to the guest.
> 1. Unbind other devices in same iommu group from the host.
> # virsh nodedev-detach  pci_0003_09_00_1
> Device pci_0003_09_00_1 detached
> 
> # virsh nodedev-reset  pci_0003_09_00_1
> Device pci_0003_09_00_1 reset

Not sure the reset is really needed, but I don't see how it
could hurt either :)

> Same to pci_0003_09_00_0, pci_0003_09_00_2
> 
> 
> 2. Hot plug pci_0003_09_00_3 to the guest and lspci can list the device in
> guest.
> # virsh attach-device vm1  device_hostdev.xml
> Device attached successfully
> 
> 3. Can not detach the device as bug 1272300.

Since all devices in the IOMMU group have been detached
from the host, you should be able to detach the device from
the guest safely despite bug 1272300.


For completeness' sake, it would be useful to make sure the
extended config space is not exposed to guests when using
pseries-rhel7.3.0 or older.

Comment 6 Dan Zheng 2017-05-11 10:03:29 UTC
Thanks Andrea for pointing it out.

Retest Case 2.

1. Detach pci_0003_09_00_0 ~ pci_0003_09_00_2 from the host.
2. Start guest and attach pci_0003_09_00_3 to the guest. It is ok.
3. Check the configuration space and can see Capabilities > 100 in guest

# lspci
00:01.0 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)

# lspci -vvs 00:01.0
00:01.0 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
	Subsystem: IBM Device 0420
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin B routed to IRQ 17
	Region 0: Memory at 210000020000 (64-bit, prefetchable) [size=64K]
	Region 2: Memory at 210000030000 (64-bit, prefetchable) [size=64K]
	Region 4: Memory at 210000040000 (64-bit, prefetchable) [size=64K]
	[virtual] Expansion ROM at 200081000000 [disabled] [size=512K]
	Capabilities: [48] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
	Capabilities: [50] Vital Product Data
		Product Name: PCIe2 4-port 1GbE Adapter
		Read-only fields:
			[FN] Unknown: 30 30 45 32 38 37 33
			[EC] Engineering changes: D77470
			[CC] Unknown: 35 37 36 46
			[PN] Part number: 00E2872
			[FC] Unknown: 35 38 39 39
			[SN] Serial number: YL50203CD12T
			[MN] Manufacture ID: 36 43 41 45 38 42 36 41 38 32 30 34
			[RV] Reserved: checksum good, 83 byte(s) reserved
		Read/write fields:
			[YB] System specific: OFMENA\x02\x04\x00\x00\x00\x00\x00\x00\x00\x01\x00\x01\x00\x01\x00\x01\x00\x01\x00\x01\x00\x02\x00\x03\x00\x01\x00\x01\x00\x04\x00\x03\x00\x01\x00\x01\x00\x08
			[RW] Read-write area: 137 byte(s) free
		End
	Capabilities: [58] MSI: Enable- Count=1/8 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [a0] MSI-X: Enable+ Count=17 Masked-
		Vector table: BAR=4 offset=00000000
		PBA: BAR=4 offset=00001000
	Capabilities: [ac] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25.000W
		DevCtl:	Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
			RlxdOrd- ExtTag- PhantFunc- AuxPwr+ NoSnoop- FLReset-
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 5GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <2us, L1 <4us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis+, LTR-, OBFF Disabled
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
**************** Extended Configuration Space *************************
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn+ ChkCap+ ChkEn+
	Capabilities: [13c v1] Device Serial Number 00-00-6c-ae-8b-6a-82-07
	Capabilities: [150 v1] Power Budgeting <?>
	Capabilities: [160 v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed- WRR32- WRR64- WRR128-
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
			Status:	NegoPending- InProgress-
	Kernel driver in use: tg3
	Kernel modules: tg3


3. Yesterday I happened to some abnormal problems like 'host no response' when I did detach the PCIe device. But I tried attach/detach again today for several times. it works as expected now.

4. Configure the guest with machine type **pseries-rhel7.3.0 ** and start guest
5. Repeat attach and check the capabilities in guest. Capabilities: [ac] is found, guest can not access Extended Configuration space (>=100)

# lspci -vvs 00:01.0
00:01.0 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
	Subsystem: IBM Device 0420
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin B routed to IRQ 17
	Region 0: Memory at 10120020000 (64-bit, prefetchable) [size=64K]
	Region 2: Memory at 10120030000 (64-bit, prefetchable) [size=64K]
	Region 4: Memory at 10120040000 (64-bit, prefetchable) [size=64K]
	[virtual] Expansion ROM at 100a1000000 [disabled] [size=512K]
	Capabilities: [48] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
	Capabilities: [50] Vital Product Data
		Product Name: PCIe2 4-port 1GbE Adapter
		Read-only fields:
			[FN] Unknown: 30 30 45 32 38 37 33
			[EC] Engineering changes: D77470
			[CC] Unknown: 35 37 36 46
			[PN] Part number: 00E2872
			[FC] Unknown: 35 38 39 39
			[SN] Serial number: YL50203CD12T
			[MN] Manufacture ID: 36 43 41 45 38 42 36 41 38 32 30 34
			[RV] Reserved: checksum good, 83 byte(s) reserved
		Read/write fields:
			[YB] System specific: OFMENA\x02\x04\x00\x00\x00\x00\x00\x00\x00\x01\x00\x01\x00\x01\x00\x01\x00\x01\x00\x01\x00\x02\x00\x03\x00\x01\x00\x01\x00\x04\x00\x03\x00\x01\x00\x01\x00\x08
			[RW] Read-write area: 137 byte(s) free
		End
	Capabilities: [58] MSI: Enable- Count=1/8 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [a0] MSI-X: Enable+ Count=17 Masked-
		Vector table: BAR=4 offset=00000000
		PBA: BAR=4 offset=00001000
	Capabilities: [ac] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25.000W
		DevCtl:	Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
			RlxdOrd- ExtTag- PhantFunc- AuxPwr+ NoSnoop- FLReset-
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 5GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <2us, L1 <4us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis+, LTR-, OBFF Disabled
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Kernel driver in use: tg3
	Kernel modules: tg3

Comment 7 Dan Zheng 2017-05-17 06:57:16 UTC
Based on above test, I mark it verified.


Note You need to log in before you can comment on or make changes to this bug.