Bug 1346688 - [Q35] vfio read-only SR-IOV capability confuses OVMF
Summary: [Q35] vfio read-only SR-IOV capability confuses OVMF
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-rhev
Version: 7.3
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Alex Williamson
QA Contact: jingzhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-06-15 08:05 UTC by jingzhao
Modified: 2017-06-28 07:41 UTC (History)
8 users (show)

Fixed In Version: qemu-kvm-rhev-2.6.0-12.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-11-07 21:17:36 UTC
Target Upstream Version:


Attachments (Terms of Use)
ovmf log (87.46 KB, text/plain)
2016-06-15 08:06 UTC, jingzhao
no flags Details
ovmf debug log (65.05 KB, text/plain)
2016-06-15 22:05 UTC, Alex Williamson
no flags Details
updated ovmf log (65.05 KB, text/plain)
2016-06-16 01:11 UTC, jingzhao
no flags Details
the ovmf log of add rombar parameter (65.05 KB, text/plain)
2016-06-20 03:25 UTC, jingzhao
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:2673 normal SHIPPED_LIVE qemu-kvm-rhev bug fix and enhancement update 2016-11-08 01:06:13 UTC

Description jingzhao 2016-06-15 08:05:32 UTC
Description of problem:
Guest didn't boot up when passthrough device using vfio-pci with ovmf

Version-Release number of selected component (if applicable):
kernel-3.10.0-433.el7.x86_64
qemu-kvm-rhev-2.6.0-5.el7.x86_64
OVMF-20160608-1.git988715a.el7.noarch.rpm

How reproducible:
3/3

Steps to Reproduce:
1. Boot guest with following cli
[root@localhost home]# cat q35-ovmf.sh 
/usr/libexec/qemu-kvm \
-M q35 \
-cpu Nehalem \
-monitor stdio \
-m 4G \
-vga qxl \
-drive file=/usr/share/OVMF/OVMF_CODE.secboot.fd,if=pflash,format=raw,unit=0,readonly=on \
-drive file=/usr/share/OVMF/OVMF_VARS.fd,if=pflash,format=raw,unit=1 \
-debugcon file:/home/q35.ovmf.log \
-global isa-debugcon.iobase=0x402 \
-spice port=5932,disable-ticketing \
-smp 4,sockets=4,cores=1,threads=1 \
-device ioh3420,bus=pcie.0,id=root1.0,slot=1 \
-device x3130-upstream,bus=root1.0,id=upstream1.1 \
-device xio3130-downstream,bus=upstream1.1,id=downstream1.1,chassis=2 \
-device virtio-net-pci,bus=downstream1.1,netdev=tap10,mac=9a:6a:6b:6c:6d:6e -netdev tap,id=tap10 \
-device ioh3420,bus=pcie.0,id=root1.1,slot=2 \
-device x3130-upstream,bus=root1.1,id=upstream1.2 \
-device xio3130-downstream,bus=upstream1.2,id=downstream1.2,chassis=3 \
-device xio3130-downstream,bus=upstream1.2,id=downstream1.3,chassis=4 \
-drive if=none,id=drive0,file=/home/pxb-ovmf.qcow2 \
-device virtio-blk-pci,drive=drive0,scsi=off,bus=downstream1.2,disable-legacy=on,disable-modern=off  \
-device ioh3420,bus=pcie.0,id=root1.2,slot=3 \
-device vfio-pci,host=0000:03:00.0,id=vfio,bus=downstream1.3


Actual results:
Guest didn't boot up successfully, please check the attachment of ovmf log

Expected results:
Gust can boot up successfully

Additional info:

Comment 1 jingzhao 2016-06-15 08:06:46 UTC
Created attachment 1168238 [details]
ovmf log

Comment 3 Laszlo Ersek 2016-06-15 11:05:13 UTC
Jing Zhao,

two remarks / questions:

- Your use of OVMF_VARS.fd is not correct. Please refer to bug 1308678 comment 23 bullet (1) for details. (This is independent from the functionality being tested -- it's a general remark, but I think it's worth pointing out.)

- I checked the attached OVMF log file, from comment 1. It says

PciHostBridgeGetRootBridges: 2 extra root buses reported by QEMU
InitRootBridge: populated root bus 0, with room for 7 subordinate bus(es)
InitRootBridge: populated root bus 8, with room for 11 subordinate bus(es)
InitRootBridge: populated root bus 20, with room for 235 subordinate bus(es)

However, this doesn't seem to match the command line given in comment 0 -- on that command line, you do not have any pxb-pcie devices.

At the moment it appears to me that you tried VFIO device assignment in combination with pxb-pcie, and you ran into the bus_nr problem that we've been discussing in bug 1345738. Looking at the OVMF debug log, this is the impression I'm getting. And I think you ended up pasting a different QEMU command line (without pxb-pcie devices) into comment 0.

Can you please clarify? Thanks.

Comment 4 Laszlo Ersek 2016-06-15 11:23:28 UTC
Jing Zhao,

another remark: *assuming* that you intend to place (a) the assigned device, and/or (b) the *modern* virtio-blk device, behind a pxb-pcie extra root bridge, please be aware of bug 1323976.

Namely, unlike SeaBIOS, the edk2 PCI infrastructure built into OVMF prefers to allocate 64-bit MMIO BARs of PCI devices outside of the 32-bit address space. This works fine if you place such PCI device directly on the "main" root bridge (bus_nr=0), but it can break if the PCI device with 64-bit MMIO BARs is elsewhere. The possible breakage is due to QEMU's ACPI generator producing incorrect resource descriptors --> see bug 1323976.

This may affect both modern virtio devices, and assigned physical devices. There are two work-arounds:

- Plug these devices directly into "pcie.0".
- Alternatively, pass the following switch to QEMU, disabling the 64-bit MMIO
  aperture for OVMF:

  -fw_cfg name=opt/ovmf/X-PciMmio64Mb,string=0

Thanks.

Comment 5 Laszlo Ersek 2016-06-15 11:25:39 UTC
Sigh, I CC'd Marcel while writing my previous comment, but of course Bugzilla had to apply some automatic changes meanwhile, so my metadata changes went lost. Adding the CC now.

Comment 6 Alex Williamson 2016-06-15 12:56:07 UTC
(In reply to jingzhao from comment #0)
> -device vfio-pci,host=0000:03:00.0,id=vfio,bus=downstream1.3

What is this device?  Please always identify the device being assigned.

Laszlo has additional questions in comment 3 that also needinfo regarding the consistency of the original report.

Comment 7 Alex Williamson 2016-06-15 22:05:10 UTC
Created attachment 1168522 [details]
ovmf debug log

Reproduced with 82576 PF:

# lspci -vs 7:00.0
07:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
	Subsystem: Intel Corporation Gigabit ET Dual Port Server Adapter
	Physical Slot: 5
	Flags: fast devsel, IRQ 16
	Memory at ef420000 (32-bit, non-prefetchable) [disabled] [size=128K]
	Memory at ef000000 (32-bit, non-prefetchable) [disabled] [size=4M]
	I/O ports at 8020 [disabled] [size=32]
	Memory at ef4c4000 (32-bit, non-prefetchable) [disabled] [size=16K]
	Expansion ROM at eec00000 [disabled] [size=4M]

OVMF log ends with:

PciHostBridge: SubmitResources for PciRoot(0x0)
 I/O: Granularity/SpecificFlag = 0 / 01
      Length/Alignment = 0x3000 / 0xFFF
 Mem: Granularity/SpecificFlag = 32 / 00
      Length/Alignment = 0x1E8200000 / 0xEF49FFFF
PciBus: HostBridge->SubmitResources() - Invalid Parameter

ASSERT_EFI_ERROR (Status = Invalid Parameter)
ASSERT /builddir/build/BUILD/ovmf-988715a/MdeModulePkg/Bus/Pci/PciBusDxe/PciLib.c(561): !EFI_ERROR (Status)

VM boots with SeaBIOS.

OVMF-20160608-1.git988715a.el7.noarch
qemu-kvm-rhev-2.6.0-6.el7.x86_64

Comment 8 Alex Williamson 2016-06-15 22:06:47 UTC
Seems like an OVMF bug, reassigning

Comment 9 Alex Williamson 2016-06-15 22:23:39 UTC
The command works with either a non-SR-IOV capable device (82579LM) or a VF (82576), is it the SR-IOV enumeration that kills OVMF?  Did I get lucky picking an SR-IOV PF on my first try?

Comment 10 Alex Williamson 2016-06-15 22:28:47 UTC
NB, vfio exposes the SR-IOV capability as read-only, in case that's confusing OVMF, but it would seem unusual for OVMF to blindly attempt to enable SR-IOV.

Comment 11 Alex Williamson 2016-06-15 22:53:38 UTC
Yes, if vfio hides the sr-iov capability on the device, OVMF boots.

Comment 12 jingzhao 2016-06-16 01:10:23 UTC
Thanks Laszlo

Correct information of bug

1) the nic information which passthrough to the guest 

[root@hp-z800-01 home]# lspci -vvv -s 03:00.0
03:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
	Subsystem: Intel Corporation Gigabit ET Dual Port Server Adapter
	Physical Slot: 1
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 29
	Region 0: Memory at e4800000 (32-bit, non-prefetchable) [disabled] [size=128K]
	Region 1: Memory at e4000000 (32-bit, non-prefetchable) [disabled] [size=4M]
	Region 2: I/O ports at c000 [disabled] [size=32]
	Region 3: Memory at e4840000 (32-bit, non-prefetchable) [disabled] [size=16K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
		Address: 0000000000000000  Data: 0000
		Masking: 00000000  Pending: 00000000
	Capabilities: [70] MSI-X: Enable- Count=10 Masked-
		Vector table: BAR=3 offset=00000000
		PBA: BAR=3 offset=00002000
	Capabilities: [a0] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 2.5GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <4us, L1 <64us
			ClockPM- Surprise- LLActRep- BwNot-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
	Capabilities: [140 v1] Device Serial Number 00-1b-21-ff-ff-42-33-84
	Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 1
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)
		IOVCap:	Migration-, Interrupt Message Number: 000
		IOVCtl:	Enable- Migration- Interrupt- MSE- ARIHierarchy+
		IOVSta:	Migration-
		Initial VFs: 8, Total VFs: 8, Number of VFs: 0, Function Dependency Link: 00
		VF offset: 128, stride: 2, Device ID: 10ca
		Supported Page Size: 00000553, System Page Size: 00000001
		Region 0: Memory at 00000000e4848000 (64-bit, non-prefetchable)
		Region 3: Memory at 00000000e4868000 (64-bit, non-prefetchable)
		VF Migration: offset: 00000000, BIR: 0
	Kernel driver in use: vfio-pci


2. the test command line 

/usr/libexec/qemu-kvm \
-M q35 \
-cpu Nehalem \
-monitor stdio \
-m 4G \
-vga qxl \
-drive file=/usr/share/OVMF/OVMF_CODE.secboot.fd,if=pflash,format=raw,unit=0,readonly=on \
-drive file=/home/my_varstore.fd,if=pflash,format=raw,unit=1 \
-debugcon file:/home/q35.ovmf.log \
-global isa-debugcon.iobase=0x402 \
-spice port=5932,disable-ticketing \
-smp 4,sockets=4,cores=1,threads=1 \
-device ioh3420,bus=pcie.0,id=root1.0,slot=1 \
-device x3130-upstream,bus=root1.0,id=upstream1.1 \
-device xio3130-downstream,bus=upstream1.1,id=downstream1.1,chassis=2 \
-device virtio-net-pci,bus=downstream1.1,netdev=tap10,mac=9a:6a:6b:6c:6d:6e -netdev tap,id=tap10 \
-device ioh3420,bus=pcie.0,id=root1.1,slot=2 \
-device x3130-upstream,bus=root1.1,id=upstream1.2 \
-device xio3130-downstream,bus=upstream1.2,id=downstream1.2,chassis=3 \
-device xio3130-downstream,bus=upstream1.2,id=downstream1.3,chassis=4 \
-drive if=none,id=drive0,file=/home/pxb-ovmf.qcow2 \
-device virtio-blk-pci,drive=drive0,scsi=off,bus=downstream1.2,disable-legacy=on,disable-modern=off  \
-device ioh3420,bus=pcie.0,id=root1.2,slot=3 \
-device vfio-pci,host=0000:03:00.0,id=vfio,bus=downstream1.3

3. Update the ovmf log of my testing


Thanks
Jing Zhao

Comment 13 jingzhao 2016-06-16 01:11:33 UTC
Created attachment 1168549 [details]
updated ovmf log

Comment 14 Laszlo Ersek 2016-06-17 18:52:13 UTC
(In reply to Alex Williamson from comment #9)

> Did I get lucky picking an SR-IOV PF on my first try?

Yes. In both OVMF log files (comment 7, comment 13), I can see messages such
as:

> PciBus: Discovered PCI @ [07|00|00]
>  ARI: CapOffset = 0x150
>  SR-IOV: SupportedPageSize = 0x553; SystemPageSize = 0x1; FirstVFOffset =
>          0x180; InitialVFs = 0x8; ReservedBusNum = 0x2; CapOffset = 0x160
>    BAR[0]: Type =  Mem32; Alignment = 0x1FFFF;	Length =
>    0x20000;	Offset = 0x10
>    BAR[1]: Type =  Mem32; Alignment = 0x3FFFFF;	Length =
>    0x400000;	Offset = 0x14
>    BAR[2]: Type =   Io32; Alignment = 0x1F;	Length = 0x20;	Offset =
>    0x18
>    BAR[3]: Type =  Mem32; Alignment = 0x3FFF;	Length = 0x4000;
>    Offset = 0x1C
>  VFBAR[0]: Type =  Mem64; Alignment = 0xEF49FFFF;	Length =
>  0xEF4A0000;	Offset = 0x184
>  VFBAR[2]: Type =  Mem64; Alignment = 0xEF47FFFF;	Length =
>  0xEF480000;	Offset = 0x190

(which is the very first time I see VFBARs in this section of the log).
Then, when the collected resources are submitted to PciHostBridgeDxe, it
blows up:

> PciHostBridge: SubmitResources for PciRoot(0x0)
>  I/O: Granularity/SpecificFlag = 0 / 01
>       Length/Alignment = 0x3000 / 0xFFF
>  Mem: Granularity/SpecificFlag = 32 / 00
>       Length/Alignment = 0x1E8200000 / 0xEF49FFFF
> PciBus: HostBridge->SubmitResources() - Invalid Parameter
>
> ASSERT_EFI_ERROR (Status = Invalid Parameter)
> ASSERT
> /builddir/build/BUILD/ovmf-988715a/MdeModulePkg/Bus/Pci/PciBusDxe/PciLib.c(561):
> /EFI_ERROR (Status)

The direct reason being that the Length field (which is the sum of MMIO
resources for the bridge, 0x1E8200000) is greater than 4GB, but the resource
type is 32-bit MMIO (Granularity=32).

It seems that the Length and Alignment fields have special meanings for VF
BARs (I skimmed the SR-IOV spec very-very superficially).

... Hm, I think I might even suspect what causes this. I believe it is
<https://github.com/tianocore/edk2/commit/05070c1b471b0>. The 64-bit MMIO
BAR is degraded to 32-bit if it is a VFBAR and the device has an option ROM.
(See the DegradeResource() function in the linked patch.)

I don't understand the reasoning behind this. I'll take the discussion to
the upstream list.

Meanwhile, Alex, Jing Zhao, can you please repeat your tests, with the small
modification that the ROM BAR for the assigned device be turned off?

  -device vfio-pci,...,rombar=0
                       ^^^^^^^^

Thanks!

Comment 15 Alex Williamson 2016-06-17 19:23:04 UTC
Laszlo, note that 82576 has very modest MMIO requirements for the MMIO space, this is typically the "works anywhere" SR-IOV device because it requires <= 2MB of MMIO space, which is the minimum bridge granularity.  If we're coming up with needing more than 4G, it's probably because the read-only SR-IOV capability is being misinterpreted, ie. no sanity checks on the sizing.

Comment 16 Laszlo Ersek 2016-06-17 19:25:09 UTC
Upstream thread: http://thread.gmane.org/gmane.comp.bios.edk2.devel/13381

Comment 18 Alex Williamson 2016-06-18 00:07:33 UTC
FWIW, I'm also open to the idea that QEMU hide the SR-IOV capability from the VM.  We have no support for the VM enabling SR-IOV, so there's really no dependency on exposing this capability.  Kernel-level vfio exposes the capability read-only to prevent users from creating VFs, but it's still up to the hypervisor whether to further expose such capabilities to the VM.

Comment 19 Laszlo Ersek 2016-06-18 00:18:53 UTC
I'll try to follow whatever you deem best -- I guess first we should hear back from the maintainers of PciBusDxe in edk2, about their goals with VF BARs to begin with.

In edk2 there is a feature flag, defined in "MdeModulePkg/MdeModulePkg.dec":

  ## Indicates if the Single Root I/O virtualization is supported.<BR><BR>
  #   TRUE  - Single Root I/O virtualization is supported.<BR>
  #   FALSE - Single Root I/O virtualization is not supported.<BR>
  # @Prompt Enable SRIOV support.
  gEfiMdeModulePkgTokenSpaceGuid.PcdSrIovSupport|TRUE|BOOLEAN|0x10000044

OVMF inherits the default value (TRUE) without overriding it, for the time being. (It can override it if we want it to.) I don't know precisely what this feature flag controls.

But, if it controls the "creation of VFs", then I guess we should disable it? Based on your comment 18.

Comment 20 jingzhao 2016-06-20 03:24:21 UTC
(In reply to Laszlo Ersek from comment #14)
> (In reply to Alex Williamson from comment #9)
> 
> > Did I get lucky picking an SR-IOV PF on my first try?
> 
> Yes. In both OVMF log files (comment 7, comment 13), I can see messages such
> as:
> 
> > PciBus: Discovered PCI @ [07|00|00]
> >  ARI: CapOffset = 0x150
> >  SR-IOV: SupportedPageSize = 0x553; SystemPageSize = 0x1; FirstVFOffset =
> >          0x180; InitialVFs = 0x8; ReservedBusNum = 0x2; CapOffset = 0x160
> >    BAR[0]: Type =  Mem32; Alignment = 0x1FFFF;	Length =
> >    0x20000;	Offset = 0x10
> >    BAR[1]: Type =  Mem32; Alignment = 0x3FFFFF;	Length =
> >    0x400000;	Offset = 0x14
> >    BAR[2]: Type =   Io32; Alignment = 0x1F;	Length = 0x20;	Offset =
> >    0x18
> >    BAR[3]: Type =  Mem32; Alignment = 0x3FFF;	Length = 0x4000;
> >    Offset = 0x1C
> >  VFBAR[0]: Type =  Mem64; Alignment = 0xEF49FFFF;	Length =
> >  0xEF4A0000;	Offset = 0x184
> >  VFBAR[2]: Type =  Mem64; Alignment = 0xEF47FFFF;	Length =
> >  0xEF480000;	Offset = 0x190
> 
> (which is the very first time I see VFBARs in this section of the log).
> Then, when the collected resources are submitted to PciHostBridgeDxe, it
> blows up:
> 
> > PciHostBridge: SubmitResources for PciRoot(0x0)
> >  I/O: Granularity/SpecificFlag = 0 / 01
> >       Length/Alignment = 0x3000 / 0xFFF
> >  Mem: Granularity/SpecificFlag = 32 / 00
> >       Length/Alignment = 0x1E8200000 / 0xEF49FFFF
> > PciBus: HostBridge->SubmitResources() - Invalid Parameter
> >
> > ASSERT_EFI_ERROR (Status = Invalid Parameter)
> > ASSERT
> > /builddir/build/BUILD/ovmf-988715a/MdeModulePkg/Bus/Pci/PciBusDxe/PciLib.c(561):
> > /EFI_ERROR (Status)
> 
> The direct reason being that the Length field (which is the sum of MMIO
> resources for the bridge, 0x1E8200000) is greater than 4GB, but the resource
> type is 32-bit MMIO (Granularity=32).
> 
> It seems that the Length and Alignment fields have special meanings for VF
> BARs (I skimmed the SR-IOV spec very-very superficially).
> 
> ... Hm, I think I might even suspect what causes this. I believe it is
> <https://github.com/tianocore/edk2/commit/05070c1b471b0>. The 64-bit MMIO
> BAR is degraded to 32-bit if it is a VFBAR and the device has an option ROM.
> (See the DegradeResource() function in the linked patch.)
> 
> I don't understand the reasoning behind this. I'll take the discussion to
> the upstream list.
> 
> Meanwhile, Alex, Jing Zhao, can you please repeat your tests, with the small
> modification that the ROM BAR for the assigned device be turned off?
> 
>   -device vfio-pci,...,rombar=0
>                        ^^^^^^^^
> 
> Thanks!

Repeat the test with above changed, and failed. please check the ovmf log of add rombar parameter

Comment 21 jingzhao 2016-06-20 03:25:02 UTC
Created attachment 1169645 [details]
the ovmf log of add rombar parameter

Comment 22 Laszlo Ersek 2016-06-20 09:51:06 UTC
Thank you for checking; the symptoms in the log file are identical. So it looks like rombar=0 makes no difference, and I should instrument the edk2 code with debug messages and experiment a little with it.

Comment 24 Laszlo Ersek 2016-06-20 14:04:14 UTC
Upstream thread #2, from a different angle:
http://thread.gmane.org/gmane.comp.bios.edk2.devel/13437

Comment 25 Alex Williamson 2016-06-20 14:31:15 UTC
(In reply to Laszlo Ersek from comment #24)
> Upstream thread #2, from a different angle:
> http://thread.gmane.org/gmane.comp.bios.edk2.devel/13437

Lazslo, some folks think that allowing the guest to enable sr-iov on an assigned device is not a completely insane thing to do, see http://www.spinics.net/lists/kvm/msg134370.html

The more I think about it, the more I think vfio is asking ovmf to detect something non-standard per the spec.  Sure, it might be robust to detect that the VFBARs aren't getting sized correctly, but is that something we can reasonably expect of guest software?  We can have QEMU hide the SR-IOV capability, though this also gets a little ugly because extended capabilities always start at 0x100 in PCI config space and it's not feasible to relocated capabilities, which means we need to support stubbing that first entry to something a guest will traverse, but not recognize.  If we hope that the guest follows the spec to the letter, we could use capability ID 0x0, except QEMU-pci uses this internally.  ID 0xFFFF also has special meaning for root complex register block based capabilities, which might mean a guest would assume no capabilities at all.  That leaves unlikely to be assigned values, like 0xFFFE.  It's all generally unappealing, but we do some hiding of capabilities in the kernel too, so I'll look to see whether we have a better algorithm there.

Another possibility is that we virtualize the VFBARs to allow them to be sized, but leave the rest of the capability read-only.  I'll need to look through the spec to see if we have any leeway to do this.

Comment 26 Laszlo Ersek 2016-06-20 17:40:57 UTC
Sounds great to me, thanks for looking into this! (And yes, the blurb on the
referenced patch set seems reasonable as well.0

Regarding any possible stubbing out for the SR-IOV capability: the
CreatePciIoDevice() function in
[MdeModulePkg/Bus/Pci/PciBusDxe/PciEnumeratorSupport.c] has a section like
this:

>   //
>   // Initialization for SR-IOV
>   //
>
>   if (PcdGetBool (PcdSrIovSupport)) {
>     Status = LocatePciExpressCapabilityRegBlock (
>                PciIoDevice,
>                EFI_PCIE_CAPABILITY_ID_SRIOV,
>                &PciIoDevice->SrIovCapabilityOffset,
>                NULL
>                );
>     if (!EFI_ERROR (Status)) {

If you can make that LocatePciExpressCapabilityRegBlock() function call to
fail, then SR-IOV will not be used, I think. (Similarly to the effect of
setting PcdSrIovSupport to FALSE in the OVMF platform description files.)

The LocatePciExpressCapabilityRegBlock() function is at the end of
"MdeModulePkg/Bus/Pci/PciBusDxe/PciCommand.c", and it seems to perform a
"fairly standard" traversal of the PCI Express config space.

AFAICT, LocatePciExpressCapabilityRegBlock() is called in three places in total, looking for:
- EFI_PCIE_CAPABILITY_ID_ARI (0x0E)
- EFI_PCIE_CAPABILITY_ID_SRIOV (0x10)
- EFI_PCIE_CAPABILITY_ID_MRIOV (0x11)

These macro definitions are in
"MdePkg/Include/IndustryStandard/PciExpress21.h". Grepping header files for
"EFI_PCIE_CAPABILITY_ID_", I find no other header files with definitions.
And, all the definitions I find in this header, are:

> #define EFI_PCIE_CAPABILITY_ID_SRIOV_CONTROL_ARI_HIERARCHY          0x10
> #define EFI_PCIE_CAPABILITY_ID_ARI        0x0E
> #define EFI_PCIE_CAPABILITY_ID_ATS        0x0F
> #define EFI_PCIE_CAPABILITY_ID_SRIOV      0x10
> #define EFI_PCIE_CAPABILITY_ID_MRIOV      0x11
> #define EFI_PCIE_CAPABILITY_ID_SRIOV_CAPABILITIES               0x04
> #define EFI_PCIE_CAPABILITY_ID_SRIOV_CONTROL                    0x08
> #define EFI_PCIE_CAPABILITY_ID_SRIOV_STATUS                     0x0A
> #define EFI_PCIE_CAPABILITY_ID_SRIOV_INITIALVFS                 0x0C
> #define EFI_PCIE_CAPABILITY_ID_SRIOV_TOTALVFS                   0x0E
> #define EFI_PCIE_CAPABILITY_ID_SRIOV_NUMVFS                     0x10
> #define EFI_PCIE_CAPABILITY_ID_SRIOV_FUNCTION_DEPENDENCY_LINK   0x12
> #define EFI_PCIE_CAPABILITY_ID_SRIOV_FIRSTVF                    0x14
> #define EFI_PCIE_CAPABILITY_ID_SRIOV_VFSTRIDE                   0x16
> #define EFI_PCIE_CAPABILITY_ID_SRIOV_VFDEVICEID                 0x1A
> #define EFI_PCIE_CAPABILITY_ID_SRIOV_SUPPORTED_PAGE_SIZE        0x1C
> #define EFI_PCIE_CAPABILITY_ID_SRIOV_SYSTEM_PAGE_SIZE           0x20
> #define EFI_PCIE_CAPABILITY_ID_SRIOV_BAR0                       0x24
> #define EFI_PCIE_CAPABILITY_ID_SRIOV_BAR1                       0x28
> #define EFI_PCIE_CAPABILITY_ID_SRIOV_BAR2                       0x2C
> #define EFI_PCIE_CAPABILITY_ID_SRIOV_BAR3                       0x30
> #define EFI_PCIE_CAPABILITY_ID_SRIOV_BAR4                       0x34
> #define EFI_PCIE_CAPABILITY_ID_SRIOV_BAR5                       0x38
> #define EFI_PCIE_CAPABILITY_ID_SRIOV_VF_MIGRATION_STATE         0x3C

I think it might be worth a shot to try 0xFFFE or similar.

... But, given how the effect of that would be practically identical to
setting PcdSrIovSupport to FALSE in OvmfPkg/*.dsc (for the time being
anyway!), I don't see a problem with that either. I'm happy to try out both,
assuming I get access to a machine with a suitable NIC. (Obviously for the
stubbing, you would have to provide the QEMU patch :))

Thanks!

Comment 27 Laszlo Ersek 2016-06-20 17:46:38 UTC
... Hm, sorry I didn't see your list posting at <http://thread.gmane.org/gmane.comp.bios.edk2.devel/13437/focus=13439>. If you prefer to research QEMU / kernel changes for this, I'm happy to follow your lead, and/or assist with it as much as I can. If you'd like to take this BZ even, I won't object, obviously :)

Comment 28 Alex Williamson 2016-06-20 22:20:44 UTC
Let's fix this in QEMU, directly exposing a read-only SR-IOV capability to the guest doesn't seem to have much merit or spec compliance.  QEMU patch posted:

http://lists.nongnu.org/archive/html/qemu-devel/2016-06/msg05813.html

(I wonder if we should default to hiding all extended capabilities and add them as we go, but I'll start here)

Comment 29 Laszlo Ersek 2016-06-28 14:46:49 UTC
I managed to reproduce this error, in the following environment:

- assigned device (I350-T2V2 (8086:1521) PF):
  03:00.0 Ethernet controller:
  Intel Corporation I350 Gigabit Network Connection (rev 01)

- domain XML snippet (Q35):

  <controller type='pci' index='3' model='pcie-root-port'>
    <model name='ioh3420'/>
    <target chassis='3' port='0xe8'/>
    <address type='pci' domain='0x0000' bus='0x00' slot='0x1d' function='0x0'/>
  </controller>
  <controller type='pci' index='4' model='pcie-switch-upstream-port'>
    <model name='x3130-upstream'/>
    <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
  </controller>
  <controller type='pci' index='5' model='pcie-switch-downstream-port'>
    <model name='xio3130-downstream'/>
    <target chassis='5' port='0x0'/>
    <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
  </controller>

  <hostdev mode='subsystem' type='pci' managed='yes'>
    <source>
      <address domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
    </source>
    <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
  </hostdev>

  This generates a QEMU command line like

  -device ioh3420,port=0xe8,chassis=3,id=pci.3,bus=pcie.0,addr=0x1d \
  -device x3130-upstream,id=pci.4,bus=pci.3,addr=0x0 \
  -device xio3130-downstream,port=0x0,chassis=5,id=pci.5,bus=pci.4,addr=0x0 \
  -device vfio-pci,host=03:00.0,id=hostdev3,bus=pci.5,addr=0x0 \

- QEMU: upstream v2.6.0-1395-g40428fe

- host kernel: 4.4.5-200.fc22.x86_64

I confirm that toggling the ROM BAR on/off makes no difference, so the error
is not due to any 64->32 bit resource degradation like I initially
suspected.

I'll go ahead and test Alex's upstream patch (comment 28).

Comment 30 Laszlo Ersek 2016-06-28 16:29:57 UTC
Yes, the patch works:

> PciBus: Discovered PCI @ [03|00|00]
>  ARI: forwarding enabled for PPB[02:00:00]
>  ARI: CapOffset = 0x150
>    BAR[0]: Type =  Mem32; Alignment = 0xFFFFF;  Length = 0x100000;      Offset = 0x10
>    BAR[3]: Type =  Mem32; Alignment = 0x3FFF;   Length = 0x4000;        Offset = 0x1C
>
> [...]
>
> PciBus: Resource Map for Bridge [02|00|00]
> Type =  Mem32; Base = 0x99200000;       Length = 0x200000;      Alignment = 0xFFFFF
>    Base = 0x99200000;   Length = 0x100000;      Alignment = 0xFFFFF;    Owner = PCI [03|00|00:10]
>    Base = 0x99300000;   Length = 0x4000;        Alignment = 0x3FFF;     Owner = PCI [03|00|00:1C]

The Windows Server 2012 R2 guest OS was also launched. The driver for the
NIC was installed automatically. The device looks fine in Device Manager.

I'm still having network connectivity problems though -- I'm debugging them.
(Almost certainly an issue with the dnsmasq / iptables setup on my the
gateway machine.) I'd like to have an all-positive result before replying on
the list.

Comment 31 Miroslav Rezanina 2016-07-08 08:39:48 UTC
Fix included in qemu-kvm-rhev-2.6.0-12.el7

Comment 33 jingzhao 2016-08-10 09:15:52 UTC
Verified it 
kernel-3.10.0-489.el7.x86_64
qemu-img-rhev-2.6.0-19.el7.x86_64
OVMF-20160608-3.git988715a.el7.noarch

Following is the verfied steps:

1. Boot guest with following cmd:
/usr/libexec/qemu-kvm \
-M q35 \
-cpu Nehalem \
-monitor stdio \
-m 4G \
-vga qxl \
-drive file=/usr/share/OVMF/OVMF_CODE.secboot.fd,if=pflash,format=raw,unit=0,readonly=on \
-drive file=/home/OVMF_VARS.fd,if=pflash,format=raw,unit=1 \
-debugcon file:/home/q35.ovmf.log \
-global isa-debugcon.iobase=0x402 \
-spice port=5932,disable-ticketing \
-smp 4,sockets=4,cores=1,threads=1 \
-device ioh3420,bus=pcie.0,id=root1.0,slot=1 \
-device x3130-upstream,bus=root1.0,id=upstream1.1 \
-device xio3130-downstream,bus=upstream1.1,id=downstream1.1,chassis=2 \
-device virtio-net-pci,bus=downstream1.1,netdev=tap10,mac=9a:6a:6b:6c:6d:6e -netdev tap,id=tap10 \
-device ioh3420,bus=pcie.0,id=root1.1,slot=2 \
-device x3130-upstream,bus=root1.1,id=upstream1.2 \
-device xio3130-downstream,bus=upstream1.2,id=downstream1.2,chassis=3 \
-device xio3130-downstream,bus=upstream1.2,id=downstream1.3,chassis=4 \
-drive if=none,id=drive0,file=/home/pxb-ovmf.qcow2 \
-device virtio-blk-pci,drive=drive0,scsi=off,bus=downstream1.2,disable-legacy=on,disable-modern=off  \
-device ioh3420,bus=pcie.0,id=root1.2,slot=3 \
-device vfio-pci,host=03:00.0,id=vf-00.0,bus=root1.2 \


2. guest can boot up successfully

3. check the nic which passthrough from host in the guest
[root@dhcp-66-145-44 ~]# lspci 
00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
00:01.0 VGA compatible controller: Red Hat, Inc. QXL paravirtual graphic card (rev 04)
00:02.0 PCI bridge: Intel Corporation 7500/5520/5500/X58 I/O Hub PCI Express Root Port 0 (rev 02)
00:03.0 PCI bridge: Intel Corporation 7500/5520/5500/X58 I/O Hub PCI Express Root Port 0 (rev 02)
00:04.0 PCI bridge: Intel Corporation 7500/5520/5500/X58 I/O Hub PCI Express Root Port 0 (rev 02)
00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
01:00.0 PCI bridge: Texas Instruments XIO3130 PCI Express Switch (Upstream) (rev 02)
02:00.0 PCI bridge: Texas Instruments XIO3130 PCI Express Switch (Downstream) (rev 01)
03:00.0 Ethernet controller: Red Hat, Inc Virtio network device (rev 01)
04:00.0 PCI bridge: Texas Instruments XIO3130 PCI Express Switch (Upstream) (rev 02)
05:00.0 PCI bridge: Texas Instruments XIO3130 PCI Express Switch (Downstream) (rev 01)
05:01.0 PCI bridge: Texas Instruments XIO3130 PCI Express Switch (Downstream) (rev 01)
06:00.0 SCSI storage controller: Red Hat, Inc Virtio block device (rev 01)
08:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
[root@dhcp-66-145-44 ~]# lspci -vvv -t
-[0000:00]-+-00.0  Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
           +-01.0  Red Hat, Inc. QXL paravirtual graphic card
           +-02.0-[01-03]----00.0-[02-03]----00.0-[03]----00.0  Red Hat, Inc Virtio network device
           +-03.0-[04-07]----00.0-[05-07]--+-00.0-[06]----00.0  Red Hat, Inc Virtio block device
           |                               \-01.0-[07]--
           +-04.0-[08]----00.0  Intel Corporation 82576 Gigabit Network Connection
           +-1f.0  Intel Corporation 82801IB (ICH9) LPC Interface Controller
           +-1f.2  Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode]
           \-1f.3  Intel Corporation 82801I (ICH9 Family) SMBus Controller

[root@dhcp-66-145-44 ~]# ifconfig
ens3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        ether 00:1b:21:42:33:84  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device memory 0x98400000-9841ffff  

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.66.145.44  netmask 255.255.252.0  broadcast 10.66.147.255
        inet6 2620:52:0:4292:986a:6bff:fe6c:6d6e  prefixlen 64  scopeid 0x0<global>
        inet6 fe80::986a:6bff:fe6c:6d6e  prefixlen 64  scopeid 0x20<link>
        ether 9a:6a:6b:6c:6d:6e  txqueuelen 1000  (Ethernet)
        RX packets 788  bytes 57561 (56.2 KiB)
        RX errors 0  dropped 6  overruns 0  frame 0
        TX packets 150  bytes 20636 (20.1 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0


Thanks
Jing Zhao

Comment 36 errata-xmlrpc 2016-11-07 21:17:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2673.html


Note You need to log in before you can comment on or make changes to this bug.