1541839 – Guest can not be rebooted after binding pcie-expander-bus controller to a specified NUMA mode

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1541839 - Guest can not be rebooted after binding pcie-expander-bus controller to a specified NUMA mode

Summary: Guest can not be rebooted after binding pcie-expander-bus controller to a spe...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	qemu-kvm-rhev
Sub Component:
Version:	7.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Laszlo Ersek
QA Contact:	jingzhao
Docs Contact:
URL:
Whiteboard:
Depends On:	1559542
Blocks:
TreeView+	depends on / blocked

Reported:	2018-02-05 03:32 UTC by Lili Zhu
Modified:	2018-11-01 11:06 UTC (History)
CC List:	16 users (show)
Fixed In Version:	qemu-kvm-rhev-2.12.0-1.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-11-01 11:04:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:3443	0	None	None	None	2018-11-01 11:06:44 UTC

Description Lili Zhu 2018-02-05 03:32:46 UTC

Description of problem:
Guest can not be rebooted after binding pcie-expander-bus controller to a specified NUMA mode

Version-Release number of selected component (if applicable):
kernel-3.10.0-768.el7.x86_64
qemu-kvm-rhev-2.10.0-19.el7.x86_64
libvirt-3.9.0-11.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1.prepare a q35+UEFI guest with NUMA node
 # virsh list --all
 Id    Name                           State
----------------------------------------------------
 13    rhel7.5-q35                        running

# virsh dumpxml rhel7.5-q35
......
  <os>
    <type arch='x86_64' machine='pc-q35-rhel7.5.0'>hvm</type>
    <loader readonly='yes' secure='yes' type='pflash'>/usr/share/OVMF/OVMF_CODE.secboot.fd</loader>
    <nvram>/var/lib/libvirt/qemu/nvram/rhel7.5_VARS.fd</nvram>
    <boot dev='hd'/>
  </os>
......
  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>Haswell-noTSX</model>
    <numa>
      <cell id='0' cpus='0' memory='512000' unit='KiB'/>
      <cell id='1' cpus='1' memory='512000' unit='KiB'/>
    </numa>
  </cpu>
......
  <device>
    <controller type='pci' index='8' model='pcie-expander-bus'>
      <model name='pxb-pcie'/>
      <target busNr='100'>
        <node>1</node>
      </target>
      <alias name='pci.8'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </controller>
......
  </device>
......

2. reboot the guest 

Actual results:
Guest can not be rebooted 

Additional info:
I tried with the similar numa and pci controller settings on 
pc machine type guest, guest can work well

Comment 2 Laszlo Ersek 2018-02-05 12:41:36 UTC

Hi,

this is strange; I have tested OVMF with pxb-pcie several times, and it has always worked as expected. Yet, I can reproduce the symptom locally.

The QEMU command line option that libvirt generates looks correct:

  -device pxb-pcie,bus_nr=100,id=pci.4,numa_node=1,bus=pcie.0,addr=0x3 \

From the OVMF debug log, from the first (cold) boot, it looks like the device is not behaving correctly.


(1) In order to find the extra root buses, OVMF uses two pieces of information:

(a) it queries the fw_cfg file "etc/extra-pci-roots", which exposes the number of extra root bridges, and

(b) it scans function 0 of every device (0..31) of every bus (1..255). If the VendorId register of said function reads different from all-bits-one (0xFFFF), then we know that we found an extra root bus (that the device is on). Note that this scanning happens before any other PCI hierarchy programming, so only extra root buses are expected to respond.

Hint (a) is used as an upper limit, to terminate search (b) as soon as all extra root buses have been found (so that we don't scan up to & including bus number 255).


(2) From the first boot log, I can tell that the extra root bus (the only one we have) is not found:

> PciHostBridgeGetRootBridges: 1 extra root buses reported by QEMU
> InitRootBridge: populated root bus 0, with room for 255 subordinate bus(es)

The first line quoted is correct (that comes simply from hint (a)). However, the second line is wrong: the bus number range for the default root bus should end at 99. And then, a third message is missing, about the first extra root bus.

In other words, even at the first boot, the extra root bridge does not respond to OVMF's scanning like it should.


(3) At the second boot (= the reboot), things fail even worse:

> PciHostBridgeGetRootBridges: 1 extra root buses reported by QEMU
> InitRootBridge: populated root bus 0, with room for 1 subordinate bus(es)
> InitRootBridge: populated root bus 2, with room for 253 subordinate bus(es)

The first baffling question is: why does QEMU's PCI hierarchy respond differently from the first boot? We now "find" the extra root bus, however the bus number is entirely wrong. Does a different PCI device (on a different bridge) survive the platform reset, remain in configured state, and  mislead the scanning by responding?

(

For example, I see that the pcie-root-port device "disables forwarding", by calling pci_bridge_disable_base_limit():

rp_reset()
  pci_bridge_reset()
  pci_bridge_disable_base_limit()

Shouldn't we do the same in a reset handler in "hw/pci-bridge/pci_expander_bridge.c"?

)

Later, the PCI bus driver's enumeration process just fall off a cliff; most of the devices are not found, including the display device (hence no splash screen, and the boot failure too -- no virtio disks or scsi controllers are found etc). This follows most likely from the earlier, incorrectly determined, subordinate bus number 1 for root bus 0.


(4) If I add a PCI Express root port to the extra root bus, then pxb-pcie works as expected -- but at the first boot only:

> PciHostBridgeGetRootBridges: 1 extra root buses reported by QEMU
> InitRootBridge: populated root bus 0, with room for 99 subordinate bus(es)
> InitRootBridge: populated root bus 100, with room for 155 subordinate bus(es)

After reboot though, the same issue occurs:

> PciHostBridgeGetRootBridges: 1 extra root buses reported by QEMU
> InitRootBridge: populated root bus 0, with room for 1 subordinate bus(es)
> InitRootBridge: populated root bus 2, with room for 253 subordinate bus(es)

Leading to the same boot failure.

I don't understand this; on a platform reset, QEMU should return the PCI hierarchy to its default un-configured state.


(5) OK, summary of questions:

(Q1) Why does the pxb-pcie device's secondary bus (i.e., the extra root bus) not respond to the scanning, without another device sitting on it? I seem to remember that device 0 function 0 used to respond.

(Q2) Why is the PCI hierarchy not de-configured upon platform reset? Are we missing a call to pci_bridge_disable_base_limit()?

Marcel, can you help with this please? Thank you.

Comment 3 Marcel Apfelbaum 2018-02-06 13:57:18 UTC

(In reply to Laszlo Ersek from comment #2)
> Hi,
> 
> this is strange; I have tested OVMF with pxb-pcie several times, and it has
> always worked as expected. Yet, I can reproduce the symptom locally.
> 
> The QEMU command line option that libvirt generates looks correct:
> 
>   -device pxb-pcie,bus_nr=100,id=pci.4,numa_node=1,bus=pcie.0,addr=0x3 \
> 
> From the OVMF debug log, from the first (cold) boot, it looks like the
> device is not behaving correctly.
> 
> 
> (1) In order to find the extra root buses, OVMF uses two pieces of
> information:
> 
> (a) it queries the fw_cfg file "etc/extra-pci-roots", which exposes the
> number of extra root bridges, and
> 

Right

> (b) it scans function 0 of every device (0..31) of every bus (1..255). If
> the VendorId register of said function reads different from all-bits-one
> (0xFFFF), then we know that we found an extra root bus (that the device is
> on).

The scan you describe is not for host bridges only, but for all devices.
The pxb host-bridges do not respond on dev=0,func=0 on their respective bus at all,
you can actually have a PCIe Root Port plugged at function 0. 

> Note that this scanning happens before any other PCI hierarchy
> programming, so only extra root buses are expected to respond.
>

Since we are doing a wide scan, all devices should respond, not only
the host bridges.
 
> Hint (a) is used as an upper limit, to terminate search (b) as soon as all
> extra root buses have been found (so that we don't scan up to & including
> bus number 255).
> 

Right, just we look we discovered all devices, the hint is used to know
is there any reason to continue scanning. We can actually have a pxb
with no devices attached to it, but in this case it will not be usable
at all (!!) since no MEM/IO range will be assigned to it.

An empty pxb-pcie is useless. We can improve the resources allocation
mechanism in the future if we'll need the feature.

> 
> (2) From the first boot log, I can tell that the extra root bus (the only
> one we have) is not found:
> 
> > PciHostBridgeGetRootBridges: 1 extra root buses reported by QEMU
> > InitRootBridge: populated root bus 0, with room for 255 subordinate bus(es)
> 
> The first line quoted is correct (that comes simply from hint (a)). However,
> the second line is wrong: the bus number range for the default root bus
> should end at 99. And then, a third message is missing, about the first
> extra root bus.
> 
> In other words, even at the first boot, the extra root bridge does not
> respond to OVMF's scanning like it should.
> 

The extra root bridge is transparent, the host bridge is actually a PCI
bridge on bus 0.
The information on what is the latest possible bus number for a host bridge
is in the ACPI code only.

> 
> (3) At the second boot (= the reboot), things fail even worse:
> 
> > PciHostBridgeGetRootBridges: 1 extra root buses reported by QEMU
> > InitRootBridge: populated root bus 0, with room for 1 subordinate bus(es)
> > InitRootBridge: populated root bus 2, with room for 253 subordinate bus(es)
> 

Did we have a hot-plug operation between the first and second boot?
How did we find bus 2 if no device is behind it? Really strange indeed.

> The first baffling question is: why does QEMU's PCI hierarchy respond
> differently from the first boot? We now "find" the extra root bus, however
> the bus number is entirely wrong. Does a different PCI device (on a
> different bridge) survive the platform reset, remain in configured state,
> and  mislead the scanning by responding?
> 

The really interesting thing is if no device is attached to the pxb-pcie
the firmware has nothing to configure.

> (
> 
> For example, I see that the pcie-root-port device "disables forwarding", by
> calling pci_bridge_disable_base_limit():
> 
> rp_reset()
>   pci_bridge_reset()
>   pci_bridge_disable_base_limit()
> 
> Shouldn't we do the same in a reset handler in
> "hw/pci-bridge/pci_expander_bridge.c"?
> 
> )
> 

I'll double check it, however a pxb is nothing but an opaque
pci device on bus 0 and an ACPI description of the bus range and resources
we need.

> Later, the PCI bus driver's enumeration process just fall off a cliff; most
> of the devices are not found, including the display device (hence no splash
> screen, and the boot failure too -- no virtio disks or scsi controllers are
> found etc). This follows most likely from the earlier, incorrectly
> determined, subordinate bus number 1 for root bus 0.
> 
> 
> (4) If I add a PCI Express root port to the extra root bus, then pxb-pcie
> works as expected -- but at the first boot only:
>

This is a "good" configuration (at least one device behind the pxb)
 
> > PciHostBridgeGetRootBridges: 1 extra root buses reported by QEMU
> > InitRootBridge: populated root bus 0, with room for 99 subordinate bus(es)
> > InitRootBridge: populated root bus 100, with room for 155 subordinate bus(es)
>

How does the OVMF gets this info?
The way SeaBIOS works it blindly checks for devices as you describe above
on each bus, once it finds it, if is a pci-bridge it looks for subordinate
buses and so on. But The firmware does not have a clear picture
of the boundaries of a pxb. 

 
> After reboot though, the same issue occurs:
> 
> > PciHostBridgeGetRootBridges: 1 extra root buses reported by QEMU
> > InitRootBridge: populated root bus 0, with room for 1 subordinate bus(es)
> > InitRootBridge: populated root bus 2, with room for 253 subordinate bus(es)
> 
> Leading to the same boot failure.
> 
> I don't understand this; on a platform reset, QEMU should return the PCI
> hierarchy to its default un-configured state.
> 

Agreed, but which state?

> 
> (5) OK, summary of questions:
> 
> (Q1) Why does the pxb-pcie device's secondary bus (i.e., the extra root bus)
> not respond to the scanning, without another device sitting on it? I seem to
> remember that device 0 function 0 used to respond.
> 

No it shouldn't respond. As mentioned above the pxb host-bridge is a PCI
device attached to bus 0, and has no appearance on its own bus.

> (Q2) Why is the PCI hierarchy not de-configured upon platform reset? Are we
> missing a call to pci_bridge_disable_base_limit()?
> 

I need to understand what exactly remains configured.
The pxb itself has no configuration bits, the bus numbers are
re-negociated from start at reboot and the PCIe Root Ports
should indeed reset during reboot.

> Marcel, can you help with this please? Thank you.

Sure, first of all a pxb without devices behind it is not
a good configuration.

Then I'll try to reproduce (already trying) with SeaBIOS by rebooting a few times.
If I cannot reproduce we will need to discuss again the way OVMF
handles the pxb devices.

The most important issue is how (and why) the OVMF knows the bus ranges
of each host bridge (I think we talked about it but I forgot).

The way it should work:
The firmware scans all the range 0-255 until finishes to find all the
*devices* behind all the host-bridges, the extra-host-bridges fw_cfg file
gives only a hint on when to stop and nothing more.
QEMU pci buses of the pxbs will respond to bus numbers configured
at command line. (as you can see, no configuration here).
The firmware will find for each root bus the whole hierarchy of pci bridges. 

Thanks,
Marcel

Comment 4 Laszlo Ersek 2018-02-06 15:33:11 UTC

(In reply to Marcel Apfelbaum from comment #3)
> (In reply to Laszlo Ersek from comment #2)

>> Hint (a) is used as an upper limit, to terminate search (b) as soon as
>> all extra root buses have been found (so that we don't scan up to &
>> including bus number 255).
>>
>
> Right, just we look we discovered all devices, the hint is used to know is
> there any reason to continue scanning. We can actually have a pxb with no
> devices attached to it, but in this case it will not be usable at all (!!)
> since no MEM/IO range will be assigned to it.

This sounds reasonable. While it will prevent OVMF from noticing the extra
root bus at all, that's likely not a problem per se -- there are no devices
behind the extra root bridge after all that could become unreachable.

However, the problem occurs that at reboot, when the scanning loop is
actively misled by some device that should not respond (because it doesn't
directly sit on a root bridge).


> An empty pxb-pcie is useless.

Thanks, this sounds good! It seems to supports my idea that empty pxb-pcie
devices should not be added to the domain config. Which in turn invalidates
the original report here (in part).

... Partly only because the reboot behavior is broken with a nonempty
pxb-pcie too.


>> (2) From the first boot log, I can tell that the extra root bus (the only
>> one we have) is not found:
>>
>>> PciHostBridgeGetRootBridges: 1 extra root buses reported by QEMU
>>> InitRootBridge: populated root bus 0, with room for 255 subordinate
>>> bus(es)
>>
>> The first line quoted is correct (that comes simply from hint (a)).
>> However, the second line is wrong: the bus number range for the default
>> root bus should end at 99. And then, a third message is missing, about
>> the first extra root bus.
>>
>> In other words, even at the first boot, the extra root bridge does not
>> respond to OVMF's scanning like it should.
>>
>
> The extra root bridge is transparent, the host bridge is actually a PCI
> bridge on bus 0.
> The information on what is the latest possible bus number for a host
> bridge is in the ACPI code only.

Yes, based on your explanation higher up as well, if OVMF misses an empty
pxb-pcie, that should be fine. (But then it should continue looking empty at
reboot too, which is not what happens right now.)


>> (3) At the second boot (= the reboot), things fail even worse:
>>
>>> PciHostBridgeGetRootBridges: 1 extra root buses reported by QEMU
>>> InitRootBridge: populated root bus 0, with room for 1 subordinate
>>> bus(es)
>>> InitRootBridge: populated root bus 2, with room for 253 subordinate
>>> bus(es)
>>
>
> Did we have a hot-plug operation between the first and second boot?

No, we didn't.


> How did we find bus 2 if no device is behind it? Really strange indeed.

After the cold boot, this is the hierarchy:

  -[0000:00]-+-00.0  Intel Corporation 82G33/G31/P35/P31 Express DRAM
                     Controller
             +-01.0  Device 1234:1111
             +-02.0-[01]----00.0  NEC Corporation uPD720200 USB 3.0 Host
                                  Controller
             +-03.0  Red Hat, Inc. QEMU PCIe Expander bridge
             +-1e.0-[02-03]----01.0-[03]--+-01.0  Red Hat, Inc. Virtio
             |                            |       network device
             |                            +-03.0  Red Hat, Inc. Virtio
             |                            |       network device
             |                            +-04.0  Red Hat, Inc. Virtio
             |                            |       memory balloon
             |                            +-05.0  Red Hat, Inc. Virtio SCSI
             |                            \-06.0  Red Hat, Inc. Virtio
             |                                    console
             +-1f.0  Intel Corporation 82801IB (ICH9) LPC Interface
                     Controller
             +-1f.2  Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port
                     SATA Controller [AHCI mode]
             \-1f.3  Intel Corporation 82801I (ICH9 Family) SMBus Controller

  00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM
          Controller
  00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
  00:02.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
  00:03.0 Host bridge: Red Hat, Inc. QEMU PCIe Expander bridge
  00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 92)
  00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface
          Controller (rev 02)
  00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6
          port SATA Controller [AHCI mode] (rev 02)
  00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller
          (rev 02)
  01:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller
          (rev 03)
  02:01.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge
  03:01.0 Ethernet controller: Red Hat, Inc. Virtio network device
  03:03.0 Ethernet controller: Red Hat, Inc. Virtio network device
  03:04.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon
  03:05.0 SCSI storage controller: Red Hat, Inc. Virtio SCSI
  03:06.0 Communication controller: Red Hat, Inc. Virtio console

Bus number "2" is the secondary bus of the "dmi-to-pci-bridge" in 00:1e.0.
On that bus, there is only one device, a "pci-bridge", in 02:01.0. I assume
that the SHPC of this "pci-bridge" device responds when the VendorId
register is read by the scanning loop.

I don't understand why the "pci-bridge" responds, given that between it and
the pcie.0 root complex, there is a "dmi-to-pci-bridge" that should have
been de-configured by the platform reset.


>> The first baffling question is: why does QEMU's PCI hierarchy respond
>> differently from the first boot? We now "find" the extra root bus,
>> however the bus number is entirely wrong. Does a different PCI device (on
>> a different bridge) survive the platform reset, remain in configured
>> state, and  mislead the scanning by responding?
>>
>
> The really interesting thing is if no device is attached to the pxb-pcie
> the firmware has nothing to configure.

I'm glad we agree on that!


>> (
>>
>> For example, I see that the pcie-root-port device "disables forwarding",
>> by calling pci_bridge_disable_base_limit():
>>
>> rp_reset()
>>   pci_bridge_reset()
>>   pci_bridge_disable_base_limit()
>>
>> Shouldn't we do the same in a reset handler in
>> "hw/pci-bridge/pci_expander_bridge.c"?
>>
>> )
>>
>
> I'll double check it, however a pxb is nothing but an opaque
> pci device on bus 0 and an ACPI description of the bus range and resources
> we need.

Yes. I think if that "other" device stops responding when it shouldn't,
we'll be fine.


>> (4) If I add a PCI Express root port to the extra root bus, then pxb-pcie
>> works as expected -- but at the first boot only:
>>
>
> This is a "good" configuration (at least one device behind the pxb)
>
>>> PciHostBridgeGetRootBridges: 1 extra root buses reported by QEMU
>>> InitRootBridge: populated root bus 0, with room for 99 subordinate
>>> bus(es)
>>> InitRootBridge: populated root bus 100, with room for 155 subordinate
>>> bus(es)
>>
>
> How does the OVMF gets this info?

OVMF's PciHostBridgeLib implementation has to provide the edk2
PciHostBridgeDxe driver with the list of root bridges first thing. The edk2
PciBusDxe driver starts only later (it has a dependency on
PciHostBridgeDxe), and the recursive traversal and resource assignments
(such as bus number assignments) occur only in PciBusDxe.

Thus, OVMF's PciHostBridgeLib needs a way to identify the extra root
bridges. The method we settled on in 2015 (edk2 commit bdb1c98f0361,
"OvmfPkg: PciHostBridgeDxe: look for all root buses", 2015-07-14) was the
scanning loop that I described earlier. It is invoked ahead of any other
runtime configuration that occurs for the PCI hierarchy. Thus, it was
reasonable to expect that only the extra root buses should have working bus
number assignments at that point (from QEMU itself, not from firmware
actions).

Basically OVMF starts iterating upward from bus number 1, to 255 (or more
precisely, up to the fw-cfg hint). For each bus, it tries each device. For
each device, it tries to read the VendorId register of function 0. If the
value read is all-bits-one, the iteration simply continues. If the value
read is *not* all-bits-one, then we have found a *reachable* device on the
current bus number. This means that the bus is "live" without any preceding
PCI configuration. In other words, the bus is a root bus.

Thus, we make a note that "bus XYZ is another root bus". We also remember
the last root bus -- the one we found just before the current one (when we
start the loop at bus nr 1, that "last root bus" is 0, because bus #0 is
always there). The difference between the currently found bus number and the
last found bus number is the range that can be assigned to the *previous*
root bus, as bus number aperture.


> The way SeaBIOS works it blindly checks for devices as you describe above
> on each bus, once it finds it, if is a pci-bridge it looks for subordinate
> buses and so on. But The firmware does not have a clear picture
> of the boundaries of a pxb.

OVMF works similarly. It also blindly scans the devices first, but at this
stage, it performs no recursion at all, it just covers the {1..255} x
{1..31} busnr*device space, exhaustively (well, up to the fw-cfg hint).

Once all the root bus numbers are found, the bus number space is partitioned
between them (as busnr apertures for the root buses), and after this is when
PciBusDxe can perform the recursive traversal (assigning secondary bus
numbers, and calculating the subordinate bus numbers, from within the busnr
apertures that were assigned at the root bus level).


>> After reboot though, the same issue occurs:
>>
>>> PciHostBridgeGetRootBridges: 1 extra root buses reported by QEMU
>>> InitRootBridge: populated root bus 0, with room for 1 subordinate
>>> bus(es)
>>> InitRootBridge: populated root bus 2, with room for 253 subordinate
>>> bus(es)
>>
>> Leading to the same boot failure.
>>
>> I don't understand this; on a platform reset, QEMU should return the PCI
>> hierarchy to its default un-configured state.
>>
>
> Agreed, but which state?

To the state where reading the VendorId register of the "pci-bridge" device
behind the "dmi-to-pci-bridge" device returns all-bits-zero :)

At this point, bus number 2 is undefined (because it neither belongs to a
root bus, nor has it been programmatically assigned to any bridge as
secondary bus number), so *any* 02:dd.f config space read should fail.


>> (5) OK, summary of questions:
>>
>> (Q1) Why does the pxb-pcie device's secondary bus (i.e., the extra root
>> bus) not respond to the scanning, without another device sitting on it? I
>> seem to remember that device 0 function 0 used to respond.
>>
>
> No it shouldn't respond. As mentioned above the pxb host-bridge is a PCI
> device attached to bus 0, and has no appearance on its own bus.

OK. This should work fine as discussed above.


>> (Q2) Why is the PCI hierarchy not de-configured upon platform reset? Are
>> we missing a call to pci_bridge_disable_base_limit()?
>>
>
> I need to understand what exactly remains configured. The pxb itself has
> no configuration bits, the bus numbers are re-negociated from start at
> reboot and the PCIe Root Ports should indeed reset during reboot.

Thank you, this makes a lot of sense.

I think the issue is that the "dmi-to-pci-bridge" device model does not
clear its secondary / subordinate bus numbers at reset. Thus the SHPC of the
"pci-bridge" device behind "dmi-to-pci-bridge" remains addressible in PCI
config space.


>> Marcel, can you help with this please? Thank you.
>
> Sure, first of all a pxb without devices behind it is not a good
> configuration.

OK, great, thanks!


> Then I'll try to reproduce (already trying) with SeaBIOS by rebooting a
> few times. If I cannot reproduce we will need to discuss again the way
> OVMF handles the pxb devices.

Please see the loop construct above.


> The most important issue is how (and why) the OVMF knows the bus ranges of
> each host bridge (I think we talked about it but I forgot).

Again, it basically checks for each bus number in the (1..255) range whether
there is any "live" device on that bus. Such buses are then considered root
buses (because, if they weren't, no device behind them could be responding
without prior configuration, which is *not* done until this point, after
boot). Given this set of bus numbers, each "gap" between two adjacent ones
is assigned as bus number aperture to the lower member of the pair.


> The way it should work:
> The firmware scans all the range 0-255 until finishes to find all the
> *devices* behind all the host-bridges, the extra-host-bridges fw_cfg file
> gives only a hint on when to stop and nothing more.

Right.

> QEMU pci buses of the pxbs will respond to bus numbers configured
> at command line. (as you can see, no configuration here).

Correct.

> The firmware will find for each root bus the whole hierarchy of pci
> bridges.

Also correct.

I think we have an understanding here. The issue is that post-reset, some
*non-root* buses *also* respond, and OVMF mistakes them for PXB-provided
buses.

Perhaps this goes away immediately once we stop using "dmi-to-pci-bridge".
The original report does not contain a complete domain XML, so I couldn't
tell if such a device was in use. The domain I personally used for testing
did contain a "dmi-to-pci-bridge".

Thank you, Marcel!

Comment 5 Laszlo Ersek 2018-02-07 11:01:00 UTC

I believe I've found the error. Consider the following code in QEMU:

[hw/pci/pcie_host.c]

> static uint64_t pcie_mmcfg_data_read(void *opaque,
>                                      hwaddr mmcfg_addr,
>                                      unsigned len)
> {
>     PCIExpressHost *e = opaque;
>     PCIBus *s = e->pci.bus;
>     PCIDevice *pci_dev = pcie_dev_find_by_mmcfg_addr(s, mmcfg_addr);

This code is exercised when OVMF scans the {1..255}x{0..31}x{0} config space
in PciHostBridgeLib, reading the VendorId register for each device
interrogated, in order to find extra root buses. On Q35 the config space
access goes through the MMCONFIG area, so this QEMU function is relevant.

I booted up the guest first, then set a breakpoint on this function with:

(gdb) break pcie_mmcfg_data_read if (((mmcfg_addr >> 20) & 0x1ff) == 3)

In other words, break if a config space read targets any device/function on
bus 3. (Because, in my testing, the "pci-bridge" device with address 03:01.0
was what mislead OVMF's scanning, and this device sits on the secondary bus
of the dmi-to-pci (i82801b11) bridge.)

I let the guest continue execution, and then I gracefully rebooted the guest
from within. Sure enough, the breakpoint was hit while OVMF was rebooting
and scanning for extra root buses in PciHostBridgeLib.

Then, in QEMU we have:

  pcie_dev_find_by_mmcfg_addr() [hw/pci/pcie_host.c]
    pci_find_device()           [hw/pci/pci.c]
      pci_find_bus_nr()         [hw/pci/pci.c]

Here pci_find_bus_nr() should return NULL, because "bus_num" is 3, and we
are just after reboot, i.e. no nonzero bus numbers (except the pxbs' extra
root buses) should be live. Unfortunately, the bus *is* found, because here:

>     /* try child bus */
>     for (; bus; bus = sec) {
>         QLIST_FOREACH(sec, &bus->child, sibling) {
>             if (pci_bus_num(sec) == bus_num) {
>                 return sec;
>             }

we end up calling:

[hw/pci/pci.c]

> static int pcibus_num(PCIBus *bus)
> {
>     if (pcibus_is_root(bus)) {
>         return 0; /* pci host bridge */
>     }
>     return bus->parent_dev->config[PCI_SECONDARY_BUS];
> }

and the PCI_SECONDARY_BUS register of the "i82801b11-bridge" device (the
dmi-to-pci bridge) is *not* zero after reset; it has retained its previously
configured value 3.

And that happens because the i82801b11_bridge_class_init() function
[hw/pci-bridge/i82801b11.c] does not set the reset callback to
pci_bridge_reset().

The pci_bridge_reset() function clears the PCI_SECONDARY_BUS register (and
other bridge specific registers), and it is widely used by other bridge-like
device models that are derived from TYPE_PCI_BRIDGE:

- the "pci-bridge" device model (TYPE_PCI_BRIDGE_DEV) has
  qdev_pci_bridge_dev_reset() as reset callback, which does

>     pci_bridge_reset(qdev);
>     if (shpc_present(dev)) {
>         shpc_reset(dev);
>     }

- the "pcie-pci-bridge" device model (TYPE_PCIE_PCI_BRIDGE_DEV) has
  pcie_pci_bridge_reset(), which does:

>     pci_bridge_reset(qdev);
>     if (msi_present(d)) {
>         msi_reset(d);
>     }
>     shpc_reset(d);

- the PCIE root port device has rp_reset():

>     rp_aer_vector_update(d);
>     pcie_cap_root_reset(d);
>     pcie_cap_deverr_reset(d);
>     pcie_cap_slot_reset(d);
>     pcie_cap_arifwd_reset(d);
>     pcie_aer_root_reset(d);
>     pci_bridge_reset(qdev);        <--------- here
>     pci_bridge_disable_base_limit(d);

- the xio3130 switch's upstream and downstream ports both call
  pci_bridge_reset() too, in xio3130_upstream_reset() and
  xio3130_downstream_reset() respectively.

- The following three device models:

  - "dec-21154-p2p-bridge",
  - "pbm-bridge" (TYPE_PBM_PCI_BRIDGE),
  - "xilinx-pcie-root" (TYPE_XILINX_PCIE_ROOT),

  simply *set* pci_bridge_reset() as their own reset callbacks.

  This is what "i82801b11-bridge" should do as well, in
  i82801b11_bridge_class_init(). "dc->reset" is currently left NULL there (I
  checked that in gdb too).


I have a one-liner QEMU patch for this, I'll submit it soon.

SeaBIOS does not trigger this problem in QEMU because the
pci_bios_init_bus_rec() function [src/fw/pciinit.c] preventatively
overwrites, as first step, the secondary and subordinate busnr registers of
all bridge devices that are on the currently processed bus:

>     /* prevent accidental access to unintended devices */
>     foreachbdf(bdf, bus) {
>         class = pci_config_readw(bdf, PCI_CLASS_DEVICE);
>         if (class == PCI_CLASS_BRIDGE_PCI) {
>             pci_config_writeb(bdf, PCI_SECONDARY_BUS, 255);
>             pci_config_writeb(bdf, PCI_SUBORDINATE_BUS, 0);
>         }
>     }

While that's not a bad idea, I do think this should be fixed in QEMU.

Comment 6 Laszlo Ersek 2018-02-07 12:12:43 UTC

Posted upstream patch:
[qemu-devel] [PATCH] pci-bridge/i82801b11: clear bridge registers on platform
                     reset
http://mid.mail-archive.com/20180207121027.32466-1-lersek@redhat.com

(URL with message-id might only go live in a few minutes)

Comment 7 Laszlo Ersek 2018-02-07 12:21:22 UTC

(In reply to Laszlo Ersek from comment #6)
> Posted upstream patch:
> [qemu-devel] [PATCH] pci-bridge/i82801b11: clear bridge registers on platform
>                      reset
> http://mid.mail-archive.com/20180207121027.32466-1-lersek@redhat.com

Alternative URL:
https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg01679.html

Comment 8 Laszlo Ersek 2018-02-13 19:03:01 UTC

Upstream commit ed247f40db84 ("pci-bridge/i82801b11: clear bridge registers on platform reset", 2018-02-08).

Comment 14 errata-xmlrpc 2018-11-01 11:04:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3443

Note You need to log in before you can comment on or make changes to this bug.