Bug 1536147

Summary: [Q35] SeaBIOS computes wrongly the IO hints vendor capability for bus numbers reservation
Product: Red Hat Enterprise Linux 7 Reporter: Marcel Apfelbaum <marcel>
Component: seabiosAssignee: Michael S. Tsirkin <mst>
Status: CLOSED WONTFIX QA Contact: Michael <choma>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.5CC: ailan, chayang, jinzhao, jusual, juzhang, lersek, mst, rkhan, virt-maint
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-09-09 08:09:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marcel Apfelbaum 2018-01-18 17:17:11 UTC
From Laszlo on virtual-devel list:

The debug messages in the pci_find_resource_reserve_capability()
function would benefit from a clean-up. Post-patch (i.e., matching
current upstream), the message

  PCI: QEMU resource reserve cap not found

is printed if the device under investigation does not have the VID/DID
pair that we're interested in.

(a) So that's not precise, because for those devices we don't even care
for, nor look for, the capability.

Second, when the VID/DID pair *is* right, but the capability is not
there (we hit the terminator element in the capabilities list), we don't
print anything.

(b) That's when we *should* print the above message.

Third -- and this is a logic bug --, when we find the capability, but
its length is too small, we print an appropriate debug message, but
don't re-set the capability offset to zero. We fall through to "return
cap", and return the nonzero offset of the mal-sized (truncated)
capability to the caller.

I mean, it's fine if we trust QEMU to provide a well-formed capability
(whenever the capability exists), but then we should remove the sanity
check. If we keep the sanity check, then it should do the right thing.

(c) So I think we should either remove the sanity check, or keep it, but
then, if it fails, we should zero out "cap" (and return zero).

Side comment ends; on to my main point:

> @@ -619,13 +617,11 @@ pci_bios_init_bus_rec(int bus, u8 *pci_bus)
>                                  res_bus);
>                          res_bus = 0;
>                      }
> -                }
> -                if (secbus + res_bus > *pci_bus) {
> -                    dprintf(1, "PCI: QEMU resource reserve cap: bus = %u\n",
> -                            res_bus);
> -                    res_bus = secbus + res_bus;
> -                } else {
> -                    res_bus = *pci_bus;
> +                    if (secbus + res_bus > *pci_bus) {
> +                        dprintf(1, "PCI: QEMU resource reserve cap: bus = %u\n",
> +                                res_bus);
> +                        res_bus = secbus + res_bus;
> +                    }
>                  }
>              }
>              dprintf(1, "PCI: subordinate bus = 0x%x -> 0x%x\n",
>

In order to make heads or tails of this hunk of the patch, we have to
review the original (pre-ec6cb17f) version of this function:

$ git show ec6cb17f^:src/fw/pciinit.c | cat -n

>    531  pci_bios_init_bus_rec(int bus, u8 *pci_bus)
>    532  {
>    533      int bdf;
>    534      u16 class;
>    535
>    536      dprintf(1, "PCI: %s bus = 0x%x\n", __func__, bus);
>    537
>    538      /* prevent accidental access to unintended devices */
>    539      foreachbdf(bdf, bus) {
>    540          class = pci_config_readw(bdf, PCI_CLASS_DEVICE);
>    541          if (class == PCI_CLASS_BRIDGE_PCI) {
>    542              pci_config_writeb(bdf, PCI_SECONDARY_BUS, 255);
>    543              pci_config_writeb(bdf, PCI_SUBORDINATE_BUS, 0);
>    544          }
>    545      }
>    546
>    547      foreachbdf(bdf, bus) {
>    548          class = pci_config_readw(bdf, PCI_CLASS_DEVICE);
>    549          if (class != PCI_CLASS_BRIDGE_PCI) {
>    550              continue;
>    551          }
>    552          dprintf(1, "PCI: %s bdf = 0x%x\n", __func__, bdf);
>    553
>    554          u8 pribus = pci_config_readb(bdf, PCI_PRIMARY_BUS);
>    555          if (pribus != bus) {
>    556              dprintf(1, "PCI: primary bus = 0x%x -> 0x%x\n", pribus, bus);
>    557              pci_config_writeb(bdf, PCI_PRIMARY_BUS, bus);
>    558          } else {
>    559              dprintf(1, "PCI: primary bus = 0x%x\n", pribus);
>    560          }
>    561
>    562          u8 secbus = pci_config_readb(bdf, PCI_SECONDARY_BUS);
>    563          (*pci_bus)++;
>    564          if (*pci_bus != secbus) {
>    565              dprintf(1, "PCI: secondary bus = 0x%x -> 0x%x\n",
>    566                      secbus, *pci_bus);
>    567              secbus = *pci_bus;
>    568              pci_config_writeb(bdf, PCI_SECONDARY_BUS, secbus);
>    569          } else {
>    570              dprintf(1, "PCI: secondary bus = 0x%x\n", secbus);
>    571          }
>    572
>    573          /* set to max for access to all subordinate buses.
>    574             later set it to accurate value */
>    575          u8 subbus = pci_config_readb(bdf, PCI_SUBORDINATE_BUS);
>    576          pci_config_writeb(bdf, PCI_SUBORDINATE_BUS, 255);
>    577
>    578          pci_bios_init_bus_rec(secbus, pci_bus);
>    579
>    580          if (subbus != *pci_bus) {
>    581              dprintf(1, "PCI: subordinate bus = 0x%x -> 0x%x\n",
>    582                      subbus, *pci_bus);
>    583              subbus = *pci_bus;
>    584          } else {
>    585              dprintf(1, "PCI: subordinate bus = 0x%x\n", subbus);
>    586          }
>    587          pci_config_writeb(bdf, PCI_SUBORDINATE_BUS, subbus);
>    588      }
>    589  }

This is a recursive function. The "bus" input parameter identifies the
PCI bus to scan. The "pci_bus" parameter is input/output. On input, it
tells the function the bus number up to which (inclusively) PCI bus
numbers have been assigned already. In other words, it is an *exclusive*
starting point for the function itself to hand out bus numbers from. On
output, the "pci_bus" parameter is updated so that the invariant is
upheld -- it identifies the last bus number that the function itself has
assigned.

(It wouldn't be bad to document this contract.)

For each bridge that we find on "bus", we first ensure that the primary
bus number of the bridge (i.e., the bus that the bridge thinks it is
*on*) actually matches "bus". Lines 554-560.

On lines 562-571, we set the secondary bus number of the bridge. This is
the bus number that devices *directly* behind the bridge consider as
their own bus number (i.e. the number of the bus that they are on). For
this, we increment "pci_bus" *before* we put it into the bridge's
appropriate register.

Here's where things get interesting. The third register in question is
the subordinate bus number of the bridge, which stands for the highest
(inclusive) bus number in the sub-hierarchy (recursively) behind the
bridge. We set it to 255 temporarily (lines 575-576), then recursively
traverse the sub-hierarchy behind the bridge (line 578). When the
recursive  pci_bios_init_bus_rec() call returns, "pci_bus" -- according
to its contract, see above -- is exactly what we need as subordinate bus
number. We store this value to the corresponding bridge register, on
lines 580-587.

As we continue the loop (or return from the function), "pci_bus"
conforms to its invariant -- namely, the bus number that can be
assigned, as secondary bus number, to the bridge that is going to be
found next, is (pci_bus+1).

Now let's look at the function at commit ec6cb17f:

$ git show ec6cb17f:src/fw/pciinit.c | cat -n

>    600          /* set to max for access to all subordinate buses.
>    601             later set it to accurate value */
>    602          u8 subbus = pci_config_readb(bdf, PCI_SUBORDINATE_BUS);
>    603          pci_config_writeb(bdf, PCI_SUBORDINATE_BUS, 255);
>    604
>    605          pci_bios_init_bus_rec(secbus, pci_bus);
>    606
>    607          if (subbus != *pci_bus) {
>    608              u8 res_bus = *pci_bus;
>    609              u8 cap = pci_find_resource_reserve_capability(bdf);
>    610
>    611              if (cap) {
>    612                  u32 tmp_res_bus = pci_config_readl(bdf,
>    613                          cap + RES_RESERVE_BUS_RES);
>    614                  if (tmp_res_bus != (u32)-1) {
>    615                      res_bus = tmp_res_bus & 0xFF;
>    616                      if ((u8)(res_bus + secbus) < secbus ||
>    617                              (u8)(res_bus + secbus) < res_bus) {
>    618                          dprintf(1, "PCI: bus_reserve value %d is invalid\n",
>    619                                  res_bus);
>    620                          res_bus = 0;
>    621                      }
>    622                  }
>    623                  if (secbus + res_bus > *pci_bus) {
>    624                      dprintf(1, "PCI: QEMU resource reserve cap: bus = %u\n",
>    625                              res_bus);
>    626                      res_bus = secbus + res_bus;
>    627                  } else {
>    628                      res_bus = *pci_bus;
>    629                  }
>    630              }
>    631              dprintf(1, "PCI: subordinate bus = 0x%x -> 0x%x\n",
>    632                      subbus, res_bus);
>    633              subbus = res_bus;
>    634              *pci_bus = res_bus;
>    635          } else {
>    636              dprintf(1, "PCI: subordinate bus = 0x%x\n", subbus);
>    637          }
>    638          pci_config_writeb(bdf, PCI_SUBORDINATE_BUS, subbus);

Here, after we determine the "candidate" subordinate bus number, from
the recursive traversal, we check the reservation hint.

That's where we have the first (small) bug: if the bridge's subordinate
bus nr register already matches the candidate (from the recursion), we
don't check the reservation capability at all.

(d) This is wrong, the reservation hint should always be checked, and
the original register value should be compared with the calculated
maximum.

Okay, we reach the definition of the new "res_bus" variable. Looking at
the initialization at line 608, and at the assignments at lines 633-634,
it is clear that "res_bus" stands for:

  subordinate bus number in the *absolute sense* that accommodates both
  the recursive traversal (the sub-hierarchy) and the bus number
  reservation

(The value is clearly supposed to be the maximum of the two individual
calculations.) Line 633 ensures that (unchanged) line 638 will store the
right register value, and line 634 restores the "pci_bus" invariant.
Fine.

Now, if the capability is not present (lines 609-611), then "res_bus"
simply preserves its original value (*pci_bus, from line 608), and the
logic correctly decays to the pre-ec6cb17f variant -- we store the
candidate from the recursive traversal to the register. Fine.

Let's see what happens if the capability *is* found. We use the
temporary variable "tmp_res_bus" (lines 612-615) to check if the
capability contains a bus number reservation hint. And this is where the
grave programming sin is committed, because we *totally* change the
meaing of the "res_bus" variable to:

  the *relative* bus number increment that should be reserved for this
  bridge

The maximum calculation at lines 623-629 indeed outputs the maximum in
"res_bus" as an absolute quantity, however, it *assumes* that on input,
"res_bus" is this *relative* increment. This assumption is false when
the capability does not contain a bus nr reservation hint -- in that
case, "res_bus" is still set to the *absolute* candidate from the
recursive traversal, and we'll always take the nonsensical branch (at
lines 623-626)

    if (secbus + *pci_bus > *pci_bus) {
        /* ... */
        res_bus = secbus + *pci_bus;

In other words, we end up with the subordinate bus number by
incrementing the recursive candidate with our own secondary bus number
-- nonsense.

Now let's see how the current backport (upstream 14d91c353) updates the
code:

$ git show 14d91c353:src/fw/pciinit.c | cat -n

>    598          /* set to max for access to all subordinate buses.
>    599             later set it to accurate value */
>    600          u8 subbus = pci_config_readb(bdf, PCI_SUBORDINATE_BUS);
>    601          pci_config_writeb(bdf, PCI_SUBORDINATE_BUS, 255);
>    602
>    603          pci_bios_init_bus_rec(secbus, pci_bus);
>    604
>    605          if (subbus != *pci_bus) {
>    606              u8 res_bus = *pci_bus;
>    607              u8 cap = pci_find_resource_reserve_capability(bdf);
>    608
>    609              if (cap) {
>    610                  u32 tmp_res_bus = pci_config_readl(bdf,
>    611                          cap + RES_RESERVE_BUS_RES);
>    612                  if (tmp_res_bus != (u32)-1) {
>    613                      res_bus = tmp_res_bus & 0xFF;
>    614                      if ((u8)(res_bus + secbus) < secbus ||
>    615                              (u8)(res_bus + secbus) < res_bus) {
>    616                          dprintf(1, "PCI: bus_reserve value %d is invalid\n",
>    617                                  res_bus);
>    618                          res_bus = 0;
>    619                      }
>    620                      if (secbus + res_bus > *pci_bus) {
>    621                          dprintf(1, "PCI: QEMU resource reserve cap: bus = %u\n",
>    622                                  res_bus);
>    623                          res_bus = secbus + res_bus;
>    624                      }
>    625                  }
>    626              }
>    627              dprintf(1, "PCI: subordinate bus = 0x%x -> 0x%x\n",
>    628                      subbus, res_bus);
>    629              subbus = res_bus;
>    630              *pci_bus = res_bus;
>    631          } else {
>    632              dprintf(1, "PCI: subordinate bus = 0x%x\n", subbus);
>    633          }
>    634          pci_config_writeb(bdf, PCI_SUBORDINATE_BUS, subbus);

It restricts the maximum search to the case when we know for certain
that the capability contains a bus number reservation (lines 620-624).
If the bus number reservation hint is missing, then "res_bus" remains
initialized to the recursive candidate, and everything is fine (same as
when the entire capability is missing).

However, another bug remains, and it is due to the exact same
"programming sin":

If the bridge in question has *both* a bus nr reservation hint *and* a
sub-hierarchy, *and* the sub-hierarchy produces a candidate that is
*already* more than, or equal to, what the reservation would imply,
then:

- the "maximum update" condition on line 620 evaluates to "false",

- and we not only abstain from updating the maximum in "res_bus", on
  line 623, but we also fail to restore "res_bus" to its *absolute*
  meaning (i.e., value *pci_bus)! Because, on line 613, we have set
  "res_bus" to the *relative* increment.

As a result, we'll store the relative increment to the bridge register,
which is *guaranteed* to break the sub-hierarchy behind the bridge.
(This is because the increment (= the reservation) did not exceed the
recursive candidate even when added to the bridge's own secondary bus
number; thus, it will be even more insufficient when it is
re-interpreted as an absolute value -- i.e., when it is added to zero.)

Worse, on line 630, *pci_bus will be re-wound to this value, and then
the rest of the PCI/PCIE hierarchy will get overlapping bus numbers.

(e) The only sensible fix for this issue is to keep the two meanings of
"res_bus" in separate variables.

I think that issue (e) is realistic even for downstream. If a user
configures a root port with a bus number reservation of 1 (expecting to
hotplug a PCIE-PCI bridge later), but then they change their mind and
*cold*-plug the PCIE-PCI bridge, without removing the reservation
property, then line 620 will face an equality, and things will break.

So... IMO, this patch does not solve RHBZ#1523166 as broadly as it
should.

Marcel, if you agree, can you please submit some more patches to
upstream SeaBIOS, in order to solve issues (a) through (e)?

Thanks,
Laszlo

Comment 4 Laszlo Ersek 2018-09-13 10:36:17 UTC
The "side comments" from comment 0 have been (partly?) fixed by the following upstream commits, from Jing2 Liu <jing2.liu.com>:

2c455ccc0cd6 pci: fix the return value for truncated capability
478bc3e99394 pci: clean up the debug message for pci capability found