Bug 1469338 - RFE: expose Q35 extended TSEG size in domain XML element or attribute
RFE: expose Q35 extended TSEG size in domain XML element or attribute
Status: NEW
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: libvirt (Show other bugs)
7.4
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Libvirt Maintainers
lijuan men
: FutureFeature
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-10 22:11 EDT by Laszlo Ersek
Modified: 2017-07-17 04:14 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Laszlo Ersek 2017-07-10 22:11:50 EDT
The SMRAM (TSEG) needs of OVMF on the Q35 board grow with VCPU count (see bug 1447027) and guest RAM size (see bug 1468526). In upstream QEMU 2.10, commit 2f295167e0c4 ("q35/mch: implement extended TSEG sizes", 2017-06-08) makes the TSEG size configurable, via the following property:

  -global mch.extended-tseg-mbytes=N

On i440fx machine types, the property (the mch device to begin with) does not exist.

On Q35 machine types, extended TSEG is disabled up to and including pc-q35-2.9 (by QEMU automatically setting N:=0). On pc-q35-2.10, the default value in QEMU is N:=16. For some use cases -- gigantic guest RAM sizes --, this is insufficient (and the RHELx supported limits might change with time anyway). The necessary size is technically predictable (see bug 1468526 comment 8 point (2a) e.g.), but the formula is neither exact nor easy to describe, so as a first step, libvirt should please expose this value in an optional element or attribute.

It is a motherboard/chipset property (the mch device always exists, implicitly, and there's always one of it), if that helps with finding the right place for the new element/attribute in the domain XML schema.

Thanks.
Comment 1 Peter Krempa 2017-07-11 06:42:35 EDT
If there's any possibility to calculate that on our behalf and not to have users set it we should use it, since it would basically force to do them the calculation themselves.
Comment 2 Laszlo Ersek 2017-07-11 09:32:02 EDT
I'll try to come up with the formula, but ultimately the result depends on arbitrary constants from edk2 project. So this will always remain a moving target -- at least for last-resort override purposes, a public knob would be really helpful. I'm setting needinfo on myself so I don't forget this.
Comment 3 Daniel Berrange 2017-07-11 10:15:43 EDT
(In reply to Laszlo Ersek from comment #2)
> I'll try to come up with the formula, but ultimately the result depends on
> arbitrary constants from edk2 project. So this will always remain a moving
> target -- at least for last-resort override purposes, a public knob would be
> really helpful. I'm setting needinfo on myself so I don't forget this.

If it is going to be an arbitrary moving target, that is an even stronger reason to not expose this to the user in the XML - they've no idea what edk2 version is being used, so can't know which formula to use to calculate it correctly.
Comment 4 Laszlo Ersek 2017-07-11 11:05:21 EDT
They can use trial and error though. (Which the guest can't -- this limit is not guest-configurable, only guest-discoverable, and if there isn't enough SMRAM, all the guest can do is abort booting and hang.)
Comment 5 Laszlo Ersek 2017-07-11 11:18:15 EDT
Argh, didn't mean to clear needinfo.

Back to the question of the exact amount of SMRAM neeed -- it's not really different from the questions
- "what is the minimum guest RAM requirement for this virtual hardware and guest OS combination",
- "what is the minimum virtual disk size for installing this guest OS".

The guest can't change these facts from the inside, it can only fail to boot and/or to install. Users can configure these values in the domain XML, perhaps based on trial and error (for example, recent x86_64 Fedora won't install nicely with 1GB of guest RAM only), or they can consult the "minimum requirements" chapters of the various guest OS docs.

If that is acceptable, we can do the exact same, in our product documentation. RHV specifies the maximum supported virtualization limits in <https://access.redhat.com/articles/906543>. Given the maximum supported VCPU count and maximum supported guest RAM size, I can provide SMRAM (TSEG) sizes that will make things work. If invariably specifying the "largest" TSEG size necessary is deemed wasteful, I can also work out the numbers (tested in practice too) for a few "distinguished" configurations. This is only a question of having access to beefy enough hardware in Beaker.

For example, yesterday I wrote in bug 1468526 comment 8 point (4a),

> For the currently published RHV4 limits (see link above, 240 VCPUs and
> 4TB guest RAM), 16MB SMRAM for the VCPUs and 4*8MB=32MB SMRAM for 4TB
> guest RAM should suffice (48MB SMRAM total).

Now, if you combine this with the minimum guest RAM requirements on the same page (512 MB), I might as well simplify it to: "always specify 48MB SMRAM and be done with it" -- that will always make things work. However, some people will dislike wasting 48MB from the 512MB RAM on a uselessly huge SMRAM (TSEG), and will either ask for automated calculation (which is pretty hard to do) or want to tweak the TSEG size by trial and error.

Also, for the upstream virt stack, I have no clue if anybody maintains any min/max limits.
Comment 7 Laszlo Ersek 2017-07-11 11:49:39 EDT
As I keep unintentionally clearing the needinfo on myself, I might as well
write up a back-of-the-envelope calculation for the SMRAM size (hopefully
erring on the safe side):

(1) Start with 16MB SMRAM.

  This is the default for the pc-q35-2.10 machine type, and it will suffice
  for up to 272 VCPUs, 5GB guest RAM in total, no hotplug memory range, and
  32GB of 64-bit PCI MMIO aperture.

  For significantly higher VCPU counts, this starting value might have to be
  raised. Thus far 272 VCPUs have been the highest I could test on real
  hardware. And, I don't yet have a ratio for converting VCPU count to SMRAM
  footprint. (For that I'd have to re-provision one of the few Beaker boxes
  with such huge logical processor counts.)

  The 16MB starting point for SMRAM is also sufficient for 48 VCPUs, with
  1TB of guest RAM, no hotplug DIMM range, and 32GB of 64-bit PCI MMIO
  aperture.

(2) For each terabyte (== 2^40 bytes) of *address space* added, add 8MB of
    TSEG.

    Note that I wrote "address space", not guest RAM. The following factors
    increase the address space maximum for OVMF (in this order, going from
    low addresses to high addresses):

    * guest RAM

    * Hotplug RAM (DIMM) size. This is controlled on the QEMU command line
      (not sure how exactly) and it is exposed to OVMF via a canonical
      fw_cfg file. It defaults to 0.

    * 64-bit PCI MMIO aperture size (OVMF provides an experimental fw_cfg
      knob for this, and without the knob, it defaults to 32GB). There is
      currently no non-experimental knob in QEMU to control this.

      Also note that both the base and the size of the 64-bit PCI MMIO
      aperture are rounded up to 1GB, and then the aperture base is also
      aligned up to the largest power of two (= BAR size) that the aperture
      size can contain.

    OVMF determines the size of the address space in the following function
    (it is heavily commented):

    https://github.com/lersek/edk2/blob/highram1tb/OvmfPkg/PlatformPei/MemDetect.c#L292

    Once the address space size is determined, the SMRAM footprint of paging
    structures can be calculated. However, this is not a simple linear
    function, because as you grow the RAM size, you will need internal nodes
    in the "tree of page tables" as well (so I guess we could call it an
    n*log(n) style formula). The above number (8MB extra SMRAM per 1TB
    address space added) is pretty accurate in the TB range, and likely a
    bit wasteful in sub-TB address space ranges.

    Furthermore, if 1GB paging is exposed in the VCPU / CPUID flags to the
    guest, then the SMRAM requirement goes down as well (fewer page tables
    are necessary). I haven't worked out the ratio for this, but sticking
    with the above 8MB/1TB ratio should be safe (albeit technically wasteful
    a little bit).

Of course once we automate the calculation, the code should be tested in
practice with a number of scenarios.
Comment 8 Laszlo Ersek 2017-07-11 12:58:23 EDT
(
Side point:

> However, this is not a simple linear function, because as you grow the RAM
> size, you will need internal nodes in the "tree of page tables" as well (so I
> guess we could call it an n*log(n) style formula).

Without affecting my main point, the logarithm reference was bogus here. In a fully populated tree with arity "a", the number of internal nodes is

(number of leaf nodes - 1) / (a - 1)

I.e., it results from a division by a constant.

See e.g. <https://math.stackexchange.com/questions/260809/show-1-a-a2-a3-ldots-an-fracan1-1a-1-by-induction>.

The total number of nodes is then

  (number of leaf nodes) * a - 1
  ------------------------------
              a - 1

So the function (= SMRAM needed for page tables) is slightly sub-linear in the number of leaf nodes (= RAM mapped by the page tables).

Of course the above is idealized (with page tables, different levels have different arities, and the tree is almost never fully populated, some internal nodes are "wasted"). I just wanted to correct my bogus reference to "log". For practical purposes, the SMRAM footprint looks mostly linear.
)

Note You need to log in before you can comment on or make changes to this bug.