1469338 – RFE: expose Q35 extended TSEG size in domain XML element or attribute

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1469338 - RFE: expose Q35 extended TSEG size in domain XML element or attribute

Summary: RFE: expose Q35 extended TSEG size in domain XML element or attribute

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	7.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Martin Kletzander
QA Contact:	Meina Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1558125
TreeView+	depends on / blocked

Reported:	2017-07-11 02:11 UTC by Laszlo Ersek
Modified:	2018-10-30 09:52 UTC (History)
CC List:	10 users (show)
Fixed In Version:	libvirt-4.5.0-1.el7
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-10-30 09:49:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1447027	unspecified	CLOSED	Guest cannot boot with 240 or above vcpus when using ovmf	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1468526	medium	CLOSED	>1TB RAM support	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHSA-2018:3113	None	None	None	2018-10-30 09:52:27 UTC

Internal Links: 1447027 1468526 1866110

Description Laszlo Ersek 2017-07-11 02:11:50 UTC

The SMRAM (TSEG) needs of OVMF on the Q35 board grow with VCPU count (see bug 1447027) and guest RAM size (see bug 1468526). In upstream QEMU 2.10, commit 2f295167e0c4 ("q35/mch: implement extended TSEG sizes", 2017-06-08) makes the TSEG size configurable, via the following property:

  -global mch.extended-tseg-mbytes=N

On i440fx machine types, the property (the mch device to begin with) does not exist.

On Q35 machine types, extended TSEG is disabled up to and including pc-q35-2.9 (by QEMU automatically setting N:=0). On pc-q35-2.10, the default value in QEMU is N:=16. For some use cases -- gigantic guest RAM sizes --, this is insufficient (and the RHELx supported limits might change with time anyway). The necessary size is technically predictable (see bug 1468526 comment 8 point (2a) e.g.), but the formula is neither exact nor easy to describe, so as a first step, libvirt should please expose this value in an optional element or attribute.

It is a motherboard/chipset property (the mch device always exists, implicitly, and there's always one of it), if that helps with finding the right place for the new element/attribute in the domain XML schema.

Thanks.

Comment 1 Peter Krempa 2017-07-11 10:42:35 UTC

If there's any possibility to calculate that on our behalf and not to have users set it we should use it, since it would basically force to do them the calculation themselves.

Comment 2 Laszlo Ersek 2017-07-11 13:32:02 UTC

I'll try to come up with the formula, but ultimately the result depends on arbitrary constants from edk2 project. So this will always remain a moving target -- at least for last-resort override purposes, a public knob would be really helpful. I'm setting needinfo on myself so I don't forget this.

Comment 3 Daniel Berrangé 2017-07-11 14:15:43 UTC

(In reply to Laszlo Ersek from comment #2)
> I'll try to come up with the formula, but ultimately the result depends on
> arbitrary constants from edk2 project. So this will always remain a moving
> target -- at least for last-resort override purposes, a public knob would be
> really helpful. I'm setting needinfo on myself so I don't forget this.

If it is going to be an arbitrary moving target, that is an even stronger reason to not expose this to the user in the XML - they've no idea what edk2 version is being used, so can't know which formula to use to calculate it correctly.

Comment 4 Laszlo Ersek 2017-07-11 15:05:21 UTC

They can use trial and error though. (Which the guest can't -- this limit is not guest-configurable, only guest-discoverable, and if there isn't enough SMRAM, all the guest can do is abort booting and hang.)

Comment 5 Laszlo Ersek 2017-07-11 15:18:15 UTC

Argh, didn't mean to clear needinfo.

Back to the question of the exact amount of SMRAM neeed -- it's not really different from the questions
- "what is the minimum guest RAM requirement for this virtual hardware and guest OS combination",
- "what is the minimum virtual disk size for installing this guest OS".

The guest can't change these facts from the inside, it can only fail to boot and/or to install. Users can configure these values in the domain XML, perhaps based on trial and error (for example, recent x86_64 Fedora won't install nicely with 1GB of guest RAM only), or they can consult the "minimum requirements" chapters of the various guest OS docs.

If that is acceptable, we can do the exact same, in our product documentation. RHV specifies the maximum supported virtualization limits in <https://access.redhat.com/articles/906543>. Given the maximum supported VCPU count and maximum supported guest RAM size, I can provide SMRAM (TSEG) sizes that will make things work. If invariably specifying the "largest" TSEG size necessary is deemed wasteful, I can also work out the numbers (tested in practice too) for a few "distinguished" configurations. This is only a question of having access to beefy enough hardware in Beaker.

For example, yesterday I wrote in bug 1468526 comment 8 point (4a),

> For the currently published RHV4 limits (see link above, 240 VCPUs and
> 4TB guest RAM), 16MB SMRAM for the VCPUs and 4*8MB=32MB SMRAM for 4TB
> guest RAM should suffice (48MB SMRAM total).

Now, if you combine this with the minimum guest RAM requirements on the same page (512 MB), I might as well simplify it to: "always specify 48MB SMRAM and be done with it" -- that will always make things work. However, some people will dislike wasting 48MB from the 512MB RAM on a uselessly huge SMRAM (TSEG), and will either ask for automated calculation (which is pretty hard to do) or want to tweak the TSEG size by trial and error.

Also, for the upstream virt stack, I have no clue if anybody maintains any min/max limits.

Comment 7 Laszlo Ersek 2017-07-11 15:49:39 UTC

As I keep unintentionally clearing the needinfo on myself, I might as well
write up a back-of-the-envelope calculation for the SMRAM size (hopefully
erring on the safe side):

(1) Start with 16MB SMRAM.

  This is the default for the pc-q35-2.10 machine type, and it will suffice
  for up to 272 VCPUs, 5GB guest RAM in total, no hotplug memory range, and
  32GB of 64-bit PCI MMIO aperture.

  For significantly higher VCPU counts, this starting value might have to be
  raised. Thus far 272 VCPUs have been the highest I could test on real
  hardware. And, I don't yet have a ratio for converting VCPU count to SMRAM
  footprint. (For that I'd have to re-provision one of the few Beaker boxes
  with such huge logical processor counts.)

  The 16MB starting point for SMRAM is also sufficient for 48 VCPUs, with
  1TB of guest RAM, no hotplug DIMM range, and 32GB of 64-bit PCI MMIO
  aperture.

(2) For each terabyte (== 2^40 bytes) of *address space* added, add 8MB of
    TSEG.

    Note that I wrote "address space", not guest RAM. The following factors
    increase the address space maximum for OVMF (in this order, going from
    low addresses to high addresses):

    * guest RAM

    * Hotplug RAM (DIMM) size. This is controlled on the QEMU command line
      (not sure how exactly) and it is exposed to OVMF via a canonical
      fw_cfg file. It defaults to 0.

    * 64-bit PCI MMIO aperture size (OVMF provides an experimental fw_cfg
      knob for this, and without the knob, it defaults to 32GB). There is
      currently no non-experimental knob in QEMU to control this.

      Also note that both the base and the size of the 64-bit PCI MMIO
      aperture are rounded up to 1GB, and then the aperture base is also
      aligned up to the largest power of two (= BAR size) that the aperture
      size can contain.

    OVMF determines the size of the address space in the following function
    (it is heavily commented):

    https://github.com/lersek/edk2/blob/highram1tb/OvmfPkg/PlatformPei/MemDetect.c#L292

    Once the address space size is determined, the SMRAM footprint of paging
    structures can be calculated. However, this is not a simple linear
    function, because as you grow the RAM size, you will need internal nodes
    in the "tree of page tables" as well (so I guess we could call it an
    n*log(n) style formula). The above number (8MB extra SMRAM per 1TB
    address space added) is pretty accurate in the TB range, and likely a
    bit wasteful in sub-TB address space ranges.

    Furthermore, if 1GB paging is exposed in the VCPU / CPUID flags to the
    guest, then the SMRAM requirement goes down as well (fewer page tables
    are necessary). I haven't worked out the ratio for this, but sticking
    with the above 8MB/1TB ratio should be safe (albeit technically wasteful
    a little bit).

Of course once we automate the calculation, the code should be tested in
practice with a number of scenarios.

Comment 8 Laszlo Ersek 2017-07-11 16:58:23 UTC

(
Side point:

> However, this is not a simple linear function, because as you grow the RAM
> size, you will need internal nodes in the "tree of page tables" as well (so I
> guess we could call it an n*log(n) style formula).

Without affecting my main point, the logarithm reference was bogus here. In a fully populated tree with arity "a", the number of internal nodes is

(number of leaf nodes - 1) / (a - 1)

I.e., it results from a division by a constant.

See e.g. <https://math.stackexchange.com/questions/260809/show-1-a-a2-a3-ldots-an-fracan1-1a-1-by-induction>.

The total number of nodes is then

  (number of leaf nodes) * a - 1
  ------------------------------
              a - 1

So the function (= SMRAM needed for page tables) is slightly sub-linear in the number of leaf nodes (= RAM mapped by the page tables).

Of course the above is idealized (with page tables, different levels have different arities, and the tree is almost never fully populated, some internal nodes are "wasted"). I just wanted to correct my bogus reference to "log". For practical purposes, the SMRAM footprint looks mostly linear.
)

Comment 11 Martin Kletzander 2018-04-19 12:03:53 UTC

Few questions about this, just to make sure we expose it properly.

1) From what I read, there seems to be a difference between SMRAM and TSEG (TSEG being one of the size extensions of SMRAM, the other one being "High SMRAM"), is that the case in QEMU?

2) Is this just a part of RAM dedicated to SM (or SMI handling or SMM, not sure how you call it) or is that something that's allocated alongside it?  I just want to know whether this cuts down the RAM available for the guest.

3) This is kind of a follow-up to question 2. Does this need to be accounted for in the `maxmem=` option of the `-m` parameter?

4) Is it possible that on some other hypervisors or in the future this will make sense not only per-VM, but for example per-NUMA node?

Comment 13 Martin Kletzander 2018-04-19 12:32:52 UTC

Continuing from comment #11

5) I assume the answers to questions 1, 2 and 3 are 'yes' 'yes' 'no', simplifying the implementation.  This makes sense *only* with `-M q35,smm=on`, right?

Comment 14 Martin Kletzander 2018-04-19 14:02:34 UTC

I'll just... put this... back... here...

Comment 15 Laszlo Ersek 2018-04-20 13:35:47 UTC

(In reply to Martin Kletzander from comment #11)
> Few questions about this, just to make sure we expose it properly.
> 
> 1) From what I read, there seems to be a difference between SMRAM and TSEG
> (TSEG being one of the size extensions of SMRAM, the other one being "High
> SMRAM"), is that the case in QEMU?

Valid question. The naming is a mess.

The Intel datasheet for the Q35 "Graphics and Memory Controller Hub" (document number 316966-002) calls the guest-phys address range A_0000h – B_FFFFh "Compatible SMRAM Address Range", and it calls TSEG "extended SMRAM space". In 3.8 "System Management Mode (SMM)", it writes:

> System Management Mode uses main memory for System Management RAM (SMM
> RAM). The (G)MCH supports: Compatible SMRAM (C_SMRAM), High Segment (HSEG),
> and Top of Memory Segment (TSEG). [...] The (G)MCH provides three SMRAM
> options:
>
> * Below 1 MB option that supports compatible SMI handlers.
> * Above 1 MB option that allows new SMI handlers to execute with
>   write-back cacheable SMRAM.
> * Optional TSEG area of 1 MB, 2 MB, or 8 MB in size. [...]

So, first, for OVMF, all but TSEG are irrelevant (way too small), therefore everywhere I refer to SMRAM, I refer to TSEG. Second, when I refer to *extended* TSEG, that's a QEMU- and OVMF-specific extension to the Q35 board, where we can use any TSEG size we like (up to 65534 MB, in 1MB increments), not just 1MB, 2MB, or 8MB.

TL;DR:
- "SMRAM" always means "TSEG" for OVMF,
- "extended TSEG" means a non-standard feature of Q35 that lets us use any
  size for TSEG.

> 2) Is this just a part of RAM dedicated to SM (or SMI handling or SMM, not
> sure how you call it) or is that something that's allocated alongside it?  I
> just want to know whether this cuts down the RAM available for the guest.

Another valid question :) Yes, TSEG comes out of the RAM that is otherwise available to the guest. In "OvmfPkg/PlatformPei/MemDetect.c", we have:

    //
    // TSEG is chipped from the end of low RAM
    //

("low RAM" means the RAM that falls into 32-bit address space).

However, I don't think that this changes the meaning of "RAM made available to the guest". Considering the virtual hardware, TSEG is just the same RAM alright, except it is hidden from code that runs outside of SMM.

> 3) This is kind of a follow-up to question 2. Does this need to be accounted
> for in the `maxmem=` option of the `-m` parameter?

I don't think so. "maxmem" is about memory hotplug, and the hotplug DIMM range is entirely separate. TSEG is part of the cold-plugged memory.

To be precise, TSEG covers ("hides") the end of *that* portion of the cold-plugged RAM which falls in the 32-bit address space. In practice, with the usual multi-GB RAM guests, this means that TSEG sits just below the address 2GB, because that's where cold-plugged RAM ends in 32-bit space on Q35.

(You can find a related discussion here: <https://bugzilla.redhat.com/show_bug.cgi?id=1353591#c8>.)

> 4) Is it possible that on some other hypervisors or in the future this will
> make sense not only per-VM, but for example per-NUMA node?

I don't believe so. The 32-bit address space is already special in that it contains a bunch of MMIO interleaved with RAM (MMCONFIG, APIC, 32-bit PCI aperture, ...), so whichever NUMA node owns the RAM under 32-bit "stands unmatched" by other NUMA nodes anyway.

I also don't think that carving SMRAM out of any particular NUMA node would be helpful performance-wise (on real hardware, that is) -- usually a performance benefit comes from routing an interrupt to a given CPU and then keeping data relevant for handling that interrupt close to that CPU. However, the OS cannot change the routing of SMIs (System Management Interrupts) -- the point of an SMI is that it entirely pre-empts the OS. Furthermore, SMI handling in the firmware starts with dragging all other CPUs into SMM as well (for security reasons) and putting them to sleep or keeping them busy otherwise. Realtime kernel folks have published results that show that SMIs incur terrible latency spikes. That's purely due to the interrupt pattern and likely cannot be helped with memory proximity.

In order to know for certain, you'd have to talk to designers of physical boards. Personally I don't believe that NUMA will become relevant for SMI handling.

Comment 16 Laszlo Ersek 2018-04-20 13:40:14 UTC

(In reply to Martin Kletzander from comment #13)
> [...] This makes sense *only* with `-M q35,smm=on`, right?

Yes. TSEG is exclusive to Q35 (even the original, 1MB / 2MB / 8MB-only TSEG, let alone our "extended" Frankenstein TSEG). And TSEG is unusable (invisible) to the guest unless it runs code in SMM. With "-machine smm=off", no code can be run in SMM, hence TSEG would be a permanent black hole MMIO area.

Comment 17 Martin Kletzander 2018-05-30 14:19:09 UTC

I forgot to mention that I sent a series upstream for this more than a week ago, sorry, fixing now:

https://www.redhat.com/archives/libvir-list/2018-May/msg01515.html

Comment 18 Laszlo Ersek 2018-05-30 15:39:28 UTC

Hi Martin,

(In reply to Martin Kletzander from comment #17)
> I forgot to mention that I sent a series upstream for this more than a week
> ago, sorry, fixing now:
> 
> https://www.redhat.com/archives/libvir-list/2018-May/msg01515.html

thanks for the patches (also *blush* regarding your blurb) -- I'm not subscribed to libvir-list and you didn't CC me, so I've just learned of the patches and I can only respond here.

(1) In

  [libvirt] [PATCH 3/5] conf, schema, docs: Add support for TSEG size setting

you write,

  For QEMU TSEG was disabled up to and including <code>pc-q35-2.9</code>

This is not correct. TSEG is a lot older feature than that; what's new is the *extended* TSEG, i.e. the ability that the user can specify a non-standard TSEG size on the command line. It's always up to the guest to configure (select) one of the TSEG sizes that QEMU offers. The standard sizes offered are 1MB, 2MB, 8MB. With the extended TSEG feature, QEMU offers a fourth ("extended") option for the guest firmware to pick, and the actual size of that option is specified by the user. If OVMF sees that the "extended" option is being offered, it will go for it. So, I suggest something like,

  In QEMU, the user-configurable extended TSEG feature was unavailable up to
  and including <code>pc-q35-2.9</code>. [...] From <code>pc-q35-2.10</code>
  the feature is available, with default size 16 MiB.


(2) Less importantly, in the statement

  it does not make sense fo any other machine type than q35)

there's a typo ("fo").


(3) Regarding the max value 65534 MB that you mention in the commit message, that originates from my similar, *bogus*, statement in comment 15. A TSEG of ~64 GB makes absolutely no sense. I got confused by the uint16_t representation of the number of megabytes. In the QEMU commit, I actually enforce a limit of 4095 MB, but even that is only for technical reasons (so that the number of *bytes* can be represented internally in a uint32_t). Obviously ~4GB TSEG is also totally bogus, in practice. So, I suggest updating the commit message like this:

  ... it can have any size from 1 MiB up to (technically) 4095 MiB in 1 MiB
  increments (hints for practical extended TSEG sizes are given in the
  documentation, in the patch below).

Feel free to reword, of course; I'm just trying to point out the concept. And, my apologies about the glitch in comment 15 above.


(4) Can you please add the RHBZ link to the individual commit messages too?

Thanks!

Comment 19 Martin Kletzander 2018-05-30 21:42:48 UTC

(In reply to Laszlo Ersek from comment #18)
Thanks for the info.  No problem about the wrong size info.  There's a glitch on my part as well, I should've checked the QMEU documentation (and by that I mean the sources, of course).

I added you to Bcc on the cover letter, maybe I made a typo or screwed up somewhere else.  I'm always unaware who doesn't like to be Cc'd on patches, so I figured Bcc would be the easiest option that wouldn't bite back if you don't want to get all the replies.  You can still reply to the individual patches, but I understand it would've been easier if you were Cc'd.  I'll fix that next time.

We usually add "Resolves" trailers to the patch that actually resolves the BZ, I can add the BZ with some other keyword than "Resolves" to the other ones of course.

Just so I fix that size defaults for TSEG with previous versions, let me have another question or rather clarification request. If you have pc-q35-2.7 VM, guest can pick 1, 2 or 8 MiB.  That is the same as real Intel's Q35 would offer.  With pc-q35-2.10 the guest will see 4 options, that is the first three with the additional 16 MiB one.  With -global mch.extended-tseg-mbytes=32 the guest will see 1, 2, 8 and 32 MiB.  Did I get that correctly?

Comment 20 Laszlo Ersek 2018-05-31 07:44:48 UTC

* It's better to CC me explicitly on all the patches (including the blurb)
  -- I don't mind being CC'd on the entire thread. If you don't CC me,
  there's a good chance I won't notice the series, or even if I do, I'll
  have difficulty replying. :)

* I must have missed that the BZ was already referenced in one of the commit
  messages. If that's already covered, then I'm fine, thanks. I just wanted
  the git history of the series to refer to the BZ somewhere :)

* For answering the last paragraph, I must quote it:

(In reply to Martin Kletzander from comment #19)

> If you have pc-q35-2.7 VM, guest can pick 1, 2 or 8 MiB.  That is the same
> as real Intel's Q35 would offer.

Correct.

> With pc-q35-2.10 the guest will see 4 options, that is the first three
> with the additional 16 MiB one.

Correct.

OVMF queries the extended TSEG feature. If it finds it, it will request the
extended TSEG size advertised by QEMU. If the feature is absent, OVMF will
pick the one standard TSEG size -- of 1 / 2 / 8 MB -- that is configured as
an OVMF build time constant. (8MB generally.)

> With -global mch.extended-tseg-mbytes=32 the guest will see 1, 2, 8 and 32
> MiB.  Did I get that correctly?

You did. Thanks!

Comment 21 Martin Kletzander 2018-06-07 12:47:26 UTC

New version posted:

https://www.redhat.com/archives/libvir-list/2018-June/msg00533.html

Comment 22 Martin Kletzander 2018-06-08 10:27:43 UTC

Fixed upstream with v4.4.0-168-g343894b74b98..v4.4.0-172-gd60b730b48c5:
commit 343894b74b981c416b533eae80279cacc65530b9
Author: Martin Kletzander <mkletzan>
Date:   Thu May 10 23:28:24 2018 +0200

    qemu: Move checks for SMM from command-line creation into validation phase
    
commit 3f2499d7d703c83667ff1df6603e4c9d9791e9a4
Author: Martin Kletzander <mkletzan>
Date:   Thu Jun 7 23:24:45 2018 +0200

    qemu: Relax check for SMM feature
    
commit 1bd5a08d3810aa7fd99f2a0c433c1d15d274deb9
Author: Martin Kletzander <mkletzan>
Date:   Thu May 10 21:32:26 2018 +0200

    conf, schema, docs: Add support for TSEG size setting
    
commit 3ea32b635dee46dfcd93d19778bf789d858860f6
Author: Martin Kletzander <mkletzan>
Date:   Thu May 10 23:27:57 2018 +0200

    qemu: Add capability flag for setting the extended tseg size
    
commit d60b730b48c58425b879751fdb468c4e27b8442d
Author: Martin Kletzander <mkletzan>
Date:   Thu May 10 23:37:18 2018 +0200

    qemu: Add support for setting the TSEG size

Comment 23 Meina Li 2018-06-25 08:41:35 UTC

Hi Martin,

From the previous comments I know that the TSEG needs of OVMF on the Q35 board grow with VCPU count and guest RAM size, and this property doesn't exist on i440fx machine type. But the above patch said i440fx can also be configurable on smm element(including TSEG). So if it means this element just use for presentation in XML, not as effective as q35 machine type?

Thank you.

Comment 24 Martin Kletzander 2018-06-28 15:33:13 UTC

There are two things that are pretty much unrelated:
- OVMF needs TSEG size to grow with increasing vCPU count, addressable memory area, etc.
- You can enable/disable SMM on i440fx

The second point has pretty much nothing to do with this BZ, it's just a thing that we did not allow although it is possible.  Probably nobody wants to change that (otherwise they would already be heard).

Does that answer your question?

Comment 25 Meina Li 2018-06-29 08:22:40 UTC

Hi Martin,

Thanks for your reply. I have already got it.

Comment 27 Meina Li 2018-07-19 06:23:16 UTC

Verified on libvirt-4.5.0-3.el7.x86_64.

Test steps:
1. Positive scenario 

a. test on OVMF guest
1)Define and start a guest with the flowing tseg sub-element by MiB.
# vi ovmf.xml
...
<smm state='on'> 
  <tseg unit='MiB'>48</tseg> 
</smm>
...
# virsh define ovmf.xml
Domain ovmf defined from ovmf.xml
# virsh start ovmf
Domain ovmf started

2)Check the status of tseg in guest.
# virsh dumpxml ovmf | grep /smm -B3
<vmport state='off'/>
    <smm state='on'>
      <tseg unit='MiB'>48</tseg>
    </smm>   
# ps aux | grep qemu | grep tseg
...
-global mch.extended-tseg-mbytes=48 
...
3) Define and start a guest with the tseg sub-element by 2048000 KiB, it works well.

b. test on seabios guest
1)Edit a q35-seabios guest with smm='on' element and start guest.
# virsh edit guest
...
<smm state='on'> 
  <tseg unit='MiB'>48</tseg> 
</smm>
...
# virsh start guest 
Domain guest started
2)Edit a q35-seabios guest with smm='off' element and start guest.
# virsh edit guest
...
<smm state='off'/>
...
# virsh start guest 
Domain guest started
3)Edit a i440fx guest with smm='on' element and start guest.
# virsh edit guest
...
<smm state='on'/> 
...
# virsh start guest 
Domain guest started
4)Edit a i440fx guest with smm='off' element and start guest.
# virsh edit guest
...
<smm state='off'/>
...
# virsh start guest 
Domain guest started

2. Negative scenario
a. with invalid machine type.
1)Edit a i440fx guest with tseg sub-element.
# virsh edit guest
...
<smm state='on'>
<tseg unit='MiB'>48</tseg> 
</smm> 
...
error: unsupported configuration: SMM TSEG is only supported with q35 machine type
Failed. Try again? [y,n,i,f,?]:

b. with invalid parameter.
1)Edit a q35 guest with tseg sub-element.
...
<smm state='on'>
<tseg unit='a'>48</tseg> 
</smm> 
...
error: invalid argument: unknown suffix 'a'
Failed. Try again? [y,n,f,?]:

c. with invalid values.
1)# virsh edit guest
...
<smm state='on'>
<tseg unit='MiB'>asd</tseg> 
</smm> 
...
error: XML error: Invalid value 'asd' for element or attribute 'string(./features/smm/tseg)'
Failed. Try again? [y,n,f,?]:
2)# virsh edit guest
...
<smm state='on'>
<tseg unit='KiB'>12345</tseg> 
</smm> 
...
error: unsupported configuration: SMM TSEG size must be divisible by 1 MiB
Failed. Try again? [y,n,i,f,?]:

Comment 28 Meina Li 2018-07-19 06:49:48 UTC

Hi Martin,

I have met two another problems when I verified this bugs, can you help me to check with them? Thank you in advance.

(1) In patch 'conf, schema, docs: Add support for TSEG size setting', it said: Maximum may vary based on QEMU and is way too big, so we don't need to check for the maximum here. 
But there's two different results.

In my test:
1)Start a guest with 1024MiB tseg, start success but the status of guest is paused.
# virsh edit guest
...
<smm state='on'>
<tseg unit='MiB'>1024</tseg>      ---or <tseg unit='GiB'>1</tseg>
</smm> 
...
# virsh start guest 
Domain guest started
# virsh list --all
 Id    Name                           State
----------------------------------------------------
 42    ovmf                           paused
 -     440                            shut off
 -     lmn                            shut off
 -     win7                           shut off

------I think this is a bug.

2) Start a guest with 1024MiB tseg, start success but the status of guest is paused.
# virsh edit guest
...
<smm state='on'>
<tseg unit='MiB'>10240</tseg>     
</smm> 
...
# virsh start guest 
error: Failed to start domain guest
error: internal error: qemu unexpectedly closed the monitor: 2018-07-19T06:34:40.203840Z qemu-kvm: Initialization of device mch failed: invalid extended-tseg-mbytes value: 10240

---------I think 10240MiB has exceeded the maximum, but maybe we can have a clear error info.

(2)When the value of tseg is large enough, for example, 1000MiB, the count of vcpus is still limited to 384(https://bugzilla.redhat.com/show_bug.cgi?id=1447027#c34).

a. edit vcpu in guest xml:
...<vcpu placement='static'>385</vcpu>...
# virsh start guest
error: Failed to start domain guest
error: unsupported configuration: Maximum CPUs greater than specified machine type limit

b. Boot a uefi guest with cli:
/usr/libexec/qemu-kvm -enable-kvm -M q35 -smp 385 -m 8192 -name vm1 \
...\
-global mch.extended-tseg-mbytes=1000
qemu-kvm: Invalid SMP CPUs 385. The max CPUs supported by machine 'pc-q35-rhel7.6.0' is 384

Comment 29 Meina Li 2018-07-19 07:02:34 UTC

Additional info:

/var/log/libvirt/qemu/guest.log for part (1).1) 'guest paused' in comment 28:
2018-07-19T06:50:51.802937Z qemu-kvm: -chardev pty,id=charserial0: char device redirected to /dev/pts/1 (label charserial0)
KVM internal error. Suberror: 1
emulation failure
EAX=00000000 EBX=00818000 ECX=fffd06dc EDX=0081f9d0
ESI=0081f9c2 EDI=00008000 EBP=00000000 ESP=fbfdd988
EIP=000a0000 EFL=00000046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0008 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
CS =0010 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
SS =0008 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0008 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
FS =0008 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
GS =0008 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
GDT=     ffffffa0 00000017
IDT=     fbfddec8 0000010f
CR0=00000033 CR2=00000000 CR3=00000000 CR4=00000640
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Comment 30 Laszlo Ersek 2018-07-19 09:00:58 UTC

Hi Meina Li,

I'm sorry if my answer below doesn't answer your questions fully -- I'm
having difficulty understanding the questions.

I've now re-checked the documentation on the extended tseg size at
<https://libvirt.org/formatdomain.html>:

> Optional sub-element tseg can be used to specify the amount of memory
> dedicated to SMM's extended TSEG. That offers a fourth option size apart
> from the existing ones (1 MiB, 2 MiB and 8 MiB) that the guest OS (or
> rather loader) can choose from. The size can be specified as a value of
> that element, optional attribute unit can be used to specify the unit of
> the aforementioned value (defaults to 'MiB'). If set to 0 the extended
> size is not advertised and only the default ones (see above) are
> available.
>
> If the VM is booting you should leave this option alone, unless you are
> very certain you know what you are doing.
>
> This value is configurable due to the fact that the calculation cannot be
> done right with the guarantee that it will work correctly. In QEMU, the
> user-configurable extended TSEG feature was unavailable up to and
> including pc-q35-2.9. Starting with pc-q35-2.10 the feature is available,
> with default size 16 MiB. That should suffice for up to roughly 272 vCPUs,
> 5 GiB guest RAM in total, no hotplug memory range, and 32 GiB of 64-bit
> PCI MMIO aperture. Or for 48 vCPUs, with 1TB of guest RAM, no hotplug DIMM
> range, and 32GB of 64-bit PCI MMIO aperture. The values may also vary
> based on the loader the VM is using.
>
> Additional size might be needed for significantly higher vCPU counts or
> increased address space (that can be memory, maxMemory, 64-bit PCI MMIO
> aperture size; roughly 8 MiB of TSEG per 1 TiB of address space) which can
> also be rounded up.
>
> Due to the nature of this setting being similar to "how much RAM should
> the guest have" users are advised to either consult the documentation of
> the guest OS or loader (if there is any), or test this by trial-and-error
> changing the value until the VM boots successfully. Yet another guiding
> value for users might be the fact that 48 MiB should be enough for pretty
> large guests (240 vCPUs and 4TB guest RAM), but it is on purpose not set
> as default as 48 MiB of unavailable RAM might be too much for small guests
> (e.g. with 512 MiB of RAM).

It means that, if you put random values in there, the guest is entirely
expected to behave badly -- it is expected to fail to boot, or to crash. The
default (16 MB) shouldn't be touched unless the guest fails to boot with the
default. And in that case, gradually increase it up from the default 16 MB,
maybe up to 48 MB. For most practical purposes, 48 MB should suffice.

An extended TSEG of 1024 MB is totally unreasonable.

In addition, a large enough TSEG is a *necessary* condition for high VCPU
counts with OVMF, not a *sufficient* condition. If your TSEG is too small,
it may prevent you from booting with very many VCPUs. However, just because
your TSEG is large, it does not guarantee that you can boot with a huge VCPU
count -- there are many other limiting factors.

Comment 31 Meina Li 2018-07-20 02:31:21 UTC

Thanks a lot for the detailed reply, and I have already knew its usages in libvirt. According to comment 30, move this bug to be verified.

Comment 33 errata-xmlrpc 2018-10-30 09:49:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3113

Note You need to log in before you can comment on or make changes to this bug.