This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1866110 - automated TSEG size calculation
Summary: automated TSEG size calculation
Keywords:
Status: CLOSED MIGRATED
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: qemu-kvm
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Igor Mammedov
QA Contact: Xueqiang Wei
URL:
Whiteboard:
Depends On:
Blocks: 1788991
TreeView+ depends on / blocked
 
Reported: 2020-08-04 21:32 UTC by Laszlo Ersek
Modified: 2023-09-22 17:32 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-09-22 17:32:31 UTC
Type: Story
Target Upstream Version:
Embargoed:
xuwei: needinfo-


Attachments (Terms of Use)
minimum sufficient TSEG sizes (whole powers of two) per VCPU count and RAM size (8.20 KB, text/plain)
2020-08-04 21:32 UTC, Laszlo Ersek
no flags Details
Illustration of PDP1GB results. (64.01 KB, image/png)
2020-08-05 17:29 UTC, Daniel Berrangé
no flags Details
Illustration of no-PDP1GB results. (66.96 KB, image/png)
2020-08-05 17:30 UTC, Daniel Berrangé
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1447027 0 unspecified CLOSED Guest cannot boot with 240 or above vcpus when using ovmf 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1468526 0 medium CLOSED >1TB RAM support 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1469338 0 medium CLOSED RFE: expose Q35 extended TSEG size in domain XML element or attribute 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1819292 1 None None None 2021-09-18 08:07:01 UTC
Red Hat Issue Tracker   RHEL-7480 0 None Migrated None 2023-09-22 17:32:15 UTC

Description Laszlo Ersek 2020-08-04 21:32:59 UTC
Created attachment 1710428 [details]
minimum sufficient TSEG sizes (whole powers of two) per VCPU count and RAM size

* Description of problem:

The edk2 SMM infrastructure's SMRAM footprint is not constant, it grows
with (minimally) VCPU count (bug 1447027, bug 1819292) and guest
physical address space size (mainly RAM, but also 64-bit PCI MMIO
aperture) (bug 1468526). As of QEMU v5.1.0-rc2, the default TSEG size
for the Q35 machine type is 16MB, which is sufficient for most setups,
but not all. When TSEG runs out, the SRMAM exhaustion is noted in a
firmware assertion, and OVMF hangs.

This can be prevented by sizing TSEG correctly in advance, via "-global
mch.extended-tseg-mbytes=...". Libvirt exposes this property (bug
1469338) but Dan made the point that it's not comfortable to use, we
should automate the calculation (in QEMU or (less likely) libvirt).

A simple formula remains elusive. I've made some measurements and will
attach a table.

* Version-Release number of selected component (if applicable):
- current upstream QEMU (v5.1.0-rc2-33-gfd3cd581f9dc)
- current upstream edk2 (e557442e3f7e)

* How reproducible:
- Always

* Steps to Reproduce:
- See any one of bug 1447027, bug 1468526, bug 1819292)

* Actual results:
- OVMF boot hangs with various ASSERTs reporting SMRAM exhaustion

* Expected results:
- OVMF boot succeeds without manually changing the TSEG size on the QEMU
  cmdline or in the libvirtd domain XML.

* Additional info:


(1) The attached table has six colums (all values decimal, in spite of
being zero padded on the left):

- column #1:

Sum of the X and Y coordinates in the test matrix.

The X coordinate is log2(VCPU count).

The Y coordinate is (log2(RAM size in bytes)-30).

Keeping the sum constant (that is, keeping the sum of powers constant),
a diagonal in the matrix is identified (left bottom to top right, with
the top left corner being (0, 0)). Incrementing the sum by 1, the next
diagonal is identified. This allows for a diagonal-wise traversal of the
matrix, where the next diagonal is considered more "demanding" than the
previous one.

This column is not relevant to the results, its just the way how the
tests were run / organized.

- column #2: VCPU count
- column #3: guest RAM size in MB
- column #4: whether the "pdpe1gb" CPU feature flag was enabled or
             disabled

- column #5:

The first whole power-of-two TSEG size (in MB) that enabled the guest to
boot; the smallest tried was 4MB, as 2MB is not sufficient for even
launching the edk2 SMM infrastructure, bare-bones.

- column #6:

Result (always "good" in the table; while running the tests, different
values could be there).


(2) Test methodology:

A horrible script was written and used for generating the test cases in
the first place (basically columns #1 through #4 above). The VCPU count
would double from 1 up to 512 (2^9) and the RAM size would double from
1G (2^30) up to 16TB (2^44), both inclusive. I'm including the script
here.

> for ((VCPU_POW=0; VCPU_POW<10; VCPU_POW++)); do
>   VCPUS=$((2**VCPU_POW))
>   for ((MEM_POW=30; MEM_POW<45; MEM_POW++)); do
>     MEM_MB=$((2**(MEM_POW-20)))
>       for PDPE1GB in "-pdpe1gb" "+pdpe1gb"; do
>         printf '%02u %03lu %08lu %s\n' \
>           $((VCPU_POW+(MEM_POW-30))) $VCPUS $MEM_MB "$PDPE1GB"
>       done
>   done
> done \
> | sort -n

Another horrible (and I mean *horrible*) script was used to read back
the cases (one per line), and execute them one by one. I'm including it
here too. You've been warned.

> #!/bin/bash
> set -e -u -C
>
> QEMU=/opt/qemu-installed-optimized/bin/qemu-system-x86_64
> KERNEL=/boot/vmlinuz-3.10.0-1127.18.2.el7.x86_64
>
> while read DIST VCPUS MEM_MB PDPE1GB; do
>   for ((TSEG_POW=2; TSEG_POW<11; TSEG_POW++)); do
>     TSEG_MB=$(printf '%04u' $((2**TSEG_POW)))
>     FWLOG=fwlog.$DIST.$VCPUS.$MEM_MB.$PDPE1GB.$TSEG_MB
>     SERIAL=serial.$DIST.$VCPUS.$MEM_MB.$PDPE1GB.$TSEG_MB
>     ERR=err.$DIST.$VCPUS.$MEM_MB.$PDPE1GB.$TSEG_MB
>
>     rm -f -- "$FWLOG" "$SERIAL" "$ERR"
>     if ! $QEMU \
>              -nodefaults \
>              -nographic \
>              -machine q35,accel=kvm,smm=on,kernel-irqchip=split \
>              -smp $((10#$VCPUS)) \
>              -m $((10#$MEM_MB)) \
>              -cpu host,"$PDPE1GB" \
>              -global driver=cfi.pflash01,property=secure,value=on \
>              -drive if=pflash,unit=0,format=raw,readonly=on,file=/root/tseg-table/OVMF_CODE.4m.3264.fd \
>              -drive if=pflash,unit=1,format=raw,snapshot=on,file=/root/tseg-table/OVMF_VARS.4m.fd \
>              -chardev file,id=debugfile,path=$FWLOG \
>              -device isa-debugcon,iobase=0x402,chardev=debugfile \
>              -global mch.extended-tseg-mbytes=$((10#$TSEG_MB)) \
>              -chardev file,id=serial,path=$SERIAL \
>              -serial chardev:serial \
>              -device intel-iommu,intremap=on,eim=on \
>              -kernel $KERNEL \
>              -append "ignore_loglevel earlyprintk=ttyS0,115200n8 console=ttyS0,115200n8 efi=debug initcall_blacklist=fw_cfg_sysfs_init" \
>              -pidfile qemu.pid \
>              -daemonize \
>              >"$ERR" 2>&1; then
>       RESULT=startup-failed
>       break
>     fi
>
>     QEMU_PID=$(< qemu.pid)
>
>     RESULT=
>     while [ -z "$RESULT" ]; do
>       if egrep \
>              -q -w \
>              'ASSERT|ASSERT_EFI_ERROR|ASSERT_RETURN_ERROR' \
>              $FWLOG \
>              2>/dev/null; then
>         RESULT=tseg-too-small
>       elif grep -q "Kernel panic" $SERIAL 2>/dev/null; then
>         RESULT=good
>       else
>         sleep 1
>       fi
>     done
>
>     kill $QEMU_PID
>     while kill -0 $QEMU_PID 2>/dev/null; do
>       sleep 1
>     done
>
>     if [ good = "$RESULT" ]; then
>       break;
>     fi
>   done
>   echo "$DIST $VCPUS $MEM_MB $PDPE1GB $TSEG_MB $RESULT"
> done < cases.txt >> cases.out.txt

The result for a test case is "good" when the firmware boots and the
guest kernel boots via fw_cfg sufficiently to panic due to lack of a
rootfs (initrd). In this case the smallest found TSEG is saved.

The result is "tseg-too-small" if even 1GB of TSEG is not sufficient for
reaching a good result. (Never seen, or expected, but I had to terminate
the loop in that case too, somehow.)

The result is "startup-failed" if QEMU rejected to start up.


(3) Test execution

Running this script (in Beaker) was an exercise in frustration and
tedium.

(3a) When reaching VCPU count 256, QEMU wouldn't start up without
"-device intel-iommu,intremap=on,eim=on".

(3b) With large (~1TB) guest RAM sizes, the guest kernel would
consistently crash in the fw_cfg guest kernel driver (which is a
built-in driver, not a module). Hence the
"initcall_blacklist=fw_cfg_sysfs_init" kernel cmdline option. Initially
I didn't filter for any random panic but failure to mount the rootfs. So
different panics would simply hang the script (in the "sleep 1" loop).
Yay.

(3c) When using large VCPU counts (>= 256 or thereabouts), the guest
kernel would occasionally hang before emitting anything at all to the
serial console (despite "earlyprintk"). This would again hang the script
(in the "sleep 1" loop).

This was completely random. Getting into "VCPU overcommit" domain
(>=160, which was the PCPU count on machine [2], see below) seemed to
contribute to the issue, statistically speaking.

The above test script hangs were the reason why I generated the test
cases separately, and processed / executed them in a different script.
This way I could stop the executor script at any point, trim the list of
remaining test cases, and restart execution. An unwelcome janitorial
activity for 1-2 nights. The particular fun bit was when this occurred
with a VCPU count of 384, where (due to heavy VCPU overcommit) OVMF
itself took 30+ minutes to start up (*with* the fix for bug 1861718
incorporated).

(3d) Beaker woes.

I searched Beaker for such machines available to me (both
permission-wise and from the immediate reservation perspective) that had
48+ PCPUs. I'd then review the hits for combined RAM + disk size
(expecting to use most of the disk as extra swap).

For starters, machine [1], used by Eduardo for bug 1819292, is not
available to me in Beaker (no permission). Worse than that, its combined
RAM + disk size is worse than the same on machine [2], which *was*
available to me. So I used [2]. No other candidate machines even entered
the game.

Alas, machine [2] in turn does not permit installing RHEL8. You read
that right. I had to use RHEL7 because of this.

Using RHEL7 forced me to limit the max VCPU count of the test corpus to
384, from 512. (I edited the test case list manually -- so that's why
you see 384 and not 512 as the highest VCPU count in column #2 of the
output too, in the attachment.)

Furthermore, although machine [2] *had* more RAM + disk combined than
machine [1], it still only sufficed for 1.5 TB of guest RAM. So that's
why you don't see >= 2TB values in column #3 of the table, only 1.5 TB
after 1 TB.

Whoever analyses the table will have to guesstrapolate from the
parameter space that was covered, to larger VCPU counts and RAM sizes.

(Because the TSEG defaults to 16MB, only the lines with 32MB report a
practical problem at once. My intent with starting the TSEG scaling at
the artifically lowered 4MB was to support extrapolation -- the "high
end" is constrained due to host hardware limits, so I wanted to make the
"low end" finer-grained than seen in practice otherwise. I hope that
seeing actual 4MB and 8MB values that in practice would be masked by the
16MB default will also support determining a trend.)


Note: I'm not volunteering for implementing the QEMU (let alone
libvirtd) feature; I'm merely providing the data I managed to collect.

Comment 2 Laszlo Ersek 2020-08-04 21:52:42 UTC
Forgot to mention, regarding "use most of the disk as extra swap":

the "fallocate" example in the mkswap(8) manual does not work on xfs. When "swapon" is invoked subsequently, the kernel complains (in dmesg) that the file has "holes". The point of "fallocate" is *exactly* to prevent holes in the file, without having to write every single byte of it. So, it doesn't work. The internet knows this; I was led to multiple CentOS/RHEL themed blog posts that recommended "dd" instead.

So then I spent the next few tens of minutes waiting on "dd" to write a 700 GB swap file, on machine [2], so I'd have enough RAM+swap combined for 1.5TB of guest RAM. As I said, an exercise in frustration and tedium.

Comment 3 Daniel Berrangé 2020-08-05 17:29:57 UTC
Created attachment 1710546 [details]
Illustration of PDP1GB results.

Comment 4 Daniel Berrangé 2020-08-05 17:30:21 UTC
Created attachment 1710547 [details]
Illustration of no-PDP1GB results.

Comment 5 Daniel Berrangé 2020-08-05 17:37:36 UTC
I used gnuplot to attempt to provide some illustration of the data to aid in coming up with a viable rule. Bear in mind the grid lines and heights are interpolated, since it is drawing a grid points at fixed intervals, but our data is at power-of-2 intervals. I scaled RAM to GB instead of MB.

The data is kind of noisy, so I'm wondering how stable the results obtained are. 

If we assume they are stable though, considering both the PDP1GB and non-PDP1G results together, the data suggests a possibly viable rule is

 - Use 32 MB TSeg if either more than 350 CPUs are present, or if more than 768 GB of RAM are present
 - Ootherwise default 16 MB Tseg is sufficient.

Comment 6 Laszlo Ersek 2020-08-06 00:23:49 UTC
Thanks Dan for the plot! I had thought of it, but really only as "it would be nice if someone plotted this".

I consider the results stable. And indeed the plots convey a cryptic message :/

Your rule sounds good to me, but it should be tested in the environment where bug 1788991 too is going to be tested. (The existence of bug 1788991 suggests such an environment *exists*.) I don't know if we'll need another bump at higher boundaries. Thank you!

Comment 7 John Ferlan 2020-08-13 19:57:51 UTC
Amnon - passing your way since as I read this the "global mch.extended-tseg-mbytes" for q35 needs some sort of adjustment.

Comment 9 John Ferlan 2021-09-09 13:27:43 UTC
Bulk update: Move RHEL-AV bugs to RHEL9. If necessary to resolve in RHEL8, then clone to the current RHEL8 release.

Comment 11 RHEL Program Management 2022-02-04 07:27:14 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 12 Nitesh Narayan Lal 2022-02-04 12:55:47 UTC
Re-opening as we still need to fix the issue reported in this BZ.

Comment 16 RHEL Program Management 2022-08-04 07:28:03 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 26 John Ferlan 2022-10-16 12:56:51 UTC
Setting ITR=9.2.0 - we need to come to a resolution/decision on this one way or another - fix it or close wont/cant fix.

Comment 32 RHEL Program Management 2023-02-04 07:27:46 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 37 RHEL Program Management 2023-08-06 07:28:16 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 41 RHEL Program Management 2023-09-22 17:30:55 UTC
Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 42 RHEL Program Management 2023-09-22 17:32:31 UTC
This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.


Note You need to log in before you can comment on or make changes to this bug.