1468526 – >1TB RAM support

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1468526 - >1TB RAM support

Summary: >1TB RAM support

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	ovmf
Sub Component:
Version:	7.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Laszlo Ersek
QA Contact:	FuXiangChun
Docs Contact:
URL:
Whiteboard:
Depends On:	1447027 ovmf-rebase-rhel-7.5
Blocks:
TreeView+	depends on / blocked

Reported:	2017-07-07 10:26 UTC by Dr. David Alan Gilbert
Modified:	2018-04-10 16:30 UTC (History)
CC List:	9 users (show)
Fixed In Version:	ovmf-20171011-1.git92d07e48907f.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-04-10 16:28:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Domain XML for comment 9 (1.50 KB, application/x-xz) 2017-07-11 02:53 UTC, Laszlo Ersek	no flags	Details
OVMF boot log for comment 9 (20.45 KB, application/x-xz) 2017-07-11 02:53 UTC, Laszlo Ersek	no flags	Details
Guest dmesg for comment 9 (14.36 KB, application/x-xz) 2017-07-11 02:54 UTC, Laszlo Ersek	no flags	Details
OVMF S3 resume log from comment 9 (2.96 KB, application/x-xz) 2017-07-11 02:55 UTC, Laszlo Ersek	no flags	Details
guest kernel S3 log (2.73 KB, application/x-xz) 2017-07-11 02:55 UTC, Laszlo Ersek	no flags	Details
2T memory ovmf log (65.61 KB, text/plain) 2017-12-05 15:47 UTC, FuXiangChun	no flags	Details
ovmf-4T log (11.49 MB, application/x-gzip) 2017-12-06 17:39 UTC, FuXiangChun	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1469338	0	medium	CLOSED	RFE: expose Q35 extended TSEG size in domain XML element or attribute	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1488247	1	None	None	None	2021-01-20 06:05:38 UTC
Red Hat Product Errata	RHBA-2018:0902	0	None	None	None	2018-04-10 16:30:13 UTC

Internal Links: 1469338 1488247 1866110

Description Dr. David Alan Gilbert 2017-07-07 10:26:21 UTC

Description of problem:
We've got downstream patches for supporting >1TB RAM in our qemu, and our SeaBIOS; Laszlo reckons we haven't in OVMF.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 8 Laszlo Ersek 2017-07-11 01:59:37 UTC

(1) Reproducing the problem:

- qemu-kvm-rhev-2.9.0-16.el7.x86_64
- machine type: pc-q35-rhel7.4.0
- OVMF-20170228-5.gitc325e41585e3.el7.noarch

(1a) Specifying 1026 GB guest RAM, the problem is triggered. Q35 puts 2GB
     RAM in the 32-bit address space, and 1024 GB above it. This means taht
     the CMOS would have to express 0x100_0000 64KB chunks for the high
     1024GB, which cannot be represented in the 24-bit (= 6-nibble) CMOS
     register that OVMF reads (and upstream QEMU sets BTW).

     The MEMMAP command of the UEFI shell reports the total RAM size as 2GB,
     because all six nibbles read from the CMOS are 0.

(1b) If we decrease the guest RAM size by 1MB, to 1026*1024-1 MB ==
     1,050,623 MB, the number of 64KB chunks becomes 0xFF_FFF0. OVMF reads
     this correctly from the CMOS.

     However, this triggers SMRAM exhaustion in PiSmmCpuDxeSmm.efi (using
     the above package versions, we have 8MB of SMRAM):

> 1GPageTableSupport - 0x0
> PcdCpuSmmStaticPageTable - 0x1
> PhysicalAddressBits - 0x29
> ASSERT
> /builddir/build/BUILD/ovmf-c325e41585e3/UefiCpuPkg/PiSmmCpuDxeSmm/X64/PageTbl.c(210):
> PageDirectoryEntry != ((void *) 0)

(2) Excursion: while the subject of this BZ is problem (1a), we have to
    mitigate (1b) first, so that we can go above 1026 GB guest RAM and
    verify the fix for problem (1a).

(2a) We can determine the SMRAM footprint needed for such large memory
     amounts from two sources:

     - The commit message on
       <https://github.com/tianocore/edk2/commit/28b020b5de1e>, in which
       Jiewen provided some SMRAM footprint examples back then, at my
       request.

     - The code fingered by the failed ASSERT itself.

     With 1026 GB RAM (and by default 32 GB of 64-bit PCI MMIO aperture),
     we're looking at an address width of 41-bits. The SetStaticPageTable()
     function -- which runs out of SMRAM above -- maps the entire address
     space using 2MB pages (if the guest supports 1GB pages, then those are
     used and less SMRAM is needed, but we should make a pessimistic
     estimate). A 2MB page covers 21 bits, and the remaining (41-21)=20 bits
     are subdivided (from least to most significant) 9+9+2:

     - On the lowest level, a 4KB page is needed for a page directory
       covering 9 bits (512 PDEs).

     - On the middle level, a 4KB page needed for a page directory pointer
       table, covering 9 bits (512 PDPTEs). Meaning, up to and including the
       middle level, we need 4KB (for the PDPT) plus 512 * 4KB (for the
       pointed-to PDs).

     - On the top level, a 4KB page needed for the single PML4 table, from
       which we use 4 entries (of the 512 possible) for covering the
       remaining 2 bits. This means that up to and including the top level,
       we need 4KB + 4 * (4KB + 512 * 4KB) == 8,409,088 bytes (a bit more
       than 8MB).

     - The Customer Portal article "Virtualization limits for Red Hat
       Enterprise Virtualization" at
       <https://access.redhat.com/articles/906543> (last modified: April 10
       2017 at 9:08 AM) states that under RHV4, "Maximum memory in
       virtualized guest" is 4TB.

       For every other 1TB beyond the initial 1TB that we used above for the
       calculation, we need 4 more entries in the PML4 table, meaning 4 *
       (4KB + 512 * 4KB) = 8,404,992 additional bytes of SMRAM, for paging
       structures.

       We can roughly say, adding 1TB of guest RAM requires 8MB more SMRAM.

(2b) Given that bug 1447027 is now fixed in upstream QEMU and in upstream
     edk2 as well, we can experiment with the SMRAM sizes in practice. For
     this, the following snippet is needed in the domain XML:

     <domain type='kvm'
      xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
       <qemu:commandline>
         <qemu:arg value='-global'/>
         <qemu:arg value='mch.extended-tseg-mbytes=N'/>
       </qemu:commandline>
     </domain>

     For the required upstream QEMU and OVMF commits, refer to bug 1447027.
     The required machine type is "pc-q35-2.10". Using said components, the
     extended TSEG defaults to 16MB (double of the earlier max, which is
     8MB).

     - Starting the domain with such a TSEG and 1026GB of RAM, we get an
       iPXE splat (note "1af41000.efidrv"):

> !!!! X64 Exception Type - 0E(#PF - Page-Fault)  CPU Apic ID - 00000000
> !!!!
> ExceptionData - 000000000000000B  I:0 R:1 U:0 W:1 P:1 PK:0 S:0
> RIP  - 000000007D0B1C74, CS  - 0000000000000038, RFLAGS - 0000000000010206
> RAX  - 0000000000000000, RCX - 0000000000000014, RDX - 000001080000C014
> RBX  - 000000007D0C1670, RSP - 000000007EEC66E8, RBP - 000000007D0C1680
> RSI  - 000000007D0C1680, RDI - 000000007D0C1670
> R8   - 0000000000000000, R9  - 0000000000000000, R10 - 000000007D0BE680
> R11  - 000000007D0BE940, R12 - 000000007D0C1660, R13 - 0000000000000060
> R14  - 0000000000000084, R15 - 0000000000000070
> DS   - 0000000000000030, ES  - 0000000000000030, FS  - 0000000000000030
> GS   - 0000000000000030, SS  - 0000000000000030
> CR0  - 0000000080010033, CR2 - 000001080000C014, CR3 - 000000007E6A2000
> CR4  - 0000000000000668, CR8 - 0000000000000000
> DR0  - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000
> DR3  - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400
> GDTR - 000000007E68FA98 0000000000000047, LDTR - 0000000000000000
> IDTR - 000000007DEC5018 0000000000000FFF,   TR - 0000000000000000
> FXSAVE_STATE - 000000007EEC6340
> !!!! Find image 1af41000.efidrv (ImageBase=000000007D0A0000,
> EntryPoint=000000007D0A7005) !!!!

       Debugging this issue is left as an exercise to the reader; for now
       I've disabled the iPXE oprom with <rom bar='off'/> under <interface>,
       and let the built-in VirtioNetDxe bind the virtio-net NIC.

       That way, the UEFI shell is reached fine, and the MEMMAP shell
       command reports approx. 1025.9 GB free memory. I checked 8MB, 9MB,
       10MB, 11MB and 12MB extended TSEG sizes individually, and 12MB is the
       first that succeeds, the smaller ones all prevent the firmware from
       booting due to SMRAM exhaustion at different places (the exhaustion
       progresses to later and later points as the SMRAM size grows).

       Therefore problem (1b) has been mitigated.

(3) Testing the candidate patch for problem (1a) -- which uses the
    "etc/e820" fw_cfg as a preference to the CMOS -- , in parallel with
    growing SMRAM footprint:

(3a) Specifying 2TB of guest RAM, 16MB of SMRAM is insufficient. I
     "bisected" the 16..32 MB range, and the first SMRAM size that allowed
     the firmware to boot the UEFI shell was 20MB.

     This confirms the calculation in (2a) -- we went from 1TB to 2TB guest
     RAM, and had to use 20MB of TSEG rather than 12MB.

     Regarding the subject of ths BZ, problem (1a), the UEFI shell reports
     approx. 2047.97 GB total memory. The candidate patch seems to work.

(3b) I couldn't test 4TB of guest RAM. When I tried that, still using Dave's
     trick from comment 5, the host (having 24GB phys RAM) seemed to lock
     up. (Dave said in comment 7 that he couldn't get more than 2TB to
     work.) After a while I started seeing  "task XXXX:PID blocked for more
     than 120 seconds" messages from the host kernel, with stack traces
     indicating swap activity. I forcefully rebooted the host.

(3c) On this host, I cannot actually install a guest OS, with 1026GB or more
     RAM. This host only has 40 physical address bits, and that's not enough
     for more than 1TB of address space -- EPT just stops working, and guest
     Linux immediately hits that issue. (The UEFI shell is reached fine
     because it doesn't try to massage the missing phys address bits.)

     Disabling EPT (and going with the less performant shadow paging in KVM)
     can work this around, speaking from past (SMM-less) experience.
     However, SMM emulation doesn't work without EPT at the moment, see bug
     1348092.

     So, for installing an actual guest OS with >=1TB address space, I'd
     need a box with at least 41 phys address bits.

(4) Further SMRAM size considerations

(4a) SMRAM footprint grows with both VCPU
     count (see bug 1447027) and guest RAM size (see this bug).

     (These needs are added, not multiplied, together -- my 2TB testing in
     (3a) was indifferent to using 4 vs 16 VCPUs.)

     Providing sane defaults is a hard question here, especially if we
     consider 1GB paging as well. I think we'll need a libvirt bug for
     exposing "-global mch.extended-tseg-mbytes=N", and then separate
     documentation for tweaking the value as necessary. For everyday
     purposes, the default 16MB extened TSEG (with pc-q35-2.10) should be
     plenty, it accommodates 272 VCPUs (tested with 5GB of guest RAM, in bug
     1447027.)

     For the currently published RHV4 limits (see link above, 240 VCPUs and
     4TB guest RAM), 16MB SMRAM for the VCPUs and 4*8MB=32MB SMRAM for 4TB
     guest RAM should suffice (48MB SMRAM total).

(4b) Edk2 has a knob called "PcdCpuSmmStaticPageTable". From
     "UefiCpuPkg/UefiCpuPkg.dec":

>   ## Indicates if SMM uses static page table.
>   #  If enabled, SMM will not use on-demand paging. SMM will build static
>   #  page table for all memory.<BR><BR>
>   #  This flag only impacts X64 build, because SMM alway builds static
>   #  page table for IA32.
>   #   TRUE  - SMM uses static page table for all memory.<BR>
>   #   FALSE - SMM uses static page table for below 4G memory and use
>   #           on-demand paging for above 4G memory.<BR>
>   # @Prompt Use static page table for all memory in SMM.
>   gUefiCpuPkgTokenSpaceGuid.PcdCpuSmmStaticPageTable|TRUE|BOOLEAN|0x3213210D

     We should not disable this PCD (i.e., we shouldn't opt for on-demand
     paging):

     - The savings are negligible (again, the impact of static paging is,
       without 1GB paging support (which is the worst case), ~8MB TSEG
       needed per 1TB guest RAM. The TSEG is chipped away from guest RAM.)

     - When Jiewen was working on SMM memory protection and about to add
       this knob, I asked him to describe its effects. He wrote in
       <http://mid.mail-archive.com/74D8A39837DF1E4DA445A8C0B3885C50386BD98A@shsmsx102.ccr.corp.intel.com>,

> If static page is supported, page table is RO. [...] If we use dynamic
> paging, we can still provide *partial* protection. And hope page table is
> not modified by other component.

       I don't think we should weaken any such protection for relatively
       negligible memory savings.

Comment 9 Laszlo Ersek 2017-07-11 02:44:22 UTC

Update to point (3c) from comment 8: I got access to the host mentioned in
comment 4. That box has 46 physical bits (amazing!) and enough disk space
for a ~4.3TB swap file. (So I did create a real, non-sparse swap file.)

I tried installing a domain with 4TB RAM, but as soon as the guest OS was
booted, it actually hit the swap file. Not wanting to wait for weeks, I
power-cycled rebooted the machine.

I lowered the RAM size to 1026 GB (see point (1a) in comment 8). This way
the guest installed fine. At the end of the installation, there was 11GB
swap space in use, with QEMU having 1.010TB for VIRT and 0.021TB for RES.

Host:                see comment 4
Host kernel:         3.10.0-693.el7.x86_64
libvirt:             3.2.0-14.el7.x86_64
QEMU:                upstream v2.9.0-1880-g94c5665
OVMF:                upstream edk2 built at commit 60e85a39fe49 *plus*
                     candidate patch for this BZ
Domain XML:          see attached (ovmf.rhel7.q35.xml.xz)
OVMF boot log:       see attached (ovmf.rhel7.q35.boot.log.xz)
Guest OS:            "Minimal install" from
                     "RHEL-7.4-20170630.1-Server-x86_64-dvd1.iso" (via
                     symlink called "RHEL-7-Server-x86_64-dvd1.iso")
Guest dmesg:         see attached (guest-dmesg.txt.xz)
OVMF S3 resume log:  see attached (ovmf.rhel7.q35.s3.log.xz)
guest kernel S3 log: see attached (guest-s3-dmesg.txt.xz)

(NOTE: S3 is not supported on RHEL7 hosts; this was just for upstream
testing.)

Relevant OVMF boot log entries (visually compressed here a bit):

Comment 10 Laszlo Ersek 2017-07-11 02:48:37 UTC

(In reply to Laszlo Ersek from comment #9)

> Relevant OVMF boot log entries (visually compressed here a bit):

> E820HighRamIterate: Base=0xFEFFC000 Length=0x4000 Type=2
> E820HighRamIterate: Base=0x0 Length=0x80000000 Type=1
> E820HighRamIterate: Base=0x100000000 Length=0x10000000000 Type=1
> E820HighRamFindHighestExclusiveAddress: MaxAddress=0x10100000000
> GetFirstNonAddress: Pci64Base=0x10800000000 Pci64Size=0x800000000
> MaxCpuCountInitialization: QEMU reports 48 processor(s)
> Q35TsegMbytesInitialization: QEMU offers an extended TSEG (16 MB)
> PublishPeiMemory: mPhysMemAddressWidth=41 PeiMemoryCap=73748 KB
> PeiInstallPeiMemory MemoryBegin 0x7A5FB000, MemoryLength 0x4805000
> E820HighRamIterate: Base=0xFEFFC000 Length=0x4000 Type=2
> E820HighRamIterate: Base=0x0 Length=0x80000000 Type=1
> E820HighRamIterate: Base=0x100000000 Length=0x10000000000 Type=1
> E820HighRamAddMemoryHob: [0x100000000, 0x10100000000)

And, for this test, the default 16MB extended TSEG size was used.

Comment 11 Laszlo Ersek 2017-07-11 02:53:15 UTC

Created attachment 1296032 [details]
Domain XML for comment 9

Comment 12 Laszlo Ersek 2017-07-11 02:53:52 UTC

Created attachment 1296033 [details]
OVMF boot log for comment 9

Comment 13 Laszlo Ersek 2017-07-11 02:54:26 UTC

Created attachment 1296034 [details]
Guest dmesg for comment 9

Comment 14 Laszlo Ersek 2017-07-11 02:55:02 UTC

Created attachment 1296035 [details]
OVMF S3 resume log from comment 9

Comment 15 Laszlo Ersek 2017-07-11 02:55:40 UTC

Created attachment 1296036 [details]
guest kernel S3 log

Comment 16 Laszlo Ersek 2017-07-11 02:56:17 UTC

(In reply to Laszlo Ersek from comment #15)
> Created attachment 1296036 [details]
> guest kernel S3 log

... for comment 9

Comment 17 Laszlo Ersek 2017-07-11 03:23:54 UTC

Posted upstream patch (called "candidate patch" in comment 8 and comment 9):
[edk2] [PATCH 0/1] OvmfPkg/PlatformPei: support >=1TB high RAM, and
                   discontiguous high RAM
Message-Id: <20170711032231.29280-1-lersek>
https://lists.01.org/pipermail/edk2-devel/2017-July/012304.html

Comment 18 Laszlo Ersek 2017-08-04 23:02:56 UTC

Posted upstream v2:
[edk2] [PATCH v2 0/1] OvmfPkg/PlatformPei: support >=1TB high RAM, and
                      discontiguous high RAM
Message-Id: <20170804230043.12977-1-lersek>
https://lists.01.org/pipermail/edk2-devel/2017-August/012942.html

Comment 19 Laszlo Ersek 2017-08-05 01:58:16 UTC

Upstream commit 1fceaddb12b5 ("OvmfPkg/PlatformPei: support >=1TB high RAM, and discontiguous high RAM", 2017-07-08).

Comment 21 FuXiangChun 2017-12-05 15:46:07 UTC

QE tested 1T memory, RHEL7.5 guest works well.  But, Fail to boot when using 2T memory to boot RHEL7.5 guest. OVMF log is uploaded as attachment.

In additional. Host memory is enough as below.
# free -g
              total        used        free      shared  buff/cache   available
Mem:          12094          84       12004           0           5       12006
Swap:             3           0           3


# rpm -qa|grep qemu
qemu-kvm-rhev-2.10.0-10.el7.x86_64
# rpm -qa|grep OVMF
OVMF-20171011-3.git92d07e48907f.el7.noarch

qemu command:
/usr/libexec/qemu-kvm -enable-kvm -M q35 -nodefaults -smp 16,cores=2,threads=8,sockets=1 -m 2T -name vm1 -drive file=/usr/share/OVMF/OVMF_CODE.secboot.fd,if=pflash,format=raw,unit=0,readonly=on -drive file=/usr/share/OVMF/OVMF_VARS.fd,if=pflash,format=raw,unit=1 -debugcon file:/home/test/ovmf.log -drive file=/usr/share/OVMF/UefiShell.iso,if=none,cache=none,snapshot=off,aio=native,media=cdrom,id=cdrom1

Comment 22 FuXiangChun 2017-12-05 15:47:04 UTC

Created attachment 1363251 [details]
2T memory ovmf log

Comment 23 Laszlo Ersek 2017-12-05 21:17:30 UTC

Hello FuXiangChun,

thanks for the log.

While building the static SMM page tables (for the whole guest RAM), the SetStaticPageTable() function in "UefiCpuPkg/PiSmmCpuDxeSmm/X64/PageTbl.c" runs out of SMRAM:

   211            PageDirectoryEntry = AllocatePageTableMemory (1);
   212            ASSERT(PageDirectoryEntry != NULL);

* This is discussed at length in (2a), in comment 8 -- the summary is, "We can roughly say, adding 1TB of guest RAM requires 8MB more SMRAM."

* In comment 8 bullet (3a), I specifically tested 2TB and stated that the default 16MB SMRAM size is insufficient.

* Furthermore, please refer to the following test case in the RHEL-7.5 OVMF test plan (bug 1505265): RHEL7-110151. It carries the following Note:

> If boot guest with a very large guest RAM size(>=4T) and a high VCPU
> count(>272), then need add this option to qemu command
>
> -global mch.extended-tseg-mbytes=48

Based on the above references, please *either* append

  -global mch.extended-tseg-mbytes=24

to the QEMU command line (because you added 1TB of RAM, and 16 + 8 = 24); *or else* append

  -global mch.extended-tseg-mbytes=48

(which is the value given under the RHEL7-110151 test case that should be sufficient up to 4TB -- namely, 16 + 4*8 = 16 + 32 = 48.)

Thank you!

Comment 24 FuXiangChun 2017-12-06 09:34:36 UTC

Thanks Laszlo.

Re-tested this bug with 3.10.0-693.5.2.el7.x86_64 & OVMF-20171011-3.git92d07e48907f.el7.noarch & qemu-kvm-rhev-2.10.0-10.el7.x86_64.

Key qemu command:

/usr/libexec/qemu-kvm -enable-kvm -M q35 -nodefaults -smp 384,cores=8,threads=24,sockets=2 -m 4T -name vm1 -drive file=/usr/share/OVMF/OVMF_CODE.secboot.fd,if=pflash,format=raw,unit=0,readonly=on -drive file=/usr/share/OVMF/OVMF_VARS.fd,if=pflash,format=raw,unit=1 -debugcon file:/home/test/ovmf.log -drive file=/usr/share/OVMF/UefiShell.iso,if=none,cache=none,snapshot=off,aio=native,media=cdrom,id=cdrom1 -device ahci,id=ahci0 -device ide-cd,drive=cdrom1,id=ide-cd1,bus=ahci0.1 -global isa-debugcon.iobase=0x402 -drive file=/home/rhel7.5-secureboot.qcow2,if=none,id=guest-img,format=qcow2,werror=stop,rerror=stop -device ide-hd,drive=guest-img,bus=ide.0,unit=0,id=os-disk,bootindex=1 -spice port=5931,disable-ticketing -vga qxl -monitor stdio -qmp tcp:0:6666,server,nowait -boot menu=on,reboot-timeout=8,strict=on -device pcie-root-port,bus=pcie.0,id=root.0,slot=0,io-reserve=0 -device e1000,netdev=tap0,mac=9a:6a:6b:6c:6d:50,bus=root.0 -netdev tap,id=tap0 -machine kernel_irqchip=split -device intel-iommu,intremap=on,eim=on -global mch.extended-tseg-mbytes=48 -serial unix:/tmp/console,server,nowait -vnc :1

Result:
4T size memory can be found inside guest. and guest works well.

I need to confirm 2 small problems with you.

Q1). If I reboot 7.5 guest with 4T. It will take ~50 minutes from the send reboot command to load ovmf UI. Then it will take ~40 minutes during booting.  Is it normal?

Q2) As this big machine can not install RHEL7.5 host(always kernel panic). I use RHEL7.4 host to test this bug. But qemu-kvm-rhev and OVMF are the latest version. and Guest is RHEL7.5 guest.  Can it be used as valid results to verify this bug? Thanks.

Comment 25 Laszlo Ersek 2017-12-06 12:23:08 UTC

(In reply to FuXiangChun from comment #24)

> Result:
> 4T size memory can be found inside guest. and guest works well.

Thanks!


> I need to confirm 2 small problems with you.
>
> Q1). If I reboot 7.5 guest with 4T. It will take ~50 minutes from the send
> reboot command to load ovmf UI. Then it will take ~40 minutes during
> booting.  Is it normal?

I have absolutely no idea. I've never used a 4TB guest.

* Can the slowness be related to swap space usage on the host, perhaps?

* How does a 4TB SeaBIOS/RHEL-7.5 guest behave across a reboot?

* How does the physical host -- with the multi-terabyte RAM -- behave across
  a RHEL-7.5 reboot?


> Q2) As this big machine can not install RHEL7.5 host(always kernel panic).

Oh, wow. That sort of answers my last question above -- "it does not boot at
all", namely. So:

* Do we have an RHBZ about this, for the host kernel?


> I use RHEL7.4 host to test this bug. But qemu-kvm-rhev and OVMF are the
> latest version. and Guest is RHEL7.5 guest.  Can it be used as valid
> results to verify this bug? Thanks.

I have two thoughts -- Dave, please comment if you can:

(1) Apparently, the RHEL-7.5 kernel does not boot at all on a similarly
    large physical machine. That makes me wonder if we can at all use the
    RHEL-7.5 kernel for testing guest functionality!

    - What is the reboot behavior of a 4TB OVMF/RHEL-7.4 guest (using OVMF
      from RHEL-7.5 on a RHEL-7.4 host)?

(2) I *think* it could be OK to use a RHEL-7.4 host, with only OVMF upgraded
    to 7.5, but I'm not entirely sure. The only practical scenario where I
    can imagine such a setup is the following:

    (a) You start the guest on a RHEL-7.5 host (including OVMF from the 7.5
        host).

    (b) You use the pc-q35-rhel7.4.0 machine type.

    (c) You migrate the guest *down* to a RHEL-7.4 host.

    (d) You reboot the migrated guest on the target host.

    Because the firmware is migrated (in memory / flash) together with the
    guest, the reboot will effectively execute OVMF from RHEL-7.5 on the
    RHEL-7.4 host.

    However, I'm unsure if we support backwards migration from RHEL-7.5 to
    RHEL-7.4 hosts. (I think backward migration is supported on a
    case-by-case basis only. I could be wrong.)


Summary:

- I think your host setup is fine.

- Please use a kernel in the guest that is known to work well on large hosts
  too (that is, RHEL-7.4.z).

- If the RHEL-7.4.z guest takes very long to reboot as well, with OVMF from
  RHEL-7.5, then please repeat the test with SeaBIOS as well (preserving all
  other details of the OVMF test).

Thanks!

Comment 26 Dr. David Alan Gilbert 2017-12-06 12:36:57 UTC

I'm not sure - I've not used anything this big either;  I agree with Laszlo's  suggestions; lets find the bz for the reason 7.5 kernel crashes on the host, and lets see if SeaBIOS takes that long as well.   ~50minute reboot sounds like a bug somewhere.

Comment 27 FuXiangChun 2017-12-06 17:33:12 UTC

(In reply to Laszlo Ersek from comment #25)
> (In reply to FuXiangChun from comment #24)
> 
> > Result:
> > 4T size memory can be found inside guest. and guest works well.
> 
> Thanks!
> 
> 
> > I need to confirm 2 small problems with you.
> >
> > Q1). If I reboot 7.5 guest with 4T. It will take ~50 minutes from the send
> > reboot command to load ovmf UI. Then it will take ~40 minutes during
> > booting.  Is it normal?
> 
> I have absolutely no idea. I've never used a 4TB guest.
> 
> * Can the slowness be related to swap space usage on the host, perhaps?

# free -g
              total        used        free      shared  buff/cache   available
Mem:          12094         182       11897           0          14       11907
Swap:             3           0           3
 
so,Host doesn't use Swap.
> 
> * How does a 4TB SeaBIOS/RHEL-7.5 guest behave across a reboot?

I installed a RHEL7.5 with SeaBIOS with 4T memory.  Reboot guest only need ~2 minutes. 

> 
> * How does the physical host -- with the multi-terabyte RAM -- behave across
>   a RHEL-7.5 reboot?
> 

Reboot host will takes ~20~30 minutes to reboot.

> 
> > Q2) As this big machine can not install RHEL7.5 host(always kernel panic).
> 
> Oh, wow. That sort of answers my last question above -- "it does not boot at
> all", namely. So:
> 
> * Do we have an RHBZ about this, for the host kernel?

https://bugzilla.redhat.com/show_bug.cgi?id=1446771

> 
> 
> > I use RHEL7.4 host to test this bug. But qemu-kvm-rhev and OVMF are the
> > latest version. and Guest is RHEL7.5 guest.  Can it be used as valid
> > results to verify this bug? Thanks.
> 
> I have two thoughts -- Dave, please comment if you can:
> 
> (1) Apparently, the RHEL-7.5 kernel does not boot at all on a similarly
>     large physical machine. That makes me wonder if we can at all use the
>     RHEL-7.5 kernel for testing guest functionality!
> 
>     - What is the reboot behavior of a 4TB OVMF/RHEL-7.4 guest (using OVMF
>       from RHEL-7.5 on a RHEL-7.4 host)?
> 
Sorry, I need correct host and guest's version.

Host is RHEL7.4.z(3.10.0-693.5.2.el7.x86_64)
Guest is RHEL7.4(3.10.0-693.el7.x86_64)



> (2) I *think* it could be OK to use a RHEL-7.4 host, with only OVMF upgraded
>     to 7.5, but I'm not entirely sure. The only practical scenario where I
>     can imagine such a setup is the following:
> 
>     (a) You start the guest on a RHEL-7.5 host (including OVMF from the 7.5
>         host).
> 
As bug 1446771, I can not install fresh RHEL-7.5 host. It always fail.

>     (b) You use the pc-q35-rhel7.4.0 machine type.

I tested pc-q35-rhel7.4.0 and pc-q35-rhel7.5.0, Both are the same result.

> 
>     (c) You migrate the guest *down* to a RHEL-7.4 host.
Sorry, I only found a memory big in beaker. so can not do migrate.
> 
>     (d) You reboot the migrated guest on the target host.
> 
>     Because the firmware is migrated (in memory / flash) together with the
>     guest, the reboot will effectively execute OVMF from RHEL-7.5 on the
>     RHEL-7.4 host.
> 
>     However, I'm unsure if we support backwards migration from RHEL-7.5 to
>     RHEL-7.4 hosts. (I think backward migration is supported on a
>     case-by-case basis only. I could be wrong.)
> 
I'm sorry, I updated inaccurate time for reboot guest. I re-tested it with 4T memory and less vcpu(32). Boot RHEL7.4 guest will take ~17 minutes on RHEL7.4.z host. (Before I use 384 vcpu and 4T memory,It will take more time)

> 
> Summary:
> 
> - I think your host setup is fine.
> 
> - Please use a kernel in the guest that is known to work well on large hosts
>   too (that is, RHEL-7.4.z).
> 
> - If the RHEL-7.4.z guest takes very long to reboot as well, with OVMF from
>   RHEL-7.5, then please repeat the test with SeaBIOS as well (preserving all
>   other details of the OVMF test).
> 
> Thanks!


Summary my testing for ovmf and seabios.

1.For seabios. 
It will take ~2 minutes to boot. and take ~3 minutes to reboot.

version:
host is RHEL7.4.z(3.10.0-693.5.2.el7.x86_64) 
guest is RHEL7.5(3.10.0-799.el7.x86_64)
qemu-kvm-rhev-2.10.0-10.el7.x86_64

qemu command:
/usr/libexec/qemu-kvm -enable-kvm -M q35 -nodefaults -smp 32,cores=4,threads=4,sockets=2 -m 4T -name vm1  -global isa-debugcon.iobase=0x402 -drive file=rhel7.5-seabios.qcow2,if=none,id=guest-img,format=qcow2,werror=stop,rerror=stop -device ide-hd,drive=guest-img,bus=ide.0,unit=0,id=os-disk -spice port=5931,disable-ticketing -vga qxl -monitor stdio -qmp tcp:0:6666,server,nowait -boot menu=on,reboot-timeout=8,strict=on -device pcie-root-port,bus=pcie.0,id=root.0,slot=0,io-reserve=0 -device e1000,netdev=tap0,mac=9a:6a:6b:6c:6d:50,bus=root.0 -netdev tap,id=tap0  -vnc :1

2. For OVMF
It will take ~17 minutes to boot guest. and take ~18 minutes to reboot guest.

Version:
host is RHEL7.4.z(3.10.0-693.5.2.el7.x86_64) 
guest is RHEL7.4(3.10.0-693.el7.x86_64)
qemu-kvm-rhev-2.10.0-10.el7.x86_64
OVMF-20171011-3.git92d07e48907f.el7.noarch

qemu command:

/usr/libexec/qemu-kvm -enable-kvm -M pc-q35-rhel7.5.0 -nodefaults -smp 32,cores=4,threads=4,sockets=2 -m 4T -name vm1 -drive file=/usr/share/OVMF/OVMF_CODE.secboot.fd,if=pflash,format=raw,unit=0,readonly=on -drive file=/usr/share/OVMF/OVMF_VARS.fd,if=pflash,format=raw,unit=1 -debugcon file:/home/test/ovmf.log -drive file=/usr/share/OVMF/UefiShell.iso,if=none,cache=none,snapshot=off,aio=native,media=cdrom,id=cdrom1 -device ahci,id=ahci0 -device ide-cd,drive=cdrom1,id=ide-cd1,bus=ahci0.1 -global isa-debugcon.iobase=0x402 -drive file=/home/rhel7.5-secureboot.qcow2,if=none,id=guest-img,format=qcow2,werror=stop,rerror=stop -device ide-hd,drive=guest-img,bus=ide.0,unit=0,id=os-disk,bootindex=1 -spice port=5931,disable-ticketing -vga qxl -monitor stdio -qmp tcp:0:6666,server,nowait -boot menu=on,reboot-timeout=8,strict=on -device pcie-root-port,bus=pcie.0,id=root.0,slot=0,io-reserve=0 -device e1000,netdev=tap0,mac=9a:6a:6b:6c:6d:50,bus=root.0 -netdev tap,id=tap0 -machine kernel_irqchip=split -device intel-iommu,intremap=on,eim=on -global mch.extended-tseg-mbytes=48 -serial unix:/tmp/console,server,nowait -vnc :1


In addition. This host will be returned to beaker tomorrow. Because of existing bug, I can not install RHEL7.5 host.  Do I need to do other tests?

Comment 28 FuXiangChun 2017-12-06 17:39:21 UTC

Created attachment 1363791 [details]
ovmf-4T log

Comment 29 Laszlo Ersek 2017-12-06 23:01:28 UTC

FuXiangChun,

your feedback is extremely helpful, thank you for that.

Yes, I would like to ask you for one more test with OVMF. Let me analyze the
newest information below; there are two important facts:

(1) boot time of OVMF is consistent with reboot time of OVMF (17 mins vs. 18
    mins -- this is from your summary in comment 27). That's great.

(2) Your command line for the OVMF testing, from comment 27, uses:

  OVMF-20171011-3.git92d07e48907f.el7.noarch

and specifies:

  -debugcon file:/home/test/ovmf.log \
  -global isa-debugcon.iobase=0x402

In turn, the boot produces the *absolutely hugest* OVMF debug log (comment
28) I've ever seen. It's uncompressed size is 133M, containing 2,112,575
lines.


Why is this relevant? Because:

- in ovmf-20171011-2.git92d07e48907f.el7, Paolo fixed bug 1488247, such that
  the debug log is written to the QEMU debug port *if and only if* the debug
  console is actually enabled with the "-debugcon" switch;

- producing large amounts of debug log impacts performance;

- the log from comment 28 consists overwhelmingly of lines that say:

> ConvertPageEntryAttribute 0x800000007F0FB067->0x800000007F0FB065

  and the number of such lines is linearly proportional to guest RAM size.

  (If you remove these lines from the log file, only 224,043 bytes are left;
  or put differently, 3685 lines. That's ~0.17% of the full line count.)


So, the one test that I would like to request in addition is just this:

please repeat your last OVMF test (from the end of comment 27), but *remove*
the following options:

  -debugcon file:/home/test/ovmf.log \
  -global isa-debugcon.iobase=0x402

and measure the boot time like this.

(Functionally, you already confirmed in comment 24, "4T size memory can be
found inside guest. and guest works well". So this is now only about the
performance.)

I expect that the boot (and reboot) will be sped up quite a bit. If it does
not catch up with SeaBIOS, that's not a problem though; a similarly sized
host reboot takes ~20~30 minutes as well, according to comment 27.

Thank you!

Comment 30 FuXiangChun 2017-12-07 02:40:26 UTC

Thanks Laszl, 

I re-tested this problem without ' -debugcon file:/home/test/ovmf.log' and '-global isa-debugcon.iobase=0x402', Guest just spend 3 minutes to boot or reboot. and guest works well, All memory and vcpus can be found inside guest.

According to this test result. Can I set this bug as verified?

Comment 31 Laszlo Ersek 2017-12-07 11:11:12 UTC

FuXiangChun, those are awesome results; many thanks for your continued thorough work!

Yes, please set this BZ to VERIFIED status. Cheers!

Comment 34 errata-xmlrpc 2018-04-10 16:28:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0902

Note You need to log in before you can comment on or make changes to this bug.