Bug 1861718
| Summary: | Very slow boot when overcommitting CPU | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Eduardo Habkost <ehabkost> |
| Component: | edk2 | Assignee: | Laszlo Ersek <lersek> |
| Status: | CLOSED ERRATA | QA Contact: | leidwang <leidwang> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 8.3 | CC: | berrange, coli, jinzhao, juzhang, kraxel, leidwang, lersek, mtessun, pbonzini, philmd, virt-maint, xuwei, yuhuang |
| Target Milestone: | rc | Keywords: | Triaged |
| Target Release: | 8.3 | Flags: | pm-rhel:
mirror+
|
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | edk2-20200602gitca407c7246bf-3.el8 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-11-04 04:01:20 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1788991 | ||
Posted upstream patch: [edk2-devel] [PATCH] UefiCpuPkg/PiSmmCpuDxeSmm: pause in WaitForSemaphore() before re-fetch http://mid.mail-archive.com/20200729185217.10084-1-lersek@redhat.com https://edk2.groups.io/g/devel/message/63454 Testing feedback from Eduardo, using the patch: (In reply to Eduardo Habkost from comment #0) > In a host with 448 CPUs running a 512 VCPU VM: > * VM takes 20-30 minutes to boot > * 0.72 seconds between each ConvertPageEntryAttribute message, on average with the patch: ~4 minutes to grub > In a host with 48 CPUs running a 384 VCPU VM: > * VM doesn't boot after 3 hours > * 30 seconds between each ConvertPageEntryAttribute message with the patch: 14 minutes to grub (In reply to Laszlo Ersek from comment #2) > Posted upstream patch: > > [edk2-devel] [PATCH] UefiCpuPkg/PiSmmCpuDxeSmm: pause in WaitForSemaphore() > before re-fetch > http://mid.mail-archive.com/20200729185217.10084-1-lersek@redhat.com > https://edk2.groups.io/g/devel/message/63454 Merged upstream as commit 9001b750df64, via <https://github.com/tianocore/edk2/pull/843>. Hi Laszlo, I have a question about this bz, does edk2 support vcpu overcommit? Many thanks! (In reply to leidwang from comment #6) > Hi Laszlo, > > I have a question about this bz, does edk2 support vcpu overcommit? > > Many thanks! Sorry, let me clarify. For cpu overcommit, we usually support 1:1 cpu overcommit, does edk2 support such high overcommit in one vm (48:384) ? (In reply to CongLi from comment #7) > (In reply to leidwang from comment #6) > > Hi Laszlo, > > > > I have a question about this bz, does edk2 support vcpu overcommit? > > > > Many thanks! > > Sorry, let me clarify. > For cpu overcommit, we usually support 1:1 cpu overcommit, does edk2 support > such high overcommit in one vm (48:384) ? I don't understand what you mean by "1:1 cpu overcommit". If you mean 1 VCPU per 1 PCPU, that's not "over"commit. Overcommit is when you have more VCPUs (summed over all VMs running on a host) than the host has PCPUs. Anyway, I would advise against using OVMF in any overcommit scenario; the numbers seen in this BZ come from Eduardo's work towards higher VCPU counts. 448->512 seems like a reasonable use case (that was affected by the OVMF issue). 48->384 is intentionally amplifying the issue, for illustration, it's not reasonable for production. So I'd say stick with whatever overcommit standards you've been using thus far, unless Eduardo has particular overcommit requests that are required for testing his work. Thanks. For testing this particular bugfix, I'd suggest a slight overcommit scenario, a ratio similar to Eduardo's 512/448 (~1.14). Maybe up to 1.5 if you have a low PCPU count on the host, such as 4 or 8. Can we get a qa_ack+ please? (In reply to Laszlo Ersek from comment #9) > For testing this particular bugfix, I'd suggest a slight overcommit > scenario, a ratio similar to Eduardo's 512/448 (~1.14). Maybe up to 1.5 if > you have a low PCPU count on the host, such as 4 or 8. Test this bz in a host with 40 PCPUs. Results are as bellow: * 40 vcpus VM works well * 40 vcpus + 40 vcpus VM all works well * 80 vcpus VM works well * 160 vcpus VM takes 7 minutes to boot * 200 vcpus VM takes 15 minutes to boot * 240 vcpus VM takes 28 minutes to boot * 384 vcpus VM doesn't boot after 4 hours A question about cpu overcommit,do we support 1 VM overcommit(1 vm have greate numbers of VCPU than host PCPU numbers)? Even slightly overcommit. Thanks! Hello Leidong Wang, (In reply to leidwang from comment #11) > (In reply to Laszlo Ersek from comment #9) > > For testing this particular bugfix, I'd suggest a slight overcommit > > scenario, a ratio similar to Eduardo's 512/448 (~1.14). Maybe up to 1.5 if > > you have a low PCPU count on the host, such as 4 or 8. > > Test this bz in a host with 40 PCPUs. > > Results are as bellow: > > * 40 vcpus VM works well > * 40 vcpus + 40 vcpus VM all works well > * 80 vcpus VM works well First question: what is the difference between "40+40" and "80"? Does "40+40" mean two VMs, each with 40 VCPUs? And does "80" mean one VM with 80 VCPUs? > * 160 vcpus VM takes 7 minutes to boot > * 200 vcpus VM takes 15 minutes to boot > * 240 vcpus VM takes 28 minutes to boot > * 384 vcpus VM doesn't boot after 4 hours Because this patch is a performance optimization, what we should really be doing is *compare* the boot times between: - edk2-ovmf-20200602gitca407c7246bf-2.el8.noarch - edk2-ovmf-20200602gitca407c7246bf-3.el8.noarch [upcoming build, containing the patch] So I understand your above measurements to be the baseline (i.e., boot times without the patch). Is that correct? > A question about cpu overcommit,do we support 1 VM overcommit(1 vm have > greate numbers of VCPU than host PCPU numbers)? Even slightly overcommit. I'm curious too. I don't know what our official recommendations are about CPU overcommit. Thanks. Anyway, I'd like to confirm that the before-after boot times should be compared using a single VM. Thanks! (In reply to Laszlo Ersek from comment #12) > Hello Leidong Wang, > > (In reply to leidwang from comment #11) > > (In reply to Laszlo Ersek from comment #9) > > > For testing this particular bugfix, I'd suggest a slight overcommit > > > scenario, a ratio similar to Eduardo's 512/448 (~1.14). Maybe up to 1.5 if > > > you have a low PCPU count on the host, such as 4 or 8. > > > > Test this bz in a host with 40 PCPUs. > > > > Results are as bellow: > > > > * 40 vcpus VM works well > > * 40 vcpus + 40 vcpus VM all works well > > * 80 vcpus VM works well > > First question: what is the difference between "40+40" and "80"? > > Does "40+40" mean two VMs, each with 40 VCPUs? > > And does "80" mean one VM with 80 VCPUs? Yes,you are right. > > > * 160 vcpus VM takes 7 minutes to boot > > * 200 vcpus VM takes 15 minutes to boot > > * 240 vcpus VM takes 28 minutes to boot > > * 384 vcpus VM doesn't boot after 4 hours > > Because this patch is a performance optimization, what we should really be > doing is *compare* the boot times between: > - edk2-ovmf-20200602gitca407c7246bf-2.el8.noarch > - edk2-ovmf-20200602gitca407c7246bf-3.el8.noarch [upcoming build, containing > the patch] > > So I understand your above measurements to be the baseline (i.e., boot times > without the patch). Is that correct? Yes,this result is based on edk2-ovmf-20200602gitca407c7246bf-2.el8.noarch, so I need test it with edk2-ovmf-20200602gitca407c7246bf-3.el8.noarch? > > > A question about cpu overcommit,do we support 1 VM overcommit(1 vm have > > greate numbers of VCPU than host PCPU numbers)? Even slightly overcommit. > > I'm curious too. I don't know what our official recommendations are about > CPU overcommit. > > Thanks. OK,Thanks! (In reply to leidwang from comment #14) > (In reply to Laszlo Ersek from comment #12) > > (In reply to leidwang from comment #11) > > > * 160 vcpus VM takes 7 minutes to boot > > > * 200 vcpus VM takes 15 minutes to boot > > > * 240 vcpus VM takes 28 minutes to boot > > > * 384 vcpus VM doesn't boot after 4 hours > > > > Because this patch is a performance optimization, what we should really be > > doing is *compare* the boot times between: > > - edk2-ovmf-20200602gitca407c7246bf-2.el8.noarch > > - edk2-ovmf-20200602gitca407c7246bf-3.el8.noarch [upcoming build, containing > > the patch] > > > > So I understand your above measurements to be the baseline (i.e., boot times > > without the patch). Is that correct? > > Yes,this result is based on edk2-ovmf-20200602gitca407c7246bf-2.el8.noarch, > so I need test it with edk2-ovmf-20200602gitca407c7246bf-3.el8.noarch? That's right. Please run the same tests, on the same host machine, with "edk2-ovmf-20200602gitca407c7246bf-2.el8.noarch" and "edk2-ovmf-20200602gitca407c7246bf-3.el8.noarch", and compare the boot times. Multi-VM tests are not relevant for now. Thanks! (In reply to Laszlo Ersek from comment #16) > (In reply to leidwang from comment #14) > > (In reply to Laszlo Ersek from comment #12) > > > (In reply to leidwang from comment #11) > > > > * 160 vcpus VM takes 7 minutes to boot > > > > * 200 vcpus VM takes 15 minutes to boot > > > > * 240 vcpus VM takes 28 minutes to boot > > > > * 384 vcpus VM doesn't boot after 4 hours > > > > > > Because this patch is a performance optimization, what we should really be > > > doing is *compare* the boot times between: > > > - edk2-ovmf-20200602gitca407c7246bf-2.el8.noarch > > > - edk2-ovmf-20200602gitca407c7246bf-3.el8.noarch [upcoming build, containing > > > the patch] > > > > > > So I understand your above measurements to be the baseline (i.e., boot times > > > without the patch). Is that correct? > > > > Yes,this result is based on edk2-ovmf-20200602gitca407c7246bf-2.el8.noarch, > > so I need test it with edk2-ovmf-20200602gitca407c7246bf-3.el8.noarch? > > That's right. > > Please run the same tests, on the same host machine, with > "edk2-ovmf-20200602gitca407c7246bf-2.el8.noarch" and > "edk2-ovmf-20200602gitca407c7246bf-3.el8.noarch", and compare the boot > times. Multi-VM tests are not relevant for now. > > Thanks! Retest it on the same host with "edk2-ovmf-20200602gitca407c7246bf-3.el8.noarch" Results are as bellow: * 40 vcpus VM works well * 80 vcpus VM works well * 160 vcpus VM takes 1.5 minutes to boot * 200 vcpus VM takes 1.5 minutes to boot * 240 vcpus VM takes 2 minutes to boot * 280 vcpus VM takes 2.5 minutes to boot * 320 vcpus VM takes 3.5 minutes to boot * 384 vcpus VM takes 5 minutes to boot Thank you. So we have
VCPU count -2.el8 -3.el8
---------- ---------- ----------
40 ~immediate ~immediate
80 ~immediate ~immediate
160 7 mins 1.5 mins
200 15 mins 1.5 mins
240 28 mins 2.0 mins
384 >240 mins 5.0 mins
Please set the BZ status to VERIFIED. Thanks!
(In reply to leidwang from comment #11) > A question about cpu overcommit,do we support 1 VM overcommit(1 vm have > greate numbers of VCPU than host PCPU numbers)? Even slightly overcommit. > > Thanks! Pasting reply sent by email last week: I don't know the answer to those questions, I hope Martin can help you. That BZ has overcommit involved because it was the best way to reproduce the OVMF performance issue, not because it's an important use case. (In reply to Eduardo Habkost from comment #22) > (In reply to leidwang from comment #11) > > A question about cpu overcommit,do we support 1 VM overcommit(1 vm have > > greate numbers of VCPU than host PCPU numbers)? Even slightly overcommit. > > > > Thanks! > > Pasting reply sent by email last week: > > I don't know the answer to those questions, I hope Martin can > help you. That BZ has overcommit involved because it was the > best way to reproduce the OVMF performance issue, not because > it's an important use case. I think I answered in that thread, but as well here: No, we do not support CPU overcommit on a per VM basis. In detail: #vCPUs <= #pCPUs (including threads!) Of course you can do overcommit with multiple VMs, so sum(#vCPUs) maybe greater than #pCPUs Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: edk2 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:4805 |
Description of problem: When doing testing of VCPU limits in qemu-kvm, I found out that booting gets very slow if the number of VCPUS is higher than host CPU count. If booting with isa-debugcon, I see thousands of messages like: ConvertPageEntryAttribute 0x8000000050E000E7->0x8000000050E000E6 at a very slow speed Version-Release number of selected component (if applicable): qemu-kvm-5.0.0-2.module+el8.3.0+7379+0505d6ca.x86_64 edk2-ovmf-20200602gitca407c7246bf-2.el8.noarch How reproducible: Always. Steps to Reproduce: In a host with less than 384 CPUs, run: /usr/libexec/qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -smp 384 -drive if=pflash,format=raw,readonly,file=./OVMF_CODE.secboot.fd -drive if=pflash,format=raw,file=./OVMF_VARS.fd -device intel-iommu,intremap=on,eim=on -m 4096 -drive if=virtio,file=/root/rhel-guest-image-8.3-266.x86_64.qcow2,format=qcow2 -vnc :0 -cdrom /root/seed.iso -display none -serial stdio -boot menu=on -chardev file,id=fw-debug,path=/tmp/DOMAIN_NAME.fw.log -device isa-debugcon,iobase=0x402,chardev=fw-debug Actual results: VM takes a long time to boot, using 100% of all host CPUs most of the time. In a host with 448 CPUs running a 512 VCPU VM: * VM takes 20-30 minutes to boot * 0.72 seconds between each ConvertPageEntryAttribute message, on average In a host with 48 CPUs running a 384 VCPU VM: * VM doesn't boot after 3 hours * 30 seconds between each ConvertPageEntryAttribute message Expected results: VM taking a more reasonable time to boot (less than 5 minutes to reach grub, preferably). Additional info: Laszlo analysis from email thread: >> At a point during the boot process, Platform BDS signals "SMM ready to >> lock". This is kind of a marker for the firmware before which only >> firmware modules from the platform vendor run, but after which 3rd party >> UEFI modules (such as boot loaders, PCI card option ROMs) will run. So >> the platform firmware performs various lock-down operations. >> >> One of those is that, at the first SMI following the above signal, the >> SMI handler that runs on the BSP will unmap (remove the present bit) in >> the SMM page table entries, on most pages (except those UEFI memory >> types that either are acceptable for SMM communication buffers, or map >> MMIO). So the above bunch of messages report clearing the present bit, >> from the SetUefiMemMapAttributes() function >> [UefiCpuPkg/PiSmmCpuDxeSmm/SmmCpuMemoryManagement.c]. >> >> Meanwhile the APs are busy-waiting in SMM for the BSP to finish. >> >> With VCPU overcommit, the APs could hinder the BSP's progress. And in >> this particular AP loop, in I don't see a CpuPause() call.