RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2150267 - ovmf must consider max cpu count not boot cpu count for apic mode [rhel-8]
Summary: ovmf must consider max cpu count not boot cpu count for apic mode [rhel-8]
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: edk2
Version: 8.8
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Gerd Hoffmann
QA Contact: Xueqiang Wei
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-12-02 10:39 UTC by liunana
Modified: 2023-11-14 16:12 UTC (History)
14 users (show)

Fixed In Version: edk2-20220126gitbb1bba3d77-6.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-11-14 15:26:06 UTC
Type: ---
Target Upstream Version:
Embargoed:
pm-rhel: mirror+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gitlab redhat/centos-stream/src edk2 merge_requests 43 0 None opened UefiCpuPkg/MpInitLib: fix apic mode for cpu hotplug 2023-07-04 10:25:09 UTC
Gitlab redhat/rhel/src edk2 merge_requests 29 0 None None None 2023-07-31 09:48:06 UTC
Red Hat Issue Tracker RHELPLAN-141155 0 None None None 2022-12-02 11:17:30 UTC
Red Hat Product Errata RHSA-2023:6919 0 None None None 2023-11-14 15:26:15 UTC

Internal Links: 2124143

Description liunana 2022-12-02 10:39:44 UTC
Description of problem:
Can't online all vcpus when hotplugging over 256 sockets/cores inside guest with error logs:

2022-12-02 05:10:56: [   34.221467] ACPI: Unable to map lapic to logical cpu number
2022-12-02 05:10:56: [   34.222188] acpi ACPI0007:7d: Enumeration failure
2022-12-02 05:10:56: [   34.223018] APIC: NR_CPUS/possible_cpus limit of 255 reached. Processor 635/0x17d ignored.


Version-Release number of selected component (if applicable): 
Host:
    4.18.0-441.el8.x86_64
    qemu-kvm-6.2.0-26.module+el8.8.0+17341+68372c23.x86_64
    edk2-ovmf-20220126gitbb1bba3d77-3.el8.noarch
    libvirt-client-8.0.0-11.module+el8.8.0+16835+1d966b61.x86_64
    intel-eaglestream-spr-07.khw1.lab.eng.bos.redhat.com
    On-line CPU(s) list: 0-383
Guest: 
    4.18.0-441.el8.x86_64


How reproducible:
100%


Steps to Reproduce:
1. Boot vm with -smp 1,maxcpus=384,cores=1,threads=1,dies=1,sockets=384, full qemu command line will add in Comment.
2. hotplug all vcpus.
3. check the vcpus inside guest with 'grep -c "^processor\b" /proc/cpuinfo'

Actual results:
Only 255 vcpus are online. 
And check the guest dmesg log, can see the error logs:
2022-12-02 05:10:56: [   34.221467] ACPI: Unable to map lapic to logical cpu number
2022-12-02 05:10:56: [   34.222188] acpi ACPI0007:7d: Enumeration failure
2022-12-02 05:10:56: [   34.223018] APIC: NR_CPUS/possible_cpus limit of 255 reached. Processor 635/0x17d ignored.


Expected results:
All vcpus can be online.


Additional info:
Didn't hit this issue with the combination test of 'seabios + q35 + RHEL 9.2 guest' with 384 vcpus.

Comment 4 liunana 2022-12-02 10:54:46 UTC
Full qemu commandline:

/usr/libexec/qemu-kvm \
    -S  \
    -name 'avocado-vt-vm1'  \
    -sandbox on  \
    -blockdev node-name=file_ovmf_code,driver=file,filename=/usr/share/OVMF/OVMF_CODE.secboot.fd,auto-read-only=on,discard=unmap \
    -blockdev node-name=drive_ovmf_code,driver=raw,read-only=on,file=file_ovmf_code \
    -blockdev node-name=file_ovmf_vars,driver=file,filename=/home/auto/kar/workspace/root/avocado/data/avocado-vt/avocado-vt-vm1_rhel880-64-virtio-scsi_qcow2_filesystem_VARS.fd,auto-read-only=on,discard=unmap \
    -blockdev node-name=drive_ovmf_vars,driver=raw,read-only=off,file=file_ovmf_vars \
    -machine q35,kernel-irqchip=split,memory-backend=mem-machine_mem,pflash0=drive_ovmf_code,pflash1=drive_ovmf_vars \
    -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
    -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0  \
    -nodefaults \
    -device intel-iommu,intremap=on,device-iotlb=on \
    -device VGA,bus=pcie.0,addr=0x2 \
    -m 125952 \
    -object memory-backend-ram,size=125952M,id=mem-machine_mem  \
    -smp 1,maxcpus=384,cores=1,threads=1,dies=1,sockets=384  \
    -cpu 'Icelake-Server',ss=on,vmx=on,pdcm=on,hypervisor=on,tsc-adjust=on,avx512ifma=on,sha-ni=on,waitpkg=on,rdpid=on,cldemote=on,movdiri=on,movdir64b=on,fsrm=on,md-clear=on,stibp=on,arch-capabilities=on,avx-vnni=on,avx512-bf16=on,xsaves=on,ibpb=on,ibrs=on,amd-stibp=on,amd-ssbd=on,rdctl-no=on,ibrs-all=on,skip-l1dfl-vmentry=on,mds-no=on,pschange-mc-no=on,tsx-ctrl=on,hle=off,rtm=off,mpx=off,intel-pt=off,kvm_pv_unh

    -chardev socket,server=on,wait=off,path=/var/tmp/avocado_0v9tgdc_/monitor-qmpmonitor1-20221202-051005-KcGEYV5U,id=qmp_id_qmpmonitor1  \
    -mon chardev=qmp_id_qmpmonitor1,mode=control \
    -chardev socket,server=on,wait=off,path=/var/tmp/avocado_0v9tgdc_/monitor-catch_monitor-20221202-051005-KcGEYV5U,id=qmp_id_catch_monitor  \
    -mon chardev=qmp_id_catch_monitor,mode=control \
    -device pvpanic,ioport=0x505,id=idJsXaSI \
    -chardev socket,server=on,wait=off,path=/var/tmp/avocado_0v9tgdc_/serial-serial0-20221202-051005-KcGEYV5U,id=chardev_serial0 \
    -device isa-serial,id=serial0,chardev=chardev_serial0  \
    -chardev socket,id=seabioslog_id_20221202-051005-KcGEYV5U,path=/var/tmp/avocado_0v9tgdc_/seabios-20221202-051005-KcGEYV5U,server=on,wait=off \
    -device isa-debugcon,chardev=seabioslog_id_20221202-051005-KcGEYV5U,iobase=0x402 \
    -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \
    -device qemu-xhci,id=usb1,bus=pcie-root-port-1,addr=0x0 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
    -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie-root-port-2,addr=0x0,iommu_platform=on \
    -blockdev '{"node-name": "file_image1", "driver": "file", "auto-read-only": true, "discard": "unmap", "aio": "threads", "filename": "/home/kvm_autotest_root/images/rhel880-64-virtio-scsi.qcow2", "cache": {"direct": true, "no-flush": false}}' \
    -blockdev '{"node-name": "drive_image1", "driver": "qcow2", "read-only": false, "cache": {"direct": true, "no-flush": false}, "file": "file_image1"}' \
    -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \
    -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
    -device virtio-net-pci,mac=9a:4e:4d:eb:a6:ec,id=idGDSRKP,netdev=idgmmUaD,bus=pcie-root-port-3,addr=0x0,iommu_platform=on  \
    -netdev tap,id=idgmmUaD,vhost=on,vhostfd=18,fd=15  \
    -vnc :0  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot menu=off,order=cdn,once=c,strict=off  \
    -global mch.extended-tseg-mbytes=48 \
    -enable-kvm \



Guest demsg log: Full log please check the attachments, thanks.

2022-12-02 05:10:48: [   26.074068] CPU252 has been hot-added
2022-12-02 05:10:48: [   26.074519] CPU253 has been hot-added
2022-12-02 05:10:48: [   26.087472] smpboot: Booting Node 0 Processor 169 APIC 0xa9
2022-12-02 05:10:48: [   26.089529] smpboot: CPU 169 Converting physical 169 to logical package 164
2022-12-02 05:10:48: [   26.090055] smpboot: CPU 169 Converting physical 0 to logical die 164
2022-12-02 05:10:48: [   26.111107] Will online and init hotplugged CPU: 169
2022-12-02 05:10:48: [   26.163084] CPU254 has been hot-added
2022-12-02 05:10:48: [   26.201248] smpboot: Booting Node 0 Processor 170 APIC 0xaa
2022-12-02 05:10:48: [   26.203118] smpboot: CPU 170 Converting physical 170 to logical package 165
2022-12-02 05:10:48: [   26.203809] smpboot: CPU 170 Converting physical 0 to logical die 165
2022-12-02 05:10:48: [   26.224723] Will online and init hotplugged CPU: 170
2022-12-02 05:10:48: [   26.272011] APIC: NR_CPUS/possible_cpus limit of 255 reached. Processor 509/0xff ignored.
2022-12-02 05:10:48: [   26.272666] ACPI: Unable to map lapic to logical cpu number
2022-12-02 05:10:48: [   26.273219] acpi ACPI0007:00: Enumeration failure
2022-12-02 05:10:48: [   26.273817] APIC: NR_CPUS/possible_cpus limit of 255 reached. Processor 510/0x100 ignored.
2022-12-02 05:10:48: [   26.274502] ACPI: Unable to map lapic to logical cpu number
2022-12-02 05:10:48: [   26.274983] acpi ACPI0007:01: Enumeration failure
2022-12-02 05:10:48: [   26.293566] smpboot: Booting Node 0 Processor 172 APIC 0xac

Comment 5 liunana 2022-12-05 12:38:42 UTC
Hi Igor,


Would you please help to take a look if this is a vcpu hotplug issue?
Even though I only can reproduce this issue with OVMF vm, I didn't see the error log in firmware log. Is this a OVMF limitation issue?
Anything is unclear please let me know, thanks a lot! 


Best regards
Nana

Comment 6 Igor Mammedov 2022-12-07 10:18:52 UTC
If I recall correctly it should work with uefi too.
From logs it looks like guest caps maxcpus to 255 for some reason.

CCing Pawel,
since he is working on working >1024 vcpus support in OVMF and
may have more up-to date info on its limitation.

Comment 7 John Ferlan 2022-12-07 16:51:50 UTC
Perhaps related to bug 1983086

Comment 8 Igor Mammedov 2022-12-13 16:24:32 UTC
(In reply to John Ferlan from comment #7)
> Perhaps related to bug 1983086

no that's not it.

So difference between SeaBIOS and OVMF is in this log line:
 [    0.000000] x2apic: enabled by BIOS, switching to x2apic ops
debugging guest kernel shows that

  MSR_IA32_APICBASE & X2APIC_ENABLE

is not set by edk2 when number of possible vcpus more than 255
but the number of vcpus present at boot is less than that,

hence guest kernel does not enable x2apic ops and caps nr_cpu_ids to 255.

Looks like bug is present in current upstream as well.
Moving to edk2 component for further fixing/triaging. 

PS:
tracing QEMU with
   --trace "cpuhp_*"
show that firmware enumerates all possible CPUs as expected
and negotiated proper cpu hotplug with QEMU:
(monitor) info qtree
...
      dev: ICH9-LPC, id ""
        gpio-out "gsi" 24
        noreboot = true
        smm-compat = false
        x-smi-broadcast = true
        x-smi-cpu-hotplug = true
        x-smi-cpu-hotunplug = true
...

PS2:
Extra config I've added is enabling 'smm'
   -M q35,kernel-irqchip=split,smm=on
I don't recall if it's really necessary,
but I do remember that it were required for cpu hotplug + secureboot to work.

PS3:
simplified upstream command line I used to debug the issue:
./qemu-system-x86_64 -enable-kvm -m 4096 -smp 1,maxcpus=280,cores=1,threads=1,dies=1,sockets=280 -cpu host -device intel-iommu,intremap=on,device-iotlb=on -M q35,kernel-irqchip=split,smm=on -global mch.extended-tseg-mbytes=48  -nodefaults -drive if=pflash,format=raw,unit=0,readonly=on,file=./edk2-x86_64-code.fd

Comment 10 Xueqiang Wei 2022-12-14 08:02:16 UTC
It should be the same issue as Bug 2124143 - Failed to hot-plug 448 cpus to rhel9.1.0 guest with q35 + OVMF

Comment 11 Laszlo Ersek 2022-12-14 10:11:38 UTC
(In reply to Igor Mammedov from comment #8)

> So difference between SeaBIOS and OVMF is in this log line:
>  [    0.000000] x2apic: enabled by BIOS, switching to x2apic ops
> debugging guest kernel shows that
>
>   MSR_IA32_APICBASE & X2APIC_ENABLE
>
> is not set by edk2 when number of possible vcpus more than 255
> but the number of vcpus present at boot is less than that,
>
> hence guest kernel does not enable x2apic ops and caps nr_cpu_ids to
> 255.
>
> Looks like bug is present in current upstream as well.
> Moving to edk2 component for further fixing/triaging.

This analysis is correct.

The bug is in edk2's CollectProcessorCount() function, in the
"UefiCpuPkg/Library/MpInitLib/MpLib.c" source file.

This internal library function is activated when CpuMpPei starts up. The
corresponding log line is (from comment#2; timestamp stripped):

> APIC MODE is 1

Here "1" stands for LOCAL_APIC_MODE_XAPIC. The value we should be seeing
here is "2" (LOCAL_APIC_MODE_X2APIC).

CollectProcessorCount() selects the LAPIC mode based on two factors (:

>   //
>   // Enable x2APIC mode if
>   //  1. Number of CPU is greater than 255; or
>   //  2. There are any logical processors reporting an Initial APIC ID of 255 or greater.
>   //

It does not consider the following cases:
(a) more than 255 CPUs might appear in the system *later*, due to
    hotplug
(b) a CPU might appear in the system *later*, due to hotplug, that has a
    "wide" APIC ID.

I think the simplest approach is to extend the logic to cover missing
case (a), and to keep ignoring missing case (b). That's because APIC IDs
are "compressed"; that is, it's really unlikely (I think) to see a
"wide" APIC ID unless the number of possible CPUs actually *requires* a
wide APIC ID. To put differently: I have not proved it, but I *think*
that it's not possible to create such a topology that does not have at
least 256 *possible* CPUs but requires APIC IDs wider than 8 bits.

So my suggestion is the following simple patch (for current upstream
edk2, to be backported):

-------[cut here]-------
diff --git a/UefiCpuPkg/Library/MpInitLib/MpLib.c b/UefiCpuPkg/Library/MpInitLib/MpLib.c
index e5dc852ed95f..3d2d41710d7c 100644
--- a/UefiCpuPkg/Library/MpInitLib/MpLib.c
+++ b/UefiCpuPkg/Library/MpInitLib/MpLib.c
@@ -526,7 +526,9 @@ CollectProcessorCount (
   //
   // Enable x2APIC mode if
   //  1. Number of CPU is greater than 255; or
-  //  2. There are any logical processors reporting an Initial APIC ID of 255 or greater.
+  //  2. The platform exposed the exact *boot* CPU count to us in advance, and
+  //     more than 255 logical processors are possible later, with hotplug
+  //  3. There are any logical processors reporting an Initial APIC ID of 255 or greater.
   //
   X2Apic = FALSE;
   if (CpuMpData->CpuCount > 255) {
@@ -534,6 +536,10 @@ CollectProcessorCount (
     // If there are more than 255 processor found, force to enable X2APIC
     //
     X2Apic = TRUE;
+  } else if ((PcdGet32 (PcdCpuBootLogicalProcessorNumber) > 0) &&
+             (PcdGet32 (PcdCpuMaxLogicalProcessorNumber) > 255))
+  {
+    X2Apic = TRUE;
   } else {
     CpuInfoInHob = (CPU_INFO_IN_HOB *)(UINTN)CpuMpData->CpuInfoInHob;
     for (Index = 0; Index < CpuMpData->CpuCount; Index++) {
-------[cut here]-------

The reason I'm *not only* checking "PcdCpuMaxLogicalProcessorNumber" in
the new branch is that in the platform description (DSC) files for
various physical platforms, "PcdCpuMaxLogicalProcessorNumber" may be set
to 256 or more, *even though* the actual shipped platforms will never
see such a high CPU count. This is perfectly valid for a physical
platform configuration, and we shouldn't change their behavior by
forcing X2APIC mode for them. The restriction with
"PcdCpuBootLogicalProcessorNumber" is appropriate because (effectively)
only virtual platforms are capable of advertising their exact boot CPU
count, before CPU enumeration (via IPI broadcast) is actually performed.

For more background on this, refer to
<https://bugzilla.tianocore.org/show_bug.cgi?id=1515>, and the following
edk2 commit ranges:
- a7e2d20193e8..778832bcad33
- c8b8157e126a..83357313dd67

The gist is that CPU enumeration via [S]IPI broadcast is much more
deterministic on physical platforms than on virtual ones; however,
virtual platforms can expose both the boot-time and the possible (via
hotplug) CPU count in advance. Therefore the CPU enumeration code needed
a separate path for virtual platforms, and that was what
"PcdCpuBootLogicalProcessorNumber" was introduced for. In fact, it is
not a stretch to state that back then, I had missed the X2APIC mode
setting in CollectProcessorCount(), and we can now consider this patch
as a feature completion patch for CPU hotplug.

Now... should the upstream UefiCpuPkg reviewers prefer relaxing the
proposed condition from

  (PcdGet32 (PcdCpuBootLogicalProcessorNumber) > 0) &&
  (PcdGet32 (PcdCpuMaxLogicalProcessorNumber) > 255)

to just

  (PcdGet32 (PcdCpuMaxLogicalProcessorNumber) > 255)

then that would be fine for us too, of course. (But note that the
leading comment needs to be synched up then!)

Comment 12 Laszlo Ersek 2022-12-14 10:19:36 UTC
(In reply to Laszlo Ersek from comment #11)

> It does not consider the following cases:
> (a) more than 255 CPUs might appear in the system *later*, due to
>     hotplug
> (b) a CPU might appear in the system *later*, due to hotplug, that has a
>     "wide" APIC ID.
> 
> I think the simplest approach is to extend the logic to cover missing
> case (a), and to keep ignoring missing case (b). That's because APIC IDs
> are "compressed"; that is, it's really unlikely (I think) to see a
> "wide" APIC ID unless the number of possible CPUs actually *requires* a
> wide APIC ID. To put differently: I have not proved it, but I *think*
> that it's not possible to create such a topology that does not have at
> least 256 *possible* CPUs but requires APIC IDs wider than 8 bits.

Well I can actually immediately refute (disprove) my above hypothesis; consider the following topology:

5 sockets * 17 (cores/socket) * 3 (threads/core) = 255 possible logical processors

Bit widths in the APIC ID:
- 5 sockets --> 3 bits to represent
- 17 cores --> 5 bits to represent
- 3 threads --> 2 bits to represent

In total: 3+5+2 = 10 bits, so my hypothesis does not hold.

However, this is a totally pathologic CPU topology. It's not worth covering in my opinion. The patch I'm proposing above is simple and it is a pure improvement; it should cover the most common cases.

If we really wanted to catch case (b), then we'd have to add actual APIC ID width calculation logic to edk2, based on the topology (CPUID massaging I assume).

Comment 13 Laszlo Ersek 2022-12-14 10:31:06 UTC
BTW I think we could interrogate QEMU's CPU hotplug interface for the specific APIC IDs of those CPUs that have *not* been hot-plugged yet, using command 3 (macro "QEMU_CPUHP_CMD_GET_ARCH_ID" in edk2, macro "CPHP_GET_CPU_ID_CMD" in QEMU, documentation in "docs/specs/acpi_cpu_hotplug.rst" in QEMU).

But that would be horribly complicated to wire up to "UefiCpuPkg/Library/MpInitLib" in edk2. In that case, calculating the APIC ID width based on CPU topology, using CPUID instructions, would be better (it's still simpler and it would cover physical platforms as well).

Comment 14 Igor Mammedov 2022-12-14 10:51:17 UTC
case (b) is actually real for AMD cpus, which can have odd number of threads
and that leads to 'sparse' APIC ID in AMD case (i.e. APIC ID might use more
than 8 bits even if number of logical cpus is less than 255).

Comment 16 Igor Mammedov 2023-02-06 09:58:43 UTC
(In reply to Laszlo Ersek from comment #13)
> BTW I think we could interrogate QEMU's CPU hotplug interface for the
> specific APIC IDs of those CPUs that have *not* been hot-plugged yet, using
> command 3 (macro "QEMU_CPUHP_CMD_GET_ARCH_ID" in edk2, macro
> "CPHP_GET_CPU_ID_CMD" in QEMU, documentation in
> "docs/specs/acpi_cpu_hotplug.rst" in QEMU).
> 
> But that would be horribly complicated to wire up to
> "UefiCpuPkg/Library/MpInitLib" in edk2. In that case, calculating the APIC
> ID width based on CPU topology, using CPUID instructions, would be better
> (it's still simpler and it would cover physical platforms as well).

For SeaBIOS QEMU passes max apic id, via fwcfg, but I'd rather avoid that
in OVMF if it could be inferred by other means (either via hotplug enumeration
or via CPUID). Cons of going CPUID route is that is CPU vendor dependent
(and far from trivial, just look at topo mess we have in QEMU), however as
Laszlo pointed out, it will work equally well on both virt and baremetal
(and that alone makes it worthwhile the trouble).

Comment 17 Gerd Hoffmann 2023-03-07 16:24:49 UTC
https://edk2.groups.io/g/devel/message/100801

Comment 30 Xueqiang Wei 2023-08-11 15:55:26 UTC
QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Comment 34 liunana 2023-08-17 07:45:35 UTC
Test PASS after extending time interval to 30s before check hotpluggable CPUs by qmp commandline.

(1/1) Host_RHEL.m8.u9.product_rhel.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.8.9.0.x86_64.io-github-autotest-qemu.cpu_device_hotplug_maximum.with_hugepages.max_socket.q35: PASS (15041.39 s)

Test Env
4.18.0-508.el8.x86_64
qemu-kvm-6.2.0-38.module+el8.9.0+19636+489b90af.x86_64
libvirt-client-8.0.0-22.module+el8.9.0+19544+b3045133.x86_64
lenovo-sr950-02.lab.eng.pek2.redhat.com
# rpm -qa | grep edk2
edk2-ovmf-20220126gitbb1bba3d77-6.el8.noarch
edk2-tools-20220126gitbb1bba3d77-6.el8.x86_64
edk2-tools-doc-20220126gitbb1bba3d77-6.el8.noarch


Guest dmesg log -- hotplug:
2023-08-16 22:25:16: [  144.585032] smpboot: Booting Node 0 Processor 447 APIC 0x1bf
2023-08-16 22:25:16: [  144.594251] smpboot: CPU 447 Converting physical 447 to logical package 445
2023-08-16 22:25:16: [  144.596243] smpboot: CPU 447 Converting physical 0 to logical die 445
2023-08-16 22:25:17: [  144.618943] Will online and init hotplugged CPU: 447
......
2023-08-16 22:26:32: [    2.511655]  #447
2023-08-16 22:26:32: [    0.001000] smpboot: CPU 447 Converting physical 0 to logical die 447
2023-08-16 22:26:32: [    2.516412] smp: Brought up 1 node, 448 CPUs
2023-08-16 22:26:32: [    2.517008] smpboot: Max logical packages: 448
2023-08-16 22:26:32: [    2.517581] smpboot: Total of 448 processors activated (1966578.43 BogoMIPS)
2023-08-16 22:26:41: [    4.465386] node 0 deferred pages initialised in 1747ms



Guest dmesg log -- hot-unplug
2023-08-16 22:28:13: [   95.082952] smpboot: CPU 447 is now offline
2023-08-16 22:28:44: [  126.172122] IRQ 53: no longer affine to CPU446
2023-08-16 22:28:44: [  126.175128] smpboot: CPU 446 is now offline
......
2023-08-17 02:19:12: [13954.564103] IRQ 33: no longer affine to CPU1
2023-08-17 02:19:12: [13954.564901] IRQ 36: no longer affine to CPU1
2023-08-17 02:19:12: [13954.565701] IRQ 37: no longer affine to CPU1
2023-08-17 02:19:12: [13954.570790] smpboot: CPU 1 is now offline
2023-08-17 02:19:23: [13965.053242] sda2: Can't mount, would change RO state


Hi Gerd,

May I know how long the time interval is accepted for hot-unplug one vcpu successfully? Thanks.



Best regards
Nana

Comment 35 Xueqiang Wei 2023-08-17 08:27:44 UTC
Thank you Nana for the confirmation.

I also run edk2 test loop with edk2-20220126gitbb1bba3d77-6.el8, no new bug was found.
Job link: http://fileshare.hosts.qa.psi.pek2.redhat.com/pub/logs/rhel890_edk2-20220126gitbb1bba3d77-6.el8/results.html


According to Comment 32, Comment 33 and Comment 34, set status to VERIFIED. 
For the issue of time interval, let's wait for Gerd's feedback. If it's a bug, create another bug to track that, thanks.

Comment 36 Gerd Hoffmann 2023-08-22 09:19:14 UTC
> May I know how long the time interval is accepted for hot-unplug one vcpu
> successfully? Thanks.

When bringing a new vCPU online some code must run on *all* CPUs, I think in
both firmware (setup SMM mode) and kernel.  So the time needed to bring a
hotplugged vCPU online going up with the number of vCPUs already present
in the guest is expected behavior.

I have not much experience how much time this actually takes in practice.
30s looks relatively long to me.  One factor which could play a role here
is overcommitment:  If the number of guest vCPUs is higher than the number
of physical CPUs on the host it is very likely that operations like this
are quite slow.

Comment 38 errata-xmlrpc 2023-11-14 15:26:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: edk2 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6919


Note You need to log in before you can comment on or make changes to this bug.