Bug 2172167

Summary: [Windows][QEMU7.2] Windows 2022 vm gets BSOD 'PROCESSOR_START_TIMEOUT (1df)' after hotplugging max cores with with_hugepages sometimes
Product: Red Hat Enterprise Linux 9 Reporter: liunana <nanliu>
Component: qemu-kvmAssignee: Yvugenfi <yvugenfi>
qemu-kvm sub component: Devices QA Contact: liunana <nanliu>
Status: CLOSED MIGRATED Docs Contact:
Severity: high    
Priority: unspecified CC: aodaki, chayang, coli, jinzhao, mdean, virt-maint, yama, yvugenfi
Version: 9.2Keywords: CustomerScenariosInitiative, MigratedToJIRA
Target Milestone: rcFlags: aodaki: needinfo? (nanliu)
Target Release: ---   
Hardware: x86_64   
OS: Windows   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-15 11:31:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description liunana 2023-02-21 15:34:16 UTC
Description of problem:
Windows 2022 gets BSOD 'PROCESSOR_START_TIMEOUT (1df)' after hotplugging max cores with with_hugepages sometimes


Version-Release number of selected component (if applicable):
Host:
    5.14.0-268.el9.x86_64
    qemu-kvm-7.2.0-9.el9.x86_64
    edk2-ovmf-20221207gitfff6d81270b5-6.el9.noarch
    libvirt-9.0.0-6.el9.x86_64
    dell-per750-25.lab.eng.pek2.redhat.com (Icelake)
    swtpm-0.8.0-1.el9.x86_64
Guest: Win2022


How reproducible:
1/6


Steps to Reproduce:
1.  Set hugepages on host:
# mount -t hugetlbfs -o pagesize=2048K none /mnt/kvm_hugepage

2. Boot windows 2022 vm with ' -smp 1,maxcpus=48,cores=48,threads=1,dies=1,sockets=1', Full qemu command line [1]

3. hotplug max cores:
{'execute': 'device_add', 'arguments': OrderedDict([('id', 'vcpu47'), ('driver', 'Icelake-Server-x86_64-cpu'), ('socket-id', 0), ('die-id', 0), ('core-id', 47), ('thread-id', 0)]), 'id': 'kKNeBTZi'}

4. Check guest status.



Actual results:
Guest gets BSOD 'PROCESSOR_START_TIMEOUT (1df)'


Expected results:
Guest works well without any error.


Additional info:
[1] - full qemu command line.
/usr/libexec/qemu-kvm \
     -S  \
     -name 'avocado-vt-vm1'  \
     -sandbox on  \
     -blockdev '{"node-name": "file_ovmf_code", "driver": "file", "filename": "/usr/share/OVMF/OVMF_CODE.secboot.fd", "auto-read-only": true, "discard": "unmap"}' \
     -blockdev '{"node-name": "drive_ovmf_code", "driver": "raw", "read-only": true, "file": "file_ovmf_code"}' \
     -blockdev '{"node-name": "file_ovmf_vars", "driver": "file", "filename": "/root/avocado/data/avocado-vt/avocado-vt-vm1_win2022-64-virtio-scsi_qcow2_filesystem_VARS.fd", "auto-read-only": true, "discard": "unmap"}' \
     -blockdev '{"node-name": "drive_ovmf_vars", "driver": "raw", "read-only": false, "file": "file_ovmf_vars"}' \
     -machine q35,kernel-irqchip=split,memory-backend=mem-machine_mem,pflash0=drive_ovmf_code,pflash1=drive_ovmf_vars \
     -device '{"id": "pcie-root-port-0", "driver": "pcie-root-port", "multifunction": true, "bus": "pcie.0", "addr": "0x1", "chassis": 1}' \
     -device '{"id": "pcie-pci-bridge-0", "driver": "pcie-pci-bridge", "addr": "0x0", "bus": "pcie-root-port-0"}'  \
     -nodefaults \
     -device '{"intremap": "on", "device-iotlb": true, "driver": "intel-iommu"}' \
     -device '{"driver": "VGA", "bus": "pcie.0", "addr": "0x2"}' \
     -m 62464 \
     -object '{"size": 65498251264, "mem-path": "/mnt/kvm_hugepage", "id": "mem-machine_mem", "qom-type": "memory-backend-file"}'  \
     -smp 1,maxcpus=48,cores=48,threads=1,dies=1,sockets=1  \
     -cpu 'Icelake-Server',ds=on,ss=on,dtes64=on,vmx=on,pdcm=on,hypervisor=on,tsc-adjust=on,avx512ifma=on,sha-ni=on,rdpid=on,fsrm=on,md-clear=on,stibp=on,arch-capabilities=on,xsaves=on,ibpb=on,ibrs=on,amd-stibp=on,amd-ssbd=on,rdctl-no=on,ibrs-all=on,skip-l1dfl-vmentry=on,mds-no=on,pschange-mc-no=on,tsx-ctrl=on,hle=off,rtm=off,mpx=off,intel-pt=off,hv_stimer,hv_synic,hv_vpindex,hv_relaxed,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_frequencies,hv_runtime,hv_tlbflush,hv_reenlightenment,hv_stimer_direct,hv_ipi,kvm_pv_unhalt=on \
     -chardev socket,id=qmp_id_qmpmonitor1,server=on,path=/var/tmp/avocado_stf8vw0k/monitor-qmpmonitor1-20230220-130049-6JtlC9mT,wait=off  \
     -mon chardev=qmp_id_qmpmonitor1,mode=control \
     -chardev socket,id=qmp_id_catch_monitor,server=on,path=/var/tmp/avocado_stf8vw0k/monitor-catch_monitor-20230220-130049-6JtlC9mT,wait=off  \
     -mon chardev=qmp_id_catch_monitor,mode=control \
     -device '{"ioport": 1285, "driver": "pvpanic", "id": "idDCbx2r"}' \
     -chardev socket,id=chardev_serial0,server=on,path=/var/tmp/avocado_stf8vw0k/serial-serial0-20230220-130049-6JtlC9mT,wait=off \
     -device '{"id": "serial0", "driver": "isa-serial", "chardev": "chardev_serial0"}'  \
     -chardev socket,id=seabioslog_id_20230220-130049-6JtlC9mT,path=/var/tmp/avocado_stf8vw0k/seabios-20230220-130049-6JtlC9mT,server=on,wait=off \
     -device isa-debugcon,chardev=seabioslog_id_20230220-130049-6JtlC9mT,iobase=0x402 \
     -device '{"id": "pcie-root-port-1", "port": 1, "driver": "pcie-root-port", "addr": "0x1.0x1", "bus": "pcie.0", "chassis": 2}' \
     -device '{"driver": "qemu-xhci", "id": "usb1", "bus": "pcie-root-port-1", "addr": "0x0"}' \
     -device '{"driver": "usb-tablet", "id": "usb-tablet1", "bus": "usb1.0", "port": "1"}' \
     -device '{"id": "pcie-root-port-2", "port": 2, "driver": "pcie-root-port", "addr": "0x1.0x2", "bus": "pcie.0", "chassis": 3}' \
     -device '{"id": "virtio_scsi_pci0", "driver": "virtio-scsi-pci", "bus": "pcie-root-port-2", "addr": "0x0", "iommu_platform": true}' \
     -blockdev '{"node-name": "file_image1", "driver": "file", "auto-read-only": true, "discard": "unmap", "aio": "threads", "filename": "/home/kvm_autotest_root/images/win2022-64-virtio-scsi.qcow2", "cache": {"direct": true, "no-flush": false}}' \
     -blockdev '{"node-name": "drive_image1", "driver": "qcow2", "read-only": false, "cache": {"direct": true, "no-flush": false}, "file": "file_image1"}' \
     -device '{"driver": "scsi-hd", "id": "image1", "drive": "drive_image1", "write-cache": "on"}' \
     -device '{"id": "pcie-root-port-3", "port": 3, "driver": "pcie-root-port", "addr": "0x1.0x3", "bus": "pcie.0", "chassis": 4}' \
     -device '{"driver": "virtio-net-pci", "mac": "9a:c1:2d:9f:d9:b4", "id": "idU2n9WB", "netdev": "id01VFAQ", "bus": "pcie-root-port-3", "addr": "0x0", "iommu_platform": true}'  \
     -netdev tap,id=id01VFAQ,vhost=on \
     -blockdev '{"node-name": "file_cd1", "driver": "file", "auto-read-only": true, "discard": "unmap", "aio": "threads", "filename": "/home/kvm_autotest_root/iso/windows/winutils.iso", "cache": {"direct": true, "no-flush": false}}' \
     -blockdev '{"node-name": "drive_cd1", "driver": "raw", "read-only": true, "cache": {"direct": true, "no-flush": false}, "file": "file_cd1"}' \
     -device '{"driver": "scsi-cd", "id": "cd1", "drive": "drive_cd1", "write-cache": "on"}'  \
     -vnc :0  \
     -rtc base=localtime,clock=host,driftfix=slew  \
     -boot menu=off,order=cdn,once=c,strict=off \
     -chardev socket,id=char_vtpm_avocado-vt-vm1_tpm0,path=/root/avocado/data/avocado-vt/swtpm/avocado-vt-vm1_tpm0_swtpm.sock \
     -tpmdev emulator,chardev=char_vtpm_avocado-vt-vm1_tpm0,id=emulator_vtpm_avocado-vt-vm1_tpm0 \
     -device '{"id": "tpm-crb_vtpm_avocado-vt-vm1_tpm0", "tpmdev": "emulator_vtpm_avocado-vt-vm1_tpm0", "driver": "tpm-crb"}' \
     -enable-kvm \
     -monitor stdio \

Comment 1 liunana 2023-02-21 15:40:22 UTC
Memory.dmp analysis logs:


Microsoft (R) Windows Debugger Version 10.0.22621.755 X86
Copyright (c) Microsoft Corporation. All rights reserved.


Loading Dump File [C:\Windows\Memory.dmp]
Kernel Bitmap Dump File: Kernel address space is available, User address space may not be available.


************* Path validation summary **************
Response                         Time (ms)     Location
Deferred                                       SRV*c:\symbols*http://msdl.microsoft.com/download/symbols
Symbol search path is: SRV*c:\symbols*http://msdl.microsoft.com/download/symbols
Executable search path is:
Unable to load image \SystemRoot\system32\ntoskrnl.exe, Win32 error 0n2
Windows 10 Kernel Version 20348 MP (3 procs) Free x64
Product: Server, suite: TerminalServer DataCenter SingleUserTS
Edition build lab: 20348.1.amd64fre.fe_release.210507-1500
Machine Name:
Kernel base = 0xfffff804`29419000 PsLoadedModuleList = 0xfffff804`2a04c940
Debug session time: Mon Feb 20 22:44:15.269 2023 (UTC + 8:00)
System Uptime: 0 days 0:01:28.053
Loading Kernel Symbols
...............................................................
..Page 1f73 not present in the dump file. Type ".hh dbgerr004" for details
..............................................................
.............
Loading User Symbols

Loading unloaded module list
..............
For analysis of this file, run !analyze -v
1: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

PROCESSOR_START_TIMEOUT (1df)
A processor failed to start in the allowed time.
Arguments:
Arg1: ffffad00334ee390, Virtual address of the processor state.
Arg2: 0000000000000003, Processor start state.
Arg3: 0000000000000003, NT processor number.
Arg4: 0000000000000003, Local unit ID for the processor.

Debugging Details:
------------------


KEY_VALUES_STRING: 1

    Key  : Analysis.CPU.mSec
    Value: 13046

    Key  : Analysis.DebugAnalysisManager
    Value: Create

    Key  : Analysis.Elapsed.mSec
    Value: 30734

    Key  : Analysis.Init.CPU.mSec
    Value: 5468

    Key  : Analysis.Init.Elapsed.mSec
    Value: 225216

    Key  : Analysis.Memory.CommitPeak.Mb
    Value: 95

    Key  : WER.OS.Branch
    Value: fe_release

    Key  : WER.OS.Timestamp
    Value: 2021-05-07T15:00:00Z

    Key  : WER.OS.Version
    Value: 10.0.20348.1


FILE_IN_CAB:  Memory.dmp

BUGCHECK_CODE:  1df

BUGCHECK_P1: ffffad00334ee390

BUGCHECK_P2: 3

BUGCHECK_P3: 3

BUGCHECK_P4: 3

BLACKBOXBSD: 1 (!blackboxbsd)


BLACKBOXNTFS: 1 (!blackboxntfs)


BLACKBOXPNP: 1 (!blackboxpnp)


BLACKBOXWINLOGON: 1

PROCESS_NAME:  System

LOCK_ADDRESS:  fffff8042a069340 -- (!locks fffff8042a069340)

Resource @ nt!PiEngineLock (0xfffff8042a069340)    Exclusively owned
    Contention Count = 5
    NumberOfExclusiveWaiters = 1
     Threads: ffffd68109fb5040-01<*>

     Threads Waiting On Exclusive Access:
              ffffd6810c1760c0
1 total locks

PNP_TRIAGE_DATA:
        Lock address  : 0xfffff8042a069340
        Thread Count  : 1
        Thread address: 0xffffd68109fb5040
        Thread wait   : 0x1603

STACK_TEXT:
ffffad00`334ee138 fffff804`29d5091c     : 00000000`000001df ffffad00`334ee390 00000000`00000003 00000000`00000003 : nt!KeBugCheckEx
ffffad00`334ee140 fffff804`29d515e9     : ffffad00`334eea70 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiStartDynamicProcessor+0x6ec
ffffad00`334ee9a0 fffff804`2ad8827a     : fffff804`2adbfdc8 ffffd681`0c3b0670 ffffd681`0c3b0670 ffffd681`09109400 : nt!KeStartDynamicProcessor+0x69
ffffad00`334ee9d0 fffff804`2ad5d473     : ffffd681`0e390000 00000000`00000003 ffffad00`00000003 ffffd681`0e39ae10 : ACPI!ACPIProcessorStartDeviceWorker+0x2adea
ffffad00`334eea60 fffff804`2ad51263     : ffffd681`0e39af28 ffffd681`0e39ae10 00000000`00000000 00000000`00000000 : ACPI!ACPIProcessorStartDevice+0x33
ffffad00`334eea90 fffff804`297ba207     : 00000000`00000007 ffffd681`0e39ae10 00000000`00000000 ffffd681`0c3b0670 : ACPI!ACPIDispatchIrp+0x253
ffffad00`334eeb10 fffff804`29e6c458     : ffffd681`0e39ae10 ffffd681`0c3b0670 ffffd681`0ed24968 00000000`00000000 : nt!IopfCallDriver+0x53
ffffad00`334eeb50 fffff804`298552cb     : ffffd681`0ed24830 ffffd681`0e39ae10 ffff9a81`00000001 ffffd681`0ebe1e60 : nt!IovCallDriver+0x264
ffffad00`334eeb90 fffff804`2abbe648     : ffffd681`0ed24830 00000000`00000000 00000000`0000000e ffffad00`334eecf0 : nt!IofCallDriver+0xe77bb
ffffad00`334eebd0 fffff804`2abc2d11     : ffffd681`0e39ae10 00000000`00000106 ffffad00`334eecf0 00000000`00000106 : Wdf01000!FxPkgFdo::PnpSendStartDeviceDownTheStackOverload+0x198 [minkernel\wdf\framework\shared\irphandlers\pnp\fxpkgfdo.cpp @ 1066]
ffffad00`334eec50 fffff804`2abb8c13     : 00000000`00000106 00000000`00000105 ffffd681`0ed24830 ffffd681`0dcc35c0 : Wdf01000!FxPkgPnp::PnpEventInitStarting+0x11 [minkernel\wdf\framework\shared\irphandlers\pnp\pnpstatemachine.cpp @ 1328]
ffffad00`334eec80 fffff804`2abb8d21     : 00000000`000000a0 ffffd681`0ed24830 ffffd681`0ed24830 00000000`00000000 : Wdf01000!FxPkgPnp::PnpEnterNewState+0x103 [minkernel\wdf\framework\shared\irphandlers\pnp\pnpstatemachine.cpp @ 1234]
ffffad00`334eed10 fffff804`2abb8a8e     : ffffd681`0ed24830 ffffad00`334eee00 ffffd681`0ed24990 ffffd681`0ed24968 : Wdf01000!FxPkgPnp::PnpProcessEventInner+0xc5 [minkernel\wdf\framework\shared\irphandlers\pnp\pnpstatemachine.cpp @ 1152]
ffffad00`334eed80 fffff804`2abb887e     : 00000000`00000000 ffffd681`0ed24830 ffffad00`334eeea0 00000000`00000000 : Wdf01000!FxPkgPnp::PnpProcessEvent+0x11e [minkernel\wdf\framework\shared\irphandlers\pnp\pnpstatemachine.cpp @ 933]
ffffad00`334eee10 fffff804`2ab84ce9     : ffffd681`0ed24830 ffffad00`334eeea0 00000000`0000001b ffffd681`0e39ae10 : Wdf01000!FxPkgPnp::_PnpStartDevice+0x1e [minkernel\wdf\framework\shared\irphandlers\pnp\fxpkgpnp.cpp @ 2031]
ffffad00`334eee40 fffff804`2ab85d89     : ffffd681`0e39ae10 ffffd681`0ed24020 ffffd681`0ed25420 fffff804`29e7891e : Wdf01000!FxPkgPnp::Dispatch+0xd9 [minkernel\wdf\framework\shared\irphandlers\pnp\fxpkgpnp.cpp @ 771]
ffffad00`334eeeb0 fffff804`297ba207     : ffffd681`0e170120 ffffd681`0e873618 fffff804`2978d6f0 ffffd681`0e39ae10 : Wdf01000!FxDevice::DispatchWithLock+0x1f9 [minkernel\wdf\framework\shared\core\fxdevice.cpp @ 1447]
ffffad00`334eef10 fffff804`29e6c458     : ffffd681`0e39ae10 ffffd681`0ed25420 ffffd681`0c3b0670 fffff804`29e787c1 : nt!IopfCallDriver+0x53
ffffad00`334eef50 fffff804`298552cb     : ffffd681`0e39ae10 ffffad00`334ef0b0 ffffd681`0c3b0670 ffffd681`0e873560 : nt!IovCallDriver+0x264
ffffad00`334eef90 fffff804`29ac1242     : ffffd681`0c3b0670 ffffd681`0ec43bd0 00000000`00000007 00000000`00000000 : nt!IofCallDriver+0xe77bb
ffffad00`334eefd0 fffff804`296359b6     : ffffd681`0c3b0670 00000000`00000001 ffffd681`0ec43bd0 fffff804`296346a0 : nt!PnpAsynchronousCall+0xe6
ffffad00`334ef010 fffff804`29634854     : 00000000`00000000 ffffd681`0c3b0670 fffff804`2978d6f0 fffff804`2978d6f0 : nt!PnpSendIrp+0x9e
ffffad00`334ef080 fffff804`29bbd52b     : ffffd681`0ed3a340 00000000`00000000 ffffd681`0ec43bd0 ffffd681`0e170120 : nt!PnpStartDevice+0x88
ffffad00`334ef110 fffff804`29bbd199     : ffffd681`0ed3a340 ffffad00`334ef248 00000000`00000000 ffffd681`0ed3a340 : nt!PnpStartDeviceNode+0xef
ffffad00`334ef1a0 fffff804`29bc12b0     : ffffd681`0ed3a340 ffffd681`090ba9a0 ffffd681`00000000 00000000`00000001 : nt!PipProcessStartPhase1+0x5d
ffffad00`334ef1e0 fffff804`29bb7102     : ffffd681`0e170101 ffffad00`334ef301 ffffad00`334ef2e0 fffff804`00000002 : nt!PipProcessDevNodeTree+0x394
ffffad00`334ef290 fffff804`2978c25d     : 00000001`00000003 ffffd681`090ba9a0 ffffd681`0e170120 ffffd681`0e170120 : nt!PiProcessReenumeration+0x92
ffffad00`334ef2e0 fffff804`296808e1     : ffffd681`09fb5040 fffff804`2a156640 ffffd681`090aec60 ffffd681`00000000 : nt!PnpDeviceActionWorker+0x47d
ffffad00`334ef3a0 fffff804`29701345     : ffffd681`09fb5040 fffff804`28cf8000 ffffd681`09fb5040 72006500`64006c00 : nt!ExpWorkerThread+0x161
ffffad00`334ef5b0 fffff804`29832ef8     : fffff804`28c0b180 ffffd681`09fb5040 fffff804`297012f0 20006500`68007400 : nt!PspSystemThreadStartup+0x55
ffffad00`334ef600 00000000`00000000     : ffffad00`334f0000 ffffad00`334e9000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x28


SYMBOL_NAME:  ACPI!ACPIProcessorStartDeviceWorker+2adea

MODULE_NAME: ACPI

IMAGE_NAME:  ACPI.sys

STACK_COMMAND:  .cxr; .ecxr ; kb

BUCKET_ID_FUNC_OFFSET:  2adea

FAILURE_BUCKET_ID:  0x1DF_VRF_ACPI!ACPIProcessorStartDeviceWorker

OS_VERSION:  10.0.20348.1

BUILDLAB_STR:  fe_release

OSPLATFORM_TYPE:  x64

OSNAME:  Windows 10

FAILURE_ID_HASH:  {8aaee8cf-9dcb-f04e-4e62-6eed422c50e8}

Followup:     MachineOwner
---------

1: kd>
Implicit thread is now ffffd681`09fb5040
1: kd>
Implicit thread is now ffffd681`09fb5040
~

Comment 2 liunana 2023-02-21 15:41:49 UTC
Will check if RHEL vm has a chance to reproduce this issue later.

Comment 3 liunana 2023-02-22 06:18:44 UTC
Didn't hit any error with ovmf rhel9.2 vm after repeat test 12 times.

And I can't be sure if this is a regression issue since the recurrence rate is very low.

Comment 4 John Ferlan 2023-02-22 12:43:27 UTC
Meirav - can someone from the Windows team look at this since it seems non reproducible using a Linux guest.  May need to alter the component/subcomponent too.

Comment 6 Yvugenfi@redhat.com 2023-03-27 11:26:24 UTC
1. Can you please upload the crash dump file?
2. What is the amount of physical CPUs on the host?

Thank you.

Comment 8 Akihiko Odaki 2023-06-30 04:41:22 UTC
Hi,

Thank you for memory dump. I had a look at it, and found apparently it's starting the 19th core, but only 18 cores are listed for WinDbg command: !running -i -t

Can you provide the memory dump *not* converted with elf2dmp? I'd like to know if QEMU also says the core being enabled is offline or not.

Also, I'm having a difficulty to reproduce this issue and have some questions.
- It seems you had to add 18 extra cores to trigger the bug for the memory dump. How many processors do you have to add to reproduce the issue reliably?
- Can you minimize the test case? In particular, remove all options for CPU (try just with -cpu Icelake-Server), extra disks, extra NICs, and TPM, and see if the issue can be reproduced.
- Can you update your Windows Server installation with Windows Update?

Comment 9 RHEL Program Management 2023-08-14 07:27:43 UTC
Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.