Description of problem: hit "watchdog: BUG: soft lockup - CPU#9 stuck for 44s! [swapper/9:0]" problem during hotplug vcpus. Version-Release number of selected component (if applicable): host version: kernel-5.14.0-340.el9.x86_64 qemu-kvm-8.0.0-7.el9.x86_64 edk2-ovmf-20230524-2.el9.noarch guest: rhel9.3.0 How reproducible: 2/2 Steps to Reproduce: 1.Boot guest with "cpus_hotplug.sh" script # sh cpus_hotplug.sh ovmf 2.After the cpus_hotplug.sh script is complete, dmesg message in guest. # dmesg Actual results: [ 1150.754057] watchdog: BUG: soft lockup - CPU#9 stuck for 44s! [swapper/9:0] [ 1151.178612] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables nfnetlink qrtr vfat fat intel_rapl_msr intel_rapl_common rapl iTCO_wdt iTCO_vendor_support i2c_i801 pcspkr lpc_ich i2c_smbus joydev fuse xfs libcrc32c bochs drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt drm_ttm_helper ttm ahci libahci drm sd_mod sg nvme_tcp nvme_fabrics nvme crct10dif_pclmul crc32_pclmul nvme_core crc32c_intel nvme_common libata t10_pi ghash_clmulni_intel e1000e virtio_scsi serio_raw dm_multipath dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi [ 1151.234293] CPU: 9 PID: 0 Comm: swapper/9 Not tainted 5.14.0-340.el9.x86_64 #1 [ 1151.234762] Hardware name: Red Hat KVM/RHEL, BIOS edk2-20230524-2.el9 05/24/2023 [ 1151.235369] RIP: 0010:default_idle+0x10/0x20 [ 1151.235771] Code: 8b 04 25 40 ef 01 00 f0 80 60 02 df c3 cc cc cc cc 0f ae 38 eb bb 0f 1f 40 00 0f 1f 44 00 00 eb 07 0f 00 2d 3e 3c 47 00 fb f4 <c3> cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 65 [ 1151.236764] RSP: 0018:ffffa81140113ed0 EFLAGS: 00000212 [ 1151.237040] RAX: ffffffff95197340 RBX: ffff92c200bb8000 RCX: ffff92c28c440e00 [ 1151.237402] RDX: 4000000000000000 RSI: 0000000000000009 RDI: 00000000025f488c [ 1151.237833] RBP: 0000000000000000 R08: 00000100d51e74e7 R09: 00000000000486fd [ 1151.238212] R10: 00000000000486fd R11: 0000000000000000 R12: 0000000000000000 [ 1151.259021] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 1151.259440] FS: 0000000000000000(0000) GS:ffff92d139840000(0000) knlGS:0000000000000000 [ 1151.266655] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1151.266971] CR2: 00007fe00c7d24f4 CR3: 000000022e610003 CR4: 00000000003706e0 [ 1151.267348] Call Trace: [ 1151.267532] <IRQ> [ 1151.267675] ? show_trace_log_lvl+0x1c4/0x2df [ 1151.267918] ? show_trace_log_lvl+0x1c4/0x2df [ 1151.268156] ? default_idle_call+0x33/0xe0 [ 1151.268386] ? watchdog_timer_fn+0x1b2/0x210 [ 1151.268690] ? lockup_detector_update_enable+0x50/0x50 [ 1151.269035] ? __hrtimer_run_queues+0x12a/0x2c0 [ 1151.269358] ? hrtimer_interrupt+0xfc/0x210 [ 1151.269605] ? __sysvec_apic_timer_interrupt+0x5f/0x110 [ 1151.269880] ? sysvec_apic_timer_interrupt+0x6d/0x90 [ 1151.270141] </IRQ> [ 1151.270278] <TASK> [ 1151.270432] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 [ 1151.270784] ? mwait_idle+0x70/0x70 [ 1151.271050] ? default_idle+0x10/0x20 [ 1151.271258] default_idle_call+0x33/0xe0 [ 1151.271505] cpuidle_idle_call+0x125/0x160 [ 1151.271797] ? kvm_sched_clock_read+0x14/0x30 [ 1151.272102] do_idle+0x78/0xe0 [ 1151.272335] cpu_startup_entry+0x19/0x20 [ 1151.272575] start_secondary+0x10d/0x130 [ 1151.272797] secondary_startup_64_no_verify+0xe5/0xeb [ 1151.273071] </TASK> Expected results: Not hit "watchdog: BUG: soft lockup - CPU#9 stuck for 44s! [swapper/9:0]" issues Additional info:
Can reproduce this bug with qemu-kvm-8.0.0-10.el9.x86_64 Reproduce steps: Boot a q35 + ovmf guest with 448 vcpus host version: qemu-kvm-8.0.0-10.el9.x86_64 kernel-5.14.0-349.el9.x86_64 edk2-ovmf-20230524-2.el9.noarch guest: rhel9.3.0
Yiqian Wei, what on host are running this test? (I'm interested in lscpu command output) and also lscpu from guest (I see mwait_idle in trace but don't see in script an option that enables mwait feature)
Hi Yiqian, I see nothing suspicious about host. Lets try to narrow down possible places to look at. Can you try to following variants (only one thing at a time): 1. try find maxcpus at which problem starts to appear 2. try with 'idle=nomwait' in guest's kernel command line 3. try older (9.2) host/guest kernels 4. try with older QEMU (from RHEL.9.2)
Hi Yiqian, Like Vitaly said bisection is the next logical step to finding out the offending patch. Can you lend me machine you reproduce this issue on?
(In reply to Igor Mammedov from comment #14) > Hi Yiqian, > > I see nothing suspicious about host. > Lets try to narrow down possible places to look at. > > Can you try to following variants (only one thing at a time): > 1. try find maxcpus at which problem starts to appear It is not possible to determine the specific cpu, just the process of hotplug 416 vcpus. > 2. try with 'idle=nomwait' in guest's kernel command line With 'idle=nowait' in guest's kernel, can reproduce this bug with hotplug 416 vcpus. > 3. try older (9.2) host/guest kernels NOT reproduce this bug when hotplug 416 vcpus with kernel-5.14.0-284.28.1.el9_2. x86_64, but hit Bug 2118240 when reboot after hotplug 416 vcpus. > 4. try with older QEMU (from RHEL.9.2) Can reproduce this bug with qemu-kvm-7.2.0-14.el9_2.4.x86_64