Bug 2224509

Summary: During hotplug 416 vcpus hit "watchdog: BUG: soft lockup - CPU#9 stuck for 44s! [swapper/9:0]" issues
Product: Red Hat Enterprise Linux 9 Reporter: Yiqian Wei <yiwei>
Component: kernelAssignee: Igor Mammedov <imammedo>
kernel sub component: KVM QA Contact: Yiqian Wei <yiwei>
Status: CLOSED MIGRATED Docs Contact:
Severity: unspecified    
Priority: unspecified CC: akoutsou, atodorov, coli, imammedo, jen, jinzhao, nanliu, nilal, virt-maint, vkuznets, xuwei
Version: 9.3Keywords: MigratedToJIRA
Target Milestone: rcFlags: pm-rhel: mirror+
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-06-03 13:17:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2118240    
Bug Blocks:    

Description Yiqian Wei 2023-07-21 09:25:41 UTC
Description of problem:
hit "watchdog: BUG: soft lockup - CPU#9 stuck for 44s! [swapper/9:0]" problem during hotplug vcpus.

Version-Release number of selected component (if applicable):
host version:
kernel-5.14.0-340.el9.x86_64
qemu-kvm-8.0.0-7.el9.x86_64
edk2-ovmf-20230524-2.el9.noarch
guest: rhel9.3.0

How reproducible:
2/2

Steps to Reproduce:
1.Boot guest with "cpus_hotplug.sh" script
# sh cpus_hotplug.sh ovmf

2.After the cpus_hotplug.sh script is complete, dmesg message in guest.
# dmesg

Actual results:
[ 1150.754057] watchdog: BUG: soft lockup - CPU#9 stuck for 44s! [swapper/9:0]
[ 1151.178612] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables nfnetlink qrtr vfat fat intel_rapl_msr intel_rapl_common rapl iTCO_wdt iTCO_vendor_support i2c_i801 pcspkr lpc_ich i2c_smbus joydev fuse xfs libcrc32c bochs drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt drm_ttm_helper ttm ahci libahci drm sd_mod sg nvme_tcp nvme_fabrics nvme crct10dif_pclmul crc32_pclmul nvme_core crc32c_intel nvme_common libata t10_pi ghash_clmulni_intel e1000e virtio_scsi serio_raw dm_multipath dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
[ 1151.234293] CPU: 9 PID: 0 Comm: swapper/9 Not tainted 5.14.0-340.el9.x86_64 #1
[ 1151.234762] Hardware name: Red Hat KVM/RHEL, BIOS edk2-20230524-2.el9 05/24/2023
[ 1151.235369] RIP: 0010:default_idle+0x10/0x20
[ 1151.235771] Code: 8b 04 25 40 ef 01 00 f0 80 60 02 df c3 cc cc cc cc 0f ae 38 eb bb 0f 1f 40 00 0f 1f 44 00 00 eb 07 0f 00 2d 3e 3c 47 00 fb f4 <c3> cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 65
[ 1151.236764] RSP: 0018:ffffa81140113ed0 EFLAGS: 00000212
[ 1151.237040] RAX: ffffffff95197340 RBX: ffff92c200bb8000 RCX: ffff92c28c440e00
[ 1151.237402] RDX: 4000000000000000 RSI: 0000000000000009 RDI: 00000000025f488c
[ 1151.237833] RBP: 0000000000000000 R08: 00000100d51e74e7 R09: 00000000000486fd
[ 1151.238212] R10: 00000000000486fd R11: 0000000000000000 R12: 0000000000000000
[ 1151.259021] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1151.259440] FS:  0000000000000000(0000) GS:ffff92d139840000(0000) knlGS:0000000000000000
[ 1151.266655] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1151.266971] CR2: 00007fe00c7d24f4 CR3: 000000022e610003 CR4: 00000000003706e0
[ 1151.267348] Call Trace:
[ 1151.267532]  <IRQ>
[ 1151.267675]  ? show_trace_log_lvl+0x1c4/0x2df
[ 1151.267918]  ? show_trace_log_lvl+0x1c4/0x2df
[ 1151.268156]  ? default_idle_call+0x33/0xe0
[ 1151.268386]  ? watchdog_timer_fn+0x1b2/0x210
[ 1151.268690]  ? lockup_detector_update_enable+0x50/0x50
[ 1151.269035]  ? __hrtimer_run_queues+0x12a/0x2c0
[ 1151.269358]  ? hrtimer_interrupt+0xfc/0x210
[ 1151.269605]  ? __sysvec_apic_timer_interrupt+0x5f/0x110
[ 1151.269880]  ? sysvec_apic_timer_interrupt+0x6d/0x90
[ 1151.270141]  </IRQ>
[ 1151.270278]  <TASK>
[ 1151.270432]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 1151.270784]  ? mwait_idle+0x70/0x70
[ 1151.271050]  ? default_idle+0x10/0x20
[ 1151.271258]  default_idle_call+0x33/0xe0
[ 1151.271505]  cpuidle_idle_call+0x125/0x160
[ 1151.271797]  ? kvm_sched_clock_read+0x14/0x30
[ 1151.272102]  do_idle+0x78/0xe0
[ 1151.272335]  cpu_startup_entry+0x19/0x20
[ 1151.272575]  start_secondary+0x10d/0x130
[ 1151.272797]  secondary_startup_64_no_verify+0xe5/0xeb
[ 1151.273071]  </TASK>

Expected results: 
Not hit "watchdog: BUG: soft lockup - CPU#9 stuck for 44s! [swapper/9:0]" issues

Additional info:

Comment 3 Yiqian Wei 2023-08-07 05:58:43 UTC
Can reproduce this bug with qemu-kvm-8.0.0-10.el9.x86_64

Reproduce steps: Boot a q35 + ovmf guest with 448 vcpus
host version:
qemu-kvm-8.0.0-10.el9.x86_64
kernel-5.14.0-349.el9.x86_64
edk2-ovmf-20230524-2.el9.noarch
guest: rhel9.3.0

Comment 9 Igor Mammedov 2023-08-08 12:16:43 UTC
Yiqian Wei,

what on host are running this test? (I'm interested in lscpu command output)

and also lscpu from guest (I see mwait_idle in trace but don't see in script an option that enables mwait feature)

Comment 14 Igor Mammedov 2023-08-09 11:07:30 UTC
Hi Yiqian,

I see nothing suspicious about host.
Lets try to narrow down possible places to look at.

Can you try to following variants (only one thing at a time):
 1. try find maxcpus at which problem starts to appear
 2. try with 'idle=nomwait' in guest's kernel command line
 3. try older (9.2) host/guest kernels
 4. try with older QEMU (from RHEL.9.2)

Comment 20 Igor Mammedov 2023-08-15 07:32:55 UTC
Hi Yiqian,


Like Vitaly said bisection is the next logical step to finding out the offending patch. 
Can you lend me machine you reproduce this issue on?

Comment 21 Yiqian Wei 2023-08-15 09:25:11 UTC
(In reply to Igor Mammedov from comment #14)
> Hi Yiqian,
> 
> I see nothing suspicious about host.
> Lets try to narrow down possible places to look at.
> 
> Can you try to following variants (only one thing at a time):
>  1. try find maxcpus at which problem starts to appear

It is not possible to determine the specific cpu, just the process of hotplug 416 vcpus.

>  2. try with 'idle=nomwait' in guest's kernel command line

With 'idle=nowait' in guest's kernel, can reproduce this bug with hotplug 416 vcpus.

>  3. try older (9.2) host/guest kernels

NOT reproduce this bug when hotplug 416 vcpus with kernel-5.14.0-284.28.1.el9_2. x86_64, but hit Bug 2118240 when reboot after hotplug 416 vcpus.

>  4. try with older QEMU (from RHEL.9.2)

Can reproduce this bug with qemu-kvm-7.2.0-14.el9_2.4.x86_64

Comment 27 Igor Mammedov 2023-08-21 08:47:11 UTC
due to Bug 2118240 it's hard to say is definitely if it's a separate issue.
Lets retest once above dependency is fixed.

Comment 28 John Ferlan 2023-10-04 15:35:45 UTC
Igor -

This bug is in a strange state - if ON_QA I'd expect there to be an ITR or FixVersion set or the TestOnly keyword being set. My concern / fear especially because of the Jira transition is that this will be lost.  Can you elaborate a bit more or perhaps connect to https://issues.redhat.com/browse/RHEL-7173 (new Jira issue from the previous bug 2118240).  Maybe we just need to migrate this bug manually (hint, use the "MigratedToJira" keyword - a bot will do the rest, then link that to the other Jira issue).