RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2224509 - During hotplug 416 vcpus hit "watchdog: BUG: soft lockup - CPU#9 stuck for 44s! [swapper/9:0]" issues [NEEDINFO]
Summary: During hotplug 416 vcpus hit "watchdog: BUG: soft lockup - CPU#9 stuck for 44...
Keywords:
Status: ON_QA
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: kernel
Version: 9.3
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Igor Mammedov
QA Contact: Yiqian Wei
URL:
Whiteboard:
Depends On: 2118240
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-21 09:25 UTC by Yiqian Wei
Modified: 2023-11-30 15:34 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Target Upstream Version:
Embargoed:
jferlan: needinfo? (imammedo)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHEL-17714 0 None None None 2023-11-30 15:34:14 UTC
Red Hat Issue Tracker RHELPLAN-162945 0 None None None 2023-07-21 09:27:38 UTC

Description Yiqian Wei 2023-07-21 09:25:41 UTC
Description of problem:
hit "watchdog: BUG: soft lockup - CPU#9 stuck for 44s! [swapper/9:0]" problem during hotplug vcpus.

Version-Release number of selected component (if applicable):
host version:
kernel-5.14.0-340.el9.x86_64
qemu-kvm-8.0.0-7.el9.x86_64
edk2-ovmf-20230524-2.el9.noarch
guest: rhel9.3.0

How reproducible:
2/2

Steps to Reproduce:
1.Boot guest with "cpus_hotplug.sh" script
# sh cpus_hotplug.sh ovmf

2.After the cpus_hotplug.sh script is complete, dmesg message in guest.
# dmesg

Actual results:
[ 1150.754057] watchdog: BUG: soft lockup - CPU#9 stuck for 44s! [swapper/9:0]
[ 1151.178612] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables nfnetlink qrtr vfat fat intel_rapl_msr intel_rapl_common rapl iTCO_wdt iTCO_vendor_support i2c_i801 pcspkr lpc_ich i2c_smbus joydev fuse xfs libcrc32c bochs drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt drm_ttm_helper ttm ahci libahci drm sd_mod sg nvme_tcp nvme_fabrics nvme crct10dif_pclmul crc32_pclmul nvme_core crc32c_intel nvme_common libata t10_pi ghash_clmulni_intel e1000e virtio_scsi serio_raw dm_multipath dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
[ 1151.234293] CPU: 9 PID: 0 Comm: swapper/9 Not tainted 5.14.0-340.el9.x86_64 #1
[ 1151.234762] Hardware name: Red Hat KVM/RHEL, BIOS edk2-20230524-2.el9 05/24/2023
[ 1151.235369] RIP: 0010:default_idle+0x10/0x20
[ 1151.235771] Code: 8b 04 25 40 ef 01 00 f0 80 60 02 df c3 cc cc cc cc 0f ae 38 eb bb 0f 1f 40 00 0f 1f 44 00 00 eb 07 0f 00 2d 3e 3c 47 00 fb f4 <c3> cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 65
[ 1151.236764] RSP: 0018:ffffa81140113ed0 EFLAGS: 00000212
[ 1151.237040] RAX: ffffffff95197340 RBX: ffff92c200bb8000 RCX: ffff92c28c440e00
[ 1151.237402] RDX: 4000000000000000 RSI: 0000000000000009 RDI: 00000000025f488c
[ 1151.237833] RBP: 0000000000000000 R08: 00000100d51e74e7 R09: 00000000000486fd
[ 1151.238212] R10: 00000000000486fd R11: 0000000000000000 R12: 0000000000000000
[ 1151.259021] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1151.259440] FS:  0000000000000000(0000) GS:ffff92d139840000(0000) knlGS:0000000000000000
[ 1151.266655] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1151.266971] CR2: 00007fe00c7d24f4 CR3: 000000022e610003 CR4: 00000000003706e0
[ 1151.267348] Call Trace:
[ 1151.267532]  <IRQ>
[ 1151.267675]  ? show_trace_log_lvl+0x1c4/0x2df
[ 1151.267918]  ? show_trace_log_lvl+0x1c4/0x2df
[ 1151.268156]  ? default_idle_call+0x33/0xe0
[ 1151.268386]  ? watchdog_timer_fn+0x1b2/0x210
[ 1151.268690]  ? lockup_detector_update_enable+0x50/0x50
[ 1151.269035]  ? __hrtimer_run_queues+0x12a/0x2c0
[ 1151.269358]  ? hrtimer_interrupt+0xfc/0x210
[ 1151.269605]  ? __sysvec_apic_timer_interrupt+0x5f/0x110
[ 1151.269880]  ? sysvec_apic_timer_interrupt+0x6d/0x90
[ 1151.270141]  </IRQ>
[ 1151.270278]  <TASK>
[ 1151.270432]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 1151.270784]  ? mwait_idle+0x70/0x70
[ 1151.271050]  ? default_idle+0x10/0x20
[ 1151.271258]  default_idle_call+0x33/0xe0
[ 1151.271505]  cpuidle_idle_call+0x125/0x160
[ 1151.271797]  ? kvm_sched_clock_read+0x14/0x30
[ 1151.272102]  do_idle+0x78/0xe0
[ 1151.272335]  cpu_startup_entry+0x19/0x20
[ 1151.272575]  start_secondary+0x10d/0x130
[ 1151.272797]  secondary_startup_64_no_verify+0xe5/0xeb
[ 1151.273071]  </TASK>

Expected results: 
Not hit "watchdog: BUG: soft lockup - CPU#9 stuck for 44s! [swapper/9:0]" issues

Additional info:

Comment 3 Yiqian Wei 2023-08-07 05:58:43 UTC
Can reproduce this bug with qemu-kvm-8.0.0-10.el9.x86_64

Reproduce steps: Boot a q35 + ovmf guest with 448 vcpus
host version:
qemu-kvm-8.0.0-10.el9.x86_64
kernel-5.14.0-349.el9.x86_64
edk2-ovmf-20230524-2.el9.noarch
guest: rhel9.3.0

Comment 9 Igor Mammedov 2023-08-08 12:16:43 UTC
Yiqian Wei,

what on host are running this test? (I'm interested in lscpu command output)

and also lscpu from guest (I see mwait_idle in trace but don't see in script an option that enables mwait feature)

Comment 14 Igor Mammedov 2023-08-09 11:07:30 UTC
Hi Yiqian,

I see nothing suspicious about host.
Lets try to narrow down possible places to look at.

Can you try to following variants (only one thing at a time):
 1. try find maxcpus at which problem starts to appear
 2. try with 'idle=nomwait' in guest's kernel command line
 3. try older (9.2) host/guest kernels
 4. try with older QEMU (from RHEL.9.2)

Comment 20 Igor Mammedov 2023-08-15 07:32:55 UTC
Hi Yiqian,


Like Vitaly said bisection is the next logical step to finding out the offending patch. 
Can you lend me machine you reproduce this issue on?

Comment 21 Yiqian Wei 2023-08-15 09:25:11 UTC
(In reply to Igor Mammedov from comment #14)
> Hi Yiqian,
> 
> I see nothing suspicious about host.
> Lets try to narrow down possible places to look at.
> 
> Can you try to following variants (only one thing at a time):
>  1. try find maxcpus at which problem starts to appear

It is not possible to determine the specific cpu, just the process of hotplug 416 vcpus.

>  2. try with 'idle=nomwait' in guest's kernel command line

With 'idle=nowait' in guest's kernel, can reproduce this bug with hotplug 416 vcpus.

>  3. try older (9.2) host/guest kernels

NOT reproduce this bug when hotplug 416 vcpus with kernel-5.14.0-284.28.1.el9_2. x86_64, but hit Bug 2118240 when reboot after hotplug 416 vcpus.

>  4. try with older QEMU (from RHEL.9.2)

Can reproduce this bug with qemu-kvm-7.2.0-14.el9_2.4.x86_64

Comment 27 Igor Mammedov 2023-08-21 08:47:11 UTC
due to Bug 2118240 it's hard to say is definitely if it's a separate issue.
Lets retest once above dependency is fixed.

Comment 28 John Ferlan 2023-10-04 15:35:45 UTC
Igor -

This bug is in a strange state - if ON_QA I'd expect there to be an ITR or FixVersion set or the TestOnly keyword being set. My concern / fear especially because of the Jira transition is that this will be lost.  Can you elaborate a bit more or perhaps connect to https://issues.redhat.com/browse/RHEL-7173 (new Jira issue from the previous bug 2118240).  Maybe we just need to migrate this bug manually (hint, use the "MigratedToJira" keyword - a bot will do the rest, then link that to the other Jira issue).


Note You need to log in before you can comment on or make changes to this bug.