Bug 2117326

Summary: 5.19.0: dnf install hangs when system is under load
Product: [Fedora] Fedora Reporter: Chris Murphy <bugzilla>
Component: kernelAssignee: fedora-kernel-btrfs
Status: CLOSED INSUFFICIENT_DATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 37CC: acaringi, adscvr, airlied, alciregi, bskeggs, hdegoede, hpa, jarodwilson, jglisse, jonathan, josef, kernel-maint, lgoncalv, linville, masami256, mchehab, ptalbert, steved
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-11 17:39:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dnf stuck, 5.19 sysrq+w dmesg none

Description Chris Murphy 2022-08-10 16:04:30 UTC
Created attachment 1904786 [details]
dnf stuck, 5.19 sysrq+w dmesg

Linux openqa-x86-worker05.iad2.fedoraproject.org 5.19.0-65.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Aug 1 13:18:35 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Setup
btrfs raid10 on 8x plain partitions

Command
sudo dnf install pciutils

Reproducible:
About 1 in 3

Get stuck at 
Running scriptlet: sg3_utils-1.46-3.fc36.x86_64   2/2 

ps aux status for dnf is D+, kill -9 does nothing, strace shows nothing. The hang last at least 10 minutes, didn't test beyond that.

sysrq+w attached


First call trace of the very long list:

[ 2268.057017] sysrq: Show Blocked State
[ 2268.057866] task:kworker/u97:11  state:D stack:    0 pid:  340 ppid:     2 flags:0x00004000
[ 2268.058361] Workqueue: writeback wb_workfn (flush-btrfs-1)
[ 2268.058825] Call Trace:
[ 2268.059261]  <TASK>
[ 2268.059692]  __schedule+0x335/0x1240
[ 2268.060145]  ? __blk_mq_sched_dispatch_requests+0xe0/0x130
[ 2268.060611]  schedule+0x4e/0xb0
[ 2268.061059]  io_schedule+0x42/0x70
[ 2268.061473]  blk_mq_get_tag+0x10c/0x290
[ 2268.061910]  ? dequeue_task_stop+0x70/0x70
[ 2268.062359]  __blk_mq_alloc_requests+0x16e/0x2a0
[ 2268.062797]  blk_mq_submit_bio+0x2a2/0x590
[ 2268.063226]  __submit_bio+0xf5/0x180
[ 2268.063660]  submit_bio_noacct_nocheck+0x1f9/0x2b0
[ 2268.064055]  btrfs_map_bio+0x170/0x410
[ 2268.064451]  btrfs_submit_data_bio+0x134/0x220
[ 2268.064859]  ? __mod_memcg_lruvec_state+0x93/0x110
[ 2268.065246]  submit_extent_page+0x17a/0x4b0
[ 2268.065637]  ? page_vma_mkclean_one.constprop.0+0x1b0/0x1b0
[ 2268.066018]  __extent_writepage_io.constprop.0+0x271/0x550
[ 2268.066363]  ? end_extent_writepage+0x100/0x100
[ 2268.066720]  ? writepage_delalloc+0x8a/0x180
[ 2268.067094]  __extent_writepage+0x115/0x490
[ 2268.067472]  extent_write_cache_pages+0x178/0x500
[ 2268.067889]  extent_writepages+0x60/0x140
[ 2268.068274]  do_writepages+0xac/0x1a0
[ 2268.068643]  __writeback_single_inode+0x3d/0x350
[ 2268.069022]  ? _raw_spin_lock+0x13/0x40
[ 2268.069419]  writeback_sb_inodes+0x1c5/0x460
[ 2268.069824]  __writeback_inodes_wb+0x4c/0xe0
[ 2268.070230]  wb_writeback+0x1c9/0x2a0
[ 2268.070622]  wb_workfn+0x298/0x490
[ 2268.070988]  process_one_work+0x1c7/0x380
[ 2268.071366]  worker_thread+0x4d/0x380
[ 2268.071775]  ? process_one_work+0x380/0x380
[ 2268.072179]  kthread+0xe9/0x110
[ 2268.072588]  ? kthread_complete_and_exit+0x20/0x20
[ 2268.073002]  ret_from_fork+0x22/0x30
[ 2268.073408]  </TASK>

Comment 1 Chris Murphy 2022-08-10 16:08:41 UTC
It's more likely to happen the busier the 30 qemu processes are; right now with only 12 qemu workers running the problem doesn't happen at all.

Comment 3 Chris Murphy 2022-08-11 17:39:39 UTC
Based on Qu's comment on the call traces I'm pretty sure it's the same as, or tangent of, bug 2009585. The difference in this case, it triggers only when under load and then adding dnf. And if the main workload is killed, the problem resolves. I'm going to close this as insufficient data for now, and reopen if the fix for bug 2009585 doesn't also fix this.