Bug 2117326 - 5.19.0: dnf install hangs when system is under load
Summary: 5.19.0: dnf install hangs when system is under load
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 37
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: fedora-kernel-btrfs
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-08-10 16:04 UTC by Chris Murphy
Modified: 2022-08-11 17:39 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-11 17:39:39 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
dnf stuck, 5.19 sysrq+w dmesg (461.49 KB, text/plain)
2022-08-10 16:04 UTC, Chris Murphy
no flags Details

Description Chris Murphy 2022-08-10 16:04:30 UTC
Created attachment 1904786 [details]
dnf stuck, 5.19 sysrq+w dmesg

Linux openqa-x86-worker05.iad2.fedoraproject.org 5.19.0-65.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Aug 1 13:18:35 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Setup
btrfs raid10 on 8x plain partitions

Command
sudo dnf install pciutils

Reproducible:
About 1 in 3

Get stuck at 
Running scriptlet: sg3_utils-1.46-3.fc36.x86_64   2/2 

ps aux status for dnf is D+, kill -9 does nothing, strace shows nothing. The hang last at least 10 minutes, didn't test beyond that.

sysrq+w attached


First call trace of the very long list:

[ 2268.057017] sysrq: Show Blocked State
[ 2268.057866] task:kworker/u97:11  state:D stack:    0 pid:  340 ppid:     2 flags:0x00004000
[ 2268.058361] Workqueue: writeback wb_workfn (flush-btrfs-1)
[ 2268.058825] Call Trace:
[ 2268.059261]  <TASK>
[ 2268.059692]  __schedule+0x335/0x1240
[ 2268.060145]  ? __blk_mq_sched_dispatch_requests+0xe0/0x130
[ 2268.060611]  schedule+0x4e/0xb0
[ 2268.061059]  io_schedule+0x42/0x70
[ 2268.061473]  blk_mq_get_tag+0x10c/0x290
[ 2268.061910]  ? dequeue_task_stop+0x70/0x70
[ 2268.062359]  __blk_mq_alloc_requests+0x16e/0x2a0
[ 2268.062797]  blk_mq_submit_bio+0x2a2/0x590
[ 2268.063226]  __submit_bio+0xf5/0x180
[ 2268.063660]  submit_bio_noacct_nocheck+0x1f9/0x2b0
[ 2268.064055]  btrfs_map_bio+0x170/0x410
[ 2268.064451]  btrfs_submit_data_bio+0x134/0x220
[ 2268.064859]  ? __mod_memcg_lruvec_state+0x93/0x110
[ 2268.065246]  submit_extent_page+0x17a/0x4b0
[ 2268.065637]  ? page_vma_mkclean_one.constprop.0+0x1b0/0x1b0
[ 2268.066018]  __extent_writepage_io.constprop.0+0x271/0x550
[ 2268.066363]  ? end_extent_writepage+0x100/0x100
[ 2268.066720]  ? writepage_delalloc+0x8a/0x180
[ 2268.067094]  __extent_writepage+0x115/0x490
[ 2268.067472]  extent_write_cache_pages+0x178/0x500
[ 2268.067889]  extent_writepages+0x60/0x140
[ 2268.068274]  do_writepages+0xac/0x1a0
[ 2268.068643]  __writeback_single_inode+0x3d/0x350
[ 2268.069022]  ? _raw_spin_lock+0x13/0x40
[ 2268.069419]  writeback_sb_inodes+0x1c5/0x460
[ 2268.069824]  __writeback_inodes_wb+0x4c/0xe0
[ 2268.070230]  wb_writeback+0x1c9/0x2a0
[ 2268.070622]  wb_workfn+0x298/0x490
[ 2268.070988]  process_one_work+0x1c7/0x380
[ 2268.071366]  worker_thread+0x4d/0x380
[ 2268.071775]  ? process_one_work+0x380/0x380
[ 2268.072179]  kthread+0xe9/0x110
[ 2268.072588]  ? kthread_complete_and_exit+0x20/0x20
[ 2268.073002]  ret_from_fork+0x22/0x30
[ 2268.073408]  </TASK>

Comment 1 Chris Murphy 2022-08-10 16:08:41 UTC
It's more likely to happen the busier the 30 qemu processes are; right now with only 12 qemu workers running the problem doesn't happen at all.

Comment 3 Chris Murphy 2022-08-11 17:39:39 UTC
Based on Qu's comment on the call traces I'm pretty sure it's the same as, or tangent of, bug 2009585. The difference in this case, it triggers only when under load and then adding dnf. And if the main workload is killed, the problem resolves. I'm going to close this as insufficient data for now, and reopen if the fix for bug 2009585 doesn't also fix this.


Note You need to log in before you can comment on or make changes to this bug.