Created attachment 1904786 [details] dnf stuck, 5.19 sysrq+w dmesg Linux openqa-x86-worker05.iad2.fedoraproject.org 5.19.0-65.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Aug 1 13:18:35 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux Setup btrfs raid10 on 8x plain partitions Command sudo dnf install pciutils Reproducible: About 1 in 3 Get stuck at Running scriptlet: sg3_utils-1.46-3.fc36.x86_64 2/2 ps aux status for dnf is D+, kill -9 does nothing, strace shows nothing. The hang last at least 10 minutes, didn't test beyond that. sysrq+w attached First call trace of the very long list: [ 2268.057017] sysrq: Show Blocked State [ 2268.057866] task:kworker/u97:11 state:D stack: 0 pid: 340 ppid: 2 flags:0x00004000 [ 2268.058361] Workqueue: writeback wb_workfn (flush-btrfs-1) [ 2268.058825] Call Trace: [ 2268.059261] <TASK> [ 2268.059692] __schedule+0x335/0x1240 [ 2268.060145] ? __blk_mq_sched_dispatch_requests+0xe0/0x130 [ 2268.060611] schedule+0x4e/0xb0 [ 2268.061059] io_schedule+0x42/0x70 [ 2268.061473] blk_mq_get_tag+0x10c/0x290 [ 2268.061910] ? dequeue_task_stop+0x70/0x70 [ 2268.062359] __blk_mq_alloc_requests+0x16e/0x2a0 [ 2268.062797] blk_mq_submit_bio+0x2a2/0x590 [ 2268.063226] __submit_bio+0xf5/0x180 [ 2268.063660] submit_bio_noacct_nocheck+0x1f9/0x2b0 [ 2268.064055] btrfs_map_bio+0x170/0x410 [ 2268.064451] btrfs_submit_data_bio+0x134/0x220 [ 2268.064859] ? __mod_memcg_lruvec_state+0x93/0x110 [ 2268.065246] submit_extent_page+0x17a/0x4b0 [ 2268.065637] ? page_vma_mkclean_one.constprop.0+0x1b0/0x1b0 [ 2268.066018] __extent_writepage_io.constprop.0+0x271/0x550 [ 2268.066363] ? end_extent_writepage+0x100/0x100 [ 2268.066720] ? writepage_delalloc+0x8a/0x180 [ 2268.067094] __extent_writepage+0x115/0x490 [ 2268.067472] extent_write_cache_pages+0x178/0x500 [ 2268.067889] extent_writepages+0x60/0x140 [ 2268.068274] do_writepages+0xac/0x1a0 [ 2268.068643] __writeback_single_inode+0x3d/0x350 [ 2268.069022] ? _raw_spin_lock+0x13/0x40 [ 2268.069419] writeback_sb_inodes+0x1c5/0x460 [ 2268.069824] __writeback_inodes_wb+0x4c/0xe0 [ 2268.070230] wb_writeback+0x1c9/0x2a0 [ 2268.070622] wb_workfn+0x298/0x490 [ 2268.070988] process_one_work+0x1c7/0x380 [ 2268.071366] worker_thread+0x4d/0x380 [ 2268.071775] ? process_one_work+0x380/0x380 [ 2268.072179] kthread+0xe9/0x110 [ 2268.072588] ? kthread_complete_and_exit+0x20/0x20 [ 2268.073002] ret_from_fork+0x22/0x30 [ 2268.073408] </TASK>
It's more likely to happen the busier the 30 qemu processes are; right now with only 12 qemu workers running the problem doesn't happen at all.
Started an upstream thread https://lore.kernel.org/linux-btrfs/f7c14f0f-56e5-4748-a3f7-d44bc635b020@www.fastmail.com/T/#u
Based on Qu's comment on the call traces I'm pretty sure it's the same as, or tangent of, bug 2009585. The difference in this case, it triggers only when under load and then adding dnf. And if the main workload is killed, the problem resolves. I'm going to close this as insufficient data for now, and reopen if the fix for bug 2009585 doesn't also fix this.