2117326 – 5.19.0: dnf install hangs when system is under load

Bug 2117326 - 5.19.0: dnf install hangs when system is under load

Summary: 5.19.0: dnf install hangs when system is under load

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	37
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	fedora-kernel-btrfs
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-08-10 16:04 UTC by Chris Murphy
Modified:	2022-08-11 17:39 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2022-08-11 17:39:39 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
dnf stuck, 5.19 sysrq+w dmesg (461.49 KB, text/plain) 2022-08-10 16:04 UTC, Chris Murphy	no flags	Details
View All

Description Chris Murphy 2022-08-10 16:04:30 UTC

Created attachment 1904786 [details]
dnf stuck, 5.19 sysrq+w dmesg

Linux openqa-x86-worker05.iad2.fedoraproject.org 5.19.0-65.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Aug 1 13:18:35 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Setup
btrfs raid10 on 8x plain partitions

Command
sudo dnf install pciutils

Reproducible:
About 1 in 3

Get stuck at 
Running scriptlet: sg3_utils-1.46-3.fc36.x86_64   2/2 

ps aux status for dnf is D+, kill -9 does nothing, strace shows nothing. The hang last at least 10 minutes, didn't test beyond that.

sysrq+w attached


First call trace of the very long list:

[ 2268.057017] sysrq: Show Blocked State
[ 2268.057866] task:kworker/u97:11  state:D stack:    0 pid:  340 ppid:     2 flags:0x00004000
[ 2268.058361] Workqueue: writeback wb_workfn (flush-btrfs-1)
[ 2268.058825] Call Trace:
[ 2268.059261]  <TASK>
[ 2268.059692]  __schedule+0x335/0x1240
[ 2268.060145]  ? __blk_mq_sched_dispatch_requests+0xe0/0x130
[ 2268.060611]  schedule+0x4e/0xb0
[ 2268.061059]  io_schedule+0x42/0x70
[ 2268.061473]  blk_mq_get_tag+0x10c/0x290
[ 2268.061910]  ? dequeue_task_stop+0x70/0x70
[ 2268.062359]  __blk_mq_alloc_requests+0x16e/0x2a0
[ 2268.062797]  blk_mq_submit_bio+0x2a2/0x590
[ 2268.063226]  __submit_bio+0xf5/0x180
[ 2268.063660]  submit_bio_noacct_nocheck+0x1f9/0x2b0
[ 2268.064055]  btrfs_map_bio+0x170/0x410
[ 2268.064451]  btrfs_submit_data_bio+0x134/0x220
[ 2268.064859]  ? __mod_memcg_lruvec_state+0x93/0x110
[ 2268.065246]  submit_extent_page+0x17a/0x4b0
[ 2268.065637]  ? page_vma_mkclean_one.constprop.0+0x1b0/0x1b0
[ 2268.066018]  __extent_writepage_io.constprop.0+0x271/0x550
[ 2268.066363]  ? end_extent_writepage+0x100/0x100
[ 2268.066720]  ? writepage_delalloc+0x8a/0x180
[ 2268.067094]  __extent_writepage+0x115/0x490
[ 2268.067472]  extent_write_cache_pages+0x178/0x500
[ 2268.067889]  extent_writepages+0x60/0x140
[ 2268.068274]  do_writepages+0xac/0x1a0
[ 2268.068643]  __writeback_single_inode+0x3d/0x350
[ 2268.069022]  ? _raw_spin_lock+0x13/0x40
[ 2268.069419]  writeback_sb_inodes+0x1c5/0x460
[ 2268.069824]  __writeback_inodes_wb+0x4c/0xe0
[ 2268.070230]  wb_writeback+0x1c9/0x2a0
[ 2268.070622]  wb_workfn+0x298/0x490
[ 2268.070988]  process_one_work+0x1c7/0x380
[ 2268.071366]  worker_thread+0x4d/0x380
[ 2268.071775]  ? process_one_work+0x380/0x380
[ 2268.072179]  kthread+0xe9/0x110
[ 2268.072588]  ? kthread_complete_and_exit+0x20/0x20
[ 2268.073002]  ret_from_fork+0x22/0x30
[ 2268.073408]  </TASK>

Comment 1 Chris Murphy 2022-08-10 16:08:41 UTC

It's more likely to happen the busier the 30 qemu processes are; right now with only 12 qemu workers running the problem doesn't happen at all.

Comment 2 Chris Murphy 2022-08-10 16:26:37 UTC

Started an upstream thread
https://lore.kernel.org/linux-btrfs/f7c14f0f-56e5-4748-a3f7-d44bc635b020@www.fastmail.com/T/#u

Comment 3 Chris Murphy 2022-08-11 17:39:39 UTC

Based on Qu's comment on the call traces I'm pretty sure it's the same as, or tangent of, bug 2009585. The difference in this case, it triggers only when under load and then adding dnf. And if the main workload is killed, the problem resolves. I'm going to close this as insufficient data for now, and reopen if the fix for bug 2009585 doesn't also fix this.

Note You need to log in before you can comment on or make changes to this bug.