Description of problem: Under heavy I/O load, occasionally (once every few days) a deadlock occurs in raid5_get_active_stripe function and a reboot is required. Jul 24 11:36:11 daemon kernel: INFO: task xfsaild/md1:794 blocked for more than 122 seconds. Jul 24 11:36:11 daemon kernel: Tainted: G I -------- --- 5.14.0-284.18.1.el9_2.x86_64 #1 Jul 24 11:36:11 daemon kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 24 11:36:11 daemon kernel: task:xfsaild/md1 state:D stack: 0 pid: 794 ppid: 2 flags:0x00004000 Jul 24 11:36:11 daemon kernel: Call Trace: Jul 24 11:36:11 daemon kernel: <TASK> Jul 24 11:36:11 daemon kernel: __schedule+0x248/0x620 Jul 24 11:36:11 daemon kernel: ? __blk_flush_plug+0xdb/0x160 Jul 24 11:36:11 daemon kernel: schedule+0x5a/0xc0 Jul 24 11:36:11 daemon kernel: raid5_get_active_stripe+0x25a/0x2f0 [raid456] Jul 24 11:36:11 daemon kernel: ? cpuacct_percpu_seq_show+0x10/0x10 Jul 24 11:36:11 daemon kernel: make_stripe_request+0x9b/0x490 [raid456] Jul 24 11:36:11 daemon kernel: raid5_make_request+0x16f/0x3e0 [raid456] Jul 24 11:36:11 daemon kernel: ? sched_show_numa+0xf0/0xf0 Jul 24 11:36:11 daemon kernel: md_handle_request+0x135/0x1e0 Jul 24 11:36:11 daemon kernel: ? bio_split_to_limits+0x51/0x90 Jul 24 11:36:11 daemon kernel: __submit_bio+0x89/0x130 Jul 24 11:36:11 daemon kernel: __submit_bio_noacct+0x81/0x1f0 Jul 24 11:36:11 daemon kernel: xfs_buf_ioapply_map+0x1cb/0x280 [xfs] Jul 24 11:36:11 daemon kernel: _xfs_buf_ioapply+0xcf/0x1b0 [xfs] Jul 24 11:36:11 daemon kernel: ? wake_up_q+0x90/0x90 Jul 24 11:36:11 daemon kernel: __xfs_buf_submit+0x6e/0x1e0 [xfs] Jul 24 11:36:11 daemon kernel: xfs_buf_delwri_submit_buffers+0xe9/0x230 [xfs] Jul 24 11:36:11 daemon kernel: xfsaild_push+0x176/0x6f0 [xfs] Jul 24 11:36:11 daemon kernel: ? del_timer_sync+0x67/0xb0 Jul 24 11:36:11 daemon kernel: xfsaild+0xa4/0x1e0 [xfs] Jul 24 11:36:11 daemon kernel: ? xfsaild_push+0x6f0/0x6f0 [xfs] Jul 24 11:36:11 daemon kernel: kthread+0xd9/0x100 Jul 24 11:36:11 daemon kernel: ? kthread_complete_and_exit+0x20/0x20 Jul 24 11:36:11 daemon kernel: ret_from_fork+0x22/0x30 Jul 24 11:36:11 daemon kernel: </TASK> Jul 24 11:36:11 daemon kernel: INFO: task rsync_bpc:5165 blocked for more than 122 seconds. Jul 24 11:36:11 daemon kernel: Tainted: G I -------- --- 5.14.0-284.18.1.el9_2.x86_64 #1 Jul 24 11:36:11 daemon kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 24 11:36:11 daemon kernel: task:rsync_bpc state:D stack: 0 pid: 5165 ppid: 5076 flags:0x00004002 Jul 24 11:36:11 daemon kernel: Call Trace: Jul 24 11:36:11 daemon kernel: <TASK> Jul 24 11:36:11 daemon kernel: __schedule+0x248/0x620 Jul 24 11:36:11 daemon kernel: schedule+0x5a/0xc0 Jul 24 11:36:11 daemon kernel: raid5_get_active_stripe+0x25a/0x2f0 [raid456] Jul 24 11:36:11 daemon kernel: ? cpuacct_percpu_seq_show+0x10/0x10 Jul 24 11:36:11 daemon kernel: make_stripe_request+0x9b/0x490 [raid456] Jul 24 11:36:11 daemon kernel: raid5_make_request+0x16f/0x3e0 [raid456] Jul 24 11:36:11 daemon kernel: ? sched_show_numa+0xf0/0xf0 Jul 24 11:36:11 daemon kernel: md_handle_request+0x135/0x1e0 Jul 24 11:36:11 daemon kernel: ? bio_split_to_limits+0x51/0x90 Jul 24 11:36:11 daemon kernel: __submit_bio+0x89/0x130 Jul 24 11:36:11 daemon kernel: __submit_bio_noacct+0x81/0x1f0 Jul 24 11:36:11 daemon kernel: iomap_add_to_ioend+0x1f5/0x270 Jul 24 11:36:11 daemon kernel: iomap_writepage_map+0xe8/0x530 Jul 24 11:36:11 daemon kernel: write_cache_pages+0x17c/0x4c0 Jul 24 11:36:11 daemon kernel: ? iomap_writepage_map+0x530/0x530 Jul 24 11:36:11 daemon kernel: iomap_writepages+0x1c/0x40 Jul 24 11:36:11 daemon kernel: xfs_vm_writepages+0x7a/0xb0 [xfs] Jul 24 11:36:11 daemon kernel: do_writepages+0xcf/0x1d0 Jul 24 11:36:11 daemon kernel: ? wbc_detach_inode+0x13f/0x220 Jul 24 11:36:11 daemon kernel: filemap_fdatawrite_wbc+0x66/0x90 Jul 24 11:36:11 daemon kernel: filemap_write_and_wait_range+0x6f/0xf0 Jul 24 11:36:11 daemon kernel: xfs_setattr_size+0x267/0x390 [xfs] Jul 24 11:36:11 daemon kernel: xfs_vn_setattr+0x78/0x180 [xfs] Jul 24 11:36:11 daemon kernel: notify_change+0x34d/0x4e0 Jul 24 11:36:11 daemon kernel: ? do_truncate+0x7d/0xd0 Jul 24 11:36:11 daemon kernel: do_truncate+0x7d/0xd0 Jul 24 11:36:11 daemon kernel: do_sys_ftruncate+0x17d/0x1b0 Jul 24 11:36:11 daemon kernel: do_syscall_64+0x5c/0x90 Jul 24 11:36:11 daemon kernel: ? vfs_write+0x1ef/0x280 Jul 24 11:36:11 daemon kernel: ? syscall_exit_work+0x11a/0x150 Jul 24 11:36:11 daemon kernel: ? syscall_exit_to_user_mode+0x12/0x30 Jul 24 11:36:11 daemon kernel: ? do_syscall_64+0x69/0x90 Jul 24 11:36:11 daemon kernel: ? syscall_exit_work+0x11a/0x150 Jul 24 11:36:11 daemon kernel: ? syscall_exit_to_user_mode+0x12/0x30 Jul 24 11:36:11 daemon kernel: ? do_syscall_64+0x69/0x90 Jul 24 11:36:11 daemon kernel: ? sysvec_apic_timer_interrupt+0x3c/0x90 Jul 24 11:36:11 daemon kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd Jul 24 11:36:11 daemon kernel: RIP: 0033:0x7ff10554673b Jul 24 11:36:11 daemon kernel: RSP: 002b:00007ffc11d2e068 EFLAGS: 00000206 ORIG_RAX: 000000000000004d Jul 24 11:36:11 daemon kernel: RAX: ffffffffffffffda RBX: 0000000015dfd000 RCX: 00007ff10554673b Jul 24 11:36:11 daemon kernel: RDX: 0000000000000003 RSI: 0000000015dfd000 RDI: 0000000000000003 Jul 24 11:36:11 daemon kernel: RBP: 000055f649c02920 R08: 0000000000000000 R09: 00000000e124df27 Jul 24 11:36:11 daemon kernel: R10: 00000000c29e658e R11: 0000000000000206 R12: 0000000000000000 Jul 24 11:36:12 daemon kernel: R13: 0000000000000034 R14: 0000000000000004 R15: 000055f649c02ac0 Jul 24 11:36:12 daemon kernel: </TASK> Jul 24 11:36:12 daemon kernel: INFO: task kworker/u16:1:36563 blocked for more than 123 seconds. Jul 24 11:36:12 daemon kernel: Tainted: G I -------- --- 5.14.0-284.18.1.el9_2.x86_64 #1 Jul 24 11:36:12 daemon kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 24 11:36:12 daemon kernel: task:kworker/u16:1 state:D stack: 0 pid:36563 ppid: 2 flags:0x00004000 Jul 24 11:36:12 daemon kernel: Workqueue: writeback wb_workfn (flush-9:1) Jul 24 11:36:12 daemon kernel: Call Trace: Jul 24 11:36:12 daemon kernel: <TASK> Jul 24 11:36:12 daemon kernel: __schedule+0x248/0x620 Jul 24 11:36:12 daemon kernel: schedule+0x5a/0xc0 Jul 24 11:36:12 daemon kernel: raid5_get_active_stripe+0x25a/0x2f0 [raid456] Jul 24 11:36:12 daemon kernel: ? cpuacct_percpu_seq_show+0x10/0x10 Jul 24 11:36:12 daemon kernel: make_stripe_request+0x9b/0x490 [raid456] Jul 24 11:36:12 daemon kernel: ? bdev_start_io_acct+0x47/0x100 Jul 24 11:36:12 daemon kernel: raid5_make_request+0x16f/0x3e0 [raid456] Jul 24 11:36:12 daemon kernel: ? sched_show_numa+0xf0/0xf0 Jul 24 11:36:12 daemon kernel: md_handle_request+0x135/0x1e0 Jul 24 11:36:12 daemon kernel: ? bio_split_to_limits+0x51/0x90 Jul 24 11:36:12 daemon kernel: __submit_bio+0x89/0x130 Jul 24 11:36:12 daemon kernel: __submit_bio_noacct+0x81/0x1f0 Jul 24 11:36:12 daemon kernel: iomap_submit_ioend+0x4e/0x80 Jul 24 11:36:12 daemon kernel: xfs_vm_writepages+0x7a/0xb0 [xfs] Jul 24 11:36:12 daemon kernel: do_writepages+0xcf/0x1d0 Jul 24 11:36:12 daemon kernel: __writeback_single_inode+0x41/0x270 Jul 24 11:36:12 daemon kernel: writeback_sb_inodes+0x209/0x4a0 Jul 24 11:36:12 daemon kernel: __writeback_inodes_wb+0x4c/0xe0 Jul 24 11:36:12 daemon kernel: wb_writeback+0x1d7/0x2d0 Jul 24 11:36:12 daemon kernel: wb_do_writeback+0x22a/0x2b0 Jul 24 11:36:12 daemon kernel: wb_workfn+0x5e/0x290 Jul 24 11:36:12 daemon kernel: ? __update_idle_core+0x1b/0xb0 Jul 24 11:36:12 daemon kernel: ? __switch_to_asm+0x42/0x80 Jul 24 11:36:12 daemon kernel: ? finish_task_switch.isra.0+0x8c/0x2a0 Jul 24 11:36:12 daemon kernel: ? __schedule+0x250/0x620 Jul 24 11:36:12 daemon kernel: process_one_work+0x1e5/0x3b0 Jul 24 11:36:12 daemon kernel: worker_thread+0x50/0x3a0 Jul 24 11:36:12 daemon kernel: ? rescuer_thread+0x390/0x390 Jul 24 11:36:12 daemon kernel: kthread+0xd9/0x100 Jul 24 11:36:12 daemon kernel: ? kthread_complete_and_exit+0x20/0x20 Jul 24 11:36:12 daemon kernel: ret_from_fork+0x22/0x30 Jul 24 11:36:12 daemon kernel: </TASK> Version-Release number of selected component (if applicable): 5.14.0-284.18.1.el9_2.x86_64
If you are able to upgrade your kernel to a newer version. kernel-5.14.0-315.el9 has many new upstream commits.
Is it possible for you do to a 'mdadm -D /dev/mdXXX' and post the output? and/or a 'cat proc/mdstat' and post the outout?
# mdadm -D /dev/md1 /dev/md1: Version : 1.2 Creation Time : Tue Jul 18 10:57:59 2023 Raid Level : raid5 Array Size : 46877236224 (43.66 TiB 48.00 TB) Used Dev Size : 15625745408 (14.55 TiB 16.00 TB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Sat Aug 5 23:11:46 2023 State : clean Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Consistency Policy : bitmap Name : daemon:1 (local to host daemon) UUID : 1ae59cb2:489e30b9:0e758c2c:2893b268 Events : 27300 Number Major Minor RaidDevice State 0 8 1 0 active sync /dev/sda1 1 8 33 1 active sync /dev/sdc1 2 8 49 2 active sync /dev/sdd1 4 8 17 3 active sync /dev/sdb1 # cat /proc/mdstat Personalities : [raid0] [raid6] [raid5] [raid4] md1 : active raid5 sdd1[2] sdc1[1] sdb1[4] sda1[0] 46877236224 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU] bitmap: 0/117 pages [0KB], 65536KB chunk md0 : active raid0 sdg1[1] sdh1[2] sdf1[0] 11720653824 blocks super 1.2 512k chunks unused devices: <none>
I will try with the new kernel.