Bug 2225162 - deadlock in raid5_get_active_stripe function [NEEDINFO]
Summary: deadlock in raid5_get_active_stripe function
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: kernel
Version: 9.2
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: rc
: ---
Assignee: Nigel Croxon
QA Contact: Storage QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-24 12:40 UTC by Daniel Koukola
Modified: 2023-08-17 10:51 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Target Upstream Version:
Embargoed:
ncroxon: needinfo? (dkoukola)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-163137 0 None None None 2023-07-24 12:42:11 UTC

Description Daniel Koukola 2023-07-24 12:40:08 UTC
Description of problem:

Under heavy I/O load, occasionally (once every few days) a deadlock occurs in raid5_get_active_stripe function and a reboot is required.

Jul 24 11:36:11 daemon kernel: INFO: task xfsaild/md1:794 blocked for more than 122 seconds.
Jul 24 11:36:11 daemon kernel:      Tainted: G          I      --------  ---  5.14.0-284.18.1.el9_2.x86_64 #1
Jul 24 11:36:11 daemon kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 24 11:36:11 daemon kernel: task:xfsaild/md1     state:D stack:    0 pid:  794 ppid:     2 flags:0x00004000
Jul 24 11:36:11 daemon kernel: Call Trace:
Jul 24 11:36:11 daemon kernel: <TASK>
Jul 24 11:36:11 daemon kernel: __schedule+0x248/0x620
Jul 24 11:36:11 daemon kernel: ? __blk_flush_plug+0xdb/0x160
Jul 24 11:36:11 daemon kernel: schedule+0x5a/0xc0
Jul 24 11:36:11 daemon kernel: raid5_get_active_stripe+0x25a/0x2f0 [raid456]
Jul 24 11:36:11 daemon kernel: ? cpuacct_percpu_seq_show+0x10/0x10
Jul 24 11:36:11 daemon kernel: make_stripe_request+0x9b/0x490 [raid456]
Jul 24 11:36:11 daemon kernel: raid5_make_request+0x16f/0x3e0 [raid456]
Jul 24 11:36:11 daemon kernel: ? sched_show_numa+0xf0/0xf0
Jul 24 11:36:11 daemon kernel: md_handle_request+0x135/0x1e0
Jul 24 11:36:11 daemon kernel: ? bio_split_to_limits+0x51/0x90
Jul 24 11:36:11 daemon kernel: __submit_bio+0x89/0x130
Jul 24 11:36:11 daemon kernel: __submit_bio_noacct+0x81/0x1f0
Jul 24 11:36:11 daemon kernel: xfs_buf_ioapply_map+0x1cb/0x280 [xfs]
Jul 24 11:36:11 daemon kernel: _xfs_buf_ioapply+0xcf/0x1b0 [xfs]
Jul 24 11:36:11 daemon kernel: ? wake_up_q+0x90/0x90
Jul 24 11:36:11 daemon kernel: __xfs_buf_submit+0x6e/0x1e0 [xfs]
Jul 24 11:36:11 daemon kernel: xfs_buf_delwri_submit_buffers+0xe9/0x230 [xfs]
Jul 24 11:36:11 daemon kernel: xfsaild_push+0x176/0x6f0 [xfs]
Jul 24 11:36:11 daemon kernel: ? del_timer_sync+0x67/0xb0
Jul 24 11:36:11 daemon kernel: xfsaild+0xa4/0x1e0 [xfs]
Jul 24 11:36:11 daemon kernel: ? xfsaild_push+0x6f0/0x6f0 [xfs]
Jul 24 11:36:11 daemon kernel: kthread+0xd9/0x100
Jul 24 11:36:11 daemon kernel: ? kthread_complete_and_exit+0x20/0x20
Jul 24 11:36:11 daemon kernel: ret_from_fork+0x22/0x30
Jul 24 11:36:11 daemon kernel: </TASK>

Jul 24 11:36:11 daemon kernel: INFO: task rsync_bpc:5165 blocked for more than 122 seconds.
Jul 24 11:36:11 daemon kernel:      Tainted: G          I      --------  ---  5.14.0-284.18.1.el9_2.x86_64 #1
Jul 24 11:36:11 daemon kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 24 11:36:11 daemon kernel: task:rsync_bpc       state:D stack:    0 pid: 5165 ppid:  5076 flags:0x00004002
Jul 24 11:36:11 daemon kernel: Call Trace:
Jul 24 11:36:11 daemon kernel: <TASK>
Jul 24 11:36:11 daemon kernel: __schedule+0x248/0x620
Jul 24 11:36:11 daemon kernel: schedule+0x5a/0xc0
Jul 24 11:36:11 daemon kernel: raid5_get_active_stripe+0x25a/0x2f0 [raid456]
Jul 24 11:36:11 daemon kernel: ? cpuacct_percpu_seq_show+0x10/0x10
Jul 24 11:36:11 daemon kernel: make_stripe_request+0x9b/0x490 [raid456]
Jul 24 11:36:11 daemon kernel: raid5_make_request+0x16f/0x3e0 [raid456]
Jul 24 11:36:11 daemon kernel: ? sched_show_numa+0xf0/0xf0
Jul 24 11:36:11 daemon kernel: md_handle_request+0x135/0x1e0
Jul 24 11:36:11 daemon kernel: ? bio_split_to_limits+0x51/0x90
Jul 24 11:36:11 daemon kernel: __submit_bio+0x89/0x130
Jul 24 11:36:11 daemon kernel: __submit_bio_noacct+0x81/0x1f0
Jul 24 11:36:11 daemon kernel: iomap_add_to_ioend+0x1f5/0x270
Jul 24 11:36:11 daemon kernel: iomap_writepage_map+0xe8/0x530
Jul 24 11:36:11 daemon kernel: write_cache_pages+0x17c/0x4c0
Jul 24 11:36:11 daemon kernel: ? iomap_writepage_map+0x530/0x530
Jul 24 11:36:11 daemon kernel: iomap_writepages+0x1c/0x40
Jul 24 11:36:11 daemon kernel: xfs_vm_writepages+0x7a/0xb0 [xfs]
Jul 24 11:36:11 daemon kernel: do_writepages+0xcf/0x1d0
Jul 24 11:36:11 daemon kernel: ? wbc_detach_inode+0x13f/0x220
Jul 24 11:36:11 daemon kernel: filemap_fdatawrite_wbc+0x66/0x90
Jul 24 11:36:11 daemon kernel: filemap_write_and_wait_range+0x6f/0xf0
Jul 24 11:36:11 daemon kernel: xfs_setattr_size+0x267/0x390 [xfs]
Jul 24 11:36:11 daemon kernel: xfs_vn_setattr+0x78/0x180 [xfs]
Jul 24 11:36:11 daemon kernel: notify_change+0x34d/0x4e0
Jul 24 11:36:11 daemon kernel: ? do_truncate+0x7d/0xd0
Jul 24 11:36:11 daemon kernel: do_truncate+0x7d/0xd0
Jul 24 11:36:11 daemon kernel: do_sys_ftruncate+0x17d/0x1b0
Jul 24 11:36:11 daemon kernel: do_syscall_64+0x5c/0x90
Jul 24 11:36:11 daemon kernel: ? vfs_write+0x1ef/0x280
Jul 24 11:36:11 daemon kernel: ? syscall_exit_work+0x11a/0x150
Jul 24 11:36:11 daemon kernel: ? syscall_exit_to_user_mode+0x12/0x30
Jul 24 11:36:11 daemon kernel: ? do_syscall_64+0x69/0x90
Jul 24 11:36:11 daemon kernel: ? syscall_exit_work+0x11a/0x150
Jul 24 11:36:11 daemon kernel: ? syscall_exit_to_user_mode+0x12/0x30
Jul 24 11:36:11 daemon kernel: ? do_syscall_64+0x69/0x90
Jul 24 11:36:11 daemon kernel: ? sysvec_apic_timer_interrupt+0x3c/0x90
Jul 24 11:36:11 daemon kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd
Jul 24 11:36:11 daemon kernel: RIP: 0033:0x7ff10554673b
Jul 24 11:36:11 daemon kernel: RSP: 002b:00007ffc11d2e068 EFLAGS: 00000206 ORIG_RAX: 000000000000004d
Jul 24 11:36:11 daemon kernel: RAX: ffffffffffffffda RBX: 0000000015dfd000 RCX: 00007ff10554673b
Jul 24 11:36:11 daemon kernel: RDX: 0000000000000003 RSI: 0000000015dfd000 RDI: 0000000000000003
Jul 24 11:36:11 daemon kernel: RBP: 000055f649c02920 R08: 0000000000000000 R09: 00000000e124df27
Jul 24 11:36:11 daemon kernel: R10: 00000000c29e658e R11: 0000000000000206 R12: 0000000000000000
Jul 24 11:36:12 daemon kernel: R13: 0000000000000034 R14: 0000000000000004 R15: 000055f649c02ac0
Jul 24 11:36:12 daemon kernel: </TASK>

Jul 24 11:36:12 daemon kernel: INFO: task kworker/u16:1:36563 blocked for more than 123 seconds.
Jul 24 11:36:12 daemon kernel:      Tainted: G          I      --------  ---  5.14.0-284.18.1.el9_2.x86_64 #1
Jul 24 11:36:12 daemon kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 24 11:36:12 daemon kernel: task:kworker/u16:1   state:D stack:    0 pid:36563 ppid:     2 flags:0x00004000
Jul 24 11:36:12 daemon kernel: Workqueue: writeback wb_workfn (flush-9:1)
Jul 24 11:36:12 daemon kernel: Call Trace:
Jul 24 11:36:12 daemon kernel: <TASK>
Jul 24 11:36:12 daemon kernel: __schedule+0x248/0x620
Jul 24 11:36:12 daemon kernel: schedule+0x5a/0xc0
Jul 24 11:36:12 daemon kernel: raid5_get_active_stripe+0x25a/0x2f0 [raid456]
Jul 24 11:36:12 daemon kernel: ? cpuacct_percpu_seq_show+0x10/0x10
Jul 24 11:36:12 daemon kernel: make_stripe_request+0x9b/0x490 [raid456]
Jul 24 11:36:12 daemon kernel: ? bdev_start_io_acct+0x47/0x100
Jul 24 11:36:12 daemon kernel: raid5_make_request+0x16f/0x3e0 [raid456]
Jul 24 11:36:12 daemon kernel: ? sched_show_numa+0xf0/0xf0
Jul 24 11:36:12 daemon kernel: md_handle_request+0x135/0x1e0
Jul 24 11:36:12 daemon kernel: ? bio_split_to_limits+0x51/0x90
Jul 24 11:36:12 daemon kernel: __submit_bio+0x89/0x130
Jul 24 11:36:12 daemon kernel: __submit_bio_noacct+0x81/0x1f0
Jul 24 11:36:12 daemon kernel: iomap_submit_ioend+0x4e/0x80
Jul 24 11:36:12 daemon kernel: xfs_vm_writepages+0x7a/0xb0 [xfs]
Jul 24 11:36:12 daemon kernel: do_writepages+0xcf/0x1d0
Jul 24 11:36:12 daemon kernel: __writeback_single_inode+0x41/0x270
Jul 24 11:36:12 daemon kernel: writeback_sb_inodes+0x209/0x4a0
Jul 24 11:36:12 daemon kernel: __writeback_inodes_wb+0x4c/0xe0
Jul 24 11:36:12 daemon kernel: wb_writeback+0x1d7/0x2d0
Jul 24 11:36:12 daemon kernel: wb_do_writeback+0x22a/0x2b0
Jul 24 11:36:12 daemon kernel: wb_workfn+0x5e/0x290
Jul 24 11:36:12 daemon kernel: ? __update_idle_core+0x1b/0xb0
Jul 24 11:36:12 daemon kernel: ? __switch_to_asm+0x42/0x80
Jul 24 11:36:12 daemon kernel: ? finish_task_switch.isra.0+0x8c/0x2a0
Jul 24 11:36:12 daemon kernel: ? __schedule+0x250/0x620
Jul 24 11:36:12 daemon kernel: process_one_work+0x1e5/0x3b0
Jul 24 11:36:12 daemon kernel: worker_thread+0x50/0x3a0
Jul 24 11:36:12 daemon kernel: ? rescuer_thread+0x390/0x390
Jul 24 11:36:12 daemon kernel: kthread+0xd9/0x100
Jul 24 11:36:12 daemon kernel: ? kthread_complete_and_exit+0x20/0x20
Jul 24 11:36:12 daemon kernel: ret_from_fork+0x22/0x30
Jul 24 11:36:12 daemon kernel: </TASK>

Version-Release number of selected component (if applicable):
5.14.0-284.18.1.el9_2.x86_64

Comment 1 Nigel Croxon 2023-08-02 14:18:13 UTC
If you are able to upgrade your kernel to a newer version.
kernel-5.14.0-315.el9 has many new upstream commits.

Comment 2 Nigel Croxon 2023-08-02 15:54:17 UTC
Is it possible for you do to a 'mdadm -D /dev/mdXXX' and post the output?

and/or a 'cat proc/mdstat' and post the outout?

Comment 3 Daniel Koukola 2023-08-05 21:20:27 UTC
# mdadm -D /dev/md1
/dev/md1:
           Version : 1.2
     Creation Time : Tue Jul 18 10:57:59 2023
        Raid Level : raid5
        Array Size : 46877236224 (43.66 TiB 48.00 TB)
     Used Dev Size : 15625745408 (14.55 TiB 16.00 TB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sat Aug  5 23:11:46 2023
             State : clean
    Active Devices : 4
   Working Devices : 4
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : daemon:1  (local to host daemon)
              UUID : 1ae59cb2:489e30b9:0e758c2c:2893b268
            Events : 27300

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       33        1      active sync   /dev/sdc1
       2       8       49        2      active sync   /dev/sdd1
       4       8       17        3      active sync   /dev/sdb1

# cat /proc/mdstat
Personalities : [raid0] [raid6] [raid5] [raid4]
md1 : active raid5 sdd1[2] sdc1[1] sdb1[4] sda1[0]
      46877236224 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 0/117 pages [0KB], 65536KB chunk

md0 : active raid0 sdg1[1] sdh1[2] sdf1[0]
      11720653824 blocks super 1.2 512k chunks

unused devices: <none>

Comment 4 Daniel Koukola 2023-08-05 21:22:26 UTC
I will try with the new kernel.


Note You need to log in before you can comment on or make changes to this bug.