Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2225162

Summary: deadlock in raid5_get_active_stripe function
Product: Red Hat Enterprise Linux 9 Reporter: Daniel Koukola <dkoukola>
Component: kernelAssignee: Nigel Croxon <ncroxon>
kernel sub component: Multiple Devices (MD) QA Contact: Storage QE <storage-qe>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: high    
Priority: unspecified CC: mgandhi, ncroxon, rmadhuso, xni
Version: 9.2Keywords: Triaged
Target Milestone: rcFlags: pm-rhel: mirror+
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-23 14:29:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Daniel Koukola 2023-07-24 12:40:08 UTC
Description of problem:

Under heavy I/O load, occasionally (once every few days) a deadlock occurs in raid5_get_active_stripe function and a reboot is required.

Jul 24 11:36:11 daemon kernel: INFO: task xfsaild/md1:794 blocked for more than 122 seconds.
Jul 24 11:36:11 daemon kernel:      Tainted: G          I      --------  ---  5.14.0-284.18.1.el9_2.x86_64 #1
Jul 24 11:36:11 daemon kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 24 11:36:11 daemon kernel: task:xfsaild/md1     state:D stack:    0 pid:  794 ppid:     2 flags:0x00004000
Jul 24 11:36:11 daemon kernel: Call Trace:
Jul 24 11:36:11 daemon kernel: <TASK>
Jul 24 11:36:11 daemon kernel: __schedule+0x248/0x620
Jul 24 11:36:11 daemon kernel: ? __blk_flush_plug+0xdb/0x160
Jul 24 11:36:11 daemon kernel: schedule+0x5a/0xc0
Jul 24 11:36:11 daemon kernel: raid5_get_active_stripe+0x25a/0x2f0 [raid456]
Jul 24 11:36:11 daemon kernel: ? cpuacct_percpu_seq_show+0x10/0x10
Jul 24 11:36:11 daemon kernel: make_stripe_request+0x9b/0x490 [raid456]
Jul 24 11:36:11 daemon kernel: raid5_make_request+0x16f/0x3e0 [raid456]
Jul 24 11:36:11 daemon kernel: ? sched_show_numa+0xf0/0xf0
Jul 24 11:36:11 daemon kernel: md_handle_request+0x135/0x1e0
Jul 24 11:36:11 daemon kernel: ? bio_split_to_limits+0x51/0x90
Jul 24 11:36:11 daemon kernel: __submit_bio+0x89/0x130
Jul 24 11:36:11 daemon kernel: __submit_bio_noacct+0x81/0x1f0
Jul 24 11:36:11 daemon kernel: xfs_buf_ioapply_map+0x1cb/0x280 [xfs]
Jul 24 11:36:11 daemon kernel: _xfs_buf_ioapply+0xcf/0x1b0 [xfs]
Jul 24 11:36:11 daemon kernel: ? wake_up_q+0x90/0x90
Jul 24 11:36:11 daemon kernel: __xfs_buf_submit+0x6e/0x1e0 [xfs]
Jul 24 11:36:11 daemon kernel: xfs_buf_delwri_submit_buffers+0xe9/0x230 [xfs]
Jul 24 11:36:11 daemon kernel: xfsaild_push+0x176/0x6f0 [xfs]
Jul 24 11:36:11 daemon kernel: ? del_timer_sync+0x67/0xb0
Jul 24 11:36:11 daemon kernel: xfsaild+0xa4/0x1e0 [xfs]
Jul 24 11:36:11 daemon kernel: ? xfsaild_push+0x6f0/0x6f0 [xfs]
Jul 24 11:36:11 daemon kernel: kthread+0xd9/0x100
Jul 24 11:36:11 daemon kernel: ? kthread_complete_and_exit+0x20/0x20
Jul 24 11:36:11 daemon kernel: ret_from_fork+0x22/0x30
Jul 24 11:36:11 daemon kernel: </TASK>

Jul 24 11:36:11 daemon kernel: INFO: task rsync_bpc:5165 blocked for more than 122 seconds.
Jul 24 11:36:11 daemon kernel:      Tainted: G          I      --------  ---  5.14.0-284.18.1.el9_2.x86_64 #1
Jul 24 11:36:11 daemon kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 24 11:36:11 daemon kernel: task:rsync_bpc       state:D stack:    0 pid: 5165 ppid:  5076 flags:0x00004002
Jul 24 11:36:11 daemon kernel: Call Trace:
Jul 24 11:36:11 daemon kernel: <TASK>
Jul 24 11:36:11 daemon kernel: __schedule+0x248/0x620
Jul 24 11:36:11 daemon kernel: schedule+0x5a/0xc0
Jul 24 11:36:11 daemon kernel: raid5_get_active_stripe+0x25a/0x2f0 [raid456]
Jul 24 11:36:11 daemon kernel: ? cpuacct_percpu_seq_show+0x10/0x10
Jul 24 11:36:11 daemon kernel: make_stripe_request+0x9b/0x490 [raid456]
Jul 24 11:36:11 daemon kernel: raid5_make_request+0x16f/0x3e0 [raid456]
Jul 24 11:36:11 daemon kernel: ? sched_show_numa+0xf0/0xf0
Jul 24 11:36:11 daemon kernel: md_handle_request+0x135/0x1e0
Jul 24 11:36:11 daemon kernel: ? bio_split_to_limits+0x51/0x90
Jul 24 11:36:11 daemon kernel: __submit_bio+0x89/0x130
Jul 24 11:36:11 daemon kernel: __submit_bio_noacct+0x81/0x1f0
Jul 24 11:36:11 daemon kernel: iomap_add_to_ioend+0x1f5/0x270
Jul 24 11:36:11 daemon kernel: iomap_writepage_map+0xe8/0x530
Jul 24 11:36:11 daemon kernel: write_cache_pages+0x17c/0x4c0
Jul 24 11:36:11 daemon kernel: ? iomap_writepage_map+0x530/0x530
Jul 24 11:36:11 daemon kernel: iomap_writepages+0x1c/0x40
Jul 24 11:36:11 daemon kernel: xfs_vm_writepages+0x7a/0xb0 [xfs]
Jul 24 11:36:11 daemon kernel: do_writepages+0xcf/0x1d0
Jul 24 11:36:11 daemon kernel: ? wbc_detach_inode+0x13f/0x220
Jul 24 11:36:11 daemon kernel: filemap_fdatawrite_wbc+0x66/0x90
Jul 24 11:36:11 daemon kernel: filemap_write_and_wait_range+0x6f/0xf0
Jul 24 11:36:11 daemon kernel: xfs_setattr_size+0x267/0x390 [xfs]
Jul 24 11:36:11 daemon kernel: xfs_vn_setattr+0x78/0x180 [xfs]
Jul 24 11:36:11 daemon kernel: notify_change+0x34d/0x4e0
Jul 24 11:36:11 daemon kernel: ? do_truncate+0x7d/0xd0
Jul 24 11:36:11 daemon kernel: do_truncate+0x7d/0xd0
Jul 24 11:36:11 daemon kernel: do_sys_ftruncate+0x17d/0x1b0
Jul 24 11:36:11 daemon kernel: do_syscall_64+0x5c/0x90
Jul 24 11:36:11 daemon kernel: ? vfs_write+0x1ef/0x280
Jul 24 11:36:11 daemon kernel: ? syscall_exit_work+0x11a/0x150
Jul 24 11:36:11 daemon kernel: ? syscall_exit_to_user_mode+0x12/0x30
Jul 24 11:36:11 daemon kernel: ? do_syscall_64+0x69/0x90
Jul 24 11:36:11 daemon kernel: ? syscall_exit_work+0x11a/0x150
Jul 24 11:36:11 daemon kernel: ? syscall_exit_to_user_mode+0x12/0x30
Jul 24 11:36:11 daemon kernel: ? do_syscall_64+0x69/0x90
Jul 24 11:36:11 daemon kernel: ? sysvec_apic_timer_interrupt+0x3c/0x90
Jul 24 11:36:11 daemon kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd
Jul 24 11:36:11 daemon kernel: RIP: 0033:0x7ff10554673b
Jul 24 11:36:11 daemon kernel: RSP: 002b:00007ffc11d2e068 EFLAGS: 00000206 ORIG_RAX: 000000000000004d
Jul 24 11:36:11 daemon kernel: RAX: ffffffffffffffda RBX: 0000000015dfd000 RCX: 00007ff10554673b
Jul 24 11:36:11 daemon kernel: RDX: 0000000000000003 RSI: 0000000015dfd000 RDI: 0000000000000003
Jul 24 11:36:11 daemon kernel: RBP: 000055f649c02920 R08: 0000000000000000 R09: 00000000e124df27
Jul 24 11:36:11 daemon kernel: R10: 00000000c29e658e R11: 0000000000000206 R12: 0000000000000000
Jul 24 11:36:12 daemon kernel: R13: 0000000000000034 R14: 0000000000000004 R15: 000055f649c02ac0
Jul 24 11:36:12 daemon kernel: </TASK>

Jul 24 11:36:12 daemon kernel: INFO: task kworker/u16:1:36563 blocked for more than 123 seconds.
Jul 24 11:36:12 daemon kernel:      Tainted: G          I      --------  ---  5.14.0-284.18.1.el9_2.x86_64 #1
Jul 24 11:36:12 daemon kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 24 11:36:12 daemon kernel: task:kworker/u16:1   state:D stack:    0 pid:36563 ppid:     2 flags:0x00004000
Jul 24 11:36:12 daemon kernel: Workqueue: writeback wb_workfn (flush-9:1)
Jul 24 11:36:12 daemon kernel: Call Trace:
Jul 24 11:36:12 daemon kernel: <TASK>
Jul 24 11:36:12 daemon kernel: __schedule+0x248/0x620
Jul 24 11:36:12 daemon kernel: schedule+0x5a/0xc0
Jul 24 11:36:12 daemon kernel: raid5_get_active_stripe+0x25a/0x2f0 [raid456]
Jul 24 11:36:12 daemon kernel: ? cpuacct_percpu_seq_show+0x10/0x10
Jul 24 11:36:12 daemon kernel: make_stripe_request+0x9b/0x490 [raid456]
Jul 24 11:36:12 daemon kernel: ? bdev_start_io_acct+0x47/0x100
Jul 24 11:36:12 daemon kernel: raid5_make_request+0x16f/0x3e0 [raid456]
Jul 24 11:36:12 daemon kernel: ? sched_show_numa+0xf0/0xf0
Jul 24 11:36:12 daemon kernel: md_handle_request+0x135/0x1e0
Jul 24 11:36:12 daemon kernel: ? bio_split_to_limits+0x51/0x90
Jul 24 11:36:12 daemon kernel: __submit_bio+0x89/0x130
Jul 24 11:36:12 daemon kernel: __submit_bio_noacct+0x81/0x1f0
Jul 24 11:36:12 daemon kernel: iomap_submit_ioend+0x4e/0x80
Jul 24 11:36:12 daemon kernel: xfs_vm_writepages+0x7a/0xb0 [xfs]
Jul 24 11:36:12 daemon kernel: do_writepages+0xcf/0x1d0
Jul 24 11:36:12 daemon kernel: __writeback_single_inode+0x41/0x270
Jul 24 11:36:12 daemon kernel: writeback_sb_inodes+0x209/0x4a0
Jul 24 11:36:12 daemon kernel: __writeback_inodes_wb+0x4c/0xe0
Jul 24 11:36:12 daemon kernel: wb_writeback+0x1d7/0x2d0
Jul 24 11:36:12 daemon kernel: wb_do_writeback+0x22a/0x2b0
Jul 24 11:36:12 daemon kernel: wb_workfn+0x5e/0x290
Jul 24 11:36:12 daemon kernel: ? __update_idle_core+0x1b/0xb0
Jul 24 11:36:12 daemon kernel: ? __switch_to_asm+0x42/0x80
Jul 24 11:36:12 daemon kernel: ? finish_task_switch.isra.0+0x8c/0x2a0
Jul 24 11:36:12 daemon kernel: ? __schedule+0x250/0x620
Jul 24 11:36:12 daemon kernel: process_one_work+0x1e5/0x3b0
Jul 24 11:36:12 daemon kernel: worker_thread+0x50/0x3a0
Jul 24 11:36:12 daemon kernel: ? rescuer_thread+0x390/0x390
Jul 24 11:36:12 daemon kernel: kthread+0xd9/0x100
Jul 24 11:36:12 daemon kernel: ? kthread_complete_and_exit+0x20/0x20
Jul 24 11:36:12 daemon kernel: ret_from_fork+0x22/0x30
Jul 24 11:36:12 daemon kernel: </TASK>

Version-Release number of selected component (if applicable):
5.14.0-284.18.1.el9_2.x86_64

Comment 1 Nigel Croxon 2023-08-02 14:18:13 UTC
If you are able to upgrade your kernel to a newer version.
kernel-5.14.0-315.el9 has many new upstream commits.

Comment 2 Nigel Croxon 2023-08-02 15:54:17 UTC
Is it possible for you do to a 'mdadm -D /dev/mdXXX' and post the output?

and/or a 'cat proc/mdstat' and post the outout?

Comment 3 Daniel Koukola 2023-08-05 21:20:27 UTC
# mdadm -D /dev/md1
/dev/md1:
           Version : 1.2
     Creation Time : Tue Jul 18 10:57:59 2023
        Raid Level : raid5
        Array Size : 46877236224 (43.66 TiB 48.00 TB)
     Used Dev Size : 15625745408 (14.55 TiB 16.00 TB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sat Aug  5 23:11:46 2023
             State : clean
    Active Devices : 4
   Working Devices : 4
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : daemon:1  (local to host daemon)
              UUID : 1ae59cb2:489e30b9:0e758c2c:2893b268
            Events : 27300

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       33        1      active sync   /dev/sdc1
       2       8       49        2      active sync   /dev/sdd1
       4       8       17        3      active sync   /dev/sdb1

# cat /proc/mdstat
Personalities : [raid0] [raid6] [raid5] [raid4]
md1 : active raid5 sdd1[2] sdc1[1] sdb1[4] sda1[0]
      46877236224 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 0/117 pages [0KB], 65536KB chunk

md0 : active raid0 sdg1[1] sdh1[2] sdf1[0]
      11720653824 blocks super 1.2 512k chunks

unused devices: <none>

Comment 4 Daniel Koukola 2023-08-05 21:22:26 UTC
I will try with the new kernel.

Comment 5 Daniel Koukola 2023-08-22 10:23:18 UTC
I'm unable to reproduce the problem with 5.14.0-350.el9, so the bug seems to have been fixed. Thank you!