Created attachment 1701975 [details] Trace from lvm2 test case While testing lvm2 test case: lvconvert-raid-reshape-stripes-load-reload.sh we are now hitting on some runs (not all - so it's race) this OOPS: mdX: bitmap file is out of date, doing full recovery ... Buffer I/O error on dev dm-40, logical block 15298, async page read ... ;kernel BUG at drivers/md/raid5.c:7279! ;invalid opcode: 0000 [#1] SMP PTI ;CPU: 0 PID: 3372 Comm: dmsetup Not tainted 5.8.0-0.rc5.20200717git07a56bb875af.1.fc33.x86_64 #1 ;Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 ;RIP: 0010:raid5_run+0x40e/0x4a0 [raid456] ;Code: 00 00 8b 83 3c 01 00 00 39 83 bc 00 00 00 0f 85 a5 00 00 00 8b b3 30 01 00 00 48 c7 04 24 00 00 00 00 85 f6 0f 84 78 fd ff ff <0f> 0b 48 8b 43 48 48 c7 c6 08 76 30 c0 48 c7 c7 e0 0f 31 c0 48 85 ;RSP: 0018:ffffa54980477b10 EFLAGS: 00010206 ;RAX: 0000000000000080 RBX: ffff8b6f32a54058 RCX: ffffffffffffffff ;RDX: fffffffffffff800 RSI: 0000000000000005 RDI: 0000000000000296 ;RBP: ffff8b6f32a54058 R08: 0000000000000001 R09: 0000000000000000 ;R10: 000000000000000f R11: 0000000000000000 R12: ffff8b6f32a54070 ;R13: 0000000000000000 R14: 0000000000000000 R15: ffff8b6f32a54070 ;FS: 00007f8d210fdc80(0000) GS:ffff8b6f38000000(0000) knlGS:0000000000000000 ;CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 ;CR2: 00005596c222dd88 CR3: 000000005ed9a004 CR4: 00000000000206f0 ;Call Trace: ; md_run+0x489/0xad0 ; ? rcu_read_lock_sched_held+0x3f/0x80 ; ? module_assert_mutex_or_preempt+0x14/0x40 ; ? __module_address.part.0+0xe/0xe0 ; ? is_module_address+0x25/0x40 ; raid_ctr+0x13bb/0x288c [dm_raid] ; dm_table_add_target+0x167/0x330 ; table_load+0x103/0x350 ; ctl_ioctl+0x1b3/0x460 ; ? dev_suspend+0x2c0/0x2c0 ; dm_ctl_ioctl+0xa/0x10 ; ksys_ioctl+0x82/0xc0 ; __x64_sys_ioctl+0x16/0x20 ; do_syscall_64+0x52/0xb0 ; entry_SYSCALL_64_after_hwframe+0x44/0xa9 ;RIP: 0033:0x7f8d2157dd1b ;Code: Bad RIP value. ;RSP: 002b:00007fff802a6958 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 ;RAX: ffffffffffffffda RBX: 00007f8d21679ba6 RCX: 00007f8d2157dd1b ;RDX: 00005596c2229d80 RSI: 00000000c138fd09 RDI: 0000000000000003 ;RBP: 00007fff802a6a20 R08: 000000000000ffff R09: 0000000000000000 ;R10: 00007f8d215df5e0 R11: 0000000000000202 R12: 00007f8d216efd32 ;R13: 0000000000000000 R14: 00007f8d216efd32 R15: 00007f8d216efd32 ;Modules linked in: brd dm_delay dm_raid raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx cirrus drm_kms_helper rfkill joydev virtio_net virtio_balloon net_failover cec failover i2c_piix4 drm ip_tables crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ata_generic virtio_blk pata_acpi serio_raw [last unloaded: brd] ;---[ end trace e22e4c3fd2a1470e ]--- ;RIP: 0010:raid5_run+0x40e/0x4a0 [raid456] ;Code: 00 00 8b 83 3c 01 00 00 39 83 bc 00 00 00 0f 85 a5 00 00 00 8b b3 30 01 00 00 48 c7 04 24 00 00 00 00 85 f6 0f 84 78 fd ff ff <0f> 0b 48 8b 43 48 48 c7 c6 08 76 30 c0 48 c7 c7 e0 0f 31 c0 48 85 ;RSP: 0018:ffffa54980477b10 EFLAGS: 00010206 ;RAX: 0000000000000080 RBX: ffff8b6f32a54058 RCX: ffffffffffffffff ;RDX: fffffffffffff800 RSI: 0000000000000005 RDI: 0000000000000296 ;RBP: ffff8b6f32a54058 R08: 0000000000000001 R09: 0000000000000000 ;R10: 000000000000000f R11: 0000000000000000 R12: ffff8b6f32a54070 ;R13: 0000000000000000 R14: 0000000000000000 R15: ffff8b6f32a54070 ;FS: 00007f8d210fdc80(0000) GS:ffff8b6f38000000(0000) knlGS:0000000000000000 ;CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 ;CR2: 00005596c222dd88 CR3: 000000005ed9a004 CR4: 00000000000206f0 Started with 5.8-rc kernels. Seems to happen by 'extra' dmsetup table load while raid table is already present - so it looks like locking issue while raid target runs and new table is loaded?
This BUG is being triggered by "mddev->reshape_position != MaxSector" and "mddev->delta_disks != 0". Hypothesis is a superblock update race showing on very small (i.e. lvm test suite) RaidLV sizes. As I can't trigger it locally with the test suite: is this reproducible with 'normal' larger RaidLV sizes?
Created attachment 1702061 [details] Trace with increased leg size When increased leg size to 128M - we can observed some other sets of problems.
Rawhide kernel, no evidence of the bug as of the desciption on stable kernels, neither incarnations of comment 2.
This bug appears to have been reported against 'rawhide' during the Fedora 33 development cycle. Changing version to 33.
Actually has been reopened as bug 1916891.