1859336 – Raid reshape test is causing kernel OOPS with 5.8-rcX kernel

Bug 1859336 - Raid reshape test is causing kernel OOPS with 5.8-rcX kernel

Summary: Raid reshape test is causing kernel OOPS with 5.8-rcX kernel

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	lvm2
Sub Component:
Version:	33
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Heinz Mauelshagen
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-21 18:03 UTC by Zdenek Kabelac
Modified:	2021-01-26 16:19 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2020-12-02 19:41:15 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Trace from lvm2 test case (122.02 KB, text/plain) 2020-07-21 18:03 UTC, Zdenek Kabelac	no flags	Details
Trace with increased leg size (306.88 KB, text/plain) 2020-07-22 11:42 UTC, Zdenek Kabelac	no flags	Details
View All

Description Zdenek Kabelac 2020-07-21 18:03:12 UTC

Created attachment 1701975 [details]
Trace from lvm2 test case

While testing  lvm2  test case: lvconvert-raid-reshape-stripes-load-reload.sh
we are now hitting on some runs  (not all - so it's race)  this OOPS:

mdX: bitmap file is out of date, doing full recovery
...
Buffer I/O error on dev dm-40, logical block 15298, async page read
...

;kernel BUG at drivers/md/raid5.c:7279!
;invalid opcode: 0000 [#1] SMP PTI
;CPU: 0 PID: 3372 Comm: dmsetup Not tainted 5.8.0-0.rc5.20200717git07a56bb875af.1.fc33.x86_64 #1
;Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
;RIP: 0010:raid5_run+0x40e/0x4a0 [raid456]
;Code: 00 00 8b 83 3c 01 00 00 39 83 bc 00 00 00 0f 85 a5 00 00 00 8b b3 30 01 00 00 48 c7 04 24 00 00 00 00 85 f6 0f 84 78 fd ff ff <0f> 0b 48 8b 43 48 48 c7 c6 08 76 30 c0 48 c7 c7 e0 0f 31 c0 48 85
;RSP: 0018:ffffa54980477b10 EFLAGS: 00010206
;RAX: 0000000000000080 RBX: ffff8b6f32a54058 RCX: ffffffffffffffff
;RDX: fffffffffffff800 RSI: 0000000000000005 RDI: 0000000000000296
;RBP: ffff8b6f32a54058 R08: 0000000000000001 R09: 0000000000000000
;R10: 000000000000000f R11: 0000000000000000 R12: ffff8b6f32a54070
;R13: 0000000000000000 R14: 0000000000000000 R15: ffff8b6f32a54070
;FS:  00007f8d210fdc80(0000) GS:ffff8b6f38000000(0000) knlGS:0000000000000000
;CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
;CR2: 00005596c222dd88 CR3: 000000005ed9a004 CR4: 00000000000206f0
;Call Trace:
; md_run+0x489/0xad0
; ? rcu_read_lock_sched_held+0x3f/0x80
; ? module_assert_mutex_or_preempt+0x14/0x40
; ? __module_address.part.0+0xe/0xe0
; ? is_module_address+0x25/0x40
; raid_ctr+0x13bb/0x288c [dm_raid]
; dm_table_add_target+0x167/0x330
; table_load+0x103/0x350
; ctl_ioctl+0x1b3/0x460
; ? dev_suspend+0x2c0/0x2c0
; dm_ctl_ioctl+0xa/0x10
; ksys_ioctl+0x82/0xc0
; __x64_sys_ioctl+0x16/0x20
; do_syscall_64+0x52/0xb0
; entry_SYSCALL_64_after_hwframe+0x44/0xa9
;RIP: 0033:0x7f8d2157dd1b
;Code: Bad RIP value.
;RSP: 002b:00007fff802a6958 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
;RAX: ffffffffffffffda RBX: 00007f8d21679ba6 RCX: 00007f8d2157dd1b
;RDX: 00005596c2229d80 RSI: 00000000c138fd09 RDI: 0000000000000003
;RBP: 00007fff802a6a20 R08: 000000000000ffff R09: 0000000000000000
;R10: 00007f8d215df5e0 R11: 0000000000000202 R12: 00007f8d216efd32
;R13: 0000000000000000 R14: 00007f8d216efd32 R15: 00007f8d216efd32
;Modules linked in: brd dm_delay dm_raid raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx cirrus drm_kms_helper rfkill joydev virtio_net virtio_balloon net_failover cec failover i2c_piix4 drm ip_tables crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ata_generic virtio_blk pata_acpi serio_raw [last unloaded: brd]
;---[ end trace e22e4c3fd2a1470e ]---
;RIP: 0010:raid5_run+0x40e/0x4a0 [raid456]
;Code: 00 00 8b 83 3c 01 00 00 39 83 bc 00 00 00 0f 85 a5 00 00 00 8b b3 30 01 00 00 48 c7 04 24 00 00 00 00 85 f6 0f 84 78 fd ff ff <0f> 0b 48 8b 43 48 48 c7 c6 08 76 30 c0 48 c7 c7 e0 0f 31 c0 48 85
;RSP: 0018:ffffa54980477b10 EFLAGS: 00010206
;RAX: 0000000000000080 RBX: ffff8b6f32a54058 RCX: ffffffffffffffff
;RDX: fffffffffffff800 RSI: 0000000000000005 RDI: 0000000000000296
;RBP: ffff8b6f32a54058 R08: 0000000000000001 R09: 0000000000000000
;R10: 000000000000000f R11: 0000000000000000 R12: ffff8b6f32a54070
;R13: 0000000000000000 R14: 0000000000000000 R15: ffff8b6f32a54070
;FS:  00007f8d210fdc80(0000) GS:ffff8b6f38000000(0000) knlGS:0000000000000000
;CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
;CR2: 00005596c222dd88 CR3: 000000005ed9a004 CR4: 00000000000206f0


Started with  5.8-rc kernels.

Seems to happen by 'extra' dmsetup table load while raid table is already present - so it looks like locking issue while raid target runs and new table is loaded?

Comment 1 Heinz Mauelshagen 2020-07-22 10:47:59 UTC

This BUG is being triggered by "mddev->reshape_position != MaxSector" and "mddev->delta_disks != 0".

Hypothesis is a superblock update race showing on very small (i.e. lvm test suite) RaidLV sizes.

As I can't trigger it locally with the test suite: is this reproducible with 'normal' larger RaidLV sizes?

Comment 2 Zdenek Kabelac 2020-07-22 11:42:32 UTC

Created attachment 1702061 [details]
Trace with increased leg size

When increased  leg size to 128M - we can observed some other sets of problems.

Comment 3 Heinz Mauelshagen 2020-07-22 12:07:23 UTC

Rawhide kernel, no evidence of the bug as of the desciption on stable kernels, neither incarnations of comment 2.

Comment 4 Ben Cotton 2020-08-11 13:48:30 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 33 development cycle.
Changing version to 33.

Comment 5 Zdenek Kabelac 2021-01-26 16:19:45 UTC

Actually has been reopened as bug 1916891.

Note You need to log in before you can comment on or make changes to this bug.