Bug 2058496

Summary: mdadm-4.2-2.el8 regression panic during reshape
Product: Red Hat Enterprise Linux 8 Reporter: Fine Fan <ffan>
Component: mdadmAssignee: Nigel Croxon <ncroxon>
Status: VERIFIED --- QA Contact: Fine Fan <ffan>
Severity: unspecified Docs Contact:
Priority: high    
Version: 8.6CC: dledford, heinzm, lmiksik, ncroxon, xni, yizhan
Target Milestone: rcKeywords: Regression, Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-4.18.0-404.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Fine Fan 2022-02-25 06:37:05 UTC
Description of problem:


Version-Release number of selected component (if applicable):
RHEL-8.6.0-20220223.0
kernel-4.18.0-367.el8
mdadm-4.2-2.el8


How reproducible:


Steps to Reproduce:
mdadm --create --run /dev/md0 --level 1 --metadata 1.2 --raid-devices 6  /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3 /dev/loop4 /dev/loop5     --spare-devices 1  /dev/loop6

mkfs -t ext4 /dev/md0

mount -t ext4 /dev/md0 /mnt/md_test

dd if=/dev/urandom of=/mnt/md_test/testfile bs=1M count=100

mdadm --grow -l0 /dev/md0 --backup-file=tmp0


Actual results:
The server panic.

Expected results:
The server don't panic.

Additional info:

Comment 1 Fine Fan 2022-02-25 08:18:31 UTC
[  374.148492] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[  374.156397] PGD 0 P4D 0 
[  374.158933] Oops: 0010 [#1] SMP NOPTI
[  374.162592] CPU: 26 PID: 2201 Comm: jbd2/md0-8 Kdump: loaded Not tainted 4.18.0-367.el8.x86_64 #1
[  374.171457] Hardware name: Dell Inc. PowerEdge R6515/035YY8, BIOS 2.5.5 10/07/2021
[  374.179014] RIP: 0010:0x0
[  374.181641] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[  374.188513] RSP: 0018:ffffa9f60358ba08 EFLAGS: 00010206
[  374.193739] RAX: 0000000000000000 RBX: 0000000000411200 RCX: ffff9c4b097af7d8
[  374.200871] RDX: ffff9c4b097af7c8 RSI: 0000000000000000 RDI: 0000000000411200
[  374.207996] RBP: 0000000000611200 R08: 0000000000000001 R09: 0000000000000001
[  374.215126] R10: 0000000000000002 R11: 0000000000000400 R12: ffff9c4b097af808
[  374.222251] R13: ffffa9f60358ba40 R14: ffffffffb993d840 R15: ffff9c4b097af7d8
[  374.229376] FS:  0000000000000000(0000) GS:ffff9c521f280000(0000) knlGS:0000000000000000
[  374.237460] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  374.243197] CR2: ffffffffffffffd6 CR3: 0000000673810005 CR4: 0000000000770ee0
[  374.250320] PKRU: 55555554
[  374.253026] Call Trace:
[  374.255504]  mempool_alloc+0x67/0x180
[  374.259214]  bio_alloc_bioset+0x14a/0x220
[  374.263225]  bio_clone_fast+0x19/0x60
[  374.266892]  md_account_bio+0x39/0x80
[  374.270559]  raid0_make_request+0xa0/0x550 [raid0]
[  374.275351]  ? blk_throtl_bio+0x252/0xb80
[  374.279363]  ? finish_wait+0x80/0x80
[  374.282942]  md_handle_request+0x119/0x190
[  374.287040]  md_make_request+0x5b/0xb0
[  374.290785]  generic_make_request+0x25b/0x350
[  374.295144]  submit_bio+0x3c/0x160
[  374.298541]  ? bio_add_page+0x42/0x50
[  374.302207]  submit_bh_wbc+0x16a/0x190
[  374.305960]  jbd2_journal_commit_transaction+0x6b6/0x1a00 [jbd2]
[  374.311967]  ? __switch_to_asm+0x41/0x70
[  374.315892]  ? sk_filter_is_valid_access+0x50/0x60
[  374.320684]  ? __switch_to+0x10c/0x450
[  374.324436]  kjournald2+0xbd/0x270 [jbd2]
[  374.328448]  ? finish_wait+0x80/0x80
[  374.332020]  ? commit_timeout+0x10/0x10 [jbd2]
[  374.336466]  kthread+0x10a/0x120
[  374.339697]  ? set_kthread_struct+0x40/0x40
[  374.343876]  ret_from_fork+0x22/0x40
[  374.347456] Modules linked in: raid0 ext4 mbcache jbd2 raid1 loop sunrpc dm_multipath dell_smbios intel_rapl_msr dell_wmi_descriptor wmi_bmof dcdbas intel_rapl_common amd64_edac_mod edac_mce_amd amd_energy kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl pcspkr ipmi_ssif ccp sp5100_tco k10temp i2c_piix4 wmi acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter xfs libcrc32c sd_mod t10_pi sg mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci libahci crc32c_intel libata tg3 i2c_algo_bit dm_mirror dm_region_hash dm_log dm_mod
[  374.399497] CR2: 0000000000000000

Comment 2 XiaoNi 2022-02-28 02:03:38 UTC
This should be fixed by patch


commit 0c031fd37f69deb0cd8c43bbfcfccd62ebd7e952
Author: Xiao Ni <xni>
Date:   Fri Dec 10 17:31:15 2021 +0800

    md: Move alloc/free acct bioset in to personality

Comment 3 Nigel Croxon 2022-03-08 13:09:34 UTC
The --backup-file=  should specify a directory with it.

Bad: mdadm --grow -l0 /dev/md0 --backup-file=tmp0

Good: mdadm --grow -l0 /dev/md0 --backup-file=/tmp/tmp0

Comment 10 Heinz Mauelshagen 2022-03-30 12:20:16 UTC
Getting across this bz I was wondering why the summary talks about reshape as of the bug descriptiabove on a takeover from a 6-legged raid1 with one spare device to a 1 legged raid0.

I.e. 6 out of seven devices are being dropped with one kept as a raid0 leg (i.e. a linear layout) in the takeover:

# uname -r
4.18.0-372.3.1.el8.x86_64

# Running on virtio-scsi devices, ^ kernel doesn't oops (loop issue, we've seen those before?)

# mdadm -C /dev/md0 -e 1.2 -l1 -n6 /dev/sd[a-f] -x 1 /dev/sdg  
mdadm: array /dev/md0 started.

# cat /proc/mdstat
Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] 
md0 : active raid1 sdg[6](S) sdf[5] sde[4] sdd[3] sdc[2] sdb[1] sda[0]
      523264 blocks super 1.2 [6/6] [UUUUUU]
      [=================>...]  resync = 85.0% (445888/523264) finish=0.0min speed=222944K/sec
      
unused devices: <none>

# mkfs -t ext4 /dev/md0
# mount /dev/md0 /mnt
# dd if=/dev/urandom of=/mnt/testfile bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.287026 s, 365 MB/s

# df -h /mnt
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        487M  103M  356M  23% /mnt

# mdadm -G /dev/md0 -l0 --backup-file=tmp0
mdadm: level of /dev/md0 changed to raid0

# cat /proc/mdstat
Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] 
md0 : active raid0 sde[4]
      523264 blocks super 1.2 64k chunks
      
unused devices: <none>

# ll /mnt
total 102414
drwx------. 2 root root     12288 Mar 30 08:15 lost+found
-rw-r--r--. 1 root root 104857600 Mar 30 08:15 testfile

# mdadm -G /dev/md0 -l0
mdadm: level of /dev/md0 changed to raid0

# cat /proc/mdstat
Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] 
md0 : active raid0 sde[4]
      523264 blocks super 1.2 64k chunks
      
unused devices: <none>

Comment 14 Nigel Croxon 2022-05-02 16:08:25 UTC
Fix has been proposed upstream.
https://www.spinics.net/lists/raid/msg70201.html

Comment 16 Nigel Croxon 2022-05-24 18:48:04 UTC
Test kernel
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=45518198

Based on my list of patchset #23