Bug 2058496

Summary: mdadm-4.2-2.el8 regression panic during reshape
Product: Red Hat Enterprise Linux 8 Reporter: Fine Fan <ffan>
Component: mdadmAssignee: Nigel Croxon <ncroxon>
Status: CLOSED WONTFIX QA Contact: Fine Fan <ffan>
Severity: unspecified Docs Contact:
Priority: high    
Version: 8.6CC: dledford, heinzm, lmiksik, ncroxon, xni, yizhan
Target Milestone: rcKeywords: Regression, Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-4.18.0-404.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-25 07:28:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Fine Fan 2022-02-25 06:37:05 UTC
Description of problem:


Version-Release number of selected component (if applicable):
RHEL-8.6.0-20220223.0
kernel-4.18.0-367.el8
mdadm-4.2-2.el8


How reproducible:


Steps to Reproduce:
mdadm --create --run /dev/md0 --level 1 --metadata 1.2 --raid-devices 6  /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3 /dev/loop4 /dev/loop5     --spare-devices 1  /dev/loop6

mkfs -t ext4 /dev/md0

mount -t ext4 /dev/md0 /mnt/md_test

dd if=/dev/urandom of=/mnt/md_test/testfile bs=1M count=100

mdadm --grow -l0 /dev/md0 --backup-file=tmp0


Actual results:
The server panic.

Expected results:
The server don't panic.

Additional info:

Comment 1 Fine Fan 2022-02-25 08:18:31 UTC
[  374.148492] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[  374.156397] PGD 0 P4D 0 
[  374.158933] Oops: 0010 [#1] SMP NOPTI
[  374.162592] CPU: 26 PID: 2201 Comm: jbd2/md0-8 Kdump: loaded Not tainted 4.18.0-367.el8.x86_64 #1
[  374.171457] Hardware name: Dell Inc. PowerEdge R6515/035YY8, BIOS 2.5.5 10/07/2021
[  374.179014] RIP: 0010:0x0
[  374.181641] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[  374.188513] RSP: 0018:ffffa9f60358ba08 EFLAGS: 00010206
[  374.193739] RAX: 0000000000000000 RBX: 0000000000411200 RCX: ffff9c4b097af7d8
[  374.200871] RDX: ffff9c4b097af7c8 RSI: 0000000000000000 RDI: 0000000000411200
[  374.207996] RBP: 0000000000611200 R08: 0000000000000001 R09: 0000000000000001
[  374.215126] R10: 0000000000000002 R11: 0000000000000400 R12: ffff9c4b097af808
[  374.222251] R13: ffffa9f60358ba40 R14: ffffffffb993d840 R15: ffff9c4b097af7d8
[  374.229376] FS:  0000000000000000(0000) GS:ffff9c521f280000(0000) knlGS:0000000000000000
[  374.237460] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  374.243197] CR2: ffffffffffffffd6 CR3: 0000000673810005 CR4: 0000000000770ee0
[  374.250320] PKRU: 55555554
[  374.253026] Call Trace:
[  374.255504]  mempool_alloc+0x67/0x180
[  374.259214]  bio_alloc_bioset+0x14a/0x220
[  374.263225]  bio_clone_fast+0x19/0x60
[  374.266892]  md_account_bio+0x39/0x80
[  374.270559]  raid0_make_request+0xa0/0x550 [raid0]
[  374.275351]  ? blk_throtl_bio+0x252/0xb80
[  374.279363]  ? finish_wait+0x80/0x80
[  374.282942]  md_handle_request+0x119/0x190
[  374.287040]  md_make_request+0x5b/0xb0
[  374.290785]  generic_make_request+0x25b/0x350
[  374.295144]  submit_bio+0x3c/0x160
[  374.298541]  ? bio_add_page+0x42/0x50
[  374.302207]  submit_bh_wbc+0x16a/0x190
[  374.305960]  jbd2_journal_commit_transaction+0x6b6/0x1a00 [jbd2]
[  374.311967]  ? __switch_to_asm+0x41/0x70
[  374.315892]  ? sk_filter_is_valid_access+0x50/0x60
[  374.320684]  ? __switch_to+0x10c/0x450
[  374.324436]  kjournald2+0xbd/0x270 [jbd2]
[  374.328448]  ? finish_wait+0x80/0x80
[  374.332020]  ? commit_timeout+0x10/0x10 [jbd2]
[  374.336466]  kthread+0x10a/0x120
[  374.339697]  ? set_kthread_struct+0x40/0x40
[  374.343876]  ret_from_fork+0x22/0x40
[  374.347456] Modules linked in: raid0 ext4 mbcache jbd2 raid1 loop sunrpc dm_multipath dell_smbios intel_rapl_msr dell_wmi_descriptor wmi_bmof dcdbas intel_rapl_common amd64_edac_mod edac_mce_amd amd_energy kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl pcspkr ipmi_ssif ccp sp5100_tco k10temp i2c_piix4 wmi acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter xfs libcrc32c sd_mod t10_pi sg mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci libahci crc32c_intel libata tg3 i2c_algo_bit dm_mirror dm_region_hash dm_log dm_mod
[  374.399497] CR2: 0000000000000000

Comment 2 XiaoNi 2022-02-28 02:03:38 UTC
This should be fixed by patch


commit 0c031fd37f69deb0cd8c43bbfcfccd62ebd7e952
Author: Xiao Ni <xni>
Date:   Fri Dec 10 17:31:15 2021 +0800

    md: Move alloc/free acct bioset in to personality

Comment 3 Nigel Croxon 2022-03-08 13:09:34 UTC
The --backup-file=  should specify a directory with it.

Bad: mdadm --grow -l0 /dev/md0 --backup-file=tmp0

Good: mdadm --grow -l0 /dev/md0 --backup-file=/tmp/tmp0

Comment 10 Heinz Mauelshagen 2022-03-30 12:20:16 UTC
Getting across this bz I was wondering why the summary talks about reshape as of the bug descriptiabove on a takeover from a 6-legged raid1 with one spare device to a 1 legged raid0.

I.e. 6 out of seven devices are being dropped with one kept as a raid0 leg (i.e. a linear layout) in the takeover:

# uname -r
4.18.0-372.3.1.el8.x86_64

# Running on virtio-scsi devices, ^ kernel doesn't oops (loop issue, we've seen those before?)

# mdadm -C /dev/md0 -e 1.2 -l1 -n6 /dev/sd[a-f] -x 1 /dev/sdg  
mdadm: array /dev/md0 started.

# cat /proc/mdstat
Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] 
md0 : active raid1 sdg[6](S) sdf[5] sde[4] sdd[3] sdc[2] sdb[1] sda[0]
      523264 blocks super 1.2 [6/6] [UUUUUU]
      [=================>...]  resync = 85.0% (445888/523264) finish=0.0min speed=222944K/sec
      
unused devices: <none>

# mkfs -t ext4 /dev/md0
# mount /dev/md0 /mnt
# dd if=/dev/urandom of=/mnt/testfile bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.287026 s, 365 MB/s

# df -h /mnt
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        487M  103M  356M  23% /mnt

# mdadm -G /dev/md0 -l0 --backup-file=tmp0
mdadm: level of /dev/md0 changed to raid0

# cat /proc/mdstat
Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] 
md0 : active raid0 sde[4]
      523264 blocks super 1.2 64k chunks
      
unused devices: <none>

# ll /mnt
total 102414
drwx------. 2 root root     12288 Mar 30 08:15 lost+found
-rw-r--r--. 1 root root 104857600 Mar 30 08:15 testfile

# mdadm -G /dev/md0 -l0
mdadm: level of /dev/md0 changed to raid0

# cat /proc/mdstat
Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] 
md0 : active raid0 sde[4]
      523264 blocks super 1.2 64k chunks
      
unused devices: <none>

Comment 14 Nigel Croxon 2022-05-02 16:08:25 UTC
Fix has been proposed upstream.
https://www.spinics.net/lists/raid/msg70201.html

Comment 16 Nigel Croxon 2022-05-24 18:48:04 UTC
Test kernel
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=45518198

Based on my list of patchset #23

Comment 21 RHEL Program Management 2023-08-25 07:28:24 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.