Description of problem: Version-Release number of selected component (if applicable): RHEL-8.6.0-20220223.0 kernel-4.18.0-367.el8 mdadm-4.2-2.el8 How reproducible: Steps to Reproduce: mdadm --create --run /dev/md0 --level 1 --metadata 1.2 --raid-devices 6 /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3 /dev/loop4 /dev/loop5 --spare-devices 1 /dev/loop6 mkfs -t ext4 /dev/md0 mount -t ext4 /dev/md0 /mnt/md_test dd if=/dev/urandom of=/mnt/md_test/testfile bs=1M count=100 mdadm --grow -l0 /dev/md0 --backup-file=tmp0 Actual results: The server panic. Expected results: The server don't panic. Additional info:
[ 374.148492] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 [ 374.156397] PGD 0 P4D 0 [ 374.158933] Oops: 0010 [#1] SMP NOPTI [ 374.162592] CPU: 26 PID: 2201 Comm: jbd2/md0-8 Kdump: loaded Not tainted 4.18.0-367.el8.x86_64 #1 [ 374.171457] Hardware name: Dell Inc. PowerEdge R6515/035YY8, BIOS 2.5.5 10/07/2021 [ 374.179014] RIP: 0010:0x0 [ 374.181641] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. [ 374.188513] RSP: 0018:ffffa9f60358ba08 EFLAGS: 00010206 [ 374.193739] RAX: 0000000000000000 RBX: 0000000000411200 RCX: ffff9c4b097af7d8 [ 374.200871] RDX: ffff9c4b097af7c8 RSI: 0000000000000000 RDI: 0000000000411200 [ 374.207996] RBP: 0000000000611200 R08: 0000000000000001 R09: 0000000000000001 [ 374.215126] R10: 0000000000000002 R11: 0000000000000400 R12: ffff9c4b097af808 [ 374.222251] R13: ffffa9f60358ba40 R14: ffffffffb993d840 R15: ffff9c4b097af7d8 [ 374.229376] FS: 0000000000000000(0000) GS:ffff9c521f280000(0000) knlGS:0000000000000000 [ 374.237460] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 374.243197] CR2: ffffffffffffffd6 CR3: 0000000673810005 CR4: 0000000000770ee0 [ 374.250320] PKRU: 55555554 [ 374.253026] Call Trace: [ 374.255504] mempool_alloc+0x67/0x180 [ 374.259214] bio_alloc_bioset+0x14a/0x220 [ 374.263225] bio_clone_fast+0x19/0x60 [ 374.266892] md_account_bio+0x39/0x80 [ 374.270559] raid0_make_request+0xa0/0x550 [raid0] [ 374.275351] ? blk_throtl_bio+0x252/0xb80 [ 374.279363] ? finish_wait+0x80/0x80 [ 374.282942] md_handle_request+0x119/0x190 [ 374.287040] md_make_request+0x5b/0xb0 [ 374.290785] generic_make_request+0x25b/0x350 [ 374.295144] submit_bio+0x3c/0x160 [ 374.298541] ? bio_add_page+0x42/0x50 [ 374.302207] submit_bh_wbc+0x16a/0x190 [ 374.305960] jbd2_journal_commit_transaction+0x6b6/0x1a00 [jbd2] [ 374.311967] ? __switch_to_asm+0x41/0x70 [ 374.315892] ? sk_filter_is_valid_access+0x50/0x60 [ 374.320684] ? __switch_to+0x10c/0x450 [ 374.324436] kjournald2+0xbd/0x270 [jbd2] [ 374.328448] ? finish_wait+0x80/0x80 [ 374.332020] ? commit_timeout+0x10/0x10 [jbd2] [ 374.336466] kthread+0x10a/0x120 [ 374.339697] ? set_kthread_struct+0x40/0x40 [ 374.343876] ret_from_fork+0x22/0x40 [ 374.347456] Modules linked in: raid0 ext4 mbcache jbd2 raid1 loop sunrpc dm_multipath dell_smbios intel_rapl_msr dell_wmi_descriptor wmi_bmof dcdbas intel_rapl_common amd64_edac_mod edac_mce_amd amd_energy kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl pcspkr ipmi_ssif ccp sp5100_tco k10temp i2c_piix4 wmi acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter xfs libcrc32c sd_mod t10_pi sg mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci libahci crc32c_intel libata tg3 i2c_algo_bit dm_mirror dm_region_hash dm_log dm_mod [ 374.399497] CR2: 0000000000000000
This should be fixed by patch commit 0c031fd37f69deb0cd8c43bbfcfccd62ebd7e952 Author: Xiao Ni <xni> Date: Fri Dec 10 17:31:15 2021 +0800 md: Move alloc/free acct bioset in to personality
The --backup-file= should specify a directory with it. Bad: mdadm --grow -l0 /dev/md0 --backup-file=tmp0 Good: mdadm --grow -l0 /dev/md0 --backup-file=/tmp/tmp0
Getting across this bz I was wondering why the summary talks about reshape as of the bug descriptiabove on a takeover from a 6-legged raid1 with one spare device to a 1 legged raid0. I.e. 6 out of seven devices are being dropped with one kept as a raid0 leg (i.e. a linear layout) in the takeover: # uname -r 4.18.0-372.3.1.el8.x86_64 # Running on virtio-scsi devices, ^ kernel doesn't oops (loop issue, we've seen those before?) # mdadm -C /dev/md0 -e 1.2 -l1 -n6 /dev/sd[a-f] -x 1 /dev/sdg mdadm: array /dev/md0 started. # cat /proc/mdstat Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] md0 : active raid1 sdg[6](S) sdf[5] sde[4] sdd[3] sdc[2] sdb[1] sda[0] 523264 blocks super 1.2 [6/6] [UUUUUU] [=================>...] resync = 85.0% (445888/523264) finish=0.0min speed=222944K/sec unused devices: <none> # mkfs -t ext4 /dev/md0 # mount /dev/md0 /mnt # dd if=/dev/urandom of=/mnt/testfile bs=1M count=100 100+0 records in 100+0 records out 104857600 bytes (105 MB, 100 MiB) copied, 0.287026 s, 365 MB/s # df -h /mnt Filesystem Size Used Avail Use% Mounted on /dev/md0 487M 103M 356M 23% /mnt # mdadm -G /dev/md0 -l0 --backup-file=tmp0 mdadm: level of /dev/md0 changed to raid0 # cat /proc/mdstat Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] md0 : active raid0 sde[4] 523264 blocks super 1.2 64k chunks unused devices: <none> # ll /mnt total 102414 drwx------. 2 root root 12288 Mar 30 08:15 lost+found -rw-r--r--. 1 root root 104857600 Mar 30 08:15 testfile # mdadm -G /dev/md0 -l0 mdadm: level of /dev/md0 changed to raid0 # cat /proc/mdstat Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] md0 : active raid0 sde[4] 523264 blocks super 1.2 64k chunks unused devices: <none>
Fix has been proposed upstream. https://www.spinics.net/lists/raid/msg70201.html
Test kernel https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=45518198 Based on my list of patchset #23