Description of problem: Reshaping a 3-disk RAID5 to 4-disk RAID6 hangs, restore from critical section impossible Version-Release number of selected component (if applicable): mdadm-4.1-13.el8.x86_64 kernel-4.18.0-190.3.el8.x86_64 How reproducible: always Steps to Reproduce: truncate -s 1G disk1 truncate -s 1G disk2 truncate -s 1G disk3 truncate -s 1G disk4 DEVS=($(losetup --find --show disk1)) DEVS+=($(losetup --find --show disk2)) DEVS+=($(losetup --find --show disk3)) ADD=$(losetup --find --show disk4) mdadm --create /dev/md0 --level=5 --raid-devices=3 "${DEVS[@]}" mdadm --wait /dev/md0 mdadm /dev/md0 --add "$ADD" mdadm --grow /dev/md0 --level=6 --raid-devices=4 --backup-file=mdadm.backup Actual results: hanged at the beginning of of migration: # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid6 loop3[4] loop2[3] loop1[1] loop0[0] 2093056 blocks super 1.2 level 6, 512k chunk, algorithm 18 [4/3] [UUU_] [>....................] reshape = 0.0% (1/1046528) finish=2.0min speed=8305K/sec unused devices: <none> Expected results: a RAID6 array with previously existing data Additional info: mdadm --stop /dev/md0 mdadm --assemble /dev/md0 "${DEVS[@]}" $ADD --backup-file=mdadm.backup mdadm: Failed to restore critical section for reshape, sorry.
Just ran fine on Fedora 31, kernel 5.5.11-200 and mdadm 'v4.1 - 2018-10-01' fine but on disks, not on loop. Can you repreduce on newer Fedora?
Same behaviour on Fedora 31: kernel-5.5.10-200.fc31.x86_64 mdadm-4.1-4.fc31.x86_64 # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid6 loop3[4] loop2[3] loop1[1] loop0[0] 2093056 blocks super 1.2 level 6, 512k chunk, algorithm 18 [4/3] [UUU_] [>....................] reshape = 0.0% (1/1046528) finish=0.0min speed=174421K/sec unused devices: <none>
And on Fedora 32: kernel-5.6.0-0.rc7.git0.2.fc32.x86_64 mdadm-4.1-4.fc32.x86_64 # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid6 loop3[4] loop2[3] loop1[1] loop0[0] 2093056 blocks super 1.2 level 6, 512k chunk, algorithm 18 [4/3] [UUU_] [>....................] reshape = 0.0% (1/1046528) finish=0.0min speed=174421K/sec unused devices: <none>
Test version of mdadm: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=27903664
That's a rhel-7 package, this is a rhel-8 bug...
yes, but it will work on rhel8.. it's a test version of mdadm, it's not going outside RH.
worked with kernel-4.18.0-193.5.el8.x86_64 and mdadm-4.1-njc.el7_8.x86_64: Personalities : [raid6] [raid5] [raid4] md0 : active raid6 loop3[4] loop2[3] loop1[1] loop0[0] 2093056 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/4] [UUUU] unused devices: <none>
Moving our conversation to the RHEL8 bz for tracking... Hubert, if you could try this test version on a RHEL8 machine. Give feedback. https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=28291821 Thanks Nigel
I'm afraid that the scratch build was cleaned up, there are no rpms there any more
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=28359019
seems to work fine mdadm-4.1-njc2.el8.x86_64 kernel-4.18.0-193.13.el8.x86_64 # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid6 loop3[4] loop2[3] loop1[1] loop0[0] 2093056 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/4] [UUUU] unused devices: <none>
https://www.spinics.net/lists/raid/msg64347.html
https://marc.info/?l=linux-raid&m=159195299630680&w=2 Verified the above patch fixes the hang and allows the grow to proceed.
Hubert, if you want to give this mdadm a test: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29810872
Worked on my test machine.. but just tried 1minutetip and failed.
[root@ci-vm-10-0-139-241 ~]# uname -a Linux ci-vm-10-0-139-241.hosted.upshift.rdu2.redhat.com 4.18.0-234.el8.x86_64 #1 SMP Thu Aug 20 10:25:32 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux [root@ci-vm-10-0-139-241 ~]# cat /etc/redhat-release Red Hat Enterprise Linux release 8.4 Beta (Ootpa) Sep 14 14:50:13 ci-vm-10-0-139-241 kernel: md/raid:md0: device loop1 operational as raid disk 1 Sep 14 14:50:13 ci-vm-10-0-139-241 kernel: md/raid:md0: device loop0 operational as raid disk 0 Sep 14 14:50:13 ci-vm-10-0-139-241 kernel: md/raid:md0: raid level 5 active with 2 out of 3 devices, algorithm 2 Sep 14 14:50:13 ci-vm-10-0-139-241 kernel: md0: detected capacity change from 0 to 2143289344 Sep 14 14:50:13 ci-vm-10-0-139-241 kernel: md: recovery of RAID array md0 Sep 14 14:50:24 ci-vm-10-0-139-241 kernel: md: md0: recovery done. Sep 14 14:50:52 ci-vm-10-0-139-241 kernel: md/raid:md0: device loop2 operational as raid disk 2 Sep 14 14:50:52 ci-vm-10-0-139-241 kernel: md/raid:md0: device loop1 operational as raid disk 1 Sep 14 14:50:52 ci-vm-10-0-139-241 kernel: md/raid:md0: device loop0 operational as raid disk 0 Sep 14 14:50:52 ci-vm-10-0-139-241 kernel: md/raid:md0: raid level 6 active with 3 out of 4 devices, algorithm 18 Sep 14 14:50:53 ci-vm-10-0-139-241 kernel: md: reshape of RAID array md0 Sep 14 14:50:53 ci-vm-10-0-139-241 systemd[1]: Started Manage MD Reshape on /dev/md0. Sep 14 14:50:53 ci-vm-10-0-139-241 mdadm[1500]: mdadm: array: Cannot grow - need backup-file Sep 14 14:50:53 ci-vm-10-0-139-241 mdadm[1500]: mdadm: Please provide one with "--backup=..." Sep 14 14:50:53 ci-vm-10-0-139-241 systemd[1]: mdadm-grow-continue: Main process exited, code=exited, status=1/FAILURE Sep 14 14:50:53 ci-vm-10-0-139-241 systemd[1]: mdadm-grow-continue: Failed with result 'exit-code'.
This happens on 8.2 (OK, I tested it on CentOS) it's not just reshaping RAID5 to RAID6 that hangs, I also tried this adding a disk into a RAID5 group. [root@lg02vmcentos82 ~]# [root@lg02vmcentos82 ~]# [root@lg02vmcentos82 ~]# mdadm --create /dev/md0 --level 5 -n 4 /dev/sd[bcde] mdadm: partition table exists on /dev/sdb mdadm: partition table exists on /dev/sdb but will be lost or meaningless after creating array mdadm: partition table exists on /dev/sdc mdadm: partition table exists on /dev/sdc but will be lost or meaningless after creating array mdadm: partition table exists on /dev/sdd mdadm: partition table exists on /dev/sdd but will be lost or meaningless after creating array mdadm: partition table exists on /dev/sde mdadm: partition table exists on /dev/sde but will be lost or meaningless after creating array Continue creating array? y mdadm: Defaulting to version 1.2 metadata mdadm: array /dev/md0 started. [root@lg02vmcentos82 ~]# [root@lg02vmcentos82 ~]# [root@lg02vmcentos82 ~]# mdadm /dev/md0 --add /dev/sd[fg] mdadm: added /dev/sdf mdadm: added /dev/sdg [root@lg02vmcentos82 ~]# [root@lg02vmcentos82 ~]# [root@lg02vmcentos82 ~]# cat /proc/mdstat Personalities : [raid0] [raid10] [raid6] [raid5] [raid4] md0 : active raid5 sdg[6](S) sdf[5](S) sde[4] sdd[2] sdc[1] sdb[0] 6282240 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU] unused devices: <none> [root@lg02vmcentos82 ~]# [root@lg02vmcentos82 ~]# [root@lg02vmcentos82 ~]# mdadm --grow /dev/md0 --backup-file=/tmp/md0-backup-raid5 --raid-devices=5 mdadm: Need to backup 6144K of critical section.. [root@lg02vmcentos82 ~]# [root@lg02vmcentos82 ~]# cat /proc/mdstat Personalities : [raid0] [raid10] [raid6] [raid5] [raid4] md0 : active raid5 sdg[6] sdf[5](S) sde[4] sdd[2] sdc[1] sdb[0] 6282240 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU] [>....................] reshape = 0.0% (1/2094080) finish=0.3min speed=104394K/sec unused devices: <none> [root@lg02vmcentos82 ~]# [root@lg02vmcentos82 ~]# cat /proc/mdstat Personalities : [raid0] [raid10] [raid6] [raid5] [raid4] md0 : active raid5 sdg[6] sdf[5](S) sde[4] sdd[2] sdc[1] sdb[0] 6282240 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU] [>....................] reshape = 0.0% (1/2094080) finish=0.9min speed=37283K/sec unused devices: <none> [root@lg02vmcentos82 ~]# Then wonder off for a few hours [root@lg02vmcentos82 ~]# cat /proc/mdstat Personalities : [raid0] [raid10] [raid6] [raid5] [raid4] md0 : active raid5 sdg[6] sdf[5](S) sde[4] sdd[2] sdc[1] sdb[0] 6282240 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU] [>....................] reshape = 0.0% (1/2094080) finish=402.1min speed=86K/sec unused devices: <none> [root@lg02vmcentos82 ~]# date Sun Oct 18 13:27:57 EDT 2020 [root@lg02vmcentos82 ~]# ls -l /run/mdadm/ total 4 lrwxrwxrwx. 1 root root 21 Oct 18 10:00 backup_file-md0 -> /tmp/md0-backup-raid5 -rw-------. 1 root root 53 Oct 18 09:59 map [root@lg02vmcentos82 ~]# [root@lg02vmcentos82 ~]# ls -l /tmp/md0-backup-raid5 -rw-------. 1 root root 6295552 Oct 18 10:00 /tmp/md0-backup-raid5 [root@lg02vmcentos82 ~]# cat /proc/mdstat Personalities : [raid0] [raid10] [raid6] [raid5] [raid4] md0 : active raid5 sdg[6] sdf[5](S) sde[4] sdd[2] sdc[1] sdb[0] 6282240 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU] [>....................] reshape = 0.0% (1/2094080) finish=415.9min speed=83K/sec unused devices: <none> [root@lg02vmcentos82 ~]# I get just the same behaviour with converting the RAID5 to RAID6 as noted above. My setup is running with a KVM based virtual machine.
https://www.spinics.net/lists/raid/msg67053.html