Bug 1818914 - [rhel 8.0] mdadm reshape from RAID5 to RAID6 hangs [rhel-8]
Summary: [rhel 8.0] mdadm reshape from RAID5 to RAID6 hangs [rhel-8]
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: mdadm
Version: 8.3
Hardware: x86_64
OS: Unspecified
medium
medium
Target Milestone: rc
: 8.0
Assignee: Nigel Croxon
QA Contact: Storage QE
URL:
Whiteboard:
Depends On: 1818912
Blocks: 1818931
TreeView+ depends on / blocked
 
Reported: 2020-03-30 17:08 UTC by Hubert Kario
Modified: 2021-09-06 15:22 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1818912
: 1818931 (view as bug list)
Environment:
Last Closed: 2021-08-31 20:11:33 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Hubert Kario 2020-03-30 17:08:19 UTC
Description of problem:
Reshaping a 3-disk RAID5 to 4-disk RAID6 hangs, restore from critical section impossible

Version-Release number of selected component (if applicable):
mdadm-4.1-13.el8.x86_64
kernel-4.18.0-190.3.el8.x86_64

How reproducible:
always

Steps to Reproduce:
truncate -s 1G disk1
truncate -s 1G disk2
truncate -s 1G disk3
truncate -s 1G disk4
DEVS=($(losetup --find --show disk1))
DEVS+=($(losetup --find --show disk2))
DEVS+=($(losetup --find --show disk3))
ADD=$(losetup --find --show disk4)
mdadm --create /dev/md0 --level=5 --raid-devices=3 "${DEVS[@]}"
mdadm --wait /dev/md0
mdadm /dev/md0 --add "$ADD"
mdadm --grow /dev/md0 --level=6 --raid-devices=4 --backup-file=mdadm.backup

Actual results:
hanged at the beginning of of migration:

# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid6 loop3[4] loop2[3] loop1[1] loop0[0]
      2093056 blocks super 1.2 level 6, 512k chunk, algorithm 18 [4/3] [UUU_]
      [>....................]  reshape =  0.0% (1/1046528) finish=2.0min speed=8305K/sec
      
unused devices: <none>


Expected results:
a RAID6 array with previously existing data

Additional info:
mdadm --stop /dev/md0
mdadm --assemble /dev/md0 "${DEVS[@]}" $ADD --backup-file=mdadm.backup

mdadm: Failed to restore critical section for reshape, sorry.

Comment 1 Heinz Mauelshagen 2020-03-30 17:18:07 UTC
Just ran fine on Fedora 31, kernel 5.5.11-200 and mdadm 'v4.1 - 2018-10-01' fine but on disks, not on loop.
Can you repreduce on newer Fedora?

Comment 2 Hubert Kario 2020-03-30 17:52:10 UTC
Same behaviour on Fedora 31:
kernel-5.5.10-200.fc31.x86_64
mdadm-4.1-4.fc31.x86_64

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid6 loop3[4] loop2[3] loop1[1] loop0[0]
      2093056 blocks super 1.2 level 6, 512k chunk, algorithm 18 [4/3] [UUU_]
      [>....................]  reshape =  0.0% (1/1046528) finish=0.0min speed=174421K/sec
      
unused devices: <none>

Comment 3 Hubert Kario 2020-03-30 17:58:17 UTC
And on Fedora 32:
kernel-5.6.0-0.rc7.git0.2.fc32.x86_64
mdadm-4.1-4.fc32.x86_64

# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid6 loop3[4] loop2[3] loop1[1] loop0[0]
      2093056 blocks super 1.2 level 6, 512k chunk, algorithm 18 [4/3] [UUU_]
      [>....................]  reshape =  0.0% (1/1046528) finish=0.0min speed=174421K/sec
      
unused devices: <none>

Comment 4 Nigel Croxon 2020-04-13 18:51:23 UTC
Test version of mdadm:
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=27903664

Comment 5 Hubert Kario 2020-04-14 11:06:01 UTC
That's a rhel-7 package, this is a rhel-8 bug...

Comment 6 Nigel Croxon 2020-04-14 11:38:33 UTC
yes, but it will work on rhel8.. 
it's a test version of mdadm, it's not going outside RH.

Comment 7 Hubert Kario 2020-04-14 11:46:42 UTC
worked with kernel-4.18.0-193.5.el8.x86_64 and mdadm-4.1-njc.el7_8.x86_64:

Personalities : [raid6] [raid5] [raid4] 
md0 : active raid6 loop3[4] loop2[3] loop1[1] loop0[0]
      2093056 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/4] [UUUU]
      
unused devices: <none>

Comment 8 Nigel Croxon 2020-04-30 19:43:31 UTC
Moving our conversation to the RHEL8 bz  for tracking...

Hubert, 
if you could try this test version on a RHEL8 machine.  Give feedback.

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=28291821

Thanks Nigel

Comment 9 Hubert Kario 2020-05-04 14:25:15 UTC
I'm afraid that the scratch build was cleaned up, there are no rpms there any more

Comment 11 Hubert Kario 2020-05-04 17:18:56 UTC
seems to work fine

mdadm-4.1-njc2.el8.x86_64
kernel-4.18.0-193.13.el8.x86_64

# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid6 loop3[4] loop2[3] loop1[1] loop0[0]
      2093056 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/4] [UUUU]
      
unused devices: <none>

Comment 12 Nigel Croxon 2020-05-05 19:43:28 UTC
https://www.spinics.net/lists/raid/msg64347.html

Comment 13 Nigel Croxon 2020-07-01 13:23:39 UTC
https://marc.info/?l=linux-raid&m=159195299630680&w=2

Verified the above patch fixes the hang and allows the grow to proceed.

Comment 14 Nigel Croxon 2020-07-01 13:25:18 UTC
Hubert,   if you want to give this mdadm a test:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29810872

Comment 15 Nigel Croxon 2020-07-01 14:00:21 UTC
Worked on my test machine.. but just tried 1minutetip and failed.

Comment 16 Nigel Croxon 2020-09-14 19:21:57 UTC
[root@ci-vm-10-0-139-241 ~]# uname -a
Linux ci-vm-10-0-139-241.hosted.upshift.rdu2.redhat.com 4.18.0-234.el8.x86_64 #1 SMP Thu Aug 20 10:25:32 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
[root@ci-vm-10-0-139-241 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux release 8.4 Beta (Ootpa)

Sep 14 14:50:13 ci-vm-10-0-139-241 kernel: md/raid:md0: device loop1 operational as raid disk 1
Sep 14 14:50:13 ci-vm-10-0-139-241 kernel: md/raid:md0: device loop0 operational as raid disk 0
Sep 14 14:50:13 ci-vm-10-0-139-241 kernel: md/raid:md0: raid level 5 active with 2 out of 3 devices, algorithm 2
Sep 14 14:50:13 ci-vm-10-0-139-241 kernel: md0: detected capacity change from 0 to 2143289344
Sep 14 14:50:13 ci-vm-10-0-139-241 kernel: md: recovery of RAID array md0
Sep 14 14:50:24 ci-vm-10-0-139-241 kernel: md: md0: recovery done.
Sep 14 14:50:52 ci-vm-10-0-139-241 kernel: md/raid:md0: device loop2 operational as raid disk 2
Sep 14 14:50:52 ci-vm-10-0-139-241 kernel: md/raid:md0: device loop1 operational as raid disk 1
Sep 14 14:50:52 ci-vm-10-0-139-241 kernel: md/raid:md0: device loop0 operational as raid disk 0
Sep 14 14:50:52 ci-vm-10-0-139-241 kernel: md/raid:md0: raid level 6 active with 3 out of 4 devices, algorithm 18
Sep 14 14:50:53 ci-vm-10-0-139-241 kernel: md: reshape of RAID array md0
Sep 14 14:50:53 ci-vm-10-0-139-241 systemd[1]: Started Manage MD Reshape on /dev/md0.
Sep 14 14:50:53 ci-vm-10-0-139-241 mdadm[1500]: mdadm: array: Cannot grow - need backup-file
Sep 14 14:50:53 ci-vm-10-0-139-241 mdadm[1500]: mdadm:  Please provide one with "--backup=..."
Sep 14 14:50:53 ci-vm-10-0-139-241 systemd[1]: mdadm-grow-continue: Main process exited, code=exited, status=1/FAILURE
Sep 14 14:50:53 ci-vm-10-0-139-241 systemd[1]: mdadm-grow-continue: Failed with result 'exit-code'.

Comment 18 Ken Green 2020-10-18 17:33:27 UTC
This happens on 8.2 (OK, I tested it on CentOS) it's not just reshaping RAID5 to RAID6 that hangs, I also tried this adding a disk into a RAID5 group.

[root@lg02vmcentos82 ~]#
[root@lg02vmcentos82 ~]#
[root@lg02vmcentos82 ~]# mdadm --create /dev/md0 --level 5 -n 4 /dev/sd[bcde]
mdadm: partition table exists on /dev/sdb
mdadm: partition table exists on /dev/sdb but will be lost or
       meaningless after creating array
mdadm: partition table exists on /dev/sdc
mdadm: partition table exists on /dev/sdc but will be lost or
       meaningless after creating array
mdadm: partition table exists on /dev/sdd
mdadm: partition table exists on /dev/sdd but will be lost or
       meaningless after creating array
mdadm: partition table exists on /dev/sde
mdadm: partition table exists on /dev/sde but will be lost or
       meaningless after creating array
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
[root@lg02vmcentos82 ~]#
[root@lg02vmcentos82 ~]#
[root@lg02vmcentos82 ~]# mdadm /dev/md0 --add /dev/sd[fg]
mdadm: added /dev/sdf
mdadm: added /dev/sdg
[root@lg02vmcentos82 ~]#
[root@lg02vmcentos82 ~]#
[root@lg02vmcentos82 ~]# cat /proc/mdstat
Personalities : [raid0] [raid10] [raid6] [raid5] [raid4]
md0 : active raid5 sdg[6](S) sdf[5](S) sde[4] sdd[2] sdc[1] sdb[0]
      6282240 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]

unused devices: <none>
[root@lg02vmcentos82 ~]#
[root@lg02vmcentos82 ~]#
[root@lg02vmcentos82 ~]# mdadm --grow /dev/md0  --backup-file=/tmp/md0-backup-raid5 --raid-devices=5
mdadm: Need to backup 6144K of critical section..
[root@lg02vmcentos82 ~]#
[root@lg02vmcentos82 ~]# cat /proc/mdstat                                       Personalities : [raid0] [raid10] [raid6] [raid5] [raid4]
md0 : active raid5 sdg[6] sdf[5](S) sde[4] sdd[2] sdc[1] sdb[0]
      6282240 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
      [>....................]  reshape =  0.0% (1/2094080) finish=0.3min speed=104394K/sec

unused devices: <none>
[root@lg02vmcentos82 ~]#
[root@lg02vmcentos82 ~]# cat /proc/mdstat
Personalities : [raid0] [raid10] [raid6] [raid5] [raid4]
md0 : active raid5 sdg[6] sdf[5](S) sde[4] sdd[2] sdc[1] sdb[0]
      6282240 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
      [>....................]  reshape =  0.0% (1/2094080) finish=0.9min speed=37283K/sec

unused devices: <none>
[root@lg02vmcentos82 ~]#

Then wonder off for a few hours

[root@lg02vmcentos82 ~]# cat /proc/mdstat
Personalities : [raid0] [raid10] [raid6] [raid5] [raid4]
md0 : active raid5 sdg[6] sdf[5](S) sde[4] sdd[2] sdc[1] sdb[0]
      6282240 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
      [>....................]  reshape =  0.0% (1/2094080) finish=402.1min speed=86K/sec

unused devices: <none>
[root@lg02vmcentos82 ~]# date
Sun Oct 18 13:27:57 EDT 2020
[root@lg02vmcentos82 ~]# ls -l /run/mdadm/
total 4
lrwxrwxrwx. 1 root root 21 Oct 18 10:00 backup_file-md0 -> /tmp/md0-backup-raid5
-rw-------. 1 root root 53 Oct 18 09:59 map
[root@lg02vmcentos82 ~]#
[root@lg02vmcentos82 ~]# ls -l /tmp/md0-backup-raid5
-rw-------. 1 root root 6295552 Oct 18 10:00 /tmp/md0-backup-raid5
[root@lg02vmcentos82 ~]# cat /proc/mdstat
Personalities : [raid0] [raid10] [raid6] [raid5] [raid4]
md0 : active raid5 sdg[6] sdf[5](S) sde[4] sdd[2] sdc[1] sdb[0]
      6282240 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
      [>....................]  reshape =  0.0% (1/2094080) finish=415.9min speed=83K/sec

unused devices: <none>
[root@lg02vmcentos82 ~]#

I get just the same behaviour with converting the RAID5 to RAID6 as noted above.

My setup is running with a KVM based virtual machine.

Comment 19 Nigel Croxon 2021-01-27 17:44:27 UTC
https://www.spinics.net/lists/raid/msg67053.html


Note You need to log in before you can comment on or make changes to this bug.