Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1818914

Summary: [rhel 8.0] mdadm reshape from RAID5 to RAID6 hangs [rhel-8]
Product: Red Hat Enterprise Linux 8 Reporter: Alicja Kario <hkario>
Component: mdadmAssignee: Nigel Croxon <ncroxon>
Status: CLOSED CURRENTRELEASE QA Contact: Storage QE <storage-qe>
Severity: medium Docs Contact:
Priority: medium    
Version: 8.3CC: agk, dledford, extras-qa, heinzm, jbrassow, jes.sorensen, Ken.Green, ncroxon, xni
Target Milestone: rcFlags: pm-rhel: mirror+
Target Release: 8.0   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1818912
: 1818931 (view as bug list) Environment:
Last Closed: 2021-08-31 20:11:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1818912    
Bug Blocks: 1818931    

Description Alicja Kario 2020-03-30 17:08:19 UTC
Description of problem:
Reshaping a 3-disk RAID5 to 4-disk RAID6 hangs, restore from critical section impossible

Version-Release number of selected component (if applicable):
mdadm-4.1-13.el8.x86_64
kernel-4.18.0-190.3.el8.x86_64

How reproducible:
always

Steps to Reproduce:
truncate -s 1G disk1
truncate -s 1G disk2
truncate -s 1G disk3
truncate -s 1G disk4
DEVS=($(losetup --find --show disk1))
DEVS+=($(losetup --find --show disk2))
DEVS+=($(losetup --find --show disk3))
ADD=$(losetup --find --show disk4)
mdadm --create /dev/md0 --level=5 --raid-devices=3 "${DEVS[@]}"
mdadm --wait /dev/md0
mdadm /dev/md0 --add "$ADD"
mdadm --grow /dev/md0 --level=6 --raid-devices=4 --backup-file=mdadm.backup

Actual results:
hanged at the beginning of of migration:

# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid6 loop3[4] loop2[3] loop1[1] loop0[0]
      2093056 blocks super 1.2 level 6, 512k chunk, algorithm 18 [4/3] [UUU_]
      [>....................]  reshape =  0.0% (1/1046528) finish=2.0min speed=8305K/sec
      
unused devices: <none>


Expected results:
a RAID6 array with previously existing data

Additional info:
mdadm --stop /dev/md0
mdadm --assemble /dev/md0 "${DEVS[@]}" $ADD --backup-file=mdadm.backup

mdadm: Failed to restore critical section for reshape, sorry.

Comment 1 Heinz Mauelshagen 2020-03-30 17:18:07 UTC
Just ran fine on Fedora 31, kernel 5.5.11-200 and mdadm 'v4.1 - 2018-10-01' fine but on disks, not on loop.
Can you repreduce on newer Fedora?

Comment 2 Alicja Kario 2020-03-30 17:52:10 UTC
Same behaviour on Fedora 31:
kernel-5.5.10-200.fc31.x86_64
mdadm-4.1-4.fc31.x86_64

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid6 loop3[4] loop2[3] loop1[1] loop0[0]
      2093056 blocks super 1.2 level 6, 512k chunk, algorithm 18 [4/3] [UUU_]
      [>....................]  reshape =  0.0% (1/1046528) finish=0.0min speed=174421K/sec
      
unused devices: <none>

Comment 3 Alicja Kario 2020-03-30 17:58:17 UTC
And on Fedora 32:
kernel-5.6.0-0.rc7.git0.2.fc32.x86_64
mdadm-4.1-4.fc32.x86_64

# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid6 loop3[4] loop2[3] loop1[1] loop0[0]
      2093056 blocks super 1.2 level 6, 512k chunk, algorithm 18 [4/3] [UUU_]
      [>....................]  reshape =  0.0% (1/1046528) finish=0.0min speed=174421K/sec
      
unused devices: <none>

Comment 4 Nigel Croxon 2020-04-13 18:51:23 UTC
Test version of mdadm:
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=27903664

Comment 5 Alicja Kario 2020-04-14 11:06:01 UTC
That's a rhel-7 package, this is a rhel-8 bug...

Comment 6 Nigel Croxon 2020-04-14 11:38:33 UTC
yes, but it will work on rhel8.. 
it's a test version of mdadm, it's not going outside RH.

Comment 7 Alicja Kario 2020-04-14 11:46:42 UTC
worked with kernel-4.18.0-193.5.el8.x86_64 and mdadm-4.1-njc.el7_8.x86_64:

Personalities : [raid6] [raid5] [raid4] 
md0 : active raid6 loop3[4] loop2[3] loop1[1] loop0[0]
      2093056 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/4] [UUUU]
      
unused devices: <none>

Comment 8 Nigel Croxon 2020-04-30 19:43:31 UTC
Moving our conversation to the RHEL8 bz  for tracking...

Hubert, 
if you could try this test version on a RHEL8 machine.  Give feedback.

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=28291821

Thanks Nigel

Comment 9 Alicja Kario 2020-05-04 14:25:15 UTC
I'm afraid that the scratch build was cleaned up, there are no rpms there any more

Comment 11 Alicja Kario 2020-05-04 17:18:56 UTC
seems to work fine

mdadm-4.1-njc2.el8.x86_64
kernel-4.18.0-193.13.el8.x86_64

# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid6 loop3[4] loop2[3] loop1[1] loop0[0]
      2093056 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/4] [UUUU]
      
unused devices: <none>

Comment 12 Nigel Croxon 2020-05-05 19:43:28 UTC
https://www.spinics.net/lists/raid/msg64347.html

Comment 13 Nigel Croxon 2020-07-01 13:23:39 UTC
https://marc.info/?l=linux-raid&m=159195299630680&w=2

Verified the above patch fixes the hang and allows the grow to proceed.

Comment 14 Nigel Croxon 2020-07-01 13:25:18 UTC
Hubert,   if you want to give this mdadm a test:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29810872

Comment 15 Nigel Croxon 2020-07-01 14:00:21 UTC
Worked on my test machine.. but just tried 1minutetip and failed.

Comment 16 Nigel Croxon 2020-09-14 19:21:57 UTC
[root@ci-vm-10-0-139-241 ~]# uname -a
Linux ci-vm-10-0-139-241.hosted.upshift.rdu2.redhat.com 4.18.0-234.el8.x86_64 #1 SMP Thu Aug 20 10:25:32 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
[root@ci-vm-10-0-139-241 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux release 8.4 Beta (Ootpa)

Sep 14 14:50:13 ci-vm-10-0-139-241 kernel: md/raid:md0: device loop1 operational as raid disk 1
Sep 14 14:50:13 ci-vm-10-0-139-241 kernel: md/raid:md0: device loop0 operational as raid disk 0
Sep 14 14:50:13 ci-vm-10-0-139-241 kernel: md/raid:md0: raid level 5 active with 2 out of 3 devices, algorithm 2
Sep 14 14:50:13 ci-vm-10-0-139-241 kernel: md0: detected capacity change from 0 to 2143289344
Sep 14 14:50:13 ci-vm-10-0-139-241 kernel: md: recovery of RAID array md0
Sep 14 14:50:24 ci-vm-10-0-139-241 kernel: md: md0: recovery done.
Sep 14 14:50:52 ci-vm-10-0-139-241 kernel: md/raid:md0: device loop2 operational as raid disk 2
Sep 14 14:50:52 ci-vm-10-0-139-241 kernel: md/raid:md0: device loop1 operational as raid disk 1
Sep 14 14:50:52 ci-vm-10-0-139-241 kernel: md/raid:md0: device loop0 operational as raid disk 0
Sep 14 14:50:52 ci-vm-10-0-139-241 kernel: md/raid:md0: raid level 6 active with 3 out of 4 devices, algorithm 18
Sep 14 14:50:53 ci-vm-10-0-139-241 kernel: md: reshape of RAID array md0
Sep 14 14:50:53 ci-vm-10-0-139-241 systemd[1]: Started Manage MD Reshape on /dev/md0.
Sep 14 14:50:53 ci-vm-10-0-139-241 mdadm[1500]: mdadm: array: Cannot grow - need backup-file
Sep 14 14:50:53 ci-vm-10-0-139-241 mdadm[1500]: mdadm:  Please provide one with "--backup=..."
Sep 14 14:50:53 ci-vm-10-0-139-241 systemd[1]: mdadm-grow-continue: Main process exited, code=exited, status=1/FAILURE
Sep 14 14:50:53 ci-vm-10-0-139-241 systemd[1]: mdadm-grow-continue: Failed with result 'exit-code'.

Comment 18 Ken Green 2020-10-18 17:33:27 UTC
This happens on 8.2 (OK, I tested it on CentOS) it's not just reshaping RAID5 to RAID6 that hangs, I also tried this adding a disk into a RAID5 group.

[root@lg02vmcentos82 ~]#
[root@lg02vmcentos82 ~]#
[root@lg02vmcentos82 ~]# mdadm --create /dev/md0 --level 5 -n 4 /dev/sd[bcde]
mdadm: partition table exists on /dev/sdb
mdadm: partition table exists on /dev/sdb but will be lost or
       meaningless after creating array
mdadm: partition table exists on /dev/sdc
mdadm: partition table exists on /dev/sdc but will be lost or
       meaningless after creating array
mdadm: partition table exists on /dev/sdd
mdadm: partition table exists on /dev/sdd but will be lost or
       meaningless after creating array
mdadm: partition table exists on /dev/sde
mdadm: partition table exists on /dev/sde but will be lost or
       meaningless after creating array
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
[root@lg02vmcentos82 ~]#
[root@lg02vmcentos82 ~]#
[root@lg02vmcentos82 ~]# mdadm /dev/md0 --add /dev/sd[fg]
mdadm: added /dev/sdf
mdadm: added /dev/sdg
[root@lg02vmcentos82 ~]#
[root@lg02vmcentos82 ~]#
[root@lg02vmcentos82 ~]# cat /proc/mdstat
Personalities : [raid0] [raid10] [raid6] [raid5] [raid4]
md0 : active raid5 sdg[6](S) sdf[5](S) sde[4] sdd[2] sdc[1] sdb[0]
      6282240 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]

unused devices: <none>
[root@lg02vmcentos82 ~]#
[root@lg02vmcentos82 ~]#
[root@lg02vmcentos82 ~]# mdadm --grow /dev/md0  --backup-file=/tmp/md0-backup-raid5 --raid-devices=5
mdadm: Need to backup 6144K of critical section..
[root@lg02vmcentos82 ~]#
[root@lg02vmcentos82 ~]# cat /proc/mdstat                                       Personalities : [raid0] [raid10] [raid6] [raid5] [raid4]
md0 : active raid5 sdg[6] sdf[5](S) sde[4] sdd[2] sdc[1] sdb[0]
      6282240 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
      [>....................]  reshape =  0.0% (1/2094080) finish=0.3min speed=104394K/sec

unused devices: <none>
[root@lg02vmcentos82 ~]#
[root@lg02vmcentos82 ~]# cat /proc/mdstat
Personalities : [raid0] [raid10] [raid6] [raid5] [raid4]
md0 : active raid5 sdg[6] sdf[5](S) sde[4] sdd[2] sdc[1] sdb[0]
      6282240 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
      [>....................]  reshape =  0.0% (1/2094080) finish=0.9min speed=37283K/sec

unused devices: <none>
[root@lg02vmcentos82 ~]#

Then wonder off for a few hours

[root@lg02vmcentos82 ~]# cat /proc/mdstat
Personalities : [raid0] [raid10] [raid6] [raid5] [raid4]
md0 : active raid5 sdg[6] sdf[5](S) sde[4] sdd[2] sdc[1] sdb[0]
      6282240 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
      [>....................]  reshape =  0.0% (1/2094080) finish=402.1min speed=86K/sec

unused devices: <none>
[root@lg02vmcentos82 ~]# date
Sun Oct 18 13:27:57 EDT 2020
[root@lg02vmcentos82 ~]# ls -l /run/mdadm/
total 4
lrwxrwxrwx. 1 root root 21 Oct 18 10:00 backup_file-md0 -> /tmp/md0-backup-raid5
-rw-------. 1 root root 53 Oct 18 09:59 map
[root@lg02vmcentos82 ~]#
[root@lg02vmcentos82 ~]# ls -l /tmp/md0-backup-raid5
-rw-------. 1 root root 6295552 Oct 18 10:00 /tmp/md0-backup-raid5
[root@lg02vmcentos82 ~]# cat /proc/mdstat
Personalities : [raid0] [raid10] [raid6] [raid5] [raid4]
md0 : active raid5 sdg[6] sdf[5](S) sde[4] sdd[2] sdc[1] sdb[0]
      6282240 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
      [>....................]  reshape =  0.0% (1/2094080) finish=415.9min speed=83K/sec

unused devices: <none>
[root@lg02vmcentos82 ~]#

I get just the same behaviour with converting the RAID5 to RAID6 as noted above.

My setup is running with a KVM based virtual machine.

Comment 19 Nigel Croxon 2021-01-27 17:44:27 UTC
https://www.spinics.net/lists/raid/msg67053.html