+++ This bug was initially created as a clone of Bug #1014102 +++ Description of problem: If we stop an ongoing recovery of a RAID0 or RAID10 volume, it will not continue after reassembling. The volume will be in normal/active state as if recovery finished correctly. The same defect occurs if we reboot OS during ongoing recovery - recovery will not continue when OS boots. It causes data corruption. Version-Release number of selected component (if applicable): mdadm-3.2.6-3.el6.x86_64 and mdadm-3.2.6-4.el6.x86_64 with kernels: kernel-2.6.32-407.el6.x86_64 or kernel-2.6.32-415.el6.x86_64 or kernel-2.6.32-419.el6.x86_64 How reproducible: Always Steps to reproduce: 1) Create a RAID1 volume with 2 disks: # mdadm –C /dev/md/imsm –e imsm –n 2 /dev/sd[ab] # mdadm –C /dev/md/raid1 –l 1 –n 2 /dev/sd[ab] 2) Wait for the resync to be finished 3) Add a new disk to the container # mdadm --add /dev/md127 /dev/sdc 4) Turn off or fail one of disks of the volume (sda or sdb). # mdadm -f /dev/md126 /dev/sda 5) Wait for the start of recovery 6) Stop the volume # mdadm -Ss 7) Reassemble the volume # mdadm -As OR 6) Reboot OS Actual result: Recovery does not continue after reassembling/reboot. The volume is in normal/active state in mdstat, but there is a rebuild state written in metadata: Personalities : [raid1] md126 : active raid1 sdb[1] sda[0] 47185920 blocks super external:/md127/0 [2/2] [UU] md127 : inactive sdb[1](S) sda[0](S) 6306 blocks super external:imsm unused devices: <none> # mdadm -E /dev/sda | grep -A 17 raid1 [raid1]: UUID : 99ddea0a:a99b6376:ab0ce862:6cb93b3a RAID Level : 1 <-- 1 Members : 2 <-- 2 Slots : [UU] <-- [_U] Failed disk : 0 This Slot : 1 Array Size : 94371840 (45.00 GiB 48.32 GB) Per Dev Size : 94371840 (45.00 GiB 48.32 GB) Sector Offset : 0 Num Stripes : 368640 Chunk Size : 64 KiB <-- 64 KiB Reserved : 0 Migrate State : rebuild Map State : normal <-- degraded Checkpoint : 23040 (512) Dirty State : clean Expected result: Recovery continues after reassembling/reboot. Additional info: The defect reproduces with upstream kernel 3.11.1 and upstream mdadm too. The defect does not reproduce with kernel-2.6.32-358.el6.x86_64. --- Additional comment from Lukasz Dorau on 2013-10-07 10:40:48 EDT --- We have found a fix for this bug. The attached patch has just been sent upstream: http://marc.info/?l=linux-raid&m=138115595902520&w=4
This fix is already upstream, but is not present in 3.11.7-200.fc19 kernel: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=61e4947c99c4494336254ec540c50186d186150b
(In reply to pawel.baldysiak from comment #1) > This fix is already upstream, but is not present in 3.11.7-200.fc19 kernel: > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/ > ?id=61e4947c99c4494336254ec540c50186d186150b Thank you for the pointer. That patch was just queued for 3.11.8 this morning. We'll look at getting it into Fedora soon.
kernel 3.11.9-200.fc19.x86_64.rpm on F19 and kernel-devel-3.11.9-300.fc20.x86_64 for F20 were released. I guess we can close this one now? Thanks, Michele
Yep, thanks for the reminder.