1026902 – Recovery of RAID0/10 volume does not continue after stopping and reassembling or reboot

Bug 1026902 - Recovery of RAID0/10 volume does not continue after stopping and reassembling or reboot

Summary: Recovery of RAID0/10 volume does not continue after stopping and reassembling...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-11-05 15:42 UTC by Pawel Baldysiak
Modified:	2013-11-24 14:33 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Clone Of:	1014102
Environment:
Last Closed:	2013-11-24 14:33:09 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Pawel Baldysiak 2013-11-05 15:42:32 UTC

+++ This bug was initially created as a clone of Bug #1014102 +++

Description of problem:
If we stop an ongoing recovery of a RAID0 or RAID10 volume, it will not continue after reassembling. The volume will be in normal/active state as if recovery finished correctly. The same defect occurs if we reboot OS during ongoing recovery - recovery will not continue when OS boots.
It causes data corruption.

Version-Release number of selected component (if applicable):
mdadm-3.2.6-3.el6.x86_64 and
mdadm-3.2.6-4.el6.x86_64
with kernels:
kernel-2.6.32-407.el6.x86_64 or
kernel-2.6.32-415.el6.x86_64 or
kernel-2.6.32-419.el6.x86_64

How reproducible:
Always

Steps to reproduce:
1) Create a RAID1 volume with 2 disks:
# mdadm –C /dev/md/imsm –e imsm –n 2 /dev/sd[ab]
# mdadm –C /dev/md/raid1 –l 1 –n 2 /dev/sd[ab]
2) Wait for the resync to be finished
3) Add a new disk to the container
# mdadm --add /dev/md127 /dev/sdc
4) Turn off or fail one of disks of the volume (sda or sdb).
# mdadm -f /dev/md126 /dev/sda
5) Wait for the start of recovery
6) Stop the volume
# mdadm -Ss
7) Reassemble the volume
# mdadm -As
OR 
6) Reboot OS

Actual result:
Recovery does not continue after reassembling/reboot.
The volume is in normal/active state in mdstat, 
but there is a rebuild state written in metadata:

Personalities : [raid1] 
md126 : active raid1 sdb[1] sda[0]
      47185920 blocks super external:/md127/0 [2/2] [UU]
md127 : inactive sdb[1](S) sda[0](S)
      6306 blocks super external:imsm
unused devices: <none>

# mdadm -E /dev/sda | grep -A 17 raid1
     [raid1]:
     UUID : 99ddea0a:a99b6376:ab0ce862:6cb93b3a
     RAID Level : 1 <-- 1
     Members : 2 <-- 2
     Slots : [UU] <-- [_U]
     Failed disk : 0
     This Slot : 1
     Array Size : 94371840 (45.00 GiB 48.32 GB)
     Per Dev Size : 94371840 (45.00 GiB 48.32 GB)
     Sector Offset : 0
     Num Stripes : 368640
     Chunk Size : 64 KiB <-- 64 KiB
     Reserved : 0
     Migrate State : rebuild
     Map State : normal <-- degraded
     Checkpoint : 23040 (512)
     Dirty State : clean

Expected result:
Recovery continues after reassembling/reboot.

Additional info:
The defect reproduces with upstream kernel 3.11.1 and upstream mdadm too.
The defect does not reproduce with kernel-2.6.32-358.el6.x86_64.

--- Additional comment from Lukasz Dorau on 2013-10-07 10:40:48 EDT ---

We have found a fix for this bug. The attached patch has just been sent upstream:
http://marc.info/?l=linux-raid&m=138115595902520&w=4

Comment 1 Pawel Baldysiak 2013-11-05 15:49:26 UTC

This fix is already upstream, but is not present in 3.11.7-200.fc19 kernel:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=61e4947c99c4494336254ec540c50186d186150b

Comment 2 Josh Boyer 2013-11-05 16:01:44 UTC

(In reply to pawel.baldysiak from comment #1)
> This fix is already upstream, but is not present in 3.11.7-200.fc19 kernel:
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/
> ?id=61e4947c99c4494336254ec540c50186d186150b

Thank you for the pointer.  That patch was just queued for 3.11.8 this morning.  We'll look at getting it into Fedora soon.

Comment 3 Michele Baldessari 2013-11-24 11:06:06 UTC

kernel 3.11.9-200.fc19.x86_64.rpm  on F19 and kernel-devel-3.11.9-300.fc20.x86_64
for F20 were released.

I guess we can close this one now?

Thanks,
Michele

Comment 4 Josh Boyer 2013-11-24 14:33:09 UTC

Yep, thanks for the reminder.

Note You need to log in before you can comment on or make changes to this bug.