849243 – Systemd + udev failure with assembling non-perfect md raid array

Bug 849243 - Systemd + udev failure with assembling non-perfect md raid array

Summary: Systemd + udev failure with assembling non-perfect md raid array

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	mdadm
Sub Component:
Version:	17
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Jes Sorensen
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-08-17 19:19 UTC by John 'Warthog9' Hawley
Modified:	2013-05-03 08:04 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-05-03 08:04:44 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description John 'Warthog9' Hawley 2012-08-17 19:19:36 UTC

Description of problem:

Software raid arrays are fairly common things, particularly on cheaper servers or when an entity wants the fastest possible I/O speeds (software raid is dramatically faster than hardware raid, it's just not as convenient)

With any type of raid, failure type scenarios exist, particularly ones that require a rebuild of the array:

- Drive going bad
- Growing array
- Drive getting erroneously ejected and re-added
- Bad intermediate connection
- etc

In these cases an array rebuild occurs, these can take a *VERY* long time, case in point:

3 disk raid 5 setup using 3T drives on sata III 6G interfaces, to a 4 disk array was estimating 1200 minutes to rebuild (or about a day to rebuild)

Should one reboot a machine, either intentionally or due to power failures, unrelated panics, etc the machine should reboot cleanly, assemble the now "degraded" array, and continue the rebuild. (this is the old, and obviously expected behavior)

This does not occur, the system now boots.  Systemd + udev find the array and attempt to assemble it, apparently notice it's degraded and fail to assmeble.  It then loops, re-discovers the array, attempts to assemble, notices it's degraded, assembly fails, re-discovers the array... ad infinum

Version-Release number of selected component (if applicable):


How reproducible:

Seems to be always

Steps to Reproduce:
1. take 4 disks of any size (large enough that an array rebuild will take a little while
2. Assemble a 3 disk array, raid 5.  Let this completely rebuild.
3. reboot
3. Add the 4th disk and grow the array
4. while the array rebuild is ongoing, reboot the machine (`reboot`)
  
Actual results:

udevd[<pid>]: timeout: killing '/sbin/mdadm --detail --export /dev/md#' [397]

udevd[<pid>]: timeout: killing '/sbin/mdadm --offroot -I /dev/sda1' [344]

udevd[<pid>]: timeout: killing '/sbin/mdadm --detail --export /dev/md#' [397]

udevd[<pid>]: timeout: killing '/sbin/mdadm --offroot -I /dev/sda1' [344]

udevd[<pid>]: timeout: killing '/sbin/mdadm --detail --export /dev/md#' [397]

udevd[<pid>]: timeout: killing '/sbin/mdadm --offroot -I /dev/sda1' [344]

... onto infinity

Expected results:

System recognize the array is degraded, assemble accordingly and let array rebuild continue.

Additional info:

Comment 1 John 'Warthog9' Hawley 2012-08-17 19:39:37 UTC

Ok additional detail, the system I'm running on is using encryption over the raid array (which took some doing)

However when it's asking for the encryption password, the raid information seems to indicate that's it's reassembled properly already and the rebuild will continue.  However when I just let things sit (no attempt to input the password)

the 

udevd[<pid>]: timeout: killing '/sbin/mdadm --detail --export /dev/md#' [397]

stuff shows up.

Comment 2 Michal Schmidt 2012-08-20 09:27:38 UTC

systemd does not assemble RAID arrays. It's done from mdadm's udev rules. Reassigning to mdadm.

Comment 3 Jes Sorensen 2012-09-03 12:30:25 UTC

Ok a couple of questions here:

1) Please provide mdadm/systemd/dracut/kernel versions you see this with
2) Does this happen using your described scenario, ie. without encryption on
   top of the array?
3) Can you reproduce the problem _without_ the rebuild from growing the array,
   ie. if you just force a rebuild of the array over the 3 disks?

Thanks,
Jes

Comment 4 Jes Sorensen 2012-10-08 14:10:04 UTC

J.H.

Ping?

Comment 5 Jes Sorensen 2013-05-03 08:04:44 UTC

No reply to request for info for 6 months - closing

Note You need to log in before you can comment on or make changes to this bug.