453314 – constant raid resyncs on reboot; degraded assemble; udev

Bug 453314 - constant raid resyncs on reboot; degraded assemble; udev

Summary: constant raid resyncs on reboot; degraded assemble; udev

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	mdadm
Sub Component:
Version:	9
Hardware:	All
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	---
Assignee:	Doug Ledford
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	467587 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-06-29 15:23 UTC by Jason Farrell
Modified:	2008-11-19 14:47 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-11-19 14:47:40 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Jason Farrell 2008-06-29 15:23:54 UTC

Description of problem:
After updating to mdadm-2.6.7-1.fc9.x86_64 in updates-testing (from
mdadm-0:2.6.4-4.fc9.x86_64 in fedora) two of my four md arrays are now
continually assembled degraded after EACH reboot and must be resynced. I note
quite a few changes in the rpm changelog, but am not sure what's to blame (udev?).

Will attempt to reproduce in a VM.


Additional info:
The only two arrays constantly being degraded & rebuilt are my 4x RAID1 and
RAID5. The devices dropped & readded change as well. No spares.:
---- /proc/mdstat ---------------------------------------------------
Personalities : [raid10] [raid6] [raid5] [raid4] [raid1] [raid0]
md3 : active raid5 sda5[4] sdb5[1] sdd5[3] sdc5[2]
      1585976064 blocks level 5, 256k chunk, algorithm 2 [4/3] [_UUU]
      [=======>.............]  recovery = 38.7% (204739964/528658688)
finish=61.0min speed=88436K/sec

md0 : active raid1 sdc1[4] sda1[1] sdd1[0]
      2096384 blocks [4/2] [UU__]
        resync=DELAYED

md2 : active raid0 sda3[0] sdd3[3] sdc3[2] sdb3[1]
      209712128 blocks 256k chunks

md1 : active raid10 sda2[0] sdd2[3] sdc2[2] sdb2[1]
      83891200 blocks 256K chunks 2 near-copies [4/4] [UUUU]

---- cat /etc/mdadm.conf ---------------------------------------------------
# mdadm.conf written out by anaconda
DEVICE partitions
MAILADDR root

ARRAY /dev/md0 level=raid1 num-devices=4 UUID=53f9fef2:6c86b573:6e4d8dc5:5f17557a
ARRAY /dev/md3 level=raid5 num-devices=4 UUID=ed922e99:8c1c49bb:06c75cec:bd7b7a53
ARRAY /dev/md1 level=raid10 num-devices=4 UUID=db962e4c:3eea0be4:f2551a68:5227bb7b
ARRAY /dev/md2 level=raid0 num-devices=4 UUID=3646f3df:080b1adc:a9da9d8b:3167acae


---- blkid | grep md | sort ---------------------------------------------------
/dev/md0: UUID="8335b600-3ca7-4086-86b4-0ccc15e4fc10" TYPE="ext3" SEC_TYPE="ext2"
/dev/md1: UUID="LpEJGW-E8Hx-al18-cLH0-dQ0e-6FfE-BpaAcg" TYPE="lvm2pv"
/dev/md2: UUID="XTcMLh-pxvM-DeFc-gBMu-FCHV-cCY2-gXx16t" TYPE="lvm2pv"
/dev/md3: UUID="hQjFjp-cefd-2H63-6lkT-Td4k-ZdbX-17vGii" TYPE="lvm2pv"
/dev/sda1: UUID="f2fef953-73b5-866c-c58d-4d6e7a55175f" TYPE="mdraid"
/dev/sda2: UUID="4c2e96db-e40b-ea3e-681a-55f27bbb2752" TYPE="mdraid"
/dev/sda3: UUID="dff34636-dc1a-0b08-8b9d-daa9aeac6731" TYPE="mdraid"
/dev/sda5: UUID="992e92ed-bb49-1c8c-ec5c-c706537a7bbd" TYPE="mdraid"
/dev/sdb1: UUID="f2fef953-73b5-866c-c58d-4d6e7a55175f" TYPE="mdraid"
/dev/sdb2: UUID="4c2e96db-e40b-ea3e-681a-55f27bbb2752" TYPE="mdraid"
/dev/sdb3: UUID="dff34636-dc1a-0b08-8b9d-daa9aeac6731" TYPE="mdraid"
/dev/sdb5: UUID="992e92ed-bb49-1c8c-ec5c-c706537a7bbd" TYPE="mdraid"
/dev/sdc1: UUID="f2fef953-73b5-866c-c58d-4d6e7a55175f" TYPE="mdraid"
/dev/sdc2: UUID="4c2e96db-e40b-ea3e-681a-55f27bbb2752" TYPE="mdraid"
/dev/sdc3: UUID="dff34636-dc1a-0b08-8b9d-daa9aeac6731" TYPE="mdraid"
/dev/sdc5: UUID="992e92ed-bb49-1c8c-ec5c-c706537a7bbd" TYPE="mdraid"
/dev/sdd1: UUID="f2fef953-73b5-866c-c58d-4d6e7a55175f" TYPE="mdraid"
/dev/sdd2: UUID="4c2e96db-e40b-ea3e-681a-55f27bbb2752" TYPE="mdraid"
/dev/sdd3: UUID="dff34636-dc1a-0b08-8b9d-daa9aeac6731" TYPE="mdraid"
/dev/sdd5: UUID="992e92ed-bb49-1c8c-ec5c-c706537a7bbd" TYPE="mdraid"

Comment 1 Doug Ledford 2008-06-30 21:54:20 UTC

Didn't need to add pjones/katzj

Comment 2 Doug Ledford 2008-06-30 22:17:16 UTC

Yeah, this looks like more --incremental breakage.  The problem is that udev
can't accurately tell mdadm that it has found all the disks it's going to find,
so if you don't pass the --run option and they aren't all there, the array never
gets started.  On the other hand, if you do pass --run and they are all there,
it gets started before they are all added to the array.  And once the array
comes up, it triggers another udev event for the array block dev, which can
cause the array device to be accessed before the final disk devices are added to
the array.  Then, that throws the devices still yet to be added into an unclean
state and when they are finally added, they have to be resynced.  If you had a
bitmap on the arrays, then this wouldn't be nearly such an issue as it would
only resync the few blocks that were written to between the time the array was
started in a degraded state and when the final disks were added.

I'm guessing, but I think the reason the two arrays you mentioned are always the
ones with the problems is that they are the only two not started by the initrd.
 The initrd doesn't use incremental mode to start the arrays it starts, it uses
the more normal assemble mode.

Are we sure we really need this "feature" notting?  Or at a minimum, how about
we change the theory of operation behind incremental/hotplug support up.

1)  At bootup, we know precisely what device(s) we are wanting to start and we
have control over driver module loading, so it's natural and easy for us to
continue to use mdadm assemble mode with the --run option, but only call mdadm
after all the devices are supposedly found (aka, after emit "mkblkdevs" in the
initrd).  This will run our devices either in degraded or normal mode, but we
won't ever have a device that *should* be in normal mode start in degraded mode
and get stuck there due to a write to the md array beating addition of the final
device elements.

2)  During rc.sysinit, we also know when we are adding device modules, and we
know what more or less permanent raid arrays we expect to find.  It would seem
reasonable to keep the call to mdadm assemble mode post driver module load in
there as well.

3)  In fact, we really only care about hot plugged arrays that *aren't* in the
mdadm.conf file (if they were in the mdadm.conf file, and they wouldn't start,
they would have kicked us out to the filesystem maintenance mode in previous
Fedora releases).  This raises the thought that we could change the mdadm rule
(or the mdadm --incremental behavior) to scan mdadm.conf and ignore any device
with a UUID found in an array line on the assumption that such a device will get
started by rc.sysinit.  In addition, we would be best to drop the --run option
from the udev rule since at this point the udev rule will only be used to
incrementally assemble an unknown, hot plugged array.  I think it's fair to say
that, as a rule, we don't want to auto assemble a degraded array, but instead
require the array to be clean and whole before we will auto assemble it, and
that a degraded array with missing members simply requires manual attention to
assemble.

Comment 3 Bill Nottingham 2008-07-01 19:45:09 UTC

(In reply to comment #2)
> Yeah, this looks like more --incremental breakage.

How is it --incremental breakage? It appears to be caused by the adding of --run
and --scan.


  The problem is that udev
> can't accurately tell mdadm that it has found all the disks it's going to find,
> so if you don't pass the --run option and they aren't all there, the array > never
> gets started.

I'm not seeing the logic here. Why would you *want* to pass --run to start it
before they're all there?

Comment 4 Doug Ledford 2008-07-01 19:53:13 UTC

Because otherwise your raid array is useless.(In reply to comment #3)
> (In reply to comment #2)
> > Yeah, this looks like more --incremental breakage.
> 
> How is it --incremental breakage? It appears to be caused by the adding of --run
> and --scan.

Actually, the --scan option is irrelevant to this issue.  Just --run is what's
causing this.

> 
>   The problem is that udev
> > can't accurately tell mdadm that it has found all the disks it's going to find,
> > so if you don't pass the --run option and they aren't all there, the array >
never
> > gets started.
> 
> I'm not seeing the logic here. Why would you *want* to pass --run to start it
> before they're all there?

Because otherwise your raid array is useless and you might as well not have one.
 The whole point of any of the redundant arrays is that should you loose 1 (or
more depending on type) drive, the array still works.  Of course, that's of
little comfort if the array doesn't get started at boot because it's degraded
and not all drives are there.  So, in order to make machines that have degraded
arrays still run, you *must* pass the --run option.  Now, --run and
--incremental don't play well because there is no easy way to tell mdadm "wait,
I've got more drives coming online", so it starts the array as soon as there are
enough devices to enter degraded mode.  Again, if it didn't do this, and only
started when all devices where available, then you can kiss using the machine
goodbye in the event you loose a drive, and that just blows the whole reason for
having a redundant array out of the water.  Hence the plan I outlined above.

Comment 5 Jason Farrell 2008-07-02 13:25:07 UTC

FWIW - I *can* reproduce this in a VM.

To reproduce:
1) default F9 install using LVM over 4x RAID1.
2) yum --en updates-testing update mdadm
3) note the state of /proc/mdstat immediately after each reboot

Actual results:
About 30% of the time the array will randomly be assembled degraded. (On my
actual hardware, it's 100% of the time (perhaps due to differences in device
detection timing?))

Expected results:
Array should come up clean 100% of the time. Booting up to a degraded array when
all devices are actually present (and not failed) is NOT GOOD.

I've since reverted to mdadm-2.6.4-4.fc9.x86_64 on my hardware - I don't want to
punish my drives with any more resyncs and risk a failure.

looking at the old->new diff of /etc/udev/rules.d/70-mdadm.rules I see:
 SUBSYSTEM=="block", ACTION=="add|change", ENV{ID_FS_TYPE}=="linux_raid*", \
-	RUN+="/sbin/mdadm --incremental $root/%k"
+	RUN+="/sbin/mdadm --incremental --run --scan $root/%k"

Comment 6 Bill Nottingham 2008-07-02 15:14:35 UTC

In the past, we didn't automatically kick to a prompt if assembly failed for
something in mdadm.conf - only if it failed completely and it was in fstab
(becuase it would then get dinged by the filesystem check.)

Stepping back for a second, you mentioned:
...
And once the array
comes up, it triggers another udev event for the array block dev, which can
cause the array device to be accessed before the final disk devices are added to
the array
...

According to the udev man page, it says:

       They are started in "read-auto" mode in
       which  they  are  read-only  until the first write request.  This
       means that no metadata updates are made and no attempt at  resync
       or  recovery  happens.

Why isn't this working? What's causing the write event?

Comment 7 Jason Farrell 2008-07-02 15:49:33 UTC

If it helps any, I told rc.local to reboot my VM 100 times, and save
/proc/mdstat. Here's the summary (degraded 41% of the time):

[root@f9vm3 ~]# grep 104320 mdstat.out |sort|uniq -c|sort -rn
     59       104320 blocks [4/4] [UUUU]
     36       104320 blocks [4/4] [UUU_]
      5       104320 blocks [4/4] [UU__]

Comment 8 Jason Farrell 2008-07-02 15:50:36 UTC

correction:
[root@f9vm3 ~]# grep 104320 mdstat.out |sort|uniq -c|sort -rn
     59       104320 blocks [4/4] [UUUU]
     36       104320 blocks [4/3] [UUU_]
      5       104320 blocks [4/2] [UU__]

Comment 9 Doug Ledford 2008-07-02 16:09:22 UTC

(In reply to comment #6)
> In the past, we didn't automatically kick to a prompt if assembly failed for
> something in mdadm.conf - only if it failed completely and it was in fstab
> (becuase it would then get dinged by the filesystem check.)

True enough.

> Stepping back for a second, you mentioned:
> ...
> And once the array
> comes up, it triggers another udev event for the array block dev, which can
> cause the array device to be accessed before the final disk devices are added to
> the array
> ...
> 
> According to the udev man page, it says:
> 
>        They are started in "read-auto" mode in
>        which  they  are  read-only  until the first write request.  This
>        means that no metadata updates are made and no attempt at  resync
>        or  recovery  happens.
> 
> Why isn't this working? What's causing the write event?

I believe you mean the mdadm man page.

Now, as to why this isn't working.  It's not working because once the array
exists, udev, the kernel, e2fsck, mount, all these things are free to touch it
and not a single one of them cares if the device is in read-auto mode, nor do
they care if the remaining devices show up before they start to touch it, they
just do their thing.  It's a race condition, plain and simple.  And it's one
that's impossible to solve when using --incremental --run, and if you *don't*
use --run with incremental, then you are wasting your time making a raid array
in the first place.

/me really doesn't think that the idea of making raid arrays hot plug was a good
idea...

Comment 10 Bryan Nielsen 2008-10-24 05:37:54 UTC

I've been bit by this as well. Incorporating udev and --incremental into software raid start up seems to have created some catch 22s. Having a RAID1 array to boot from provides some redundancy against drive failure but only if the system can boot with the RAID1 degraded.

I'm not sure what the best fix is but one possible solution could be to add some configuration options somewhere in /etc/sysconfig that can be used by rc.sysinit after uvdev and before fsck to attempt an assemble of any arrays that did not come up under udev.

Say a config file that lists md devices that should exist before the fsck in rc.sysinit, perform an mdadm --assemble $mddev to see if it can come up in a degraded condition there by completing the needed fsck and booting the system.

Anyhow, I hope that helps and we can get a solution to this issue soon. In the mean time I can work with my custom rc.sysinit to ensure my bootable level 1 software raid system will continue to boot even if a drive in the mirror array fails.

Comment 11 Bryan Nielsen 2008-10-24 16:09:09 UTC

I have implemented the proposed change on my Fedora 9 box and it seems to solve the problem. I can have a drive failure on my level 1 raid from which the system boots and the box will come up with the raid array in a degraded condition. From that point I can diagnose the box while it is in use and coordinate a repair.

Here is what I have, in rc.sysinit just before /sbin/start_udev is executed:

# assemble raid array devices specified in mdassemble
# this is only necessary for raid arrays that may not assemble using
# udev's --incremental rule but are needed for system operation even if degraded
if [ -f /etc/sysconfig/mdassemble ]; then
        . /etc/sysconfig/mdassemble
        LOOPVAR=${MDASSEMBLE},

        while echo $LOOPVAR | grep \, &> /dev/null
        do
                LOOPTEMP=${LOOPVAR%%\,*}
                LOOPVAR=${LOOPVAR#*\,}

                echo "Assembling /dev/$LOOPTEMP"
                mdadm --assemble /dev/$LOOPTEMP
        done
fi


And then in /etc/sysconfig/mdassemble I have the following:

# comma delimited list of raid devices to assemble during rc.sysinit before fsck
# MDASSEMBLE=
MDASSEMBLE=md0


I don't have a lot of bash programming experience so I'm not sure if that is the best way to solve this problem but it works. :)

Bryan

Comment 12 Doug Ledford 2008-10-24 16:17:01 UTC

I've asked Bill to build a new initscripts for f9 and for devel that includes this line it rc.sysinit:

[ -f /etc/mdadm.conf -a -x /sbin/mdadm ] && mdadm -As --auto=yes --run

That will cause any degraded arrays listed in mdadm.conf to be started.  Once I have confirmation that this change is in place, I'll remove the --run from the udev rule, and that should eliminate this issue.

Comment 13 Bill Nottingham 2008-10-24 16:19:42 UTC

Sorry, it should be there now. Forgot to respond earlier.

Comment 14 Bill Nottingham 2008-10-24 16:20:15 UTC

BTW, rather than removing --run, wouldn't it be better to add '--no-degraded' (or similar)?

Comment 15 Doug Ledford 2008-10-24 16:44:22 UTC

mdadm -A without run is the same as -A --run --no-degraded.  The only effect --run has is to enable starting degraded arrays.  Without --run, assemble will assemble, but not start, a degraded array, and will both assemble and start complete arrays.

BTW, did you build both f9 and devel initscripts packages?  I assume devel is done, but didn't know if you had built and pushed the initscripts change to bodhi yet (or how you wanted to coordinate doing so to resolve this).

Comment 16 Bill Nottingham 2008-10-24 16:51:59 UTC

It's in rawhide now, and currently in testing for Fedora 9; it will be pushed to stable in the next push.

https://admin.fedoraproject.org/updates/F9/FEDORA-2008-8981

Comment 17 Doug Ledford 2008-10-24 17:05:54 UTC

Thanks, I'll build the new mdadm.

Comment 18 Doug Ledford 2008-10-24 18:09:13 UTC

*** Bug 467587 has been marked as a duplicate of this bug. ***

Comment 19 Fedora Update System 2008-10-30 13:54:53 UTC

mdadm-2.6.7.1-1.fc9 has been submitted as an update for Fedora 9.
http://admin.fedoraproject.org/updates/mdadm-2.6.7.1-1.fc9

Comment 20 Fedora Update System 2008-10-31 10:25:55 UTC

mdadm-2.6.7.1-1.fc9 has been pushed to the Fedora 9 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update mdadm'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F9/FEDORA-2008-9325

Comment 21 Fedora Update System 2008-11-19 14:47:17 UTC

mdadm-2.6.7.1-1.fc9 has been pushed to the Fedora 9 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.