Description of problem: After updating to mdadm-2.6.7-1.fc9.x86_64 in updates-testing (from mdadm-0:2.6.4-4.fc9.x86_64 in fedora) two of my four md arrays are now continually assembled degraded after EACH reboot and must be resynced. I note quite a few changes in the rpm changelog, but am not sure what's to blame (udev?). Will attempt to reproduce in a VM. Additional info: The only two arrays constantly being degraded & rebuilt are my 4x RAID1 and RAID5. The devices dropped & readded change as well. No spares.: ---- /proc/mdstat --------------------------------------------------- Personalities : [raid10] [raid6] [raid5] [raid4] [raid1] [raid0] md3 : active raid5 sda5[4] sdb5[1] sdd5[3] sdc5[2] 1585976064 blocks level 5, 256k chunk, algorithm 2 [4/3] [_UUU] [=======>.............] recovery = 38.7% (204739964/528658688) finish=61.0min speed=88436K/sec md0 : active raid1 sdc1[4] sda1[1] sdd1[0] 2096384 blocks [4/2] [UU__] resync=DELAYED md2 : active raid0 sda3[0] sdd3[3] sdc3[2] sdb3[1] 209712128 blocks 256k chunks md1 : active raid10 sda2[0] sdd2[3] sdc2[2] sdb2[1] 83891200 blocks 256K chunks 2 near-copies [4/4] [UUUU] ---- cat /etc/mdadm.conf --------------------------------------------------- # mdadm.conf written out by anaconda DEVICE partitions MAILADDR root ARRAY /dev/md0 level=raid1 num-devices=4 UUID=53f9fef2:6c86b573:6e4d8dc5:5f17557a ARRAY /dev/md3 level=raid5 num-devices=4 UUID=ed922e99:8c1c49bb:06c75cec:bd7b7a53 ARRAY /dev/md1 level=raid10 num-devices=4 UUID=db962e4c:3eea0be4:f2551a68:5227bb7b ARRAY /dev/md2 level=raid0 num-devices=4 UUID=3646f3df:080b1adc:a9da9d8b:3167acae ---- blkid | grep md | sort --------------------------------------------------- /dev/md0: UUID="8335b600-3ca7-4086-86b4-0ccc15e4fc10" TYPE="ext3" SEC_TYPE="ext2" /dev/md1: UUID="LpEJGW-E8Hx-al18-cLH0-dQ0e-6FfE-BpaAcg" TYPE="lvm2pv" /dev/md2: UUID="XTcMLh-pxvM-DeFc-gBMu-FCHV-cCY2-gXx16t" TYPE="lvm2pv" /dev/md3: UUID="hQjFjp-cefd-2H63-6lkT-Td4k-ZdbX-17vGii" TYPE="lvm2pv" /dev/sda1: UUID="f2fef953-73b5-866c-c58d-4d6e7a55175f" TYPE="mdraid" /dev/sda2: UUID="4c2e96db-e40b-ea3e-681a-55f27bbb2752" TYPE="mdraid" /dev/sda3: UUID="dff34636-dc1a-0b08-8b9d-daa9aeac6731" TYPE="mdraid" /dev/sda5: UUID="992e92ed-bb49-1c8c-ec5c-c706537a7bbd" TYPE="mdraid" /dev/sdb1: UUID="f2fef953-73b5-866c-c58d-4d6e7a55175f" TYPE="mdraid" /dev/sdb2: UUID="4c2e96db-e40b-ea3e-681a-55f27bbb2752" TYPE="mdraid" /dev/sdb3: UUID="dff34636-dc1a-0b08-8b9d-daa9aeac6731" TYPE="mdraid" /dev/sdb5: UUID="992e92ed-bb49-1c8c-ec5c-c706537a7bbd" TYPE="mdraid" /dev/sdc1: UUID="f2fef953-73b5-866c-c58d-4d6e7a55175f" TYPE="mdraid" /dev/sdc2: UUID="4c2e96db-e40b-ea3e-681a-55f27bbb2752" TYPE="mdraid" /dev/sdc3: UUID="dff34636-dc1a-0b08-8b9d-daa9aeac6731" TYPE="mdraid" /dev/sdc5: UUID="992e92ed-bb49-1c8c-ec5c-c706537a7bbd" TYPE="mdraid" /dev/sdd1: UUID="f2fef953-73b5-866c-c58d-4d6e7a55175f" TYPE="mdraid" /dev/sdd2: UUID="4c2e96db-e40b-ea3e-681a-55f27bbb2752" TYPE="mdraid" /dev/sdd3: UUID="dff34636-dc1a-0b08-8b9d-daa9aeac6731" TYPE="mdraid" /dev/sdd5: UUID="992e92ed-bb49-1c8c-ec5c-c706537a7bbd" TYPE="mdraid"
Didn't need to add pjones/katzj
Yeah, this looks like more --incremental breakage. The problem is that udev can't accurately tell mdadm that it has found all the disks it's going to find, so if you don't pass the --run option and they aren't all there, the array never gets started. On the other hand, if you do pass --run and they are all there, it gets started before they are all added to the array. And once the array comes up, it triggers another udev event for the array block dev, which can cause the array device to be accessed before the final disk devices are added to the array. Then, that throws the devices still yet to be added into an unclean state and when they are finally added, they have to be resynced. If you had a bitmap on the arrays, then this wouldn't be nearly such an issue as it would only resync the few blocks that were written to between the time the array was started in a degraded state and when the final disks were added. I'm guessing, but I think the reason the two arrays you mentioned are always the ones with the problems is that they are the only two not started by the initrd. The initrd doesn't use incremental mode to start the arrays it starts, it uses the more normal assemble mode. Are we sure we really need this "feature" notting? Or at a minimum, how about we change the theory of operation behind incremental/hotplug support up. 1) At bootup, we know precisely what device(s) we are wanting to start and we have control over driver module loading, so it's natural and easy for us to continue to use mdadm assemble mode with the --run option, but only call mdadm after all the devices are supposedly found (aka, after emit "mkblkdevs" in the initrd). This will run our devices either in degraded or normal mode, but we won't ever have a device that *should* be in normal mode start in degraded mode and get stuck there due to a write to the md array beating addition of the final device elements. 2) During rc.sysinit, we also know when we are adding device modules, and we know what more or less permanent raid arrays we expect to find. It would seem reasonable to keep the call to mdadm assemble mode post driver module load in there as well. 3) In fact, we really only care about hot plugged arrays that *aren't* in the mdadm.conf file (if they were in the mdadm.conf file, and they wouldn't start, they would have kicked us out to the filesystem maintenance mode in previous Fedora releases). This raises the thought that we could change the mdadm rule (or the mdadm --incremental behavior) to scan mdadm.conf and ignore any device with a UUID found in an array line on the assumption that such a device will get started by rc.sysinit. In addition, we would be best to drop the --run option from the udev rule since at this point the udev rule will only be used to incrementally assemble an unknown, hot plugged array. I think it's fair to say that, as a rule, we don't want to auto assemble a degraded array, but instead require the array to be clean and whole before we will auto assemble it, and that a degraded array with missing members simply requires manual attention to assemble.
(In reply to comment #2) > Yeah, this looks like more --incremental breakage. How is it --incremental breakage? It appears to be caused by the adding of --run and --scan. The problem is that udev > can't accurately tell mdadm that it has found all the disks it's going to find, > so if you don't pass the --run option and they aren't all there, the array > never > gets started. I'm not seeing the logic here. Why would you *want* to pass --run to start it before they're all there?
Because otherwise your raid array is useless.(In reply to comment #3) > (In reply to comment #2) > > Yeah, this looks like more --incremental breakage. > > How is it --incremental breakage? It appears to be caused by the adding of --run > and --scan. Actually, the --scan option is irrelevant to this issue. Just --run is what's causing this. > > The problem is that udev > > can't accurately tell mdadm that it has found all the disks it's going to find, > > so if you don't pass the --run option and they aren't all there, the array > never > > gets started. > > I'm not seeing the logic here. Why would you *want* to pass --run to start it > before they're all there? Because otherwise your raid array is useless and you might as well not have one. The whole point of any of the redundant arrays is that should you loose 1 (or more depending on type) drive, the array still works. Of course, that's of little comfort if the array doesn't get started at boot because it's degraded and not all drives are there. So, in order to make machines that have degraded arrays still run, you *must* pass the --run option. Now, --run and --incremental don't play well because there is no easy way to tell mdadm "wait, I've got more drives coming online", so it starts the array as soon as there are enough devices to enter degraded mode. Again, if it didn't do this, and only started when all devices where available, then you can kiss using the machine goodbye in the event you loose a drive, and that just blows the whole reason for having a redundant array out of the water. Hence the plan I outlined above.
FWIW - I *can* reproduce this in a VM. To reproduce: 1) default F9 install using LVM over 4x RAID1. 2) yum --en updates-testing update mdadm 3) note the state of /proc/mdstat immediately after each reboot Actual results: About 30% of the time the array will randomly be assembled degraded. (On my actual hardware, it's 100% of the time (perhaps due to differences in device detection timing?)) Expected results: Array should come up clean 100% of the time. Booting up to a degraded array when all devices are actually present (and not failed) is NOT GOOD. I've since reverted to mdadm-2.6.4-4.fc9.x86_64 on my hardware - I don't want to punish my drives with any more resyncs and risk a failure. looking at the old->new diff of /etc/udev/rules.d/70-mdadm.rules I see: SUBSYSTEM=="block", ACTION=="add|change", ENV{ID_FS_TYPE}=="linux_raid*", \ - RUN+="/sbin/mdadm --incremental $root/%k" + RUN+="/sbin/mdadm --incremental --run --scan $root/%k"
In the past, we didn't automatically kick to a prompt if assembly failed for something in mdadm.conf - only if it failed completely and it was in fstab (becuase it would then get dinged by the filesystem check.) Stepping back for a second, you mentioned: ... And once the array comes up, it triggers another udev event for the array block dev, which can cause the array device to be accessed before the final disk devices are added to the array ... According to the udev man page, it says: They are started in "read-auto" mode in which they are read-only until the first write request. This means that no metadata updates are made and no attempt at resync or recovery happens. Why isn't this working? What's causing the write event?
If it helps any, I told rc.local to reboot my VM 100 times, and save /proc/mdstat. Here's the summary (degraded 41% of the time): [root@f9vm3 ~]# grep 104320 mdstat.out |sort|uniq -c|sort -rn 59 104320 blocks [4/4] [UUUU] 36 104320 blocks [4/4] [UUU_] 5 104320 blocks [4/4] [UU__]
correction: [root@f9vm3 ~]# grep 104320 mdstat.out |sort|uniq -c|sort -rn 59 104320 blocks [4/4] [UUUU] 36 104320 blocks [4/3] [UUU_] 5 104320 blocks [4/2] [UU__]
(In reply to comment #6) > In the past, we didn't automatically kick to a prompt if assembly failed for > something in mdadm.conf - only if it failed completely and it was in fstab > (becuase it would then get dinged by the filesystem check.) True enough. > Stepping back for a second, you mentioned: > ... > And once the array > comes up, it triggers another udev event for the array block dev, which can > cause the array device to be accessed before the final disk devices are added to > the array > ... > > According to the udev man page, it says: > > They are started in "read-auto" mode in > which they are read-only until the first write request. This > means that no metadata updates are made and no attempt at resync > or recovery happens. > > Why isn't this working? What's causing the write event? I believe you mean the mdadm man page. Now, as to why this isn't working. It's not working because once the array exists, udev, the kernel, e2fsck, mount, all these things are free to touch it and not a single one of them cares if the device is in read-auto mode, nor do they care if the remaining devices show up before they start to touch it, they just do their thing. It's a race condition, plain and simple. And it's one that's impossible to solve when using --incremental --run, and if you *don't* use --run with incremental, then you are wasting your time making a raid array in the first place. /me really doesn't think that the idea of making raid arrays hot plug was a good idea...
I've been bit by this as well. Incorporating udev and --incremental into software raid start up seems to have created some catch 22s. Having a RAID1 array to boot from provides some redundancy against drive failure but only if the system can boot with the RAID1 degraded. I'm not sure what the best fix is but one possible solution could be to add some configuration options somewhere in /etc/sysconfig that can be used by rc.sysinit after uvdev and before fsck to attempt an assemble of any arrays that did not come up under udev. Say a config file that lists md devices that should exist before the fsck in rc.sysinit, perform an mdadm --assemble $mddev to see if it can come up in a degraded condition there by completing the needed fsck and booting the system. Anyhow, I hope that helps and we can get a solution to this issue soon. In the mean time I can work with my custom rc.sysinit to ensure my bootable level 1 software raid system will continue to boot even if a drive in the mirror array fails.
I have implemented the proposed change on my Fedora 9 box and it seems to solve the problem. I can have a drive failure on my level 1 raid from which the system boots and the box will come up with the raid array in a degraded condition. From that point I can diagnose the box while it is in use and coordinate a repair. Here is what I have, in rc.sysinit just before /sbin/start_udev is executed: # assemble raid array devices specified in mdassemble # this is only necessary for raid arrays that may not assemble using # udev's --incremental rule but are needed for system operation even if degraded if [ -f /etc/sysconfig/mdassemble ]; then . /etc/sysconfig/mdassemble LOOPVAR=${MDASSEMBLE}, while echo $LOOPVAR | grep \, &> /dev/null do LOOPTEMP=${LOOPVAR%%\,*} LOOPVAR=${LOOPVAR#*\,} echo "Assembling /dev/$LOOPTEMP" mdadm --assemble /dev/$LOOPTEMP done fi And then in /etc/sysconfig/mdassemble I have the following: # comma delimited list of raid devices to assemble during rc.sysinit before fsck # MDASSEMBLE= MDASSEMBLE=md0 I don't have a lot of bash programming experience so I'm not sure if that is the best way to solve this problem but it works. :) Bryan
I've asked Bill to build a new initscripts for f9 and for devel that includes this line it rc.sysinit: [ -f /etc/mdadm.conf -a -x /sbin/mdadm ] && mdadm -As --auto=yes --run That will cause any degraded arrays listed in mdadm.conf to be started. Once I have confirmation that this change is in place, I'll remove the --run from the udev rule, and that should eliminate this issue.
Sorry, it should be there now. Forgot to respond earlier.
BTW, rather than removing --run, wouldn't it be better to add '--no-degraded' (or similar)?
mdadm -A without run is the same as -A --run --no-degraded. The only effect --run has is to enable starting degraded arrays. Without --run, assemble will assemble, but not start, a degraded array, and will both assemble and start complete arrays. BTW, did you build both f9 and devel initscripts packages? I assume devel is done, but didn't know if you had built and pushed the initscripts change to bodhi yet (or how you wanted to coordinate doing so to resolve this).
It's in rawhide now, and currently in testing for Fedora 9; it will be pushed to stable in the next push. https://admin.fedoraproject.org/updates/F9/FEDORA-2008-8981
Thanks, I'll build the new mdadm.
*** Bug 467587 has been marked as a duplicate of this bug. ***
mdadm-2.6.7.1-1.fc9 has been submitted as an update for Fedora 9. http://admin.fedoraproject.org/updates/mdadm-2.6.7.1-1.fc9
mdadm-2.6.7.1-1.fc9 has been pushed to the Fedora 9 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update mdadm'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F9/FEDORA-2008-9325
mdadm-2.6.7.1-1.fc9 has been pushed to the Fedora 9 stable repository. If problems still persist, please make note of it in this bug report.