+++ This bug was initially created as a clone of Bug #488038 +++ Description of problem: System fails to boot because it couldn't assemble RAID0 array with root / filesystem. Looks like kernel with older mdadm in initrd successfully assembles all arrays (so I have root accessible at least in ro mode) but then mdadm spawned from initscripts breaks everything. Version-Release number of selected component (if applicable): mdadm-3.0-0.devel2.2.fc11 older version mdadm-3.0-0.devel2.1.fc11 is fine How reproducible: always after boot Steps to Reproduce: 1. Boot the system (root=/dev/md1) 2. 3. Actual results: ....kernel messages: mdadm: /dev/md1 has been started with 2 drives. ....initscripts messages: Setting hostname localhost.localdomain: [ OK ] mdadm: /dev/md0 is already in use mdadm: /dev/md3 is already in use mdadm: failed to RUN_ARRAY /dev/md/3_0: Cannot allocate memory mdadm: Not enough devices to start the array. mdadm: /dev/md/0_0 has been started with 1 drive (out of 2) ... boot then fails on fsck unable to access the filesystem. Additional info: /dev/mdadm.conf: ARRAY /dev/md0 level=raid1 num-devices=2 metadata=0.90 UUID=0c2a49fd:ba124f6d:e0634eb5:9e0e1855 ARRAY /dev/md1 level=raid0 num-devices=2 metadata=0.90 UUID=2d64fe1d:e87e3bfe:18f25720:de7af605 ARRAY /dev/md3 level=raid0 num-devices=2 metadata=0.90 UUID=f8b1e8e6:7a83767a:00b7a568:dbd45bb9 dmesg: md: raid6 personality registered for level 6 md: raid5 personality registered for level 5 md: raid4 personality registered for level 4 md: md1 stopped. md: bind<sdb2> md: bind<sda2> md1: setting max_sectors to 128, segment boundary to 32767 raid0: looking at sda2 raid0: comparing sda2(7678976) with sda2(7678976) raid0: END raid0: ==> UNIQUE raid0: 1 zones raid0: looking at sdb2 raid0: comparing sdb2(7678976) with sda2(7678976) raid0: EQUAL raid0: FINAL 1 zones raid0: done. raid0 : md_size is 15357952 blocks. raid0 : conf->hash_spacing is 15357952 blocks. raid0 : nb_zone is 1. raid0 : Allocating 8 bytes for hash. md1: unknown partition table --- Additional comment from dledford on 2009-03-18 13:09:30 EDT --- This bug has been identified and needs a change to initscripts in order to be solved properly. Specifically, in rc.sysinit, there is this line: # Start any MD RAID arrays that haven't been started yet [ -f /etc/mdadm.conf -a -x /sbin/mdadm ] && /sbin/mdadm -As --auto=yes --run This needs to be changed as follows: # Wait for local RAID arrays to finish incremental assembly before continuing udevsettle It turns out that the original line races with udev's attempts to perform incremental assembly on the array. In the end, udev ends up grabbing some devices and sticking them in a partially assembled array, and the call to mdadm grabs some other devices and sticks them in a *different* array, and neither array gets started properly. With this change, the udev incremental assembly rules work as expected. Changing to initscripts package. --- Additional comment from dledford on 2009-03-18 13:10:43 EDT --- *** Bug 487965 has been marked as a duplicate of this bug. *** --- Additional comment from jwilson on 2009-03-18 13:37:14 EDT --- Hrm. So things are mildly better w/the change prescribed in comment #1 on one of my affected systems. Instead of getting at least two different arrays created for what is supposed to be my /boot volume, I get only /dev/md0, but it contains only a single member. --- Additional comment from jwilson on 2009-03-18 13:49:31 EDT --- Also, this change results in the following spew: the program '/bin/bash' called 'udevsettle', it should use udevadm settle <options>', this will stop working in a future release udevadm[2036]: the program '/bin/bash' called 'udevsettle', it should use 'udevadm settle <options>', this will stop working in a future release Even after changing over to 'udevadm settle --timeout=30' and adding a 'sleep 5' after that, I'm still only getting a single drive added to /dev/md0 every time. --- Additional comment from dledford on 2009-03-18 13:54:04 EDT --- What version of mdadm are you using? I tested this with mdadm-3.0-0.devel3.1.fc11 which is not yet in rawhide, only locally built, and with that version it worked fine. As far as the udevsettle versus udevadm settle, that would be because I'm testing this on an F9 machine with older udev, so it would need to be changed for the later versions of udev in rawhide. However, no timeout nor any sleeps are necessary for me with the current mdadm (which also includes an updated mdadm rules file that could certainly play a role in what you are seeing). My impression is that to fully solve the problem, you really need both udpates, but a bug can only be against one component at a time. I'll clone for the mdadm half of the issue.
mdadm-3.0-0.devel3.1.fc11 has been built to address this issue. Note it still needs the initscript update to be expected to work.
So with the initscript update hand-made and mdadm updated to 3.0-0.devel3.1.fc11, I'm still only getting one of four disks added to md0 (my /boot array). # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : inactive sdc1[2](S) 200704 blocks md1 : active raid6 sda3[0] sdd3[3] sdc3[2] sdb3[1] 307981824 blocks level 6, 256k chunk, algorithm 2 [4/4] [UUUU] unused devices: <none> #cat /etc/mdadm.conf # mdadm.conf written out by anaconda DEVICE partitions MAILADDR root ARRAY /dev/md1 level=raid6 num-devices=4 metadata=0.90 UUID=368714fb:5469cef4:f60fd542:027945d8 ARRAY /dev/md0 level=raid1 num-devices=4 metadata=0.90 UUID=04aacae3:99941fc2:d486ae8d:01f4d665 # mdadm -Eb /dev/sdc1: ARRAY /dev/md0 level=raid1 num-devices=4 UUID=04aacae3:99941fc2:d486ae8d:01f4d665 # mdadm -Eb /dev/sda1 ARRAY /dev/md0 level=raid1 num-devices=4 UUID=04aacae3:99941fc2:d486ae8d:01f4d665
# ll /*/udev/rules.d/ /etc/udev/rules.d/: total 80 -rw-r--r--. 1 root root 397 2009-03-06 08:09 40-multipath.rules -rw-r--r--. 1 root root 19994 2009-02-25 11:32 60-libmtp.rules -rw-r--r--. 1 root root 1060 2009-03-06 03:33 60-pcmcia.rules -rw-r--r--. 1 root root 6824 2009-03-09 00:13 60-wacom.rules -rw-r--r--. 1 root root 595 2009-03-06 17:13 70-persistent-cd.rules -rw-r--r--. 1 root root 845 2009-03-06 17:11 70-persistent-net.rules -rw-r--r--. 1 root root 1914 2009-03-03 17:55 85-pcscd_ccid.rules -rw-r--r--. 1 root root 244 2009-02-25 01:54 85-pcscd_egate.rules -rw-r--r--. 1 root root 320 2008-09-18 03:54 90-alsa.rules -rw-r--r--. 1 root root 83 2009-03-05 20:26 90-hal.rules -rw-r--r--. 1 root root 53 2009-02-24 11:41 91-drm-modeset.rules -rw-r--r--. 1 root root 4216 2009-03-10 14:56 95-devkit-disks.rules -rw-r--r--. 1 root root 2283 2009-03-09 15:37 97-bluetooth-serial.rules -rw-r--r--. 1 root root 85 2009-03-02 16:42 98-devkit.rules /lib/udev/rules.d/: total 112 -rw-r--r--. 1 root root 421 2009-03-09 22:05 10-console.rules -rw-r--r--. 1 root root 348 2009-03-03 08:17 40-alsa.rules -rw-r--r--. 1 root root 1431 2009-03-03 08:17 40-redhat.rules -rw-r--r--. 1 root root 172 2009-03-03 08:17 50-firmware.rules -rw-r--r--. 1 root root 4562 2009-03-03 08:17 50-udev-default.rules -rw-r--r--. 1 root root 141 2009-03-03 08:17 60-cdrom_id.rules -rw-r--r--. 1 root root 283 2009-03-09 22:05 60-net.rules -rw-r--r--. 1 root root 1538 2009-03-03 08:17 60-persistent-input.rules -rw-r--r--. 1 root root 718 2009-03-03 08:17 60-persistent-serial.rules -rw-r--r--. 1 root root 4441 2009-03-03 08:17 60-persistent-storage.rules -rw-r--r--. 1 root root 1514 2009-03-03 08:17 60-persistent-storage-tape.rules -rw-r--r--. 1 root root 711 2009-03-03 08:17 60-persistent-v4l.rules -rw-r--r--. 1 root root 3914 2009-03-02 10:33 61-option-modem-modeswitch.rules -rw-r--r--. 1 root root 525 2009-03-03 08:17 61-persistent-storage-edd.rules -rw-r--r--. 1 root root 107 2009-03-03 08:17 64-device-mapper.rules -rw-r--r--. 1 root root 1701 2009-03-18 14:29 64-md-raid.rules -rw-r--r--. 1 root root 1218 2009-03-02 10:33 70-acl.rules -rw-r--r--. 1 root root 390 2009-03-03 08:17 75-cd-aliases-generator.rules -rw-r--r--. 1 root root 2403 2009-03-03 08:17 75-persistent-net-generator.rules -rw-r--r--. 1 root root 336 2009-03-09 23:40 77-nm-probe-modem-capabilities.rules -rw-r--r--. 1 root root 2283 2009-03-02 10:33 78-sound-card.rules -rw-r--r--. 1 root root 137 2009-03-03 08:17 79-fstab_import.rules -rw-r--r--. 1 root root 779 2009-03-03 08:17 80-drivers.rules -rw-r--r--. 1 root root 221 2009-02-24 04:44 85-regulatory.rules -rw-r--r--. 1 root root 175 2009-03-09 22:05 88-clock.rules -rw-r--r--. 1 root root 234 2009-03-03 08:17 95-udev-late.rules
A new version of mdadm that solves the file conflict is in rawhide.
Got the even newer mdadm and the original 64-md-raid.rules file back in place, and restarted... Now the machine is hanging at 'Starting udev: _' for a good couple of minutes before finally continuing boot. I was hoping all that delay might have meant the array got built correctly, but alas, /dev/md0 is still getting created with only a single drive in it.
...the heck? Second try, similar issue, but I noticed a ton of spew printed to the console after the lengthy hang starting udev: /sys/devices/virtual/block/md0 (2417) /sys/devices/virtual/block/md1 (2415) /sys/devices/virtual/block/md0 (2413) /sys/devices/virtual/block/md1 (2411) /sys/devices/virtual/block/md0 (2409) /sys/devices/virtual/block/md0 (2407) /sys/devices/virtual/block/md1 (2405) /sys/devices/virtual/block/md0 (2403) /sys/devices/virtual/block/md0 (2401)
Made some progress on this. If I change rc.sysinit to have a udevadm trigger call before the udevadm settle call, all works. However, the call to start_udev a few lines above *also* has a udevadm trigger call in it, and that *should* be sufficient. So there is a question as to why calling udevadm trigger *again* only a few lines later works. I'm going to clone this bug against udev in rawhide to see if we can address that part of the issue.
Problem has been isolated. There are two possible fixes, I'll implement the one that doesn't require init script changes for now.
mdadm-3.0-0.devel3.5.fc11 has been built and solves the problem according to my testing.
mdadm-3.0-0.devel3.5.fc11 was built a week ago. Was the delay in testing? The wording of the comment seems odd referencing a build from a week ago, so I wanted to double check before trying it out.
No, the delay is that I was out of town for a week and I failed to update the bug before I left. The testing was done prior to the build.
I tried it out and it worked better than when I had to remove the rules file (though its just gone now). I did see a message about an array already being in use after I had a couple of array elements failed out (due a kernel problem). But it didn't try to start another array like it had in the past, so things worked. Once I put stuff back into the arrays and rebooted, things looked normal during the boot process.
This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle. Changing version to '11'. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping