Bug 437231

Summary: Software raid broken in recent kernels
Product: [Fedora] Fedora Reporter: Dennis Jacobfeuerborn <dennisml>
Component: mkinitrdAssignee: Peter Jones <pjones>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: low    
Version: 9CC: amlau, bruno, cweyl, dcantrell, gnomeuser, oliva, tomek, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-04-29 02:26:25 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dmesg output of kernel-2.6.25-0.87.rc3.git4.fc9
none
Diff of initrd contents from 6.0.33 to 6.0.37 none

Description Dennis Jacobfeuerborn 2008-03-13 02:02:02 UTC
I can't get recent kernels to boot because they don't seem to find necessary
devices ("no devices found for /dev/mdX"). The status is as follows:

works -> kernel-2.6.25-0.87.rc3.git4.fc9
fails -> kernel-2.6.25-0.101.rc4.git3.fc9
fails -> kernel-2.6.25-0.105.rc5.fc9
fails -> kernel-2.6.25-0.113.rc5.git2.fc9

(mkinitrd was at version 6.0.35-1.fc9 when installing the 0.113 version of the
kernel)

these are simple unencrypted raid-1 devices:

[root@nexus t]# uname -a
Linux nexus 2.6.25-0.78.rc3.git1.fc9 #1 SMP Fri Feb 29 02:19:15 EST 2008 i686
athlon i386 GNU/Linux

[root@nexus t]# rpm -q mkinitrd
mkinitrd-6.0.34-1.fc9.i386

[root@nexus t]# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md0 : active raid1 sda2[0] sdb2[1]
      104320 blocks [2/2] [UU]

md1 : active raid1 sda3[0] sdb3[1]
      1052160 blocks [2/2] [UU]

md2 : active raid1 sda5[0] sdb5[1]
      307628544 blocks [2/2] [UU]

unused devices: <none>

Relevant entries from fstab:
/dev/md2            /                       ext3    defaults,noatime 1 1
/dev/md0            /boot                   ext3    defaults,noatime 1 2
/dev/md1            swap                    swap    defaults        0 0

Relevant entries from grub.conf:
title Fedora (2.6.25-0.105.rc5.fc9)
    root (hd0,1)
    kernel /vmlinuz-2.6.25-0.105.rc5.fc9 ro root=/dev/md2
    initrd /initrd-2.6.25-0.105.rc5.fc9.img
title Fedora (2.6.25-0.78.rc3.git1.fc9)
    root (hd0,1)
    kernel /vmlinuz-2.6.25-0.78.rc3.git1.fc9 ro root=/dev/md2
    initrd /initrd-2.6.25-0.78.rc3.git1.fc9.img 

[dennis@nexus ~]$ cat /etc/mdadm.conf 

# mdadm.conf written out by anaconda
DEVICE partitions
MAILADDR root
ARRAY /dev/md2 level=raid1 num-devices=2 uuid=e76e2ddd:0704b19b:5f2c9cac:8880bf5c
ARRAY /dev/md0 level=raid1 num-devices=2 uuid=ec127fd1:2f891ce6:3bdd9733:c16732b0
ARRAY /dev/md1 level=raid1 num-devices=2 uuid=a5cd1ca6:d38b9418:949ca280:86ec588e

Comment 1 Dennis Jacobfeuerborn 2008-03-13 02:12:19 UTC
Created attachment 297883 [details]
dmesg output of kernel-2.6.25-0.87.rc3.git4.fc9

Comment 2 Bruno Wolff III 2008-03-13 10:18:48 UTC
I may be seeing a similar problem, though I have a more complicated setup in
that I am running encryption over raid. I can get things to work using a rescue
disk (with the .105 kernel), the normal boot doesn't work with a similar message.
What I would like to try next is figuring out if mkinitrd loads the proper
modules for my sata disk drives, though I am not sure which ones those should be.

Comment 3 Bruno Wolff III 2008-03-13 12:42:03 UTC
My home machine seems like it might also be having this issue and it is just
using raid 1 (no encryption). It failed to boot with a .105 kernel, but did boot
with a .95 kernel. However my old work machine is also running raid 1, but is
working fine with a .105 kernel. One potential difference is that the old work
machine has one file system that isn't on top of raid. Both of the other
machines only have filesystems that run on top of raid devices (in some cases
with an intervening enctyption layer).

Comment 4 Bruno Wolff III 2008-03-13 13:10:13 UTC
I just tested reinstalling the .95 kernel on my new work machine (using a recent
mkinitrd) and it failed to find devices when booting up. So in my case it looks
like the problem is more likely tied to mkinitrd than the kernel.

Comment 5 Bruno Wolff III 2008-03-13 20:54:21 UTC
I tried adding in all of the modles that were loaded by the rescue image. This
resulted in a more normal prompt for the swap and root luks keys. There wasn't
an error with the root pivot. However right afterwards there was a problem
finding /bin/sh . Then there was a kernel panic. I expect loading a bunch of
modules in a random order could cause problems, but it does suggest there is an
issue with some needed modules not getting added by mkinitrd.

Comment 6 Rob Riggs 2008-03-14 14:50:20 UTC
This seems almost certainly a mkinitrd problem.

Installing kernel-2.6.25-0.113.rc5.git2.fc9 with mkinitrd-6.0.34 installed
results in a kernel installation that fails to boot.  I downgraded to
mkinitrd-6.0.33 and reinstalled the kernel-2.6.25-0.113.rc5.git2.fc9 and it
boots fine.

Comment 7 Bruno Wolff III 2008-03-14 21:21:20 UTC
There may be a problem with properly walking back through the device list (since
to use raid, you need to be able to use the underlying devices and similarly for
luks). I am using encryption over raid and some modules are not getting
included. I have tried several recent versions of mkinitrd, including 6.0.33 and
am consistantly seeing that problem.
My next to latest test was munging /etc/fstab to see how that effected things.
Changing the line for swap to indicate it was on /dev/sdb2 instead of a luks
device resulted in the init file including at least some of the missing modules.
I then tried saying both / and swap were directly on raid devices and the
modules went missing again. (This last one was with an unmodified 6.0.34.)

Comment 8 Warren Togami 2008-03-14 21:52:16 UTC
I would really like to ssh into an affected machine and attempt to figure out
what's going on here.  I would need root access and the ability to reboot. 
Willing to work with you over IRC while I do it.

http://togami.com/~warren/id_dsa.pub.asc
Here is my ssh public key, GPG signed so you can verify that it is really me. 
If you want to give me ssh access please send me an e-mail.

Comment 9 Warren Togami 2008-03-14 22:21:53 UTC
Ah nevermind, pjones said he might have fixed this.

http://koji.fedoraproject.org/packages/mkinitrd/
please test the latest builds here and report back

Comment 10 Dennis Jacobfeuerborn 2008-03-15 01:06:37 UTC
Looks like there is a bug in that version:

Running Transaction
  Installing: kernel-devel                 ######################### [1/4] 
  Updating  : kernel-headers               ######################### [2/4] 
  Installing: kernel                       ######################### [3/4] 
/sbin/mkinitrd: line 1559: [: too many arguments
  Cleanup   : kernel-headers               ######################### [4/4] 


Comment 11 Dennis Jacobfeuerborn 2008-03-15 01:13:34 UTC
trivial patch:

diff -Naur old/sbin/mkinitrd new/sbin/mkinitrd
--- old/sbin/mkinitrd	2008-03-15 02:11:55.000000000 +0100
+++ new/sbin/mkinitrd	2008-03-15 02:11:40.000000000 +0100
@@ -69,6 +69,7 @@
 PREMODS=""
 DMDEVS=""
 ncryptodevs=0
+nlatecryptodevs=0
 
 NET_LIST=""
 LD_SO_CONF=/etc/ld.so.conf


Comment 12 Dennis Jacobfeuerborn 2008-03-15 01:26:09 UTC
Unfortunately the resulting installed kernel still doesn't boot. :(
(I can confirm though that kernels installed with mkinitrd 6.0.33 do boot
correctly. Not sure why I haven't thought about *downgrading* mkinitrd myself. Doh!)

Comment 13 Bruno Wolff III 2008-03-15 04:08:25 UTC
The machine I am playing with has an encrypted root and swap so you can't reboot
it without someone being present to enter the keys and if things go south boot
again with the rescue disk. I won't be able to provide that ability until Monday.
In principal though I don't have a problem letting you try stuff on it, since
it is a fresh install with yum updates, there is nothing confidential on it yet.
The only other tricky thing is that I have been hanging it off my old desktop
machine and it doesn't currently have a direct connection to the internet. But I
can do something if we go there.
In the meantime I'll take a look at the latest mkinitrd and see if the initrd
images that are created look reasonable.

Comment 14 Bruno Wolff III 2008-03-15 05:33:38 UTC
I tried out 6.0.36 and 6.0.36 with the above fix and neither generated an initrd
that loaded the modules need to read my disk drives. So this needs more work, at
least for the encryption over raid case.

Comment 15 Bruno Wolff III 2008-03-15 20:06:04 UTC
I tried out 6.0.36 with the above fix on a different machine that has both / and swap on software raid devices and booting falied similarly to above. However the kernel modules for accessing the disk drives looked to be included in initrd image, so there might be two separate problems going on.

Comment 16 Warren Togami 2008-03-18 03:03:17 UTC
Created attachment 298326 [details]
Diff of initrd contents from 6.0.33 to 6.0.37

mkinitrd-6.0.33 you reported above as working.
mkinitrd-6.0.37 you reported above as broken.

However running both versions of mkinitrd on your box, I don't see how 6.0.33
could work as it is missing everything encryption related.  Am I missing
something?

Comment 17 Warren Togami 2008-03-18 03:21:14 UTC
Bruno, look in /tmp/mkinitrd-rpms.  That is where I put together the above
comparison between 6.0.33 and 6.0.37.  How did you get 6.0.33 to produce a
working initrd earlier?

Comment 18 Warren Togami 2008-03-18 03:35:45 UTC
*** From 6.0.37 ***
echo Setting up disk encryption: /dev/md2
cryptsetup luksOpen /dev/md2 luks-md2
echo Setting up disk encryption: /dev/md1
cryptsetup luksOpen /dev/md1 luks-md1
resume mapper/luks-md1
echo Creating root device.
mkrootdev -t ext3 -o defaults,ro /dev/mapper/luks-md2

resume mapper/luks-md1 <--- Is this line supposed to be missing the preceding
"/dev/"?

Comment 19 Warren Togami 2008-03-18 03:57:57 UTC
    # find the first swap dev which would get used for swsusp
    swsuspdev=$(awk '/^[ \t]*[^#]/ { if ($3 == "swap") { print $1; exit }}' $fstab)
    if [[ "$swsuspdev" =~ ^(UUID=|LABEL=) ]]; then
        swsuspdev=$(resolve_device_name "$swsuspdev")
    fi

@   suspdev=$(findblockdevinsys "$swsuspdev")
@   suspdev=${suspdev##*/dev/}
    if [ -n "$suspdev" ]; then
         swsuspdev="$suspdev"
    fi
    unset suspdev
    if [ -n "$swsuspdev" ]; then
        handlelvordev "$swsuspdev"
    fi
fi

The second line beginning in @ sets suspdev to mapper/luks-md1, handlelvordev
does nothing with the resulting $swsuspdev because it is plain RAID (not LVM or
raw devices) meaning an invalid name without "/dev/" is later emitted as the
resume device.  This is likely a different bug though?

Comment 20 Warren Togami 2008-03-18 04:19:06 UTC
Found what might be the actual problem.  This initrd is devoid of any disk
controller.  Too tired to think of a fix now and I can't reboot your machine to
test it anyway.

Please test the following workarounds:
mkinitrd --preload=mptsas /tmp/initrd-test.img 2.6.25-0.121.rc5.git4.fc9
mkinitrd --with-avail=block /tmp/initrd-test.img 2.6.25-0.121.rc5.git4.fc9

The resulting initrd of the first command seems to have mptsas and many required
modules.  It is lacking sd_mod, not sure if that's needed.  The second command
pulls in all possible block devices into the initrd and will attempt to load
drivers that the system's devices after it finishes loading all the RAID
drivers.  Do either of these initrd's work any better?


Comment 21 Warren Togami 2008-03-18 21:05:38 UTC
pjones might have fixed it in 6.0.39 which I installed on your box.  Please try
running it and reboot to see how it goes.

Comment 22 Bruno Wolff III 2008-03-18 21:29:52 UTC
My first try with the 121 kernel failed, but I am not sure that this was
initrd's fall. I saw this failure on another machine. I forgot to run mkinitrd
for both kernels, but I should have time to test the 113 kernel before I leave.
The error message was:
/bin/sh: ro: No such file or directory
Kernel panic - not syncing: Attempted to kill init!
Sorry about not getting to this earlier today, but I had some other stuff
keeping me busy. If this next test fails, I'll have more time tomorrow to do
test reboots.

Comment 23 Bruno Wolff III 2008-03-18 21:47:37 UTC
A second test with the 113 kernel failed. The buses don't run very late this
week because of spring break here, so that is about all I can do before I have
to leave for reboot testing. I'll get the machine back up in rescue mode so that
you can try other stuff.
I'll try 6.0.39 at home, since I am not sure the problem is encryption now, and
I can see what it does on a raid system without it.

Comment 24 Warren Togami 2008-03-18 22:03:57 UTC
No, the problem is specifically with the combination of encryption and RAID.


Comment 25 Dennis Jacobfeuerborn 2008-03-18 22:26:47 UTC
After installing 6.0.39 new kernels boot properly on my machine.

Comment 26 Bruno Wolff III 2008-03-19 07:44:28 UTC
I tried 6.0.39 at home on raid, no encryption, no lvm system and it worked OK.
I look at the init file on the encryption over raid machine to see if I see
anything odd, before going back to sleep.

Comment 27 Bruno Wolff III 2008-03-19 08:14:49 UTC
I didn't see anything obviously odd. I should have some time to try things later
today, but could use some suggestions as to what.

Comment 28 Bruno Wolff III 2008-03-19 14:31:29 UTC
I am running rpm -Va to check to see if anything has gotten messed up in that
maybe only part of bash is installed. This seems unlikely since it works in
rescue mode, but seems easy enough to check.

Comment 29 Peter Jones 2008-03-19 15:28:56 UTC
Bruno, you didn't state if 6.0.39 works for you with raid and encryption.  Does it?

Comment 30 Bruno Wolff III 2008-03-19 16:31:19 UTC
No it didn't. That was what I was trying to say in comments 22 and 23.
But things were better in that I got asked for swap and root's keys and then
switchroot seemed to happen. But it looks like it was unable to run /bin/sh for
some reason.

Comment 31 Bruno Wolff III 2008-03-19 18:51:16 UTC
I ran rpm -Va and saw some diffs. While none particularly looked like a problem,
I am going to reinstall the affect packages and then see if that makes a
difference. If that doesn't fix things, I'll try a reinstall with the latest
boot.iso file and see how things work after that.

Comment 32 Bruno Wolff III 2008-03-20 00:37:26 UTC
Yeah!
After reinstalling all packages flagged by rpm -Va and the kernel (for good
measure), rebooting worked.
I suspect I got a corrupted copy of something or an update didn't work correctly.
So I think this one is really fixed now.
Thanks!
I am off to do a fresh install and to unroot Warren.

Comment 33 Bruno Wolff III 2008-03-20 21:21:11 UTC
I completed a fresh install from this morning's boot iso and encryption over
raid is working.
Thanks for making this feature work.

Comment 34 David Nielsen 2008-03-25 20:33:30 UTC
I can confirm this works with the Beta release of Fedora 9.

Comment 35 Rob Riggs 2008-03-25 20:46:03 UTC
6.0.40 works for me with two RAID-1 devices and LVM.

Comment 36 Rob Riggs 2008-03-25 21:09:50 UTC
6.0.40 is emitting a warning for me every time a new kernel is installed:
"resolveDevice: device spec expected"

Here's how I can reproduce it:
sudo /sbin/mkinitrd /tmp/test 2.6.25-0.150.rc6.git7.fc9
resolveDevice: device spec expected


Comment 37 Bug Zapper 2008-05-14 06:00:30 UTC
Changing version to '9' as part of upcoming Fedora 9 GA.
More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 38 Peter Jones 2009-02-17 20:35:17 UTC
Rob, I think your problem with 6.0.40 is unrelated and also fixed in a later release.  Is it still a problem for you?

Comment 39 Dennis Jacobfeuerborn 2009-04-29 02:26:25 UTC
I think this bug has can be counted as fixed => closing