Bug 726905 - Will not boot on some systems using software raid (possibly just version 1.2 arrays)
Summary: Will not boot on some systems using software raid (possibly just version 1.2 ...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-07-30 17:38 UTC by Bruno Wolff III
Modified: 2011-08-11 01:31 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-08-11 01:31:35 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
dmesg output from successful boot (72.09 KB, text/plain)
2011-07-30 17:41 UTC, Bruno Wolff III
no flags Details
Output on screen when boot failed (1.07 MB, image/jpeg)
2011-08-09 20:14 UTC, Bruno Wolff III
no flags Details

Description Bruno Wolff III 2011-07-30 17:38:54 UTC
Description of problem:
I have not been able to boot on one of my machines since the 3.0 kernel release. The kernel-PAE-3.0-0.rc7.git10.1.fc16.i686 works, but so far every later kernel (currently through kernel-PAE-3.1.0-0.rc0.git11.2.fc17.i686) fails to boot because no raid devices are detected and it is unable to mount the root file system.

I have another machine where this doesn't happen. Both machines have an encrypted root device on top of software raid 1. The machine that works has version 0.90 arrays and the one that doesn't has version 1.2 arrays, except for /boot which has a version 1.0 array.

While booting the problem system do md arrays are noted prior to trying to using the file system specified on the root= parameter and the boot fails when trying to mount that file system. No password for the luks device is asked for, but given that the array for the root device wasn't detected, this isn't surprising.

I haven't tried rerunning dracut on the older kernel entry, as if it breaks things I am fairly hosed. But it is possible that an update to dracut, mdadm or some other tool triggered the problem, rather than the kernel update.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Bruno Wolff III 2011-07-30 17:41:48 UTC
Created attachment 515982 [details]
dmesg output from successful boot

Comment 2 Bruno Wolff III 2011-07-30 18:02:01 UTC
I should be able to test dracut by copying over the 3.0-0.rc7.git10.1.fc16.i686 files in /boot and making a new grub entry. After confirming I can boot off the copies, I'll run the kernel update script that rebuilds initramfs and see if that makes things break. I'm in the middle of processing today's rawhide update, so it will be a bit before I test this.

Comment 3 Bruno Wolff III 2011-07-31 04:45:06 UTC
I ran /sbin/new-kernel-pkg --package kernel-PAE --mkinitrd --dracut --depmod --update 3.0-0.rc7.git10.1.fc16.i686.PAE and the 3.0-0.rc7.git10.1.fc16.i686.PAE kernel still booted. So it's looking more like something that changed between 3.0-0.rc7.git10.1 and 3.0.0-1 that triggers the problem.

Comment 4 Bruno Wolff III 2011-07-31 14:54:51 UTC
I am still seeing this with kernel-PAE-3.1.0-0.rc0.git12.1.fc17.i686. After tonight I won't have physical access to the machine for a week and won't be able to test rebooting it during that time.

Comment 5 Josh Boyer 2011-08-01 12:47:50 UTC
The only non-merge changes in the upstream kernel between 3.0-rc7-git10 and 3.0 are 33d8881af5584fb7994f6b3d17fc11dcaf07b3b2 and 2cebaa58b7de775386732bbd6cd11c3f5b73faf0 neither of which have anything to do with the md area of the kernel, so that is odd.

In the kernel package itself, we simply switched the source to the release tarball.

About the only thing I can see that changed that might be relevant between -rc7-git10 and 3.0.0-1 is that uname went from 3.0-0.rc7 to 3.0.0, so a two digit to three digit change.  I would have expected a problem the reverse way though.

Comment 6 Josh Boyer 2011-08-01 12:53:57 UTC
Do you have a log of the boot failing?

Comment 7 Bruno Wolff III 2011-08-02 12:59:10 UTC
It doesn't get far enough to log. When I get back from my trip I can take a picture of the screen.

Comment 8 Bruno Wolff III 2011-08-09 20:14:35 UTC
Created attachment 517478 [details]
Output on screen when boot failed

I tested this again with kernel-PAE-3.1.0-0.rc1.git1.1.fc17.i686 and snapped a picture when it failed.

Comment 9 Josh Boyer 2011-08-10 14:36:14 UTC
Out of curiosity, does the grub entry for the failing kernel(s) have an initrd line and do you see something that looks like this during boot:

[    0.770386] Unpacking initramfs...
[    2.545097] Freeing initrd memory: 15340k freed

and then later:

[    3.571223] Freeing unused kernel memory: 1908k freed

Comment 10 Bruno Wolff III 2011-08-10 18:02:46 UTC
I am seeing a possibly related problem with 2.6.40 kernels on F15. I filed bug 729743 for this, and because the symptoms were somewhat different was able to record dmesg output.

No there isn't an initrd line for the broken kernels.

The first entry looks like this:
title Fedora (3.1.0-0.rc1.git1.1.fc17.i686.PAE)
root (hd0,0)
kernel /vmlinuz-3.1.0-0.rc1.git1.1.fc17.i686.PAE ro root=/dev/mapper/luks-9a976b
86-8aaa-40d9-8039-89d710eac5c9 SYSFONT=latarcyrheb-sun16 LANG=en_US.UTF-8 KEYTAB
LE=us radeon.agpmode=-1

I can add one and do a test reboot when I get home from work.

Comment 11 Josh Boyer 2011-08-10 18:11:12 UTC
(In reply to comment #10)
> No there isn't an initrd line for the broken kernels.
> 
> The first entry looks like this:
> title Fedora (3.1.0-0.rc1.git1.1.fc17.i686.PAE)
> root (hd0,0)
> kernel /vmlinuz-3.1.0-0.rc1.git1.1.fc17.i686.PAE ro
> root=/dev/mapper/luks-9a976b
> 86-8aaa-40d9-8039-89d710eac5c9 SYSFONT=latarcyrheb-sun16 LANG=en_US.UTF-8
> KEYTAB
> LE=us radeon.agpmode=-1
> 
> I can add one and do a test reboot when I get home from work.

Yeah.  Without the initrd line, the initramfs doesn't get loaded.  Then the kernel decides it's going to try and be helpful and look for RAID arrays to assemble, doesn't get it right, and then gives up.

The problem here is there is no initramfs being loaded, not really anything with the kernel.

Comment 12 Bruno Wolff III 2011-08-11 01:31:35 UTC
Thanks. Adding the initrd line fixed things. I am not sure how I managed to lose the one that was there.


Note You need to log in before you can comment on or make changes to this bug.