Bug 726905

Summary: Will not boot on some systems using software raid (possibly just version 1.2 arrays)
Product: [Fedora] Fedora Reporter: Bruno Wolff III <bruno>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: aquini, bruno, gansalmon, itamar, jonathan, jwboyer, kernel-maint, madhu.chinakonda
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-08-11 01:31:35 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dmesg output from successful boot
none
Output on screen when boot failed none

Description Bruno Wolff III 2011-07-30 17:38:54 UTC
Description of problem:
I have not been able to boot on one of my machines since the 3.0 kernel release. The kernel-PAE-3.0-0.rc7.git10.1.fc16.i686 works, but so far every later kernel (currently through kernel-PAE-3.1.0-0.rc0.git11.2.fc17.i686) fails to boot because no raid devices are detected and it is unable to mount the root file system.

I have another machine where this doesn't happen. Both machines have an encrypted root device on top of software raid 1. The machine that works has version 0.90 arrays and the one that doesn't has version 1.2 arrays, except for /boot which has a version 1.0 array.

While booting the problem system do md arrays are noted prior to trying to using the file system specified on the root= parameter and the boot fails when trying to mount that file system. No password for the luks device is asked for, but given that the array for the root device wasn't detected, this isn't surprising.

I haven't tried rerunning dracut on the older kernel entry, as if it breaks things I am fairly hosed. But it is possible that an update to dracut, mdadm or some other tool triggered the problem, rather than the kernel update.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Bruno Wolff III 2011-07-30 17:41:48 UTC
Created attachment 515982 [details]
dmesg output from successful boot

Comment 2 Bruno Wolff III 2011-07-30 18:02:01 UTC
I should be able to test dracut by copying over the 3.0-0.rc7.git10.1.fc16.i686 files in /boot and making a new grub entry. After confirming I can boot off the copies, I'll run the kernel update script that rebuilds initramfs and see if that makes things break. I'm in the middle of processing today's rawhide update, so it will be a bit before I test this.

Comment 3 Bruno Wolff III 2011-07-31 04:45:06 UTC
I ran /sbin/new-kernel-pkg --package kernel-PAE --mkinitrd --dracut --depmod --update 3.0-0.rc7.git10.1.fc16.i686.PAE and the 3.0-0.rc7.git10.1.fc16.i686.PAE kernel still booted. So it's looking more like something that changed between 3.0-0.rc7.git10.1 and 3.0.0-1 that triggers the problem.

Comment 4 Bruno Wolff III 2011-07-31 14:54:51 UTC
I am still seeing this with kernel-PAE-3.1.0-0.rc0.git12.1.fc17.i686. After tonight I won't have physical access to the machine for a week and won't be able to test rebooting it during that time.

Comment 5 Josh Boyer 2011-08-01 12:47:50 UTC
The only non-merge changes in the upstream kernel between 3.0-rc7-git10 and 3.0 are 33d8881af5584fb7994f6b3d17fc11dcaf07b3b2 and 2cebaa58b7de775386732bbd6cd11c3f5b73faf0 neither of which have anything to do with the md area of the kernel, so that is odd.

In the kernel package itself, we simply switched the source to the release tarball.

About the only thing I can see that changed that might be relevant between -rc7-git10 and 3.0.0-1 is that uname went from 3.0-0.rc7 to 3.0.0, so a two digit to three digit change.  I would have expected a problem the reverse way though.

Comment 6 Josh Boyer 2011-08-01 12:53:57 UTC
Do you have a log of the boot failing?

Comment 7 Bruno Wolff III 2011-08-02 12:59:10 UTC
It doesn't get far enough to log. When I get back from my trip I can take a picture of the screen.

Comment 8 Bruno Wolff III 2011-08-09 20:14:35 UTC
Created attachment 517478 [details]
Output on screen when boot failed

I tested this again with kernel-PAE-3.1.0-0.rc1.git1.1.fc17.i686 and snapped a picture when it failed.

Comment 9 Josh Boyer 2011-08-10 14:36:14 UTC
Out of curiosity, does the grub entry for the failing kernel(s) have an initrd line and do you see something that looks like this during boot:

[    0.770386] Unpacking initramfs...
[    2.545097] Freeing initrd memory: 15340k freed

and then later:

[    3.571223] Freeing unused kernel memory: 1908k freed

Comment 10 Bruno Wolff III 2011-08-10 18:02:46 UTC
I am seeing a possibly related problem with 2.6.40 kernels on F15. I filed bug 729743 for this, and because the symptoms were somewhat different was able to record dmesg output.

No there isn't an initrd line for the broken kernels.

The first entry looks like this:
title Fedora (3.1.0-0.rc1.git1.1.fc17.i686.PAE)
root (hd0,0)
kernel /vmlinuz-3.1.0-0.rc1.git1.1.fc17.i686.PAE ro root=/dev/mapper/luks-9a976b
86-8aaa-40d9-8039-89d710eac5c9 SYSFONT=latarcyrheb-sun16 LANG=en_US.UTF-8 KEYTAB
LE=us radeon.agpmode=-1

I can add one and do a test reboot when I get home from work.

Comment 11 Josh Boyer 2011-08-10 18:11:12 UTC
(In reply to comment #10)
> No there isn't an initrd line for the broken kernels.
> 
> The first entry looks like this:
> title Fedora (3.1.0-0.rc1.git1.1.fc17.i686.PAE)
> root (hd0,0)
> kernel /vmlinuz-3.1.0-0.rc1.git1.1.fc17.i686.PAE ro
> root=/dev/mapper/luks-9a976b
> 86-8aaa-40d9-8039-89d710eac5c9 SYSFONT=latarcyrheb-sun16 LANG=en_US.UTF-8
> KEYTAB
> LE=us radeon.agpmode=-1
> 
> I can add one and do a test reboot when I get home from work.

Yeah.  Without the initrd line, the initramfs doesn't get loaded.  Then the kernel decides it's going to try and be helpful and look for RAID arrays to assemble, doesn't get it right, and then gives up.

The problem here is there is no initramfs being loaded, not really anything with the kernel.

Comment 12 Bruno Wolff III 2011-08-11 01:31:35 UTC
Thanks. Adding the initrd line fixed things. I am not sure how I managed to lose the one that was there.