Bug 441765 - [mdraid] [patch] Boot hang in all recent Fedora kernels
Summary: [mdraid] [patch] Boot hang in all recent Fedora kernels
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: All
OS: Linux
low
low
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: F10Blocker, F10FinalBlocker F9KernelBlocker
TreeView+ depends on / blocked
 
Reported: 2008-04-09 21:24 UTC by Nicolas Mailhot
Modified: 2008-05-02 10:58 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-05-02 10:58:10 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Screen capture (235.48 KB, image/jpeg)
2008-04-09 21:28 UTC, Nicolas Mailhot
no flags Details
successful dmesg (37.18 KB, application/octet-stream)
2008-04-09 21:32 UTC, Nicolas Mailhot
no flags Details
Screen capture (125.30 KB, image/jpeg)
2008-04-11 20:00 UTC, Nicolas Mailhot
no flags Details
Blurry but complete screen capture (102.04 KB, image/jpeg)
2008-04-18 22:17 UTC, Nicolas Mailhot
no flags Details
Oops complete screen capture (127.50 KB, image/jpeg)
2008-04-19 12:58 UTC, Nicolas Mailhot
no flags Details
patch (765 bytes, text/plain)
2008-04-29 19:17 UTC, Chuck Ebbert
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Linux Kernel 10484 0 None None None Never

Description Nicolas Mailhot 2008-04-09 21:24:53 UTC
Description of problem:

I normally do not reboot my always-on rawhide system very often, unless I'm
building my own testing mm kernels. This has not been the case for quite a
while. Recently, however, following the death of the home set-top dvd player,
and a rainy winter day, I remembered my old gaming windows partition and
rebooted on it (also changed the system gfx card).

Getting back into linux however proved a challenge. The system would oops on
every recent rawhide kernel 9 times out of 10. Strangely enough my old mm kernel
with the associated old initrd would always boot.

I've now captured a partial oops on a picture (very difficult it scrolls out of
the screen fast). I hope it's sufficient to point investigations in some
directions. I don't know if it's a new bug or something triggered by recent
unrelated rawhide changes. The problems always occurs at udev start time, then
the system quickly gets stuck, and need a reset.

Version-Release number of selected component (if applicable):

Couldn't find a recent fedora kernel without the problem

How reproducible:

Almost always, from cold or hot boot, sometimes the boot sequence succeeds but I
haven't found a reliable way to boot so far. The old mm kernel always boots fine

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Nicolas Mailhot 2008-04-09 21:28:10 UTC
Created attachment 301898 [details]
Screen capture

The best screen capture I could produce so far. It's blurry because the screen
is scrolling and the camera captured some remanence

Comment 2 Nicolas Mailhot 2008-04-09 21:32:33 UTC
Created attachment 301899 [details]
successful dmesg

The same kernel booted successfully the next iteration. Here is the associated
dmesg. Since getting a recent fedora kernel to boot can easily take ~ 1h of
trials I counted myself lucky and stopped the attempts to capture the oops on
camera

Comment 3 Chuck Ebbert 2008-04-10 23:03:10 UTC
Can you use the boot_delay parameter to slow down the scrolling and get a clear
picture of the oops?

Try boot_delay=100 to start with.

Comment 4 Nicolas Mailhot 2008-04-11 20:00:47 UTC
Created attachment 302165 [details]
Screen capture

I'm afraid the boot option only results in a blank screen

While I were at it however I retried a picture series and this one is a bit
better I think

Comment 5 Nicolas Mailhot 2008-04-11 20:13:21 UTC
clearing NEEDINFO

Comment 6 Chuck Ebbert 2008-04-11 22:35:56 UTC
That was good enough but too much had scrolled off the screen.

Comment 7 Nicolas Mailhot 2008-04-12 08:24:02 UTC
I'm afraid than without a reliable way to slow scrolling I can't do any better.
The previous lines just scroll too fast - they always show up as a lot of
surimposed lines in pictures (much worse than my first shot). The scrolling
slows down a little there that's why I could make the picture

Comment 8 Dave Jones 2008-04-14 16:51:47 UTC
even after trying higher values for boot_delay ?


Comment 9 Nicolas Mailhot 2008-04-14 17:57:56 UTC
boot_delay didn't result in a slower boot it resulted in a blank screen and no boot

Comment 10 Nicolas Mailhot 2008-04-18 22:15:32 UTC
Did a new run of tests with 2.6.25-1.fc9.x86_64. Turns out
1. Pressing shift+page-up like mad is a somewhat reliable way to avoid the hang
2. It's an "unable to handle null pointer deference" bug, and I managed to get a
somehow blurry but readable picture of the start of the error message

Comment 11 Nicolas Mailhot 2008-04-18 22:17:13 UTC
Created attachment 302953 [details]
Blurry but complete screen capture

Comment 12 Nicolas Mailhot 2008-04-19 12:58:19 UTC
Created attachment 302996 [details]
Oops complete screen capture

This one should be as complete and clear as it could be

Comment 13 Chuck Ebbert 2008-04-22 15:24:12 UTC
I added my analysis of the failure to the upstream bug -- thank you for filing that.

Comment 14 Nicolas Mailhot 2008-04-29 17:44:39 UTC
A fix was posted in upstream's bugzilla. Please integrate it to the Fedora
kernel before F9 release.

Comment 15 Chuck Ebbert 2008-04-29 19:17:56 UTC
Created attachment 304150 [details]
patch

Comment 16 Chuck Ebbert 2008-05-01 02:03:06 UTC
Patch in 2.6.25-13

Comment 17 Nicolas Mailhot 2008-05-01 13:21:52 UTC
I confirm 2.6.25-13 fix the issues. I hope it is not restricted to F9 updates.
I'd hate to have a boot crasher in the initial F9 kernel

Comment 18 Nicolas Mailhot 2008-05-01 13:31:41 UTC
Thank you for working on it

Comment 19 Chuck Ebbert 2008-05-02 10:58:10 UTC
2.6.25-14 tagged for F9-final


Note You need to log in before you can comment on or make changes to this bug.