Bug 441765

Summary: [mdraid] [patch] Boot hang in all recent Fedora kernels
Product: [Fedora] Fedora Reporter: Nicolas Mailhot <nicolas.mailhot>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: low Docs Contact:
Priority: low    
Version: rawhide   
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-05-02 10:58:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 438943, 439966    
Attachments:
Description Flags
Screen capture
none
successful dmesg
none
Screen capture
none
Blurry but complete screen capture
none
Oops complete screen capture
none
patch none

Description Nicolas Mailhot 2008-04-09 21:24:53 UTC
Description of problem:

I normally do not reboot my always-on rawhide system very often, unless I'm
building my own testing mm kernels. This has not been the case for quite a
while. Recently, however, following the death of the home set-top dvd player,
and a rainy winter day, I remembered my old gaming windows partition and
rebooted on it (also changed the system gfx card).

Getting back into linux however proved a challenge. The system would oops on
every recent rawhide kernel 9 times out of 10. Strangely enough my old mm kernel
with the associated old initrd would always boot.

I've now captured a partial oops on a picture (very difficult it scrolls out of
the screen fast). I hope it's sufficient to point investigations in some
directions. I don't know if it's a new bug or something triggered by recent
unrelated rawhide changes. The problems always occurs at udev start time, then
the system quickly gets stuck, and need a reset.

Version-Release number of selected component (if applicable):

Couldn't find a recent fedora kernel without the problem

How reproducible:

Almost always, from cold or hot boot, sometimes the boot sequence succeeds but I
haven't found a reliable way to boot so far. The old mm kernel always boots fine

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Nicolas Mailhot 2008-04-09 21:28:10 UTC
Created attachment 301898 [details]
Screen capture

The best screen capture I could produce so far. It's blurry because the screen
is scrolling and the camera captured some remanence

Comment 2 Nicolas Mailhot 2008-04-09 21:32:33 UTC
Created attachment 301899 [details]
successful dmesg

The same kernel booted successfully the next iteration. Here is the associated
dmesg. Since getting a recent fedora kernel to boot can easily take ~ 1h of
trials I counted myself lucky and stopped the attempts to capture the oops on
camera

Comment 3 Chuck Ebbert 2008-04-10 23:03:10 UTC
Can you use the boot_delay parameter to slow down the scrolling and get a clear
picture of the oops?

Try boot_delay=100 to start with.

Comment 4 Nicolas Mailhot 2008-04-11 20:00:47 UTC
Created attachment 302165 [details]
Screen capture

I'm afraid the boot option only results in a blank screen

While I were at it however I retried a picture series and this one is a bit
better I think

Comment 5 Nicolas Mailhot 2008-04-11 20:13:21 UTC
clearing NEEDINFO

Comment 6 Chuck Ebbert 2008-04-11 22:35:56 UTC
That was good enough but too much had scrolled off the screen.

Comment 7 Nicolas Mailhot 2008-04-12 08:24:02 UTC
I'm afraid than without a reliable way to slow scrolling I can't do any better.
The previous lines just scroll too fast - they always show up as a lot of
surimposed lines in pictures (much worse than my first shot). The scrolling
slows down a little there that's why I could make the picture

Comment 8 Dave Jones 2008-04-14 16:51:47 UTC
even after trying higher values for boot_delay ?


Comment 9 Nicolas Mailhot 2008-04-14 17:57:56 UTC
boot_delay didn't result in a slower boot it resulted in a blank screen and no boot

Comment 10 Nicolas Mailhot 2008-04-18 22:15:32 UTC
Did a new run of tests with 2.6.25-1.fc9.x86_64. Turns out
1. Pressing shift+page-up like mad is a somewhat reliable way to avoid the hang
2. It's an "unable to handle null pointer deference" bug, and I managed to get a
somehow blurry but readable picture of the start of the error message

Comment 11 Nicolas Mailhot 2008-04-18 22:17:13 UTC
Created attachment 302953 [details]
Blurry but complete screen capture

Comment 12 Nicolas Mailhot 2008-04-19 12:58:19 UTC
Created attachment 302996 [details]
Oops complete screen capture

This one should be as complete and clear as it could be

Comment 13 Chuck Ebbert 2008-04-22 15:24:12 UTC
I added my analysis of the failure to the upstream bug -- thank you for filing that.

Comment 14 Nicolas Mailhot 2008-04-29 17:44:39 UTC
A fix was posted in upstream's bugzilla. Please integrate it to the Fedora
kernel before F9 release.

Comment 15 Chuck Ebbert 2008-04-29 19:17:56 UTC
Created attachment 304150 [details]
patch

Comment 16 Chuck Ebbert 2008-05-01 02:03:06 UTC
Patch in 2.6.25-13

Comment 17 Nicolas Mailhot 2008-05-01 13:21:52 UTC
I confirm 2.6.25-13 fix the issues. I hope it is not restricted to F9 updates.
I'd hate to have a boot crasher in the initial F9 kernel

Comment 18 Nicolas Mailhot 2008-05-01 13:31:41 UTC
Thank you for working on it

Comment 19 Chuck Ebbert 2008-05-02 10:58:10 UTC
2.6.25-14 tagged for F9-final