Bug 625077

Summary: kernel-2.6.36-0.[1234].rc1.git[01].fc15 fails to boot
Product: [Fedora] Fedora Reporter: Tom London <selinux>
Component: kernelAssignee: Roland McGrath <roland>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: rawhideCC: anton, aquini, dougsland, eblake, frankly3d, gansalmon, itamar, jlayton, jonathan, kernel-maint, kurtdriver, madhu.chinakonda, masao-takahashi, michal, pebolle, rjones, robatino, spoffley, tomek, yaneti, zing
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-08-24 14:07:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Screenshot showing boot hang
none
log of failed boot with kernel-2.6.36-0.1.rc1.git0.fc15.i686
none
boot messages from guest serial console
none
traces displayed on a screen after a failed boot
none
dmesg output from booting 2.6.36-0.0.rc0.git1.fc15.x86_64 none

Description Tom London 2010-08-18 14:32:12 UTC
Created attachment 439401 [details]
Screenshot showing boot hang

Description of problem:
Boot hangs very early in cycle.  I attach screenshot showing last messages:

SELinux:  Registering netfilter hooks
cryptomgr_test used greatest stack depth: 5968 bytes left
cryptomgr_test used greatest stack depth: 5944 bytes left

System is Thinkpad X200:

[root@tlondon ~]# lspci
00:00.0 Host bridge: Intel Corporation Mobile 4 Series Chipset Memory Controller Hub (rev 07)
00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)
00:02.1 Display controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)
00:03.0 Communication controller: Intel Corporation Mobile 4 Series Chipset MEI Controller (rev 07)
00:19.0 Ethernet controller: Intel Corporation 82567LM Gigabit Network Connection (rev 03)
00:1a.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 03)
00:1a.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 03)
00:1a.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 03)
00:1a.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 03)
00:1b.0 Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller (rev 03)
00:1c.0 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 1 (rev 03)
00:1c.1 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 2 (rev 03)
00:1c.3 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 4 (rev 03)
00:1d.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 03)
00:1d.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 03)
00:1d.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 03)
00:1d.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 03)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev 93)
00:1f.0 ISA bridge: Intel Corporation ICH9M-E LPC Interface Controller (rev 03)
00:1f.2 SATA controller: Intel Corporation ICH9M/M-E SATA AHCI Controller (rev 03)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 03)
03:00.0 Network controller: Intel Corporation PRO/Wireless 5100 AGN [Shiloh] Network Connection
[root@tlondon ~]# 


Version-Release number of selected component (if applicable):
kernel-2.6.36-0.1.rc1.git0.fc15.x86_64

How reproducible:
Every boot attempt

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Tom London 2010-08-19 04:21:00 UTC
kernel-2.6.36-0.3.rc1.git1.fc15.x86_64 fails to boot in exactly the same fashion as kernel-2.6.36-0.1.rc1.git0.fc15.x86_64.

Sorry, no better debug info other than the screenshot already posted.....

Some other way to help?

Comment 2 Masao Takahashi 2010-08-19 06:19:36 UTC
I got a same problem.
kernel-2.6.36-0.2.rc1.git1.fc15.i686

Comment 3 Paul Bolle 2010-08-19 09:44:29 UTC
Created attachment 439634 [details]
log of failed boot with  kernel-2.6.36-0.1.rc1.git0.fc15.i686

0) similar situation with kernel-2.6.36-0.1.rc1.git0.fc15.i686 for me, on a IBM ThinkPad T41.

1) Note from the log that the cryptomgr_test message is just the last thing printed to the screen: there's a lot more happening after that. One just never gets to see a (graphical or text) login prompt.

Comment 4 Tomasz Torcz 2010-08-19 09:52:12 UTC
initcall_debug ends with "calling sha1_generic_mod_init+0x0/0x12 @ 1

Comment 5 Paul Bolle 2010-08-19 10:32:53 UTC
(In reply to comment #3)
> One just never gets to see a (graphical or text) login prompt.

For what it's worth: ditto for logging in over serial line. These are the last lines printed over the serial line to the other machine:
[...]
type=2000 audit(1282213485.561:1): initialized
highmem bounce pool size: 64 pages
HugeTLB registered 4 MB page size, pre-allocated 0 pages
VFS: Disk quotas dquot_6.5.2
Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)
msgmni has been set to 1680
SELinux:  Registering netfilter hooks
cryptomgr_test used greatest stack depth: 7096 bytes left

Comment 6 Paul Bolle 2010-08-19 11:42:05 UTC
(In reply to comment #3)
> Created attachment 439634 [details]
> log of failed boot with kernel-2.6.36-0.1.rc1.git0.fc15.i686
> 
> 1) Note from the log that the cryptomgr_test message is just the last thing
> printed to the screen: there's a lot more happening after that. 

2.6.36-0.0.rc0.git1.fc15.i686 != kernel-2.6.36-0.1.rc1.git0.fc15.i686, so the log wasn't from latest kernel (that fails to boot) but from the last one that still works for me.

It seems the cryptomgr_test line really was the last thing the kernel prints. (And since the system needs to be shut down by removing power, nothing shows up in the log.)

Comment 7 atswartz 2010-08-19 15:12:59 UTC
* Wed Aug 18 2010 Chuck Ebbert <cebbert> - 2.6.36-0.3.rc1.git1 - Fix hangs on boot with some AMD processors (x86-cpu-fix-regression-in-amd-errata-checking-code.patch) - Drop unused ssb_check_for_sprom.patch 

kernel-2.6.36-0.3.rc1.git1.fc15.x86_64 & kernel-2.6.36-0.1.rc1.git0.fc15.x86_64 hang at boot for me.  kernel-2.6.36-0.0.rc0.git1.fc15.x86_64 works.  The above fix either didn't work or was not all that was wrong.

Comment 8 Jeff Layton 2010-08-19 17:47:48 UTC
Created attachment 439746 [details]
boot messages from guest serial console

The machine not booting in my case is a rawhide KVM guest. I have it set up for serial console and here are the boot messages. It freezes after the cryptomgr_test line.

Comment 9 Tom London 2010-08-20 00:10:20 UTC
kernel-2.6.36-0.4.rc1.git1.fc15.x86_64 fails to boot, seemingly in exactly the same way.....   It freezes after the cryptomgr_test line.

Comment 10 Michal Jaegermann 2010-08-20 01:04:21 UTC
Created attachment 439831 [details]
traces displayed on a screen after a failed boot

.6.36-0.1.rc1.git0.fc15.x86_64 fails to boot for me getting stuck what looks like a somewhat different place.  Lines printed on my screen look like this:
.....
Trying to unpack rootfs image as initramfs...
Freeing initrd memory: 4440k freed
DMA-API: preallocated 32768 debug entries
DMA-API: debugging enabled by kernel config
agpgart-amd64 0000:00:00.0: AGP bridge [1106/3188]
agpgart-amd64 0000:00:00.0: AGP aperture is 256M @ 0xe0000000
work_for_cpu used greatest stack depth: 6008 bytes left

and at this moment everything gets stuck for a number of minutes but not forever. If you will wait long enough then suddenly a screen looks like on an attached picture and then it is frozen apparently permanently.  Obviously something got scrolled off of the top but this happens in one swoop.  There is no visible scrolling of any sort whatsoever.

For a reference dmesg from a boot with 2.6.36-0.0.rc0.git1.fc15.x86_64.  The next line which never shows up with rc1 should read:

audit: initializing netlink socket (disabled)

Comment 11 Michal Jaegermann 2010-08-20 01:05:22 UTC
Created attachment 439832 [details]
dmesg output from booting 2.6.36-0.0.rc0.git1.fc15.x86_64

Comment 12 Yanko Kaneti 2010-08-20 05:21:07 UTC
Michal, I have the exact same symptoms and I have isolated it to the fedora utrace patch. Whether or not it also causes the cryptomgr_test thing perhaps some here can test.

Comment 13 Andre Robatino 2010-08-20 13:57:31 UTC
Same problem in a VirtualBox x86_64 guest.  The last 3 kernels hang at the cryptomgr_test line.  The last one which is bootable is 2.6.36-0.0.rc0.git1.fc15.x86_64.

Smolt URL: http://www.smolts.org/client/show/pub_7ab0c5a8-dcfb-41fc-8c85-fcd0ec5fc674

Comment 14 Kurt Driver 2010-08-21 02:28:38 UTC
This continues on my laptop even after commenting out the latest kernel. I just today upgraded from F13 to rawhide, this time stopping for a few hours at F14.

Comment 15 Michal Jaegermann 2010-08-21 04:48:49 UTC
I wonder if http://lkml.org/lkml/2010/8/17/439 and followups with a thread subject "2.6.36-rc1: Doesn't boot on an AMD-based machine" is not relevant at least to some reported troubles?

Comment 16 Chuck Ebbert 2010-08-21 07:50:16 UTC
(In reply to comment #15)
> I wonder if http://lkml.org/lkml/2010/8/17/439 and followups with a thread
> subject "2.6.36-rc1: Doesn't boot on an AMD-based machine" is not relevant at
> least to some reported troubles?

That was fixed in 2.6.36-0.3.rc1.git1

Comment 17 Chuck Ebbert 2010-08-21 08:52:30 UTC
Still hangs with the latest patches from http://people.redhat.com/roland/utrace/2.6-current/

Comment 18 Chuck Ebbert 2010-08-21 09:26:38 UTC
Confirmed, removing the utrace patches makes it boot.

Comment 19 Chuck Ebbert 2010-08-21 09:35:47 UTC
Dropped the utrace patch in 2.6.36-0.6.rc1.git3

Comment 20 Andre Robatino 2010-08-21 12:17:52 UTC
Confirmed that kernel-2.6.36-0.5.rc1.git3.fc15 from today's Rawhide push doesn't boot, but kernel-2.6.36-0.6.rc1.git3.fc15 from Koji does.

Comment 21 Tom London 2010-08-21 14:21:15 UTC
kernel-2.6.36-0.6.rc1.git3.fc15.x86_64 boots for me as well....

Comment 22 atswartz 2010-08-21 21:09:58 UTC
kernel-2.6.36-0.6.rc1.git3.fc15 boots, yet I am getting : 
kernel: CPU0: AMD Phenom(tm) 9500 Quad-Core Processor stepping 02
kernel: NMI watchdog enabled, takes one hw-pmu counter.
kernel: lockdep: fixing up alternatives.
kernel:
kernel: ===================================================
kernel: [ INFO: suspicious rcu_dereference_check() usage. ]
kernel: ---------------------------------------------------
kernel: kernel/sched.c:618 invoked rcu_dereference_check() without protection!
kernel:
kernel: other info that might help us debug this:
kernel:
kernel:
kernel: rcu_scheduler_active = 1, debug_locks = 0
kernel: 3 locks held by kworker/0:0/4:
kernel: #0:  (events){+.+.+.}, at: [<ffffffff81067e11>] process_one_work+0x160/0x2ec
kernel: #1:  ((&c_idle.work)){+.+.+.}, at: [<ffffffff81067e11>] process_one_work+0x160/0x2ec
kernel: #2:  (&rq->lock){-.....}, at: [<ffffffff81490279>] init_idle+0x30/0x131
kernel:
kernel: stack backtrace:
kernel: Pid: 4, comm: kworker/0:0 Not tainted 2.6.36-0.6.rc1.git3.fc15.x86_64 #1
kernel: Call Trace:
kernel: [<ffffffff8107d944>] lockdep_rcu_dereference+0xaa/0xb3
kernel: [<ffffffff810401ce>] task_group+0x80/0x8f
kernel: [<ffffffff810401f4>] set_task_rq+0x17/0x73
kernel: [<ffffffff81490333>] init_idle+0xea/0x131
kernel: [<ffffffff81490703>] fork_idle+0x92/0xa3
kernel: [<ffffffff8101050d>] ? sched_clock+0x9/0xd
kernel: [<ffffffff8148e17c>] do_fork_idle+0x1c/0x2d
kernel: [<ffffffff81067e6d>] process_one_work+0x1bc/0x2ec
kernel: [<ffffffff81067e11>] ? process_one_work+0x160/0x2ec
kernel: [<ffffffff8148e160>] ? do_fork_idle+0x0/0x2d
kernel: [<ffffffff81068d6c>] ? manage_workers.clone.9+0xe0/0x173
kernel: [<ffffffff81068f03>] worker_thread+0x104/0x19b
kernel: [<ffffffff81068dff>] ? worker_thread+0x0/0x19b
kernel: [<ffffffff8106c84c>] kthread+0x9d/0xa5
kernel: [<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10
kernel: [<ffffffff81498150>] ? restore_args+0x0/0x30
kernel: [<ffffffff8106c7af>] ? kthread+0x0/0xa5
kernel: [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
kernel: Booting Node   0, Processors  #1
kernel: NMI watchdog enabled, takes one hw-pmu counter.
kernel: lockdep: fixing up alternatives.
kernel: #2
kernel: NMI watchdog enabled, takes one hw-pmu counter.
kernel: lockdep: fixing up alternatives.
kernel: #3
kernel: NMI watchdog enabled, takes one hw-pmu counter.

Comment 23 atswartz 2010-08-21 21:31:04 UTC
(In reply to comment #22)
I guess the above information is irrelavent since I look back at my logs and have been getting something of this sort as far back as 2.6.34-0.38.rc5.git0.fc14.x86_64

Comment 24 Chuck Ebbert 2010-08-21 23:05:20 UTC
*** Bug 624854 has been marked as a duplicate of this bug. ***

Comment 25 Michal Jaegermann 2010-08-22 03:11:51 UTC
(In reply to comment #22)
> kernel-2.6.36-0.6.rc1.git3.fc15 boots, yet I am getting : 
...
> kernel: kernel/sched.c:618 invoked rcu_dereference_check() without protection!

That one will be bug 572520 and bug 610967 AFAICS.

Comment 26 Stephen 2010-08-22 14:05:31 UTC
I am getting a similar problem on an ACER Aspire 5810T-8952 Laptop with an Intel Core 2 Solo SU3500 processor running 64 bit rawhide.  When I attempt to boot any kernel after 2.6.36-0.0.rc0.git1.fc15.x86_64 I just get a black screen, the disks do not appear to be accessing and I cannot even boot into Init level 1.  No messages to any log files.

Comment 27 Frank Murphy 2010-08-22 14:13:59 UTC
This appears to fix it for me:
http://koji.fedoraproject.org/koji/buildinfo?buildID=191339

Anyone else?

Comment 28 Andre Robatino 2010-08-22 14:19:03 UTC
The fixed kernel-2.6.36-0.6.rc1.git3.fc15 is in today's Rawhide push as well. From "rawhide report: 20100822 changes":

kernel-2.6.36-0.6.rc1.git3.fc15
-------------------------------
* Sat Aug 21 2010 Chuck Ebbert <cebbert at redhat.com> - 2.6.36-0.6.rc1.git3
- Drop utrace patch that causes hang on boot.

Comment 29 Andre Robatino 2010-08-24 13:44:54 UTC
The last two kernels (2.6.36-0.7.rc2.git0.fc15 and 2.6.36-0.8.rc2.git0.fc15) boot as well, even though today's rawhide update says

kernel-2.6.36-0.8.rc2.git0.fc15
-------------------------------
* Mon Aug 23 2010 Roland McGrath <roland at redhat.com> - 2.6.36-0.8.rc2.git0
- utrace update

Does this mean that the cause of the hang is now understood, and this bug can be closed?