318261 – [RHEL5 RT] kernel BUG at kernel/rtmutex.c:659!

Bug 318261 - [RHEL5 RT] kernel BUG at kernel/rtmutex.c:659!

Summary: [RHEL5 RT] kernel BUG at kernel/rtmutex.c:659!

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	realtime-kernel
Sub Component:
Version:	1.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Steven Rostedt
QA Contact:
Docs Contact:
URL:	http://rhts.lab.boston.redhat.com/cgi...
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-10-04 13:30 UTC by Jeff Burke
Modified:	2008-02-27 19:58 UTC (History)
CC List:	0 users
Fixed In Version:	2.6.21-53
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-11-30 16:06:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
fix alternate_node_alloc this_cpu (1.88 KB, patch) 2007-11-26 20:20 UTC, Steven Rostedt	no flags	Details \| Diff
View All

Description Jeff Burke 2007-10-04 13:30:47 UTC

Description of problem:
 While testing the rttracer kernel the system
ibm-wildhorse-01.rhts.boston.redhat.com had a kernel panic.

Version-Release number of selected component (if applicable):
 2.6.21-39.el5rttrace

How reproducible:
 Often

Steps to Reproduce:
1. Install RHEL5.1 tree RHEL5.1-Server-20070920.1 x86_64. Then install the
current rttrace kernel.
2. Reboot serveral times.
  
Actual results:
Unable to handle kernel NULL pointer dereference at 0000000000000040 RIP: 
 [<ffffffff802abdd1>] wakeup_next_waiter+0x35/0x19c
PGD 0 
Oops: 0000 [1] PREEMPT SMP 
CPU 2 
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.21-39.el5rttrace #1
RIP: 0010:[<ffffffff802abdd1>]  [<ffffffff802abdd1>] wakeup_next_waiter+0x35/0x19c
RSP: 0000:ffff8100067c7d20  EFLAGS: 00010097
RAX: 0000000000000002 RBX: ffff810003d7d000 RCX: ffffffff8026667a
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff802abdc8
RBP: ffff8100067c7d50 R08: 0000000000000000 R09: ffffffff80a5f380
R10: ffff81013fc92040 R11: ffff81013fcf9940 R12: ffff810003d7d000
R13: ffffffffffffffe8 R14: ffff81007ff57e00 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff81013fcf9940(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000040 CR3: 0000000000201000 CR4: 00000000000006e0
Process swapper (pid: 1, threadinfo ffff8100067c6000, task ffff81013fc92040)
Stack:  ffff81007ff57e00 ffff810003d7d000 0000000000000207 00000000000000d0
 ffff81007ff57e00 0000000000000001 ffff8100067c7d70 ffffffff8026580f
 ffff81013eda90c0 ffff81000679b080 ffff8100067c7d80 ffffffff802662dd
Call Trace:
 [<ffffffff8026580f>] rt_spin_lock_slowunlock+0x3e/0x5c
 [<ffffffff802662dd>] rt_spin_unlock+0x28/0x2a
 [<ffffffff8020a81d>] kmem_cache_alloc+0xd1/0xe2
 [<ffffffff803a877d>] con_insert_unipair+0x40/0xda
 [<ffffffff803a8b27>] con_set_default_unimap+0xbc/0x131
 [<ffffffff809abab0>] console_map_init+0x31/0x43
 [<ffffffff809abc58>] vty_init+0xf3/0xf7
 [<ffffffff809ab6bb>] tty_init+0x1c1/0x1c5
 [<ffffffff80990a03>] init+0x1c3/0x425
 [<ffffffff802601d8>] child_rip+0xa/0x12


Code: 4d 39 65 58 74 04 0f 0b eb fe 49 8d 74 24 08 4c 89 ef e8 a2 
RIP  [<ffffffff802abdd1>] wakeup_next_waiter+0x35/0x19c
 RSP <ffff8100067c7d20>
CR2: 0000000000000040
Kernel panic - not syncing: Attempted to kill init!

Call Trace:
 [<ffffffff8026dad8>] dump_trace+0xaa/0x32a
 [<ffffffff8026dd99>] show_trace+0x41/0x64
 [<ffffffff8026ddd1>] dump_stack+0x15/0x17
 [<ffffffff80292b69>] panic+0xaf/0x16e

Expected results:
 system should no panic on normal boot.

Additional info:
 The URL is a link to a test kernel that this was originally seen on but it was
reproduced with the standard rttrace kernel.

Comment 1 Steven Rostedt 2007-11-26 20:12:24 UTC

This is a nasty bug, and happens to be fixed upstream.

The cause of this bug was alternate_node_alloc would use its own this_cpu
variable. kmem_cache_alloc would grab the per_cpu slab lock with its own
this_cpu, and then call alternate_node_alloc. This would then pass its own
this_cpu to cache_grow, which would unlock and lock the per_cpu slab lock. If we
happen to change CPUS while this happened, we would be locking and unlocking the
wrong locks.

I'll attach a patch to fix this.

Comment 2 Steven Rostedt 2007-11-26 20:20:49 UTC

Created attachment 269261 [details]
fix alternate_node_alloc this_cpu

patch to replace the local this_cpu from alternate_node_alloc to a cpu pointer
that is passed in. This will allow the proper slab locks from being locked and 

unlocked.

Comment 3 Clark Williams 2007-11-30 16:06:16 UTC

Fixed in 2..6.21-53

Note You need to log in before you can comment on or make changes to this bug.