Description of problem: While testing the rttracer kernel the system ibm-wildhorse-01.rhts.boston.redhat.com had a kernel panic. Version-Release number of selected component (if applicable): 2.6.21-39.el5rttrace How reproducible: Often Steps to Reproduce: 1. Install RHEL5.1 tree RHEL5.1-Server-20070920.1 x86_64. Then install the current rttrace kernel. 2. Reboot serveral times. Actual results: Unable to handle kernel NULL pointer dereference at 0000000000000040 RIP: [<ffffffff802abdd1>] wakeup_next_waiter+0x35/0x19c PGD 0 Oops: 0000 [1] PREEMPT SMP CPU 2 Modules linked in: Pid: 1, comm: swapper Not tainted 2.6.21-39.el5rttrace #1 RIP: 0010:[<ffffffff802abdd1>] [<ffffffff802abdd1>] wakeup_next_waiter+0x35/0x19c RSP: 0000:ffff8100067c7d20 EFLAGS: 00010097 RAX: 0000000000000002 RBX: ffff810003d7d000 RCX: ffffffff8026667a RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff802abdc8 RBP: ffff8100067c7d50 R08: 0000000000000000 R09: ffffffff80a5f380 R10: ffff81013fc92040 R11: ffff81013fcf9940 R12: ffff810003d7d000 R13: ffffffffffffffe8 R14: ffff81007ff57e00 R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffff81013fcf9940(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000040 CR3: 0000000000201000 CR4: 00000000000006e0 Process swapper (pid: 1, threadinfo ffff8100067c6000, task ffff81013fc92040) Stack: ffff81007ff57e00 ffff810003d7d000 0000000000000207 00000000000000d0 ffff81007ff57e00 0000000000000001 ffff8100067c7d70 ffffffff8026580f ffff81013eda90c0 ffff81000679b080 ffff8100067c7d80 ffffffff802662dd Call Trace: [<ffffffff8026580f>] rt_spin_lock_slowunlock+0x3e/0x5c [<ffffffff802662dd>] rt_spin_unlock+0x28/0x2a [<ffffffff8020a81d>] kmem_cache_alloc+0xd1/0xe2 [<ffffffff803a877d>] con_insert_unipair+0x40/0xda [<ffffffff803a8b27>] con_set_default_unimap+0xbc/0x131 [<ffffffff809abab0>] console_map_init+0x31/0x43 [<ffffffff809abc58>] vty_init+0xf3/0xf7 [<ffffffff809ab6bb>] tty_init+0x1c1/0x1c5 [<ffffffff80990a03>] init+0x1c3/0x425 [<ffffffff802601d8>] child_rip+0xa/0x12 Code: 4d 39 65 58 74 04 0f 0b eb fe 49 8d 74 24 08 4c 89 ef e8 a2 RIP [<ffffffff802abdd1>] wakeup_next_waiter+0x35/0x19c RSP <ffff8100067c7d20> CR2: 0000000000000040 Kernel panic - not syncing: Attempted to kill init! Call Trace: [<ffffffff8026dad8>] dump_trace+0xaa/0x32a [<ffffffff8026dd99>] show_trace+0x41/0x64 [<ffffffff8026ddd1>] dump_stack+0x15/0x17 [<ffffffff80292b69>] panic+0xaf/0x16e Expected results: system should no panic on normal boot. Additional info: The URL is a link to a test kernel that this was originally seen on but it was reproduced with the standard rttrace kernel.
This is a nasty bug, and happens to be fixed upstream. The cause of this bug was alternate_node_alloc would use its own this_cpu variable. kmem_cache_alloc would grab the per_cpu slab lock with its own this_cpu, and then call alternate_node_alloc. This would then pass its own this_cpu to cache_grow, which would unlock and lock the per_cpu slab lock. If we happen to change CPUS while this happened, we would be locking and unlocking the wrong locks. I'll attach a patch to fix this.
Created attachment 269261 [details] fix alternate_node_alloc this_cpu patch to replace the local this_cpu from alternate_node_alloc to a cpu pointer that is passed in. This will allow the proper slab locks from being locked and unlocked.
Fixed in 2..6.21-53