LTC Owner is: jstultz.com LTC Originator is: sudhanshusingh.com Problem description: llm49 machines facing hang problem with oops messages on RHEL5-Rt kernel . Describe any custom patches installed. RT patches to RHEL5 glibc patches Provide output from "uname -a", if possible: $uname -a Linux llm49.in.ibm.com 2.6.21-14ibm #1 SMP PREEMPT RT Thu May 31 21:18:32 CDT 2007 x86_64 x86_64 x86_64 GNU/Linux Hardware Environment LS 20 machine Please provide access information for the machine if it is available. llm49.in.ibm.com Did the system produce an OOPS message on the console? If so, copy it here: =========================================== Code: 48 8b 56 08 0f 18 0a 48 8d 46 08 4c 39 e0 75 dc 48 8d 46 08 RIP [<ffffffff81154bb0>] plist_add+0x5b/0xa6 RSP <ffff81003059fc78> CR2: 0000000000000000 note: ps[3136] exited with preempt_count 2 BUG: scheduling with irqs disabled: ps/0x00000002/3136 caller is rt_spin_lock_slowlock+0xfe/0x1a1 Call Trace: [<ffffffff8106d5da>] dump_trace+0xaa/0x32a [<ffffffff8106d89b>] show_trace+0x41/0x5c [<ffffffff8106d8cb>] dump_stack+0x15/0x17 [<ffffffff8106468c>] schedule+0x82/0x102 [<ffffffff8106566a>] rt_spin_lock_slowlock+0xfe/0x1a1 [<ffffffff81065f46>] rt_spin_lock+0x1f/0x21 [<ffffffff8103af6d>] exit_mmap+0x4f/0x13f [<ffffffff8103d4a1>] mmput+0x2d/0x9e [<ffffffff81042b36>] exit_mm+0x11a/0x122 [<ffffffff8101570b>] do_exit+0x234/0x894 [<ffffffff81068c21>] do_page_fault+0x785/0x813 [<ffffffff81066add>] error_exit+0x0/0x84 [<ffffffff81154bb0>] plist_add+0x5b/0xa6 [<ffffffff810ad05d>] task_blocks_on_rt_mutex+0x153/0x1bf [<ffffffff81065896>] rt_mutex_slowlock+0x189/0x2a2 [<ffffffff8106556a>] rt_mutex_lock+0x28/0x2a [<ffffffff810ad2d4>] __rt_down_read+0x47/0x4b [<ffffffff810ad2ee>] rt_down_read+0xb/0xd [<ffffffff810d0268>] access_process_vm+0x46/0x174 [<ffffffff81109561>] proc_pid_cmdline+0x6e/0xfb [<ffffffff8110a570>] proc_info_read+0x62/0xca [<ffffffff8100b1fc>] vfs_read+0xcc/0x155 [<ffffffff81011ab0>] sys_read+0x47/0x6f [<ffffffff8105f29e>] tracesys+0xdc/0xe1 [<000000326bebfa10>] BUG: scheduling while atomic: ps/0x00000002/3136, CPU#1 Call Trace: [<ffffffff8106d5da>] dump_trace+0xaa/0x32a [<ffffffff8106d89b>] show_trace+0x41/0x5c [<ffffffff8106d8cb>] dump_stack+0x15/0x17 [<ffffffff810636b8>] __sched_text_start+0x98/0xd20 [<ffffffff810646ec>] schedule+0xe2/0x102 [<ffffffff8106566a>] rt_spin_lock_slowlock+0xfe/0x1a1 [<ffffffff81065f46>] rt_spin_lock+0x1f/0x21 [<ffffffff8103af6d>] exit_mmap+0x4f/0x13f [<ffffffff8103d4a1>] mmput+0x2d/0x9e [<ffffffff81042b36>] exit_mm+0x11a/0x122 [<ffffffff8101570b>] do_exit+0x234/0x894 [<ffffffff81068c21>] do_page_fault+0x785/0x813 [<ffffffff81066add>] error_exit+0x0/0x84 [<ffffffff81154bb0>] plist_add+0x5b/0xa6 [<ffffffff810ad05d>] task_blocks_on_rt_mutex+0x153/0x1bf [<ffffffff81065896>] rt_mutex_slowlock+0x189/0x2a2 [<ffffffff8106556a>] rt_mutex_lock+0x28/0x2a [<ffffffff810ad2d4>] __rt_down_read+0x47/0x4b [<ffffffff810ad2ee>] rt_down_read+0xb/0xd [<ffffffff810d0268>] access_process_vm+0x46/0x174 [<ffffffff81109561>] proc_pid_cmdline+0x6e/0xfb [<ffffffff8110a570>] proc_info_read+0x62/0xca [<ffffffff8100b1fc>] vfs_read+0xcc/0x155 [<ffffffff81011ab0>] sys_read+0x47/0x6f [<ffffffff8105f29e>] tracesys+0xdc/0xe1 [<000000326bebfa10>] ==================================================== Additional information: Any idea what tests were running when this occurred? glibc was default RHEL5 one : glibc-2.5-12 Sudhanshu was running release-testing.sh on this machine. At the time of failure, I think it was running kernbench.
----- Additional Comments From ankigarg.com (prefers email at ankita.com) 2007-06-07 08:43 EDT ------- I am not able to reproduce this BUG. Ran most of our tests on the system. While going through the code, came across this small data point - rt_mutex_slowlock => task_blocks_on_rt_mutex, does the following : spin_lock(¤t->pi_lock); (1) __rt_mutex_adjust_prio(current); waiter->task = current; waiter->lock = lock; plist_node_init(&waiter->list_entry, current->prio); plist_node_init(&waiter->pi_list_entry, current->prio); /* Get the top priority waiter on the lock */ if (rt_mutex_has_waiters(lock)) top_waiter = rt_mutex_top_waiter(lock); plist_add(&waiter->list_entry, &lock->wait_list); current->pi_blocked_on = waiter; (2) spin_unlock(¤t->pi_lock); (3) Is there a reason why we continue to hold the current->pi_lock [stmt (1)] post __rt_mutex_adjust_prio() [unlock is done at stmt (3)] ? Instead, we could call the rt_mutex_adjust_prio(), which does the locking for only this operation. For the stmt (2), we could hold the lock again. This would ensure that we hold no extra locks while calling plist_add, which has shown lots of BUG messages... This might not be a solution to this problem, but thought just a brain dump.
No, we need to keep pi-lock across the whole section: After adjust_prio() we initialize the waiter with the current priority. pi_lock makes sure, that nothing changes this before we add ourself to the wait list of the lock. The wait list is priority ordered and we do not want to have a change in the real priority while we enqueue the waiter.
changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEEDINFO ------- Additional Comments From skannery.com 2007-06-18 01:05 EDT ------- The problem didn't get reproduce during the release testing on 2.6.21-14ibm2 kernel. Since Java Team is carrying out some more tests, moving this into NEEDINFO state, waiting to hear from Java Team.
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-06-25 20:04 EDT ------- As also reported in bug #35201, I run multiple overnight runs of kernbench and recalibrate on an LS20 w/ -23ibm3 and -31 kernels and have seen no such problem.
changed: What |Removed |Added ---------------------------------------------------------------------------- Status|OPEN |REJECTED Resolution| |UNREPRODUCIBLE ------- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-07-02 14:36 EDT ------- Still has not been triggered since -23ibm3. The showmem/softirq fixes that landed at that time likely resolved this. Please reopen if this recurs.