Bug 242865 - scheduling with irqs disabled (related to plist_add?)
scheduling with irqs disabled (related to plist_add?)
Status: CLOSED NOTABUG
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: realtime-kernel (Show other bugs)
1.0
All Linux
low Severity medium
: ---
: ---
Assigned To: Steven Rostedt
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-06-06 01:32 EDT by IBM Bug Proxy
Modified: 2008-02-27 14:57 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-07-19 15:47:42 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
IBM Linux Technology Center 35202 None None None Never

  None (edit)
Description IBM Bug Proxy 2007-06-06 01:32:24 EDT
LTC Owner is: jstultz@us.ibm.com
LTC Originator is: sudhanshusingh@in.ibm.com


Problem description:
llm49 machines facing hang problem with oops messages on RHEL5-Rt kernel .

Describe any custom patches installed.
RT patches to RHEL5
glibc patches


Provide output from "uname -a", if possible:
$uname -a
Linux llm49.in.ibm.com 2.6.21-14ibm #1 SMP PREEMPT RT Thu May 31 21:18:32 CDT
2007 x86_64 x86_64 x86_64 GNU/Linux


Hardware Environment
LS 20 machine


Please provide access information for the machine if it is available.
llm49.in.ibm.com

Did the system produce an OOPS message on the console?
    If so, copy it here:

===========================================
Code: 48 8b 56 08 0f 18 0a 48 8d 46 08 4c 39 e0 75 dc 48 8d 46 08
RIP  [<ffffffff81154bb0>] plist_add+0x5b/0xa6
 RSP <ffff81003059fc78>
CR2: 0000000000000000
note: ps[3136] exited with preempt_count 2
BUG: scheduling with irqs disabled: ps/0x00000002/3136
caller is rt_spin_lock_slowlock+0xfe/0x1a1

Call Trace:
 [<ffffffff8106d5da>] dump_trace+0xaa/0x32a
 [<ffffffff8106d89b>] show_trace+0x41/0x5c
 [<ffffffff8106d8cb>] dump_stack+0x15/0x17
 [<ffffffff8106468c>] schedule+0x82/0x102
 [<ffffffff8106566a>] rt_spin_lock_slowlock+0xfe/0x1a1
 [<ffffffff81065f46>] rt_spin_lock+0x1f/0x21
 [<ffffffff8103af6d>] exit_mmap+0x4f/0x13f
 [<ffffffff8103d4a1>] mmput+0x2d/0x9e
 [<ffffffff81042b36>] exit_mm+0x11a/0x122
 [<ffffffff8101570b>] do_exit+0x234/0x894
 [<ffffffff81068c21>] do_page_fault+0x785/0x813
 [<ffffffff81066add>] error_exit+0x0/0x84
 [<ffffffff81154bb0>] plist_add+0x5b/0xa6
 [<ffffffff810ad05d>] task_blocks_on_rt_mutex+0x153/0x1bf
 [<ffffffff81065896>] rt_mutex_slowlock+0x189/0x2a2
 [<ffffffff8106556a>] rt_mutex_lock+0x28/0x2a
 [<ffffffff810ad2d4>] __rt_down_read+0x47/0x4b
 [<ffffffff810ad2ee>] rt_down_read+0xb/0xd
 [<ffffffff810d0268>] access_process_vm+0x46/0x174
 [<ffffffff81109561>] proc_pid_cmdline+0x6e/0xfb
 [<ffffffff8110a570>] proc_info_read+0x62/0xca
 [<ffffffff8100b1fc>] vfs_read+0xcc/0x155
 [<ffffffff81011ab0>] sys_read+0x47/0x6f
 [<ffffffff8105f29e>] tracesys+0xdc/0xe1
 [<000000326bebfa10>]

BUG: scheduling while atomic: ps/0x00000002/3136, CPU#1
Call Trace:
 [<ffffffff8106d5da>] dump_trace+0xaa/0x32a
 [<ffffffff8106d89b>] show_trace+0x41/0x5c
 [<ffffffff8106d8cb>] dump_stack+0x15/0x17
 [<ffffffff810636b8>] __sched_text_start+0x98/0xd20
 [<ffffffff810646ec>] schedule+0xe2/0x102
 [<ffffffff8106566a>] rt_spin_lock_slowlock+0xfe/0x1a1
 [<ffffffff81065f46>] rt_spin_lock+0x1f/0x21
 [<ffffffff8103af6d>] exit_mmap+0x4f/0x13f
 [<ffffffff8103d4a1>] mmput+0x2d/0x9e
 [<ffffffff81042b36>] exit_mm+0x11a/0x122
 [<ffffffff8101570b>] do_exit+0x234/0x894
 [<ffffffff81068c21>] do_page_fault+0x785/0x813
 [<ffffffff81066add>] error_exit+0x0/0x84
 [<ffffffff81154bb0>] plist_add+0x5b/0xa6
 [<ffffffff810ad05d>] task_blocks_on_rt_mutex+0x153/0x1bf
 [<ffffffff81065896>] rt_mutex_slowlock+0x189/0x2a2
 [<ffffffff8106556a>] rt_mutex_lock+0x28/0x2a
 [<ffffffff810ad2d4>] __rt_down_read+0x47/0x4b
 [<ffffffff810ad2ee>] rt_down_read+0xb/0xd
 [<ffffffff810d0268>] access_process_vm+0x46/0x174
 [<ffffffff81109561>] proc_pid_cmdline+0x6e/0xfb
 [<ffffffff8110a570>] proc_info_read+0x62/0xca
 [<ffffffff8100b1fc>] vfs_read+0xcc/0x155
 [<ffffffff81011ab0>] sys_read+0x47/0x6f
 [<ffffffff8105f29e>] tracesys+0xdc/0xe1
 [<000000326bebfa10>]
====================================================



Additional information:

Any idea what tests were running when this occurred?

glibc was default RHEL5 one : glibc-2.5-12
Sudhanshu was running release-testing.sh on this machine. At the time of
failure, I think it was running kernbench.
Comment 1 IBM Bug Proxy 2007-06-07 08:46:14 EDT
----- Additional Comments From ankigarg@in.ibm.com (prefers email at ankita@in.ibm.com)  2007-06-07 08:43 EDT -------
I am not able to reproduce this BUG. Ran most of our tests on the system. While
going through the code, came across this small data point - 

rt_mutex_slowlock => task_blocks_on_rt_mutex, does the following :

        spin_lock(&current->pi_lock);                        (1)
        __rt_mutex_adjust_prio(current);
        waiter->task = current;
        waiter->lock = lock;
        plist_node_init(&waiter->list_entry, current->prio);
        plist_node_init(&waiter->pi_list_entry, current->prio);

        /* Get the top priority waiter on the lock */
        if (rt_mutex_has_waiters(lock))
                top_waiter = rt_mutex_top_waiter(lock);
        plist_add(&waiter->list_entry, &lock->wait_list);

        current->pi_blocked_on = waiter;                      (2)

        spin_unlock(&current->pi_lock);                       (3)

Is there a reason why we continue to hold the current->pi_lock [stmt (1)] post
__rt_mutex_adjust_prio() [unlock is done at stmt (3)] ? Instead, we could call
the rt_mutex_adjust_prio(), which does the locking for only this operation. For
the stmt (2), we could hold the lock again. This would ensure that we hold no
extra locks while calling plist_add, which has shown lots of BUG messages...
This might not be a solution to this problem, but thought just a brain dump. 
Comment 2 Thomas Gleixner 2007-06-07 16:28:55 EDT
No, we need to keep pi-lock across the whole section:

After adjust_prio() we initialize the waiter with the current priority. pi_lock
makes sure, that nothing changes this before we add ourself to the wait list of
the lock. The wait list is priority ordered and we do not want to have a change
in the real priority while we enqueue the waiter.

Comment 3 IBM Bug Proxy 2007-06-18 01:10:32 EDT
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |NEEDINFO




------- Additional Comments From skannery@in.ibm.com  2007-06-18 01:05 EDT -------
The problem didn't get reproduce during the release testing on 2.6.21-14ibm2
kernel. Since Java Team is carrying out some more tests, moving this into
NEEDINFO state, waiting to hear from Java Team. 
Comment 4 IBM Bug Proxy 2007-06-25 20:10:17 EDT
----- Additional Comments From jstultz@us.ibm.com (prefers email at johnstul@us.ibm.com)  2007-06-25 20:04 EDT -------
As also reported in bug #35201, I run multiple overnight runs of kernbench and
recalibrate  on an LS20 w/ -23ibm3 and -31 kernels and have seen no such problem. 
Comment 5 IBM Bug Proxy 2007-07-02 14:42:08 EDT
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|OPEN                        |REJECTED
         Resolution|                            |UNREPRODUCIBLE




------- Additional Comments From jstultz@us.ibm.com (prefers email at johnstul@us.ibm.com)  2007-07-02 14:36 EDT -------
Still has not been triggered since -23ibm3. The showmem/softirq fixes that
landed at that time likely resolved this. Please reopen if this recurs. 

Note You need to log in before you can comment on or make changes to this bug.