|Summary:||BUG: scheduling with irqs disabled: strace/0x00000000/2011|
|Product:||Red Hat Enterprise MRG||Reporter:||IBM Bug Proxy <bugproxy>|
|Component:||realtime-kernel||Assignee:||Arnaldo Carvalho de Melo <acme>|
|Status:||CLOSED NEXTRELEASE||QA Contact:|
|Fixed In Version:||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2007-05-22 01:38:15 UTC||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
Description IBM Bug Proxy 2007-04-23 12:14:46 UTC
LTC Owner is: email@example.com LTC Originator is: firstname.lastname@example.org I was running strace on pthread_cond_many testcase when I saw a number of the following BUG messages in dmesg. I have not yet tested this with default (non-RT) RHEL5. BUG: scheduling with irqs disabled: strace/0x00000000/2011 caller is rt_spin_lock_slowlock+0x102/0x1af Call Trace: [<ffffffff8026d828>] dump_trace+0xbd/0x3d8 [<ffffffff8026db87>] show_trace+0x44/0x6d [<ffffffff8026ddc8>] dump_stack+0x13/0x15 [<ffffffff80264dc6>] schedule+0x87/0x10b [<ffffffff80265b06>] rt_spin_lock_slowlock+0x102/0x1af [<ffffffff802661af>] rt_spin_lock+0x1f/0x21 [<ffffffff8029af0c>] force_sig_info+0x26/0xb5 [<ffffffff8029b018>] force_sig_specific+0x11/0x13 [<ffffffff80298659>] ptrace_attach+0xdf/0x10b [<ffffffff802986d7>] sys_ptrace+0x52/0xb8 [<ffffffff8025f42c>] tracesys+0x151/0x1be [<00000034ecec71c9>] --------------------------- | preempt count: 00000000 ] | 0-level deep critical section nesting: ---------------------------------------- Kernel: 2.6.20-0119.rt8 glibc: glibc-2.5-12 Hardware: LS21 Kernel cmdline: ro root=LABEL=/1 rhgb quiet acpi=noirq Recreation steps: Start pthread_cond_many testcase. From another terminal, attach strace to pthread_cond_many process. Very soon, BUGs appear in dmesg. I think I know the root cause of this problem. I'll post details soon. In ptrace_attach, this is what happens: task_lock local_irq_disable write_lock(tasklist_lock) Using trylocks. Some work __ptrace_link Send SIGSTOP to target thread write_unlock_irq(tasklist_lock) task_unlock local_irq_disable + write_lock will work as write_lock_irq and write_unlock_irq will re-enable interrupts. However, on -rt, write_unlock_irq doesn't do local_irq_enable. But since we have explicitly called local_irq_disable, interrupts remain blocked! To fix the problem, I think we should call write_unlock(tasklist_lock) and local_irq_enable() instead of write_unlock_irq. Also, we should call them BEFORE sending SIGSTOP to the target thread. I think there is no need to hold the tasklist lock during sending of SIGSTOP. For the vanilla kernel too, I think we should do write_unlock_irq(tasklist_lock) before sending SIGSTOP. The following patch solves the problem on 2.6.20-rt8. I want to send this to LKML/Ingo soon. Does anyone have comments? --- linux-2.6.20.x86_64_org/kernel/ptrace.c 2007-04-19 18:19:37.000000000 +0530 +++ linux-2.6.20.x86_64/kernel/ptrace.c 2007-04-19 16:43:32.000000000 +0530 @@ -205,10 +205,16 @@ repeat: __ptrace_link(task, current); + write_unlock(&tasklist_lock); + local_irq_enable(); + force_sig_specific(SIGSTOP, task); + goto out2; bad: - write_unlock_irq(&tasklist_lock); + write_unlock(&tasklist_lock); + local_irq_enable(); +out2: task_unlock(task); out: return retval; What if some other process is reading the task_list at the time you are sending it the stop signal? Will the code in force_sig_specific take care of that by its own locking? Sripathi, thanks for clarifying it offline! I have posted this to LKML/Ingo: http://lkml.org/lkml/2007/04/20/41
Comment 1 IBM Bug Proxy 2007-05-11 14:55:35 UTC
----- Additional Comments From email@example.com (prefers email at firstname.lastname@example.org) 2007-05-11 10:49 EDT ------- I got no reply from Ingo/anyone else about my earlier mail (Apr 20). Hence I tried to fix it in another way by introducing write_trylock_irqsave API in mainline and -rt. Mainline patches are at http://lkml.org/lkml/2007/05/09/76 and http://lkml.org/lkml/2007/05/09/79 . -rt patches are at http://lkml.org/lkml/2007/05/10/47 and http://lkml.org/lkml/2007/05/10/48. The mainline patches have been accepted into -mm. I am awaiting response for -rt patches.
Comment 2 Arnaldo Carvalho de Melo 2007-05-15 15:11:55 UTC
Unable to reproduce this with 2.6.21-4.el5rtdebug and 2.6.21-3.el5rt. Checked kernel/ptrace.c, it doesn't have your patches. I'm using http://www.kernel.org/pub/linux/kernel/people/dvhart/realtime/tests/tests.tar.bz2 with './run.sh all', wait for the "./pthread_cond_many --broadcast 400 5000" processes to start, ran strace on them, no BUG messages. Machine is a Dell PowerEdge 1950 with to dual core Xeon processors. Will try now with the same kernel as you used (2.6.20-0119.rt8).
Comment 3 Arnaldo Carvalho de Melo 2007-05-15 15:49:11 UTC
Tried with http://people.redhat.com/mingo/realtime-preempt/yum/x86_64/kernel-rt-2.6.20-0119.rt8.x86_64.rpm: [root@mica ~]# uname -r 2.6.20-0119.rt8 And couldn't reproduce with it either. I'm running it now with this patch: [root@mica latency]# diff -u pthread_cond_many.sh.orig pthread_cond_many.sh --- pthread_cond_many.sh.orig 2007-05-15 12:47:11.000000000 -0300 +++ pthread_cond_many.sh 2007-05-15 12:48:27.000000000 -0300 @@ -9,11 +9,11 @@ nproc=5 i=0 -./pthread_cond_many $1 --broadcast $iter $nthread > 2100.$i.out & +strace -f ./pthread_cond_many $1 --broadcast $iter $nthread > 2100.$i.out 2> /dev/null & i=1 while test $i -lt $nproc do - ./pthread_cond_many --broadcast $iter $nthread > 2100.$i.out & + strace -f ./pthread_cond_many --broadcast $iter $nthread > 2100.$i.out 2> /dev/null & i=`expr $i + 1` done wait [root@mica latency]# pwd /home/acme/rt/IBM/rtlinux-tests/perf/latency [root@mica latency]# and running it like this: [root@mica latency]# pwd /home/acme/rt/IBM/rtlinux-tests/perf/latency [root@mica latency]# ./pthread_cond_many.sh --realtime
Comment 4 IBM Bug Proxy 2007-05-16 05:50:33 UTC
------- Additional Comments From email@example.com (prefers email at firstname.lastname@example.org) 2007-05-16 01:44 EDT ------- (In reply to comment #13) > ----- Additional Comments From email@example.com 2007-05-15 11:11 EST ------- > Unable to reproduce this with 2.6.21-4.el5rtdebug and 2.6.21-3.el5rt. Checked > kernel/ptrace.c, it doesn't have your patches. I'm using > http://www.kernel.org/pub/linux/kernel/people/dvhart/realtime/tests/tests.tar.bz2 > with './run.sh all', wait for the "./pthread_cond_many --broadcast 400 5000" > processes to start, ran strace on them, no BUG messages. Machine is a Dell > PowerEdge 1950 with to dual core Xeon processors. Will try now with the same > kernel as you used (2.6.20-0119.rt8). I tried exactly the same just now and reproduced the problem on 2.6.21-2.el5rt. I pulled down the tests from kernel.org, started the tests by hand using "./pthread_cond_many --broadcast 400 5000" and ran "strace -f -v -o strace.out <pid of first pthread_cond_many process>" and immediately I see a bunch of BUGs in dmesg. My hardware is LS20 blade, but I don't think the problem is hardware dependent.
Comment 5 Arnaldo Carvalho de Melo 2007-05-16 12:11:48 UTC
Tried now with 2.6.21-4.el5rt using exactly the same sequence described in your latest entry in this ticket: got the BUGs. Will now apply your patches to the rt kernel rpm and retest. Strange, the only difference from my test is to run ./pthread_cond_many directly instead of running it thru the shell script, anyway, reproduced, rebuilding the rpm with your patches, thanks.
Comment 6 Arnaldo Carvalho de Melo 2007-05-16 19:40:24 UTC
Did it, the BUGs are over and from my perspective the patches are OK, will talk with Steven Rostedt for a second opinion and the ask Clark to put those patches in our 2.6.21-rt kernel rpms and ask Ingo to consider them for upstream rt-preempt acceptance, thanks!
Comment 7 Arnaldo Carvalho de Melo 2007-05-22 01:38:15 UTC
Patches were merged, at least the 2.6.21-rt6 patch has it, thanks a lot for submitting them! It is already merged in the internal repo for kernel-rt and should be included in the 2.6.21-11.el5rt kernel-rt rpm release.
Comment 8 IBM Bug Proxy 2007-05-24 16:40:23 UTC
----- Additional Comments From firstname.lastname@example.org (prefers email at email@example.com) 2007-05-24 12:37 EDT ------- I have tested this with 2.6.21-14.el5rt kernel (which I believe contains Ingo's patch-2.6.21-rt7) and the problem is no more seen. strace does not produce any BUGs now. Thanks! -Sripathi.