Bug 452974

Summary: [24][FOCUS] plist_add/del crash with 2.6.24.7-65ibmrt2.4 kernel
Product: Red Hat Enterprise MRG Reporter: IBM Bug Proxy <bugproxy>
Component: realtime-kernelAssignee: Red Hat Real Time Maintenance <rt-maint>
Status: CLOSED ERRATA QA Contact:
Severity: urgent Docs Contact:
Priority: low    
Version: betaCC: bhu, lgoncalv, srostedt, williams
Target Milestone: 1.0.1   
Target Release: ---   
Hardware: x86_64   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-08-26 19:57:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
rwlock: be more conservative in locking reader_lock_count none

Description IBM Bug Proxy 2008-06-26 12:08:22 UTC
=Comment: #0=================================================
Sripathi Kodi <sripathi.com> - 2008-06-25 06:19 EDT
We have seen the following crashes on a JTC machine:

<4>krcupreemptd new prio is 115??
<1>Unable to handle kernel NULL pointer dereference at 0000000000000006 RIP: 
<1> [<ffffffff8113eeda>] plist_del+0x26/0x70
<4>PGD 158cc1067 PUD 158cc0067 PMD 158c7c067 PTE 0
<0>Oops: 0002 [1] PREEMPT SMP 
<4>CPU 2 
<4>Modules linked in: autofs4 hidp nfs lockd nfs_acl rfcomm l2cap bluetooth
sunrpc ipv6 dm_multipath video output sbs sbshc battery ac parport_pc lp parport
sr_mod cdrom sg k8_edac edac_core tg3 button pata_serverworks k8temp hwmon
pata_acpi serio_raw ata_generic shpchp pcspkr dm_snapshot dm_zero dm_mirror
dm_mod sata_svw libata mptspi mptscsih scsi_transport_spi mptbase sd_mod
scsi_mod ext3 jbd mbcache ehci_hcd ohci_hcd uhci_hcd
<4>Pid: 5276, comm: java Not tainted 2.6.24.7-65ibmrt2.4 #1
<4>RIP: 0010:[<ffffffff8113eeda>]  [<ffffffff8113eeda>] plist_del+0x26/0x70
<4>RSP: 0018:ffff81009f91bd98  EFLAGS: 00210086
<4>RAX: 0000000000000006 RBX: ffff81015c01a9d0 RCX: ffff81009eebbe50
<4>RDX: ffff81009eebbe58 RSI: ffff81009eeb4080 RDI: ffff810158c25be0
<4>RBP: ffff81009f91bd98 R08: ffff810158c25be8 R09: 00000000bbdf380b
<4>R10: ffff810152c575e0 R11: 000000039f91bbc8 R12: ffff81015c01a9d0
<4>R13: ffff810158c25bb8 R14: 0000000000000000 R15: ffff81015c01a9d0
<4>FS:  00002b932b3ec880(0000) GS:ffff81015faaa7c0(0063) knlGS:00000000c907eb90
<4>CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
<4>CR2: 0000000000000006 CR3: 0000000158cab000 CR4: 00000000000006e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process java (pid: 5276, threadinfo ffff81009f91a000, task ffff81009f914040)
<4>Stack:  ffff81009f91bdd8 ffffffff8105dd9b ffff81009f91be58 ffff81015c01a9d0
<4> 0000000000200202 0000000000000000 000000008000149c ffff81015c01a9d0
<4> ffff81009f91bdf8 ffffffff81288357 ffff81015c01a9c0 ffff81015fb63cd8
<4>Call Trace:
<4> [<ffffffff8105dd9b>] wakeup_next_waiter+0x65/0x1b2
<4> [<ffffffff81288357>] rt_mutex_slowunlock+0x3b/0x59
<4> [<ffffffff81288136>] rt_mutex_unlock+0x28/0x2a
<4> [<ffffffff8105ccea>] do_futex+0x9d5/0xb42
<4> [<ffffffff8105f398>] ? rt_mutex_up_read+0x22d/0x232
<4> [<ffffffff8128ba32>] ? do_page_fault+0x3f6/0x76d
<4> [<ffffffff810317c7>] ? post_schedule_rt+0x31/0x35
<4> [<ffffffff810368a2>] ? finish_task_switch+0x4c/0xdc
<4> [<ffffffff8105d3e9>] compat_sys_futex+0xed/0x10b
<4> [<ffffffff8100f895>] ? syscall_trace_enter+0xb7/0xbb
<4> [<ffffffff81027a94>] cstar_do_call+0x1b/0x65
<4>
<4>
<0>Code: 5f c9 c3 90 90 4c 8d 47 08 4c 39 47 08 55 48 89 e5 74 45 48 8b 4f 18 48
83 e9 18 48 8d 51 08 48 8b 71 08 48 8b 42 08 48 89 46 08 <48> 89 30 49 8b 40 08
4c 89 41 08 49 89 50 08 48 89 10 48 89 42 
<1>RIP  [<ffffffff8113eeda>] plist_del+0x26/0x70

and

Unable to handle kernel paging request at 0000000000002625 RIP: 
 [<ffffffff8113efa3>] plist_add+0x7f/0xa6
PGD 14f899067 PUD 14f8e7067 PMD 13edfa067 PTE 0
Oops: 0002 [1] PREEMPT SMP 
CPU 2 
Modules linked in: autofs4 hidp nfs lockd nfs_acl rfcomm l2cap bluetooth sunrpc
ipv6 dm_multipath video output
 sbs sbshc battery ac parport_pc lp parport sg sr_mod cdrom tg3 k8_edac
pata_serverworks shpchp edac_core pata
_acpi button k8temp hwmon serio_raw ata_generic pcspkr dm_snapshot dm_zero
dm_mirror dm_mod sata_svw libata mp
tspi mptscsih scsi_transport_spi mptbase sd_mod scsi_mod ext3 jbd mbcache
ehci_hcd ohci_hcd uhci_hcd
Pid: 9640, comm: java Not tainted 2.6.24.7-65ibmrt2.4 #1
RIP: 0010:[<ffffffff8113efa3>]  [<ffffffff8113efa3>] plist_add+0x7f/0xa6
RSP: 0018:ffff8100bbd97e18  EFLAGS: 00010083
RAX: ffff81015e0c75a0 RBX: ffff81009e52dba0 RCX: 0000000000002625
RDX: ffff81009e52dba8 RSI: ffff81015e0c7598 RDI: ffff81009e52dba0
RBP: ffff8100bbd97e28 R08: ffff81009acc1ba8 R09: 0000000000000000
R10: ffff8100bdfc24d8 R11: ffff8100bbd97db8 R12: ffff8100bbd92820
R13: ffff81009acc1b78 R14: ffff81009e52db78 R15: ffff8100bbd92818
FS:  00002ae7ce277480(0000) GS:ffff81015faaa7c0(0063) knlGS:00000000c4638b90
CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 0000000000002625 CR3: 000000014f89a000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process java (pid: 9640, threadinfo ffff8100bbd96000, task ffff8100bbd92a60)
Stack:  ffff8100bbd92040 ffff81015e0c75a8 ffff8100bbd97e68 ffffffff8105ded4
 ffff8100bbd97e48 ffff81009acc1b78 ffff81015e0c75a8 0000000000000001
 ffff81015e0c75a0 ffff81015e0c75c0 ffff8100bbd97ec8 ffffffff81288995
Call Trace:
 [<ffffffff8105ded4>] wakeup_next_waiter+0x19e/0x1b2
 [<ffffffff81288995>] rt_write_slowunlock+0x82/0x1c6
 [<ffffffff8105f166>] rt_mutex_up_write+0x69/0x6e
 [<ffffffff8105fcb7>] rt_up_write+0x9/0xb
 [<ffffffff8109a243>] sys_mprotect+0x210/0x22d
 [<ffffffff8100f895>] ? syscall_trace_enter+0xb7/0xbb
 [<ffffffff81028244>] sys32_mprotect+0x9/0xb
 [<ffffffff81027a94>] cstar_do_call+0x1b/0x65


Code: 89 c6 eb 2e 48 89 c6 48 8b 56 08 48 8d 46 08 4c 39 e0 0f 18 0a 75 dc 48 8d
46 08 48 8d 53 08 48 8b 48 08
 48 89 43 08 48 89 50 08 <48> 89 11 48 89 4a 08 48 8d 46 18 48 8d 53 18 48 8b 48
08 48 89 
RIP  [<ffffffff8113efa3>] plist_add+0x7f/0xa6
 RSP <ffff8100bbd97e18>

Hardware: e326m (rtj-opt6)
OS: RHEL5.1
RT version: Alpha18
Kernel: 2.6.24.7-65ibmrt2.4 (this is just a rebuild of RH's 2.6.24.7-65.el5rt
kernel)

One of the dumps seems to be corrupted in some way, so crash is unable to open
it. I will move the other one to some other machine.

Though the problem has been reported on e326m, the problem does not seem to be
hardware specific. Hence this problem can block JTC testing. This bug was
uncovered while testing for bug 44627.
=Comment: #3=================================================
P. N. Stanton <pstanton.com> - 2008-06-25 07:11 EDT
rtj-opt22 is re-running the test. rtj-opt6 is reserved for Sripathi's use. This
test last ran on rtj-opt6 on June 16th on kernel 2.6.24.7-62.el5rt and it ran ok
and passed.
=Comment: #5=================================================
Sripathi Kodi <sripathi.com> - 2008-06-25 09:03 EDT
Peter, there is one patch from Steve Rostedt that is not part of -65 kernel. I
have compiled a kernel with that patch and put the kernel rpms under 
http://kernel.beaverton.ibm.com/~sripathi/45869/

Could you please use this kernel and restart tests on rtj-opt6? 
=Comment: #6=================================================
P. N. Stanton <pstanton.com> - 2008-06-25 09:55 EDT
Installed the new kernel on rtj-opt22, since Sripathi was still running stuff on
rtj-opt6 and seems to have left for the day.
Good news - the test ran and passed this time.
Bad news - on the 4th iteration the machine locked up. No messages (console or
syslog), no dump and the SysRq key didn't work.
Re-running to see if it is reproducible.
=Comment: #7=================================================
P. N. Stanton <pstanton.com> - 2008-06-25 10:39 EDT
Re-ran tests on rtj-opt22, machine locked up as before on the 4 run of the test.
This time there were messages on the console:

WARNING: at lib/plist.c:104 plist_add()
Pid: 12128, comm: java Not tainted 2.6.24.7-65ibmrt2.4rwfix #1

Call Trace:
 [<ffffffff8113edb1>] plist_add+0x3d/0xa6
 [<ffffffff8105dd32>] wakeup_next_waiter+0x19e/0x1b2
 [<ffffffff8128804a>] rt_mutex_slowunlock+0x3b/0x59
 [<ffffffff81287e29>] rt_mutex_unlock+0x28/0x2a
 [<ffffffff8105cb4e>] do_futex+0x9d5/0xb42
 [<ffffffff810546f1>] ? hrtimer_nanosleep+0x6b/0xf2
 [<ffffffff81056930>] ? getnstimeofday+0x31/0x88
 [<ffffffff8105d24d>] compat_sys_futex+0xed/0x10b
 [<ffffffff8100f895>] ? syscall_trace_enter+0xb7/0xbb
 [<ffffffff81027a94>] cstar_do_call+0x1b/0x65


=Comment: #10=================================================
Darren V. Hart <dvhltc.com> - 2008-06-25 16:07 EDT
My chat with rostedt on the #Linux-rt channel.  I'll prepare a kernel, but I
don't know how to reproduce the error.  I'm happy to use rtj-opt6, but I don't
know what to run.  Can someone from the JTC fill me in?

<dvhart> hey folks, have any of you experienced a hang with 
[<ffffffff8113edb1>] plist_add+0x3d/0xa6 in the call trace?
<rostedt> dvhart: with -rt14?
<rostedt> or 25-rt7?
<-- dejones has quit (Read error: 110 (Connection timed out))
<dvhart> rostedt, this is on mrg -65 (plus an rwlock patch from you)
<dwalker> rostedt, yeah, there's still lots ..
<rostedt> dwalker: yeah, I see that (with a quick grep through the kernel)
<rostedt> dvhart: there's a lot of rwlock patches missing from that
<dvhart> rostedt, what would be the best way for me to pick those up and test?
<rostedt> dvhart: and the patch you have was also broken
<dvhart> heh
<dvhart> great news all around actually :-)
<rostedt> -rt14?
<dvhart> rostedt, ok I can grab the broken out patches and try and pull them in
<dvhart> rostedt, should be obvious which ones I need - i.e. they all match
*rwlock* or something?
<rostedt> dvhart: that would probably be the easiest
<dvhart> ack
<dvhart> thanks rostedt 
<rostedt> just look at the end of the queue and get all rwlock*
<rostedt> I try to append "rwlock" to all the patches that dealt with them
<rostedt> although you may need a few of the "rtmutex*" patches there too
<dvhart> ack
=Comment: #11=================================================
Darren V. Hart <dvhltc.com> - 2008-06-25 17:27 EDT
MRG test kernel -68 has been released and is available on mrg-ibm-extras repo. 
It contains all the rtmutex and rwlock fixes in question.  Can someone provide
info on how to kick off a test?
=Comment: #12=================================================
John G. Stultz <jstultz.com> - 2008-06-25 22:08 EDT
R2-RC1-pre is available here:
http://kernel.beaverton.ibm.com/jtcltc/drops/R2/RC1-pre/linux-rt-R2-RC1-pre-1109.tar.gz

This contains the -68 kernel and may fix this issue.

Comment 1 IBM Bug Proxy 2008-06-26 14:08:46 UTC
------- Comment From pstanton.com 2008-06-26 10:05 EDT-------
Machine has been running stress tests for three hours now on the -68 kernel. It
is still running ok, but there are some kernel messages in the logs:

Jun 26 12:19:53 rtj-opt22 kernel: WARNING: at kernel/rtmutex.c:1732
rt_read_slowunlock()
Jun 26 12:19:53 rtj-opt22 kernel: Pid: 6916, comm: java Not tainted
2.6.24.7-68ibmrt2.4 #1
Jun 26 12:19:53 rtj-opt22 kernel:
Jun 26 12:19:53 rtj-opt22 kernel: Call Trace:
Jun 26 12:19:53 rtj-opt22 kernel:  [<ffffffff812888ab>]
rt_read_slowunlock+0x14c/0x43e
Jun 26 12:19:53 rtj-opt22 kernel:  [<ffffffff8105f4f3>] rt_mutex_up_read+0x25b/0x260
Jun 26 12:19:53 rtj-opt22 kernel:  [<ffffffff8105fe38>] rt_up_read+0x9/0xb
Jun 26 12:19:53 rtj-opt22 kernel:  [<ffffffff8105c2d1>] futex_lock_pi+0x877/0x963
Jun 26 12:19:53 rtj-opt22 kernel:  [<ffffffff8105dffe>] ?
__rt_mutex_adjust_prio+0x11/0x24
Jun 26 12:19:53 rtj-opt22 kernel:  [<ffffffff8105e898>] ?
rt_mutex_adjust_prio+0x35/0x3e
Jun 26 12:19:53 rtj-opt22 kernel:  [<ffffffff8105cedf>] do_futex+0xb22/0xb42
Jun 26 12:19:53 rtj-opt22 kernel:  [<ffffffff8105d491>] compat_sys_futex+0xed/0x10b
Jun 26 12:19:53 rtj-opt22 kernel:  [<ffffffff8100f8c4>] ?
syscall_trace_enter+0xb7/0xbb
Jun 26 12:19:53 rtj-opt22 kernel:  [<ffffffff81027a94>] cstar_do_call+0x1b/0x65
Jun 26 12:19:53 rtj-opt22 kernel:
Jun 26 12:19:53 rtj-opt22 kernel: WARNING: at kernel/rtmutex.c:1732
rt_read_slowunlock()
Jun 26 12:19:53 rtj-opt22 kernel: Pid: 5816, comm: java Not tainted
2.6.24.7-68ibmrt2.4 #1
Jun 26 12:19:53 rtj-opt22 kernel:
Jun 26 12:19:53 rtj-opt22 kernel: Call Trace:
Jun 26 12:19:53 rtj-opt22 kernel:  [<ffffffff812888ab>]
rt_read_slowunlock+0x14c/0x43e
Jun 26 12:19:53 rtj-opt22 kernel:  [<ffffffff8105f4f3>] rt_mutex_up_read+0x25b/0x260
Jun 26 12:19:53 rtj-opt22 kernel:  [<ffffffff8105fe38>] rt_up_read+0x9/0xb
Jun 26 12:19:53 rtj-opt22 kernel:  [<ffffffff8105c2d1>] futex_lock_pi+0x877/0x963
Jun 26 12:19:53 rtj-opt22 kernel:  [<ffffffff8101e279>] ?
smp_send_reschedule+0x1d/0x1f
Jun 26 12:19:53 rtj-opt22 kernel:  [<ffffffff8105dffe>] ?
__rt_mutex_adjust_prio+0x11/0x24
Jun 26 12:19:53 rtj-opt22 kernel:  [<ffffffff8105e898>] ?
rt_mutex_adjust_prio+0x35/0x3e
Jun 26 12:19:53 rtj-opt22 kernel:  [<ffffffff8105cedf>] do_futex+0xb22/0xb42
Jun 26 12:19:53 rtj-opt22 kernel:  [<ffffffff8105e898>] ?
rt_mutex_adjust_prio+0x35/0x3e
Jun 26 12:19:53 rtj-opt22 kernel:  [<ffffffff81288da4>] ?
rt_write_slowunlock+0x207/0x216
Jun 26 12:19:53 rtj-opt22 kernel:  [<ffffffff8105d491>] compat_sys_futex+0xed/0x10b
Jun 26 12:19:53 rtj-opt22 kernel:  [<ffffffff8100f8c4>] ?
syscall_trace_enter+0xb7/0xbb
Jun 26 12:19:53 rtj-opt22 kernel:  [<ffffffff81027a94>] cstar_do_call+0x1b/0x65

Comment 2 IBM Bug Proxy 2008-06-30 09:08:35 UTC
------- Comment From pstanton.com 2008-06-30 05:01 EDT-------
We have had two machines running the -68 kernel running stress tests over the
weekend with no interruptions, so I'd say this is no longer a blocker for us.
(The -65 kernel with rwlock patch locked up after about 15 minutes).
However, the kernel messages in comment 18 have appeared several times on both
machines.

Comment 3 IBM Bug Proxy 2008-07-01 01:49:05 UTC
------- Comment From jstultz.com 2008-06-30 21:42 EDT-------
I managed to trigger two of these warning messages with our release testing
runs. So we've got a second reproducer (although not yet sure where in the 8-16
hours it happened).

Comment 4 Steven Rostedt 2008-07-02 15:17:24 UTC
Release 69 has several rwlock fixes that may address this issue. Can you see if
this is solved in that release?

Comment 5 IBM Bug Proxy 2008-07-02 21:24:31 UTC
------- Comment From dvhltc.com 2008-07-02 17:22 EDT-------
Currently running our release-testing on -69.  Results in < 48 hours.

Comment 6 IBM Bug Proxy 2008-07-03 01:56:30 UTC
------- Comment From jstultz.com 2008-07-02 21:49 EDT-------
So I ran calibrate tests overnight with my lock_count fix and didn't see any
warnings. I've sent the patch on to lkml. I'm not sure if it actually effects
the warning the JTC is seeing, but if Darren's -69 testing still shows it, then
i'll try to merge it in to R2.

Comment 7 IBM Bug Proxy 2008-07-03 16:00:47 UTC
------- Comment From dvhltc.com 2008-07-03 11:57 EDT-------
No messages to the console during a complete release-testing.sh run on -69.

Comment 8 IBM Bug Proxy 2008-07-08 02:24:32 UTC
------- Comment From jstultz.com 2008-07-07 22:23 EDT-------
Peter: Have you picked up the -69 kernel for light testing to see if this issue
is resolved?

Comment 9 IBM Bug Proxy 2008-07-08 06:16:26 UTC
------- Comment From paul_thwaite.com 2008-07-08 02:14 EDT-------
Hi John, No we have not moved to -69 yet.  On the LTC/JTC call I understood a
few more checks were required before the LTC would recommend the JTC move up to -69?

Based on comment #31 I am assuming -69 is good and we will upgrade machines to
confirm the messages have disappeared.

Comment 10 IBM Bug Proxy 2008-07-08 16:08:40 UTC
------- Comment From jstultz.com 2008-07-08 12:06 EDT-------
(In reply to comment #32)
> Hi John, No we have not moved to -69 yet.  On the LTC/JTC call I understood a
> few more checks were required before the LTC would recommend the JTC move up
to -69?
>
> Based on comment #31 I am assuming -69 is good and we will upgrade machines to
> confirm the messages have disappeared.

Yea, please upgrade just the machines seeing the issues.

We'll have a rc2 release at the end of the month that will contain -69+other
fixes that will be ready for all the JTC R2 test machines.

The RPM can be found here:
http://kernel.beaverton.ibm.com/jtcltc/yum/mrg-ibm-extras/x86_64/kernel-rt-2.6.24.7-69.el5rt.x86_64.rpm

(although this kernel is missing the SAN bits, but I don't think it affects you)

Comment 11 IBM Bug Proxy 2008-07-10 15:50:40 UTC
------- Comment From pstanton.com 2008-07-10 11:44 EDT-------
We have five machines running on -69 running one hour realtime Java stress tests.
All have displayed the message below several times:
WARNING: at kernel/rtmutex.c:1732 rt_read_slowunlock()
Pid: 27140, comm: java Not tainted 2.6.24.7-69.el5rt #1

Call Trace:
[<ffffffff81288823>] rt_read_slowunlock+0x14c/0x43e
[<ffffffff8105f4f7>] rt_mutex_up_read+0x25b/0x260
[<ffffffff8105fe3c>] rt_up_read+0x9/0xb
[<ffffffff8105c2d5>] futex_lock_pi+0x877/0x963
[<ffffffff8105f6f5>] ? rt_read_slowlock+0x7b/0x341
[<ffffffff8105e89c>] ? rt_mutex_adjust_prio+0x35/0x3e
[<ffffffff81288aeb>] ? rt_read_slowunlock+0x414/0x43e
[<ffffffff8105cee3>] do_futex+0xb22/0xb42
[<ffffffff8100a8e5>] ? __switch_to+0x291/0x2a0
[<ffffffff810368ba>] ? finish_task_switch+0x4c/0xdc
[<ffffffff8105d495>] compat_sys_futex+0xed/0x10b
[<ffffffff8100f8c4>] ? syscall_trace_enter+0xb7/0xbb
[<ffffffff81027a94>] cstar_do_call+0x1b/0x65

I've seen this once on rtj-opt22.hursley.ibm.com and on rtj-opt42.hursley.ibm.com:

WARNING: at kernel/rtmutex.c:1732 rt_read_slowunlock()
Pid: 2459, comm: java Not tainted 2.6.24.7-69.el5rt #1

Call Trace:
[<ffffffff8105ad2d>] ? cmpxchg_futex_value_locked+0x52/0x5e
[<ffffffff81288823>] rt_read_slowunlock+0x14c/0x43e
[<ffffffff8105ba0b>] ? fixup_pi_state_owner+0x174/0x1c7
[<ffffffff8105f4f7>] rt_mutex_up_read+0x25b/0x260
[<ffffffff8105fe3c>] rt_up_read+0x9/0xb
[<ffffffff8105c2d5>] futex_lock_pi+0x877/0x963
[<ffffffff8105e89c>] ? rt_mutex_adjust_prio+0x35/0x3e
[<ffffffff81083577>] ? cpupri_find+0x39/0x8a
[<ffffffff81031120>] ? pick_next_highest_task_rt+0xd4/0x157
[<ffffffff810313f3>] ? find_lowest_rq+0x74/0x129
[<ffffffff8105cee3>] do_futex+0xb22/0xb42
[<ffffffff810316ce>] ? push_rt_tasks+0x14/0x1c
[<ffffffff8100a8e5>] ? __switch_to+0x291/0x2a0
[<ffffffff810368ba>] ? finish_task_switch+0x4c/0xdc
[<ffffffff8105d495>] compat_sys_futex+0xed/0x10b
[<ffffffff8100f8c4>] ? syscall_trace_enter+0xb7/0xbb
[<ffffffff81027a94>] cstar_do_call+0x1b/0x65

This one is also from rtj-opt42:
WARNING: at kernel/rtmutex.c:1732 rt_read_slowunlock()
Pid: 28385, comm: java Not tainted 2.6.24.7-69.el5rt #1

Call Trace:
[<ffffffff81288823>] rt_read_slowunlock+0x14c/0x43e
[<ffffffff8105f4f7>] rt_mutex_up_read+0x25b/0x260
[<ffffffff8105fe3c>] rt_up_read+0x9/0xb
[<ffffffff8105b2d9>] futex_wait+0x339/0x34d
[<ffffffff8105fe3c>] ? rt_up_read+0x9/0xb
[<ffffffff8113bbb6>] ? vsnprintf+0x55f/0x5a5
[<ffffffff8113bd48>] ? snprintf+0x59/0x5b
[<ffffffff81033428>] ? default_wake_function+0x0/0x14
[<ffffffff8105c446>] do_futex+0x85/0xb42
[<ffffffff8102f648>] ? update_curr_rt+0x64/0x66
[<ffffffff8102f6b2>] ? put_prev_task_rt+0xd/0x1b
[<ffffffff8105d495>] compat_sys_futex+0xed/0x10b
[<ffffffff8100f8c4>] ? syscall_trace_enter+0xb7/0xbb
[<ffffffff81027910>] sysenter_do_call+0x1b/0x67

Comment 12 IBM Bug Proxy 2008-07-10 16:40:52 UTC
------- Comment From jstultz.com 2008-07-10 12:31 EDT-------
I'm going to talk with Will today and try to get the Oak test running on a box
so I can reproduce this issue more easily. Then I'll see if the atomic_t change
I have pending helps this or not.

Comment 13 IBM Bug Proxy 2008-07-10 23:30:39 UTC
------- Comment From jstultz.com 2008-07-10 19:20 EDT-------
So I've managed to reproduce this with the internal oak test. I tested with the
recently released -72 kernel, and I still see the issue. I'll dig down a bit
further on this tomorrow.

Also changed the summary to reflect that this bug has moved to just a warning
from a crash.

Comment 14 IBM Bug Proxy 2008-07-12 01:40:34 UTC
------- Comment From jstultz.com 2008-07-11 21:30 EDT-------
So I dug in a bit on this one, and the warning is tripping due to the the
"list_empty(&rls->list)" portion of the WARN_ON. Looking how rls is pulled from
owned_read_locks, I scanned through the file to see how it was protected.

It seems the owned_read_locks[] values (count,list) are protected by the
current->pi_lock, however there are a number of places where they are
manipulated without that lock clearly being held.

I've added locks around most of the manipulations i found, but I'm still seeing
the warnings. Need to dig more on this next week and I'll also ping Steven about it.

Comment 15 IBM Bug Proxy 2008-07-16 00:30:39 UTC
Created attachment 311900 [details]
rwlock: be more conservative in locking reader_lock_count

Talked with Steven about this issue and ran some logdev patches for him. He
then sent this patch which appears to fix the warning issue!

Comment 16 IBM Bug Proxy 2008-07-30 03:10:54 UTC
------- Comment From jstultz.com 2008-07-29 23:01 EDT-------
This has been included in the -74 errata.

Comment 21 errata-xmlrpc 2008-08-26 19:57:28 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0585.html