Red Hat Bugzilla – Bug 452974
[24][FOCUS] plist_add/del crash with 2.6.24.7-65ibmrt2.4 kernel
Last modified: 2008-08-26 15:57:28 EDT
=Comment: #0================================================= Sripathi Kodi <sripathi@in.ibm.com> - 2008-06-25 06:19 EDT We have seen the following crashes on a JTC machine: <4>krcupreemptd new prio is 115?? <1>Unable to handle kernel NULL pointer dereference at 0000000000000006 RIP: <1> [<ffffffff8113eeda>] plist_del+0x26/0x70 <4>PGD 158cc1067 PUD 158cc0067 PMD 158c7c067 PTE 0 <0>Oops: 0002 [1] PREEMPT SMP <4>CPU 2 <4>Modules linked in: autofs4 hidp nfs lockd nfs_acl rfcomm l2cap bluetooth sunrpc ipv6 dm_multipath video output sbs sbshc battery ac parport_pc lp parport sr_mod cdrom sg k8_edac edac_core tg3 button pata_serverworks k8temp hwmon pata_acpi serio_raw ata_generic shpchp pcspkr dm_snapshot dm_zero dm_mirror dm_mod sata_svw libata mptspi mptscsih scsi_transport_spi mptbase sd_mod scsi_mod ext3 jbd mbcache ehci_hcd ohci_hcd uhci_hcd <4>Pid: 5276, comm: java Not tainted 2.6.24.7-65ibmrt2.4 #1 <4>RIP: 0010:[<ffffffff8113eeda>] [<ffffffff8113eeda>] plist_del+0x26/0x70 <4>RSP: 0018:ffff81009f91bd98 EFLAGS: 00210086 <4>RAX: 0000000000000006 RBX: ffff81015c01a9d0 RCX: ffff81009eebbe50 <4>RDX: ffff81009eebbe58 RSI: ffff81009eeb4080 RDI: ffff810158c25be0 <4>RBP: ffff81009f91bd98 R08: ffff810158c25be8 R09: 00000000bbdf380b <4>R10: ffff810152c575e0 R11: 000000039f91bbc8 R12: ffff81015c01a9d0 <4>R13: ffff810158c25bb8 R14: 0000000000000000 R15: ffff81015c01a9d0 <4>FS: 00002b932b3ec880(0000) GS:ffff81015faaa7c0(0063) knlGS:00000000c907eb90 <4>CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 <4>CR2: 0000000000000006 CR3: 0000000158cab000 CR4: 00000000000006e0 <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 <4>Process java (pid: 5276, threadinfo ffff81009f91a000, task ffff81009f914040) <4>Stack: ffff81009f91bdd8 ffffffff8105dd9b ffff81009f91be58 ffff81015c01a9d0 <4> 0000000000200202 0000000000000000 000000008000149c ffff81015c01a9d0 <4> ffff81009f91bdf8 ffffffff81288357 ffff81015c01a9c0 ffff81015fb63cd8 <4>Call Trace: <4> [<ffffffff8105dd9b>] wakeup_next_waiter+0x65/0x1b2 <4> [<ffffffff81288357>] rt_mutex_slowunlock+0x3b/0x59 <4> [<ffffffff81288136>] rt_mutex_unlock+0x28/0x2a <4> [<ffffffff8105ccea>] do_futex+0x9d5/0xb42 <4> [<ffffffff8105f398>] ? rt_mutex_up_read+0x22d/0x232 <4> [<ffffffff8128ba32>] ? do_page_fault+0x3f6/0x76d <4> [<ffffffff810317c7>] ? post_schedule_rt+0x31/0x35 <4> [<ffffffff810368a2>] ? finish_task_switch+0x4c/0xdc <4> [<ffffffff8105d3e9>] compat_sys_futex+0xed/0x10b <4> [<ffffffff8100f895>] ? syscall_trace_enter+0xb7/0xbb <4> [<ffffffff81027a94>] cstar_do_call+0x1b/0x65 <4> <4> <0>Code: 5f c9 c3 90 90 4c 8d 47 08 4c 39 47 08 55 48 89 e5 74 45 48 8b 4f 18 48 83 e9 18 48 8d 51 08 48 8b 71 08 48 8b 42 08 48 89 46 08 <48> 89 30 49 8b 40 08 4c 89 41 08 49 89 50 08 48 89 10 48 89 42 <1>RIP [<ffffffff8113eeda>] plist_del+0x26/0x70 and Unable to handle kernel paging request at 0000000000002625 RIP: [<ffffffff8113efa3>] plist_add+0x7f/0xa6 PGD 14f899067 PUD 14f8e7067 PMD 13edfa067 PTE 0 Oops: 0002 [1] PREEMPT SMP CPU 2 Modules linked in: autofs4 hidp nfs lockd nfs_acl rfcomm l2cap bluetooth sunrpc ipv6 dm_multipath video output sbs sbshc battery ac parport_pc lp parport sg sr_mod cdrom tg3 k8_edac pata_serverworks shpchp edac_core pata _acpi button k8temp hwmon serio_raw ata_generic pcspkr dm_snapshot dm_zero dm_mirror dm_mod sata_svw libata mp tspi mptscsih scsi_transport_spi mptbase sd_mod scsi_mod ext3 jbd mbcache ehci_hcd ohci_hcd uhci_hcd Pid: 9640, comm: java Not tainted 2.6.24.7-65ibmrt2.4 #1 RIP: 0010:[<ffffffff8113efa3>] [<ffffffff8113efa3>] plist_add+0x7f/0xa6 RSP: 0018:ffff8100bbd97e18 EFLAGS: 00010083 RAX: ffff81015e0c75a0 RBX: ffff81009e52dba0 RCX: 0000000000002625 RDX: ffff81009e52dba8 RSI: ffff81015e0c7598 RDI: ffff81009e52dba0 RBP: ffff8100bbd97e28 R08: ffff81009acc1ba8 R09: 0000000000000000 R10: ffff8100bdfc24d8 R11: ffff8100bbd97db8 R12: ffff8100bbd92820 R13: ffff81009acc1b78 R14: ffff81009e52db78 R15: ffff8100bbd92818 FS: 00002ae7ce277480(0000) GS:ffff81015faaa7c0(0063) knlGS:00000000c4638b90 CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b CR2: 0000000000002625 CR3: 000000014f89a000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process java (pid: 9640, threadinfo ffff8100bbd96000, task ffff8100bbd92a60) Stack: ffff8100bbd92040 ffff81015e0c75a8 ffff8100bbd97e68 ffffffff8105ded4 ffff8100bbd97e48 ffff81009acc1b78 ffff81015e0c75a8 0000000000000001 ffff81015e0c75a0 ffff81015e0c75c0 ffff8100bbd97ec8 ffffffff81288995 Call Trace: [<ffffffff8105ded4>] wakeup_next_waiter+0x19e/0x1b2 [<ffffffff81288995>] rt_write_slowunlock+0x82/0x1c6 [<ffffffff8105f166>] rt_mutex_up_write+0x69/0x6e [<ffffffff8105fcb7>] rt_up_write+0x9/0xb [<ffffffff8109a243>] sys_mprotect+0x210/0x22d [<ffffffff8100f895>] ? syscall_trace_enter+0xb7/0xbb [<ffffffff81028244>] sys32_mprotect+0x9/0xb [<ffffffff81027a94>] cstar_do_call+0x1b/0x65 Code: 89 c6 eb 2e 48 89 c6 48 8b 56 08 48 8d 46 08 4c 39 e0 0f 18 0a 75 dc 48 8d 46 08 48 8d 53 08 48 8b 48 08 48 89 43 08 48 89 50 08 <48> 89 11 48 89 4a 08 48 8d 46 18 48 8d 53 18 48 8b 48 08 48 89 RIP [<ffffffff8113efa3>] plist_add+0x7f/0xa6 RSP <ffff8100bbd97e18> Hardware: e326m (rtj-opt6) OS: RHEL5.1 RT version: Alpha18 Kernel: 2.6.24.7-65ibmrt2.4 (this is just a rebuild of RH's 2.6.24.7-65.el5rt kernel) One of the dumps seems to be corrupted in some way, so crash is unable to open it. I will move the other one to some other machine. Though the problem has been reported on e326m, the problem does not seem to be hardware specific. Hence this problem can block JTC testing. This bug was uncovered while testing for bug 44627. =Comment: #3================================================= P. N. Stanton <pstanton@uk.ibm.com> - 2008-06-25 07:11 EDT rtj-opt22 is re-running the test. rtj-opt6 is reserved for Sripathi's use. This test last ran on rtj-opt6 on June 16th on kernel 2.6.24.7-62.el5rt and it ran ok and passed. =Comment: #5================================================= Sripathi Kodi <sripathi@in.ibm.com> - 2008-06-25 09:03 EDT Peter, there is one patch from Steve Rostedt that is not part of -65 kernel. I have compiled a kernel with that patch and put the kernel rpms under http://kernel.beaverton.ibm.com/~sripathi/45869/ Could you please use this kernel and restart tests on rtj-opt6? =Comment: #6================================================= P. N. Stanton <pstanton@uk.ibm.com> - 2008-06-25 09:55 EDT Installed the new kernel on rtj-opt22, since Sripathi was still running stuff on rtj-opt6 and seems to have left for the day. Good news - the test ran and passed this time. Bad news - on the 4th iteration the machine locked up. No messages (console or syslog), no dump and the SysRq key didn't work. Re-running to see if it is reproducible. =Comment: #7================================================= P. N. Stanton <pstanton@uk.ibm.com> - 2008-06-25 10:39 EDT Re-ran tests on rtj-opt22, machine locked up as before on the 4 run of the test. This time there were messages on the console: WARNING: at lib/plist.c:104 plist_add() Pid: 12128, comm: java Not tainted 2.6.24.7-65ibmrt2.4rwfix #1 Call Trace: [<ffffffff8113edb1>] plist_add+0x3d/0xa6 [<ffffffff8105dd32>] wakeup_next_waiter+0x19e/0x1b2 [<ffffffff8128804a>] rt_mutex_slowunlock+0x3b/0x59 [<ffffffff81287e29>] rt_mutex_unlock+0x28/0x2a [<ffffffff8105cb4e>] do_futex+0x9d5/0xb42 [<ffffffff810546f1>] ? hrtimer_nanosleep+0x6b/0xf2 [<ffffffff81056930>] ? getnstimeofday+0x31/0x88 [<ffffffff8105d24d>] compat_sys_futex+0xed/0x10b [<ffffffff8100f895>] ? syscall_trace_enter+0xb7/0xbb [<ffffffff81027a94>] cstar_do_call+0x1b/0x65 =Comment: #10================================================= Darren V. Hart <dvhltc@us.ibm.com> - 2008-06-25 16:07 EDT My chat with rostedt on the #Linux-rt channel. I'll prepare a kernel, but I don't know how to reproduce the error. I'm happy to use rtj-opt6, but I don't know what to run. Can someone from the JTC fill me in? <dvhart> hey folks, have any of you experienced a hang with [<ffffffff8113edb1>] plist_add+0x3d/0xa6 in the call trace? <rostedt> dvhart: with -rt14? <rostedt> or 25-rt7? <-- dejones has quit (Read error: 110 (Connection timed out)) <dvhart> rostedt, this is on mrg -65 (plus an rwlock patch from you) <dwalker> rostedt, yeah, there's still lots .. <rostedt> dwalker: yeah, I see that (with a quick grep through the kernel) <rostedt> dvhart: there's a lot of rwlock patches missing from that <dvhart> rostedt, what would be the best way for me to pick those up and test? <rostedt> dvhart: and the patch you have was also broken <dvhart> heh <dvhart> great news all around actually :-) <rostedt> -rt14? <dvhart> rostedt, ok I can grab the broken out patches and try and pull them in <dvhart> rostedt, should be obvious which ones I need - i.e. they all match *rwlock* or something? <rostedt> dvhart: that would probably be the easiest <dvhart> ack <dvhart> thanks rostedt <rostedt> just look at the end of the queue and get all rwlock* <rostedt> I try to append "rwlock" to all the patches that dealt with them <rostedt> although you may need a few of the "rtmutex*" patches there too <dvhart> ack =Comment: #11================================================= Darren V. Hart <dvhltc@us.ibm.com> - 2008-06-25 17:27 EDT MRG test kernel -68 has been released and is available on mrg-ibm-extras repo. It contains all the rtmutex and rwlock fixes in question. Can someone provide info on how to kick off a test? =Comment: #12================================================= John G. Stultz <jstultz@us.ibm.com> - 2008-06-25 22:08 EDT R2-RC1-pre is available here: http://kernel.beaverton.ibm.com/jtcltc/drops/R2/RC1-pre/linux-rt-R2-RC1-pre-1109.tar.gz This contains the -68 kernel and may fix this issue.
------- Comment From pstanton@uk.ibm.com 2008-06-26 10:05 EDT------- Machine has been running stress tests for three hours now on the -68 kernel. It is still running ok, but there are some kernel messages in the logs: Jun 26 12:19:53 rtj-opt22 kernel: WARNING: at kernel/rtmutex.c:1732 rt_read_slowunlock() Jun 26 12:19:53 rtj-opt22 kernel: Pid: 6916, comm: java Not tainted 2.6.24.7-68ibmrt2.4 #1 Jun 26 12:19:53 rtj-opt22 kernel: Jun 26 12:19:53 rtj-opt22 kernel: Call Trace: Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff812888ab>] rt_read_slowunlock+0x14c/0x43e Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105f4f3>] rt_mutex_up_read+0x25b/0x260 Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105fe38>] rt_up_read+0x9/0xb Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105c2d1>] futex_lock_pi+0x877/0x963 Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105dffe>] ? __rt_mutex_adjust_prio+0x11/0x24 Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105e898>] ? rt_mutex_adjust_prio+0x35/0x3e Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105cedf>] do_futex+0xb22/0xb42 Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105d491>] compat_sys_futex+0xed/0x10b Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8100f8c4>] ? syscall_trace_enter+0xb7/0xbb Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff81027a94>] cstar_do_call+0x1b/0x65 Jun 26 12:19:53 rtj-opt22 kernel: Jun 26 12:19:53 rtj-opt22 kernel: WARNING: at kernel/rtmutex.c:1732 rt_read_slowunlock() Jun 26 12:19:53 rtj-opt22 kernel: Pid: 5816, comm: java Not tainted 2.6.24.7-68ibmrt2.4 #1 Jun 26 12:19:53 rtj-opt22 kernel: Jun 26 12:19:53 rtj-opt22 kernel: Call Trace: Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff812888ab>] rt_read_slowunlock+0x14c/0x43e Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105f4f3>] rt_mutex_up_read+0x25b/0x260 Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105fe38>] rt_up_read+0x9/0xb Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105c2d1>] futex_lock_pi+0x877/0x963 Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8101e279>] ? smp_send_reschedule+0x1d/0x1f Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105dffe>] ? __rt_mutex_adjust_prio+0x11/0x24 Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105e898>] ? rt_mutex_adjust_prio+0x35/0x3e Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105cedf>] do_futex+0xb22/0xb42 Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105e898>] ? rt_mutex_adjust_prio+0x35/0x3e Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff81288da4>] ? rt_write_slowunlock+0x207/0x216 Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105d491>] compat_sys_futex+0xed/0x10b Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8100f8c4>] ? syscall_trace_enter+0xb7/0xbb Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff81027a94>] cstar_do_call+0x1b/0x65
------- Comment From pstanton@uk.ibm.com 2008-06-30 05:01 EDT------- We have had two machines running the -68 kernel running stress tests over the weekend with no interruptions, so I'd say this is no longer a blocker for us. (The -65 kernel with rwlock patch locked up after about 15 minutes). However, the kernel messages in comment 18 have appeared several times on both machines.
------- Comment From jstultz@us.ibm.com 2008-06-30 21:42 EDT------- I managed to trigger two of these warning messages with our release testing runs. So we've got a second reproducer (although not yet sure where in the 8-16 hours it happened).
Release 69 has several rwlock fixes that may address this issue. Can you see if this is solved in that release?
------- Comment From dvhltc@us.ibm.com 2008-07-02 17:22 EDT------- Currently running our release-testing on -69. Results in < 48 hours.
------- Comment From jstultz@us.ibm.com 2008-07-02 21:49 EDT------- So I ran calibrate tests overnight with my lock_count fix and didn't see any warnings. I've sent the patch on to lkml. I'm not sure if it actually effects the warning the JTC is seeing, but if Darren's -69 testing still shows it, then i'll try to merge it in to R2.
------- Comment From dvhltc@us.ibm.com 2008-07-03 11:57 EDT------- No messages to the console during a complete release-testing.sh run on -69.
------- Comment From jstultz@us.ibm.com 2008-07-07 22:23 EDT------- Peter: Have you picked up the -69 kernel for light testing to see if this issue is resolved?
------- Comment From paul_thwaite@uk.ibm.com 2008-07-08 02:14 EDT------- Hi John, No we have not moved to -69 yet. On the LTC/JTC call I understood a few more checks were required before the LTC would recommend the JTC move up to -69? Based on comment #31 I am assuming -69 is good and we will upgrade machines to confirm the messages have disappeared.
------- Comment From jstultz@us.ibm.com 2008-07-08 12:06 EDT------- (In reply to comment #32) > Hi John, No we have not moved to -69 yet. On the LTC/JTC call I understood a > few more checks were required before the LTC would recommend the JTC move up to -69? > > Based on comment #31 I am assuming -69 is good and we will upgrade machines to > confirm the messages have disappeared. Yea, please upgrade just the machines seeing the issues. We'll have a rc2 release at the end of the month that will contain -69+other fixes that will be ready for all the JTC R2 test machines. The RPM can be found here: http://kernel.beaverton.ibm.com/jtcltc/yum/mrg-ibm-extras/x86_64/kernel-rt-2.6.24.7-69.el5rt.x86_64.rpm (although this kernel is missing the SAN bits, but I don't think it affects you)
------- Comment From pstanton@uk.ibm.com 2008-07-10 11:44 EDT------- We have five machines running on -69 running one hour realtime Java stress tests. All have displayed the message below several times: WARNING: at kernel/rtmutex.c:1732 rt_read_slowunlock() Pid: 27140, comm: java Not tainted 2.6.24.7-69.el5rt #1 Call Trace: [<ffffffff81288823>] rt_read_slowunlock+0x14c/0x43e [<ffffffff8105f4f7>] rt_mutex_up_read+0x25b/0x260 [<ffffffff8105fe3c>] rt_up_read+0x9/0xb [<ffffffff8105c2d5>] futex_lock_pi+0x877/0x963 [<ffffffff8105f6f5>] ? rt_read_slowlock+0x7b/0x341 [<ffffffff8105e89c>] ? rt_mutex_adjust_prio+0x35/0x3e [<ffffffff81288aeb>] ? rt_read_slowunlock+0x414/0x43e [<ffffffff8105cee3>] do_futex+0xb22/0xb42 [<ffffffff8100a8e5>] ? __switch_to+0x291/0x2a0 [<ffffffff810368ba>] ? finish_task_switch+0x4c/0xdc [<ffffffff8105d495>] compat_sys_futex+0xed/0x10b [<ffffffff8100f8c4>] ? syscall_trace_enter+0xb7/0xbb [<ffffffff81027a94>] cstar_do_call+0x1b/0x65 I've seen this once on rtj-opt22.hursley.ibm.com and on rtj-opt42.hursley.ibm.com: WARNING: at kernel/rtmutex.c:1732 rt_read_slowunlock() Pid: 2459, comm: java Not tainted 2.6.24.7-69.el5rt #1 Call Trace: [<ffffffff8105ad2d>] ? cmpxchg_futex_value_locked+0x52/0x5e [<ffffffff81288823>] rt_read_slowunlock+0x14c/0x43e [<ffffffff8105ba0b>] ? fixup_pi_state_owner+0x174/0x1c7 [<ffffffff8105f4f7>] rt_mutex_up_read+0x25b/0x260 [<ffffffff8105fe3c>] rt_up_read+0x9/0xb [<ffffffff8105c2d5>] futex_lock_pi+0x877/0x963 [<ffffffff8105e89c>] ? rt_mutex_adjust_prio+0x35/0x3e [<ffffffff81083577>] ? cpupri_find+0x39/0x8a [<ffffffff81031120>] ? pick_next_highest_task_rt+0xd4/0x157 [<ffffffff810313f3>] ? find_lowest_rq+0x74/0x129 [<ffffffff8105cee3>] do_futex+0xb22/0xb42 [<ffffffff810316ce>] ? push_rt_tasks+0x14/0x1c [<ffffffff8100a8e5>] ? __switch_to+0x291/0x2a0 [<ffffffff810368ba>] ? finish_task_switch+0x4c/0xdc [<ffffffff8105d495>] compat_sys_futex+0xed/0x10b [<ffffffff8100f8c4>] ? syscall_trace_enter+0xb7/0xbb [<ffffffff81027a94>] cstar_do_call+0x1b/0x65 This one is also from rtj-opt42: WARNING: at kernel/rtmutex.c:1732 rt_read_slowunlock() Pid: 28385, comm: java Not tainted 2.6.24.7-69.el5rt #1 Call Trace: [<ffffffff81288823>] rt_read_slowunlock+0x14c/0x43e [<ffffffff8105f4f7>] rt_mutex_up_read+0x25b/0x260 [<ffffffff8105fe3c>] rt_up_read+0x9/0xb [<ffffffff8105b2d9>] futex_wait+0x339/0x34d [<ffffffff8105fe3c>] ? rt_up_read+0x9/0xb [<ffffffff8113bbb6>] ? vsnprintf+0x55f/0x5a5 [<ffffffff8113bd48>] ? snprintf+0x59/0x5b [<ffffffff81033428>] ? default_wake_function+0x0/0x14 [<ffffffff8105c446>] do_futex+0x85/0xb42 [<ffffffff8102f648>] ? update_curr_rt+0x64/0x66 [<ffffffff8102f6b2>] ? put_prev_task_rt+0xd/0x1b [<ffffffff8105d495>] compat_sys_futex+0xed/0x10b [<ffffffff8100f8c4>] ? syscall_trace_enter+0xb7/0xbb [<ffffffff81027910>] sysenter_do_call+0x1b/0x67
------- Comment From jstultz@us.ibm.com 2008-07-10 12:31 EDT------- I'm going to talk with Will today and try to get the Oak test running on a box so I can reproduce this issue more easily. Then I'll see if the atomic_t change I have pending helps this or not.
------- Comment From jstultz@us.ibm.com 2008-07-10 19:20 EDT------- So I've managed to reproduce this with the internal oak test. I tested with the recently released -72 kernel, and I still see the issue. I'll dig down a bit further on this tomorrow. Also changed the summary to reflect that this bug has moved to just a warning from a crash.
------- Comment From jstultz@us.ibm.com 2008-07-11 21:30 EDT------- So I dug in a bit on this one, and the warning is tripping due to the the "list_empty(&rls->list)" portion of the WARN_ON. Looking how rls is pulled from owned_read_locks, I scanned through the file to see how it was protected. It seems the owned_read_locks[] values (count,list) are protected by the current->pi_lock, however there are a number of places where they are manipulated without that lock clearly being held. I've added locks around most of the manipulations i found, but I'm still seeing the warnings. Need to dig more on this next week and I'll also ping Steven about it.
Created attachment 311900 [details] rwlock: be more conservative in locking reader_lock_count Talked with Steven about this issue and ran some logdev patches for him. He then sent this patch which appears to fix the warning issue!
------- Comment From jstultz@us.ibm.com 2008-07-29 23:01 EDT------- This has been included in the -74 errata.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0585.html