Bug 452974
Summary: | [24][FOCUS] plist_add/del crash with 2.6.24.7-65ibmrt2.4 kernel | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | IBM Bug Proxy <bugproxy> | ||||
Component: | realtime-kernel | Assignee: | Red Hat Real Time Maintenance <rt-maint> | ||||
Status: | CLOSED ERRATA | QA Contact: | |||||
Severity: | urgent | Docs Contact: | |||||
Priority: | low | ||||||
Version: | beta | CC: | bhu, lgoncalv, srostedt, williams | ||||
Target Milestone: | 1.0.1 | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | All | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2008-08-26 19:57:28 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
IBM Bug Proxy
2008-06-26 12:08:22 UTC
------- Comment From pstanton.com 2008-06-26 10:05 EDT------- Machine has been running stress tests for three hours now on the -68 kernel. It is still running ok, but there are some kernel messages in the logs: Jun 26 12:19:53 rtj-opt22 kernel: WARNING: at kernel/rtmutex.c:1732 rt_read_slowunlock() Jun 26 12:19:53 rtj-opt22 kernel: Pid: 6916, comm: java Not tainted 2.6.24.7-68ibmrt2.4 #1 Jun 26 12:19:53 rtj-opt22 kernel: Jun 26 12:19:53 rtj-opt22 kernel: Call Trace: Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff812888ab>] rt_read_slowunlock+0x14c/0x43e Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105f4f3>] rt_mutex_up_read+0x25b/0x260 Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105fe38>] rt_up_read+0x9/0xb Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105c2d1>] futex_lock_pi+0x877/0x963 Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105dffe>] ? __rt_mutex_adjust_prio+0x11/0x24 Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105e898>] ? rt_mutex_adjust_prio+0x35/0x3e Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105cedf>] do_futex+0xb22/0xb42 Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105d491>] compat_sys_futex+0xed/0x10b Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8100f8c4>] ? syscall_trace_enter+0xb7/0xbb Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff81027a94>] cstar_do_call+0x1b/0x65 Jun 26 12:19:53 rtj-opt22 kernel: Jun 26 12:19:53 rtj-opt22 kernel: WARNING: at kernel/rtmutex.c:1732 rt_read_slowunlock() Jun 26 12:19:53 rtj-opt22 kernel: Pid: 5816, comm: java Not tainted 2.6.24.7-68ibmrt2.4 #1 Jun 26 12:19:53 rtj-opt22 kernel: Jun 26 12:19:53 rtj-opt22 kernel: Call Trace: Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff812888ab>] rt_read_slowunlock+0x14c/0x43e Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105f4f3>] rt_mutex_up_read+0x25b/0x260 Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105fe38>] rt_up_read+0x9/0xb Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105c2d1>] futex_lock_pi+0x877/0x963 Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8101e279>] ? smp_send_reschedule+0x1d/0x1f Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105dffe>] ? __rt_mutex_adjust_prio+0x11/0x24 Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105e898>] ? rt_mutex_adjust_prio+0x35/0x3e Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105cedf>] do_futex+0xb22/0xb42 Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105e898>] ? rt_mutex_adjust_prio+0x35/0x3e Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff81288da4>] ? rt_write_slowunlock+0x207/0x216 Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8105d491>] compat_sys_futex+0xed/0x10b Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff8100f8c4>] ? syscall_trace_enter+0xb7/0xbb Jun 26 12:19:53 rtj-opt22 kernel: [<ffffffff81027a94>] cstar_do_call+0x1b/0x65 ------- Comment From pstanton.com 2008-06-30 05:01 EDT------- We have had two machines running the -68 kernel running stress tests over the weekend with no interruptions, so I'd say this is no longer a blocker for us. (The -65 kernel with rwlock patch locked up after about 15 minutes). However, the kernel messages in comment 18 have appeared several times on both machines. ------- Comment From jstultz.com 2008-06-30 21:42 EDT------- I managed to trigger two of these warning messages with our release testing runs. So we've got a second reproducer (although not yet sure where in the 8-16 hours it happened). Release 69 has several rwlock fixes that may address this issue. Can you see if this is solved in that release? ------- Comment From dvhltc.com 2008-07-02 17:22 EDT------- Currently running our release-testing on -69. Results in < 48 hours. ------- Comment From jstultz.com 2008-07-02 21:49 EDT------- So I ran calibrate tests overnight with my lock_count fix and didn't see any warnings. I've sent the patch on to lkml. I'm not sure if it actually effects the warning the JTC is seeing, but if Darren's -69 testing still shows it, then i'll try to merge it in to R2. ------- Comment From dvhltc.com 2008-07-03 11:57 EDT------- No messages to the console during a complete release-testing.sh run on -69. ------- Comment From jstultz.com 2008-07-07 22:23 EDT------- Peter: Have you picked up the -69 kernel for light testing to see if this issue is resolved? ------- Comment From paul_thwaite.com 2008-07-08 02:14 EDT------- Hi John, No we have not moved to -69 yet. On the LTC/JTC call I understood a few more checks were required before the LTC would recommend the JTC move up to -69? Based on comment #31 I am assuming -69 is good and we will upgrade machines to confirm the messages have disappeared. ------- Comment From jstultz.com 2008-07-08 12:06 EDT------- (In reply to comment #32) > Hi John, No we have not moved to -69 yet. On the LTC/JTC call I understood a > few more checks were required before the LTC would recommend the JTC move up to -69? > > Based on comment #31 I am assuming -69 is good and we will upgrade machines to > confirm the messages have disappeared. Yea, please upgrade just the machines seeing the issues. We'll have a rc2 release at the end of the month that will contain -69+other fixes that will be ready for all the JTC R2 test machines. The RPM can be found here: http://kernel.beaverton.ibm.com/jtcltc/yum/mrg-ibm-extras/x86_64/kernel-rt-2.6.24.7-69.el5rt.x86_64.rpm (although this kernel is missing the SAN bits, but I don't think it affects you) ------- Comment From pstanton.com 2008-07-10 11:44 EDT------- We have five machines running on -69 running one hour realtime Java stress tests. All have displayed the message below several times: WARNING: at kernel/rtmutex.c:1732 rt_read_slowunlock() Pid: 27140, comm: java Not tainted 2.6.24.7-69.el5rt #1 Call Trace: [<ffffffff81288823>] rt_read_slowunlock+0x14c/0x43e [<ffffffff8105f4f7>] rt_mutex_up_read+0x25b/0x260 [<ffffffff8105fe3c>] rt_up_read+0x9/0xb [<ffffffff8105c2d5>] futex_lock_pi+0x877/0x963 [<ffffffff8105f6f5>] ? rt_read_slowlock+0x7b/0x341 [<ffffffff8105e89c>] ? rt_mutex_adjust_prio+0x35/0x3e [<ffffffff81288aeb>] ? rt_read_slowunlock+0x414/0x43e [<ffffffff8105cee3>] do_futex+0xb22/0xb42 [<ffffffff8100a8e5>] ? __switch_to+0x291/0x2a0 [<ffffffff810368ba>] ? finish_task_switch+0x4c/0xdc [<ffffffff8105d495>] compat_sys_futex+0xed/0x10b [<ffffffff8100f8c4>] ? syscall_trace_enter+0xb7/0xbb [<ffffffff81027a94>] cstar_do_call+0x1b/0x65 I've seen this once on rtj-opt22.hursley.ibm.com and on rtj-opt42.hursley.ibm.com: WARNING: at kernel/rtmutex.c:1732 rt_read_slowunlock() Pid: 2459, comm: java Not tainted 2.6.24.7-69.el5rt #1 Call Trace: [<ffffffff8105ad2d>] ? cmpxchg_futex_value_locked+0x52/0x5e [<ffffffff81288823>] rt_read_slowunlock+0x14c/0x43e [<ffffffff8105ba0b>] ? fixup_pi_state_owner+0x174/0x1c7 [<ffffffff8105f4f7>] rt_mutex_up_read+0x25b/0x260 [<ffffffff8105fe3c>] rt_up_read+0x9/0xb [<ffffffff8105c2d5>] futex_lock_pi+0x877/0x963 [<ffffffff8105e89c>] ? rt_mutex_adjust_prio+0x35/0x3e [<ffffffff81083577>] ? cpupri_find+0x39/0x8a [<ffffffff81031120>] ? pick_next_highest_task_rt+0xd4/0x157 [<ffffffff810313f3>] ? find_lowest_rq+0x74/0x129 [<ffffffff8105cee3>] do_futex+0xb22/0xb42 [<ffffffff810316ce>] ? push_rt_tasks+0x14/0x1c [<ffffffff8100a8e5>] ? __switch_to+0x291/0x2a0 [<ffffffff810368ba>] ? finish_task_switch+0x4c/0xdc [<ffffffff8105d495>] compat_sys_futex+0xed/0x10b [<ffffffff8100f8c4>] ? syscall_trace_enter+0xb7/0xbb [<ffffffff81027a94>] cstar_do_call+0x1b/0x65 This one is also from rtj-opt42: WARNING: at kernel/rtmutex.c:1732 rt_read_slowunlock() Pid: 28385, comm: java Not tainted 2.6.24.7-69.el5rt #1 Call Trace: [<ffffffff81288823>] rt_read_slowunlock+0x14c/0x43e [<ffffffff8105f4f7>] rt_mutex_up_read+0x25b/0x260 [<ffffffff8105fe3c>] rt_up_read+0x9/0xb [<ffffffff8105b2d9>] futex_wait+0x339/0x34d [<ffffffff8105fe3c>] ? rt_up_read+0x9/0xb [<ffffffff8113bbb6>] ? vsnprintf+0x55f/0x5a5 [<ffffffff8113bd48>] ? snprintf+0x59/0x5b [<ffffffff81033428>] ? default_wake_function+0x0/0x14 [<ffffffff8105c446>] do_futex+0x85/0xb42 [<ffffffff8102f648>] ? update_curr_rt+0x64/0x66 [<ffffffff8102f6b2>] ? put_prev_task_rt+0xd/0x1b [<ffffffff8105d495>] compat_sys_futex+0xed/0x10b [<ffffffff8100f8c4>] ? syscall_trace_enter+0xb7/0xbb [<ffffffff81027910>] sysenter_do_call+0x1b/0x67 ------- Comment From jstultz.com 2008-07-10 12:31 EDT------- I'm going to talk with Will today and try to get the Oak test running on a box so I can reproduce this issue more easily. Then I'll see if the atomic_t change I have pending helps this or not. ------- Comment From jstultz.com 2008-07-10 19:20 EDT------- So I've managed to reproduce this with the internal oak test. I tested with the recently released -72 kernel, and I still see the issue. I'll dig down a bit further on this tomorrow. Also changed the summary to reflect that this bug has moved to just a warning from a crash. ------- Comment From jstultz.com 2008-07-11 21:30 EDT------- So I dug in a bit on this one, and the warning is tripping due to the the "list_empty(&rls->list)" portion of the WARN_ON. Looking how rls is pulled from owned_read_locks, I scanned through the file to see how it was protected. It seems the owned_read_locks[] values (count,list) are protected by the current->pi_lock, however there are a number of places where they are manipulated without that lock clearly being held. I've added locks around most of the manipulations i found, but I'm still seeing the warnings. Need to dig more on this next week and I'll also ping Steven about it. Created attachment 311900 [details]
rwlock: be more conservative in locking reader_lock_count
Talked with Steven about this issue and ran some logdev patches for him. He
then sent this patch which appears to fix the warning issue!
------- Comment From jstultz.com 2008-07-29 23:01 EDT------- This has been included in the -74 errata. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0585.html |