Bug 668775
Summary: | BKL (lock_kernel) in soft lockup during parallel IO discovery | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Tim Wilkinson <twilkins> | ||||
Component: | kernel | Assignee: | Jeff Moyer <jmoyer> | ||||
Status: | CLOSED ERRATA | QA Contact: | Gris Ge <fge> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 6.1 | CC: | bdonahue, buchino, chellwig, coughlan, czhang, dshaks, dzickus, eddie.williams, fge, james.leddy, jmoyer, kzhang, msnitzer, perfbz, prarit, pzijlstr, rwheeler, sgandhar, stbechto, woodard | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | kernel-2.6.32-151.el6 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 690523 1225598 (view as bug list) | Environment: | |||||
Last Closed: | 2011-12-06 12:36:48 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 658636, 690523, 699556, 1225598 | ||||||
Attachments: |
|
Description
Tim Wilkinson
2011-01-11 15:34:08 UTC
This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. If you would like it considered as an exception in the current release, please ask your support representative. This request was erroneously denied for the current release of Red Hat Enterprise Linux. The error has been fixed and this request has been re-proposed for the current release. Seeing following traces on a system trying to modprobe LPFC. crash> bt 1774 PID: 1774 TASK: ffff8804286ef4a0 CPU: 5 COMMAND: "async/15" #0 [ffff880028347e80] crash_nmi_callback at ffffffff8102e036 #1 [ffff880028347e90] notifier_call_chain at ffffffff814ce4d5 #2 [ffff880028347ed0] atomic_notifier_call_chain at ffffffff814ce53a #3 [ffff880028347ee0] notify_die at ffffffff81096fce #4 [ffff880028347f10] do_nmi at ffffffff814cc60c #5 [ffff880028347f50] nmi at ffffffff814cbf00 [exception RIP: smp_call_function_many+434] RIP: ffffffff810a7372 RSP: ffff88042b391c90 RFLAGS: 00000202 RAX: 0000000000000002 RBX: ffff8800283520a0 RCX: 0000000000000008 RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000286 RBP: ffff88042b391cd0 R8: 0000000000000000 R9: ffff88041fedde00 R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000005 R13: ffffffff818a3ba0 R14: ffffffff818a3ba0 R15: ffffffff8119dcb0 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #6 [ffff88042b391c90] smp_call_function_many at ffffffff810a7372 #7 [ffff88042b391cd8] smp_call_function at ffffffff810a73f2 #8 [ffff88042b391ce8] on_each_cpu at ffffffff81073914 #9 [ffff88042b391d18] invalidate_bh_lrus at ffffffff8119d92c #10 [ffff88042b391d28] kill_bdev at ffffffff811a3f58 #11 [ffff88042b391d48] __blkdev_put at ffffffff811a4c60 #12 [ffff88042b391d98] blkdev_put at ffffffff811a4d50 #13 [ffff88042b391da8] register_disk at ffffffff811db82a #14 [ffff88042b391df8] add_disk at ffffffff81249acc #15 [ffff88042b391e28] sd_probe_async at ffffffffa017a34b #16 [ffff88042b391e68] async_thread at ffffffff81099192 #17 [ffff88042b391ee8] kthread at ffffffff81091a86 #18 [ffff88042b391f48] kernel_thread at ffffffff810141ca crash> crash> bt -a PID: 2317 TASK: ffff88041d0ef4a0 CPU: 0 COMMAND: "blkid" #0 [ffff880028207e80] crash_nmi_callback at ffffffff8102e036 #1 [ffff880028207e90] notifier_call_chain at ffffffff814ce4d5 #2 [ffff880028207ed0] atomic_notifier_call_chain at ffffffff814ce53a #3 [ffff880028207ee0] notify_die at ffffffff81096fce #4 [ffff880028207f10] do_nmi at ffffffff814cc60c #5 [ffff880028207f50] nmi at ffffffff814cbf00 [exception RIP: lock_kernel+46] RIP: ffffffff814cb9de RSP: ffff88041d0f1e28 RFLAGS: 00000297 RAX: 0000000000000000 RBX: ffff880420e67500 RCX: 0000000000001e6b RDX: 0000000000001e6f RSI: 000000000000101d RDI: ffff880420e67520 RBP: ffff88041d0f1e28 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000101d R14: ffff880420e67520 R15: ffff880429dd0c00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #6 [ffff88041d0f1e28] lock_kernel at ffffffff814cb9de #7 [ffff88041d0f1e30] __blkdev_put at ffffffff811a4bf2 #8 [ffff88041d0f1e80] blkdev_put at ffffffff811a4d50 #9 [ffff88041d0f1e90] blkdev_close at ffffffff811a4d93 #10 [ffff88041d0f1ec0] __fput at ffffffff8116eb05 #11 [ffff88041d0f1f10] fput at ffffffff8116ec45 #12 [ffff88041d0f1f20] filp_close at ffffffff8116a19d #13 [ffff88041d0f1f50] sys_close at ffffffff8116a275 #14 [ffff88041d0f1f80] system_call_fastpath at ffffffff81013172 RIP: 0000003dfeed4150 RSP: 00007fff3cc56dd8 RFLAGS: 00010202 RAX: 0000000000000003 RBX: ffffffff81013172 RCX: 0000003e01020ba0 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003 RBP: 0000000000000003 R8: 00000000013487e8 R9: 0000000000300000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff3cc57c4d R13: 00007fff3cc572c0 R14: 0000000000000000 R15: 00007fff3cc572c0 ORIG_RAX: 0000000000000003 CS: 0033 SS: 002b PID: 28 TASK: ffff88042e2f8b30 CPU: 1 COMMAND: "events/1" #0 [ffff880028247e80] crash_nmi_callback at ffffffff8102e036 #1 [ffff880028247e90] notifier_call_chain at ffffffff814ce4d5 #2 [ffff880028247ed0] atomic_notifier_call_chain at ffffffff814ce53a #3 [ffff880028247ee0] notify_die at ffffffff81096fce #4 [ffff880028247f10] do_nmi at ffffffff814cc60c #5 [ffff880028247f50] nmi at ffffffff814cbf00 [exception RIP: cfb_imageblit+1213] RIP: ffffffff812adf8d RSP: ffff88042e31b950 RFLAGS: 00000046 RAX: ffffc90012d77134 RBX: ffff880429104978 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000000000000f RDI: 0000000000000000 RBP: ffff88042e31b9d0 R8: 0000000007070707 R9: 0000000000000004 R10: ffffffff81513520 R11: ffff880429104974 R12: 0000000000000003 R13: ffffc90012d77168 R14: 0000000000000400 R15: 000000000000000b ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #6 [ffff88042e31b950] cfb_imageblit at ffffffff812adf8d #7 [ffff88042e31b9d8] bit_putcs at ffffffff812a6e9e #8 [ffff88042e31bb28] fbcon_putcs at ffffffff812a3646 #9 [ffff88042e31bba8] fbcon_redraw at ffffffff812a3a2c #10 [ffff88042e31bc18] fbcon_scroll at ffffffff812a5543 #11 [ffff88042e31bc98] scrup at ffffffff81307280 #12 [ffff88042e31bcd8] lf at ffffffff8130741d #13 [ffff88042e31bcf8] vt_console_print at ffffffff813091e2 #14 [ffff88042e31bd58] __call_console_drivers at ffffffff8106baa5 #15 [ffff88042e31bd88] _call_console_drivers at ffffffff8106bb0a #16 [ffff88042e31bda8] release_console_sem at ffffffff8106bfe8 #17 [ffff88042e31bde8] fb_flashcursor at ffffffff812a08ea #18 [ffff88042e31be38] worker_thread at ffffffff8108c720 #19 [ffff88042e31bee8] kthread at ffffffff81091a86 #20 [ffff88042e31bf48] kernel_thread at ffffffff810141ca PID: 1908 TASK: ffff88042a39eb30 CPU: 2 COMMAND: "async/19" #0 [ffff8800282836b0] sysrq_handle_crash at ffffffff8130dfe6 #1 [ffff8800282836d0] pointer at ffffffff812628c3 #2 [ffff880028283868] sysrq_handle_crash at ffffffff8130dfe6 #3 [ffff8800282838f0] machine_kexec at ffffffff8103697b #4 [ffff880028283950] crash_kexec at ffffffff810b9078 #5 [ffff880028283a20] oops_end at ffffffff814cc900 #6 [ffff880028283a50] no_context at ffffffff8104652b #7 [ffff880028283aa0] __bad_area_nosemaphore at ffffffff810467b5 #8 [ffff880028283af0] bad_area_nosemaphore at ffffffff81046883 #9 [ffff880028283b00] do_page_fault at ffffffff814ce388 #10 [ffff880028283b50] page_fault at ffffffff814cbc75 [exception RIP: sysrq_handle_crash+22] RIP: ffffffff8130dfe6 RSP: ffff880028283c08 RFLAGS: 00010092 RAX: 0000000000000010 RBX: 0000000000000063 RCX: 00000000000008bc RDX: 0000000000000000 RSI: ffff880429a15000 RDI: 0000000000000063 RBP: ffff880028283c08 R8: 0000000000000073 R9: 0000000000000000 R10: 00000000000000fa R11: 0000000000000000 R12: ffff880429a15000 R13: ffffffff817a0700 R14: 0000000000000086 R15: 0000000000000007 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #11 [ffff880028283c10] __handle_sysrq at ffffffff8130e2a2 #12 [ffff880028283c60] handle_sysrq at ffffffff8130e38b #13 [ffff880028283c70] kbd_event at ffffffff81305251 #14 [ffff880028283ce0] input_pass_event at ffffffff813b11c6 #15 [ffff880028283d20] input_handle_event at ffffffff813b2dc3 #16 [ffff880028283d60] input_event at ffffffff813b36c4 #17 [ffff880028283db0] atkbd_interrupt at ffffffff813ba370 #18 [ffff880028283e20] serio_interrupt at ffffffff813ae3f2 #19 [ffff880028283e60] i8042_interrupt at ffffffff813af1b2 #20 [ffff880028283ed0] handle_IRQ_event at ffffffff810d8960 #21 [ffff880028283f20] handle_edge_irq at ffffffff810db046 #22 [ffff880028283f60] handle_irq at ffffffff81015fb9 #23 [ffff880028283f80] do_IRQ at ffffffff814d063c --- <IRQ stack> --- #24 [ffff88042aca7c88] ret_from_intr at ffffffff81013ad3 [exception RIP: lock_kernel+53] RIP: ffffffff814cb9e5 RSP: ffff88042aca7d30 RFLAGS: 00000293 RAX: 0000000000000000 RBX: ffff88042aca7d30 RCX: 0000000000001e6b RDX: 0000000000001e6e RSI: 0000000000000004 RDI: ffff88042e83fc00 RBP: ffffffff81013ace R8: 0000000000000004 R9: ffff88042e83fc10 R10: ffff88042e83fc60 R11: 0000000000000000 R12: 0000000000000001 R13: ffffffff81201230 R14: ffff88042aca7cc0 R15: ffff88042e098c00 ORIG_RAX: ffffffffffffffce CS: 0010 SS: 0018 #25 [ffff88042aca7d38] __blkdev_get at ffffffff811a4e50 #26 [ffff88042aca7d98] blkdev_get at ffffffff811a51d0 #27 [ffff88042aca7da8] register_disk at ffffffff811db815 #28 [ffff88042aca7df8] add_disk at ffffffff81249acc #29 [ffff88042aca7e28] sd_probe_async at ffffffffa017a34b #30 [ffff88042aca7e68] async_thread at ffffffff81099192 #31 [ffff88042aca7ee8] kthread at ffffffff81091a86 #32 [ffff88042aca7f48] kernel_thread at ffffffff810141ca PID: 1964 TASK: ffff88042c268af0 CPU: 3 COMMAND: "async/31" #0 [ffff8800282c7e80] crash_nmi_callback at ffffffff8102e036 #1 [ffff8800282c7e90] notifier_call_chain at ffffffff814ce4d5 #2 [ffff8800282c7ed0] atomic_notifier_call_chain at ffffffff814ce53a #3 [ffff8800282c7ee0] notify_die at ffffffff81096fce #4 [ffff8800282c7f10] do_nmi at ffffffff814cc60c #5 [ffff8800282c7f50] nmi at ffffffff814cbf00 [exception RIP: lock_kernel+46] RIP: ffffffff814cb9de RSP: ffff88042c295d30 RFLAGS: 00000293 RAX: 0000000000000000 RBX: ffff880427036a80 RCX: 0000000000001e6b RDX: 0000000000001e6d RSI: 0000000000000004 RDI: ffff88042e83fc00 RBP: ffff88042c295d30 R8: 0000000000000004 R9: ffff88042e83fc10 R10: ffff88042e83fc60 R11: 0000000000000000 R12: ffff880427036a80 R13: 0000000000000000 R14: ffff88042b94c138 R15: 0000000000000001 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #6 [ffff88042c295d30] lock_kernel at ffffffff814cb9de #7 [ffff88042c295d38] __blkdev_get at ffffffff811a4e50 #8 [ffff88042c295d98] blkdev_get at ffffffff811a51d0 #9 [ffff88042c295da8] register_disk at ffffffff811db815 #10 [ffff88042c295df8] add_disk at ffffffff81249acc #11 [ffff88042c295e28] sd_probe_async at ffffffffa017a34b #12 [ffff88042c295e68] async_thread at ffffffff81099192 #13 [ffff88042c295ee8] kthread at ffffffff81091a86 #14 [ffff88042c295f48] kernel_thread at ffffffff810141ca PID: 1771 TASK: ffff880428baea70 CPU: 4 COMMAND: "async/13" #0 [ffff880028307e80] crash_nmi_callback at ffffffff8102e036 #1 [ffff880028307e90] notifier_call_chain at ffffffff814ce4d5 #2 [ffff880028307ed0] atomic_notifier_call_chain at ffffffff814ce53a #3 [ffff880028307ee0] notify_die at ffffffff81096fce #4 [ffff880028307f10] do_nmi at ffffffff814cc60c #5 [ffff880028307f50] nmi at ffffffff814cbf00 [exception RIP: lock_kernel+46] RIP: ffffffff814cb9de RSP: ffff88042b289d40 RFLAGS: 00000297 RAX: 0000000000000000 RBX: ffff880420c830c0 RCX: 0000000000001e6b RDX: 0000000000001e6c RSI: 0000000000000001 RDI: ffff880420c830e0 RBP: ffff88042b289d40 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000001 R14: ffff880420c830e0 R15: ffff88042b52e000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #6 [ffff88042b289d40] lock_kernel at ffffffff814cb9de #7 [ffff88042b289d48] __blkdev_put at ffffffff811a4bf2 #8 [ffff88042b289d98] blkdev_put at ffffffff811a4d50 #9 [ffff88042b289da8] register_disk at ffffffff811db82a #10 [ffff88042b289df8] add_disk at ffffffff81249acc #11 [ffff88042b289e28] sd_probe_async at ffffffffa017a34b #12 [ffff88042b289e68] async_thread at ffffffff81099192 #13 [ffff88042b289ee8] kthread at ffffffff81091a86 #14 [ffff88042b289f48] kernel_thread at ffffffff810141ca PID: 1774 TASK: ffff8804286ef4a0 CPU: 5 COMMAND: "async/15" #0 [ffff880028347e80] crash_nmi_callback at ffffffff8102e036 #1 [ffff880028347e90] notifier_call_chain at ffffffff814ce4d5 #2 [ffff880028347ed0] atomic_notifier_call_chain at ffffffff814ce53a #3 [ffff880028347ee0] notify_die at ffffffff81096fce #4 [ffff880028347f10] do_nmi at ffffffff814cc60c #5 [ffff880028347f50] nmi at ffffffff814cbf00 [exception RIP: smp_call_function_many+434] RIP: ffffffff810a7372 RSP: ffff88042b391c90 RFLAGS: 00000202 RAX: 0000000000000002 RBX: ffff8800283520a0 RCX: 0000000000000008 RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000286 RBP: ffff88042b391cd0 R8: 0000000000000000 R9: ffff88041fedde00 R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000005 R13: ffffffff818a3ba0 R14: ffffffff818a3ba0 R15: ffffffff8119dcb0 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #6 [ffff88042b391c90] smp_call_function_many at ffffffff810a7372 #7 [ffff88042b391cd8] smp_call_function at ffffffff810a73f2 #8 [ffff88042b391ce8] on_each_cpu at ffffffff81073914 #9 [ffff88042b391d18] invalidate_bh_lrus at ffffffff8119d92c #10 [ffff88042b391d28] kill_bdev at ffffffff811a3f58 #11 [ffff88042b391d48] __blkdev_put at ffffffff811a4c60 #12 [ffff88042b391d98] blkdev_put at ffffffff811a4d50 #13 [ffff88042b391da8] register_disk at ffffffff811db82a #14 [ffff88042b391df8] add_disk at ffffffff81249acc #15 [ffff88042b391e28] sd_probe_async at ffffffffa017a34b #16 [ffff88042b391e68] async_thread at ffffffff81099192 #17 [ffff88042b391ee8] kthread at ffffffff81091a86 #18 [ffff88042b391f48] kernel_thread at ffffffff810141ca PID: 2321 TASK: ffff88041d074080 CPU: 6 COMMAND: "udevd" #0 [ffff880028387e80] crash_nmi_callback at ffffffff8102e036 #1 [ffff880028387e90] notifier_call_chain at ffffffff814ce4d5 #2 [ffff880028387ed0] atomic_notifier_call_chain at ffffffff814ce53a #3 [ffff880028387ee0] notify_die at ffffffff81096fce #4 [ffff880028387f10] do_nmi at ffffffff814cc60c #5 [ffff880028387f50] nmi at ffffffff814cbf00 [exception RIP: lock_kernel+46] RIP: ffffffff814cb9de RSP: ffff88041d15bcc8 RFLAGS: 00000287 RAX: 0000000000000000 RBX: ffff88042dad5758 RCX: 0000000000001e6b RDX: 0000000000001e71 RSI: ffff88041fceae00 RDI: ffff88042dad5758 RBP: ffff88041d15bcc8 R8: ffff88041d1ceb40 R9: ffff88041e30fd80 R10: ffff88042df24900 R11: 0000000000000002 R12: ffff88042dad5758 R13: ffff88041fceae00 R14: ffff88041fceae00 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #6 [ffff88041d15bcc8] lock_kernel at ffffffff814cb9de #7 [ffff88041d15bcd0] memory_open at ffffffff812f15d4 #8 [ffff88041d15bd00] chrdev_open at ffffffff81171315 #9 [ffff88041d15bd60] __dentry_open at ffffffff8116a590 #10 [ffff88041d15bdc0] nameidata_to_filp at ffffffff8116a907 #11 [ffff88041d15bde0] do_filp_open at ffffffff8117dafa #12 [ffff88041d15bf20] do_sys_open at ffffffff8116a339 #13 [ffff88041d15bf70] sys_open at ffffffff8116a450 #14 [ffff88041d15bf80] system_call_fastpath at ffffffff81013172 RIP: 0000003dfeed3fc0 RSP: 00007fffb93153a8 RFLAGS: 00010246 RAX: 0000000000000002 RBX: ffffffff81013172 RCX: 0000003dfeed4150 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 000000000041d4a6 RBP: 0000000000000000 R8: 000000000000070d R9: 0000000000000000 R10: 00007f926de14a70 R11: 0000000000000246 R12: ffffffff8116a450 R13: ffff88041d15bf78 R14: 000000000164ddb0 R15: 0000000000000000 ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b PID: 2337 TASK: ffff88041d20ab30 CPU: 7 COMMAND: "multipath" #0 [ffff8800283c7e80] crash_nmi_callback at ffffffff8102e036 #1 [ffff8800283c7e90] notifier_call_chain at ffffffff814ce4d5 #2 [ffff8800283c7ed0] atomic_notifier_call_chain at ffffffff814ce53a #3 [ffff8800283c7ee0] notify_die at ffffffff81096fce #4 [ffff8800283c7f10] do_nmi at ffffffff814cc60c #5 [ffff8800283c7f50] nmi at ffffffff814cbf00 [exception RIP: lock_kernel+46] RIP: ffffffff814cb9de RSP: ffff88041d141c98 RFLAGS: 00000283 RAX: 0000000000000000 RBX: 0000000000a0003a RCX: 0000000000001e6b RDX: 0000000000001e70 RSI: ffff88041d1ceb40 RDI: ffff880429b17758 RBP: ffff88041d141c98 R8: ffff88042ba79440 R9: ffff88042b76a900 R10: ffff88042df3c180 R11: 0000000000000002 R12: ffff88041d1ceb40 R13: ffff880429b17758 R14: ffff88041d1ceb40 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #6 [ffff88041d141c98] lock_kernel at ffffffff814cb9de #7 [ffff88041d141ca0] misc_open at ffffffff812ff424 #8 [ffff88041d141d00] chrdev_open at ffffffff81171315 #9 [ffff88041d141d60] __dentry_open at ffffffff8116a590 #10 [ffff88041d141dc0] nameidata_to_filp at ffffffff8116a907 #11 [ffff88041d141de0] do_filp_open at ffffffff8117dafa #12 [ffff88041d141f20] do_sys_open at ffffffff8116a339 #13 [ffff88041d141f70] sys_open at ffffffff8116a450 #14 [ffff88041d141f80] system_call_fastpath at ffffffff81013172 RIP: 0000003dff20ed30 RSP: 00007fffd15727c8 RFLAGS: 00010202 RAX: 0000000000000002 RBX: ffffffff81013172 RCX: 0000003dfeed3ae5 RDX: 0000000000000a3a RSI: 0000000000000002 RDI: 00007fffd1572820 RBP: 00007fffd1572820 R8: 00007f9ff8fc67a0 R9: 0000000000000004 R10: fffffffffffff6c4 R11: 0000000000000246 R12: ffffffff8116a450 R13: ffff88041d141f78 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b crash> (In reply to comment #4) > Seeing following traces on a system trying to modprobe LPFC. Hi, Sumeet, OK, so you tried to reproduce this on an 8 processor box, it seems. I looked at the vmcore, but the log was filled with what appears to be sysrq output (not entirely sure about that). Did you actually experience a softlockup? It's not apparent from what you posted. Also, there are a lot of 'Device not ready' errors in the logs: sd 3:0:0:14: [sdq] CDB: Read(10): 28 00 00 00 00 00 00 00 08 00 end_request: I/O error, dev sdq, sector 0 sd 3:0:0:14: [sdq] Device not ready sd 3:0:0:14: [sdq] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE sd 3:0:0:14: [sdq] Sense Key : Not Ready [current] sd 3:0:0:14: [sdq] Add. Sense: Logical unit not ready, manual intervention required They are mostly from sdq, though the following disks also show up in the logs (albeit with much lower frequency): sdh sdau sdbn sdbr sdcg sdcl sdcs sddo sddp sdcn sdbm Are you sure this is a symptom of the same problem experienced by the reporter? How many HBAs were you testing with, and to how many logical volumes? Could the reporter please attach full dmesg/console output, if available? I'm not sure why the problem is more readily reproducible with HT enabled. As far as I can tell, the mutex implementation should be immune to starvation. We could certainly backport the BKL removal patches in this area, but I'd like a better understanding of what's going on, first. So, I didn't realize Sumeet was not trying to reproduce this issue, and I managed to overlook some key aspects, here. It's clear that the smp_call_function didn't finish, and that that is the code path holding the BKL. I'm not sure why the smp_call_function is hung. The running cpu is cpu 2. Examining its call_function_data shows that all cpus have cleared their bits, and the lock is not held. crash> p per_cpu__cfd_data PER-CPU DATA TYPE: struct call_function_data per_cpu__cfd_data; PER-CPU ADDRESSES: [0]: ffff8800282120a0 [1]: ffff8800282520a0 [2]: ffff8800282920a0 <---- cpu2 [3]: ffff8800282d20a0 [4]: ffff8800283120a0 [5]: ffff8800283520a0 [6]: ffff8800283920a0 [7]: ffff8800283d20a0 crash> call_function_data ffff8800282920a0 struct call_function_data { csd = { list = { next = 0xffffffff817311c0, prev = 0xdead000000200200 }, func = 0xffffffff8119dcb0 <invalidate_bh_lru>, info = 0x0, flags = 0, <--- CSD_LOCK not set priv = 0 }, refs = { counter = 0 }, cpumask = 0xffff88042e1fda00 } crash> ptype cpu_online_mask type = const struct cpumask { long unsigned int bits[64]; } * const crash> p cpu_online_mask cpu_online_mask = $4 = (const struct cpumask * const) 0xffffffff818a3ba0 crash> rd -x ffffffff818a3ba0 ffffffff818a3ba0: 00000000000000ff crash> rd -x 0xffff88042e1fda00 <--- cpumask embedded in the percpu data ffff88042e1fda00: 0000000000000000 Don, any ideas how this might happen? Just to be certain I hadn't mismapped the cpu into the per-cpu array, I also checked all of the other per-cpu objects, and they had the exact same state. Sumeet, how long was the system stuck like this? Is it responsive otherwise? Jeff, The system gets hung immediately after invoking modprobe lpfc. After this command, the shell never returns the prompt. The system never returns to the normal state, even after waiting for several hours the system remains unresponsive. Sumeet, is this something that's easily reproducible? Would the customer be willing to try a kernel with a couple of patches to the smp_call_function path to see if it resolves the problem? Jeff, Problem is easily reproducible on customer setup. I am sure customer will be ready to try a test kernel. Please could you provide the same. Thanks, Sumeet Sumeet, Please bring those device errors to the customer's attention. Also, I'm still waiting for an answer to how many HBAs and how many LUNs are in this configuration. Hi, Peter, We have a customer system with a lot of storage attached. During module load, the system load average climbs to over 130 and becomes very sluggish (see comment #34). Most of the activity is related to setting up the sd devices using udev (it will call blkid for every partition, for example). I'm not sure what to expect from the system when this happens. It sounds like a case of, "don't do that." However, I'd like an official story from a scheduler person on what the expected behavior is. I'll attach the sysrq-t output. Created attachment 488155 [details]
sysrq-t output during high load
There were a number of fixes to kernel/smp.c recently fixing a number of races in that code, one I think very much like the problem reported, I would suggest backporting all patches touching this file. In particular see commit 723aae2, but all other fixes there are very much needed too. We already did backport a number of fixes (5, I think) for smp_call_function and friends. That did fix the hung system and got rid of most of the soft lockups (there's still one left that I saw). The question I had for you was, in the presence of 120 runnable processes performing blkid and the like commands, would you expect the system to be responsive to interactive processes? I'm sure it's not an easy question to answer. OK, missed the smp_call_function fixup there. Right, hard question, that very much depends on what those processes do and what the state of the BKL mess is in that kernel. So from what I could see of the backtraces in this BZ the blkid stuff is BKL heavy and will thus serialize all 120 processes, if say the TTY subsystem is also still a BKL user (it used to be for a long while, not quite sure of the RHEL6 status but a git grep shows lots of lock_kernel in drivers/char/) then anything tty related will also grind to a halt. Since RHEL 6.1 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. (In reply to comment #37) > Hi, Peter, > > We have a customer system with a lot of storage attached. During module load, > the system load average climbs to over 130 and becomes very sluggish (see > comment #34). Most of the activity is related to setting up the sd devices > using udev (it will call blkid for every partition, for example). I'm not sure > what to expect from the system when this happens. It sounds like a case of, > "don't do that." However, I'd like an official story from a scheduler person > on what the expected behavior is. > > I'll attach the sysrq-t output. I looked into this today. It appears that the lpfc driver is responsible for the sluggishness of loading. In a host with 100 luns configured, direct comparision between rhel5.6 and rhel6.1-snap4, the rhel6 load took on the order of 20 seconds where the rhel5.6 load took on the order of 2 seconds. Repeating the experiment with a qlogic adapter, the qlogic adapter loaded under both scenarios in a few seconds. I am opened a new bug on this specific issue and will get Emulex involved directly: https://bugzilla.redhat.com/show_bug.cgi?id=698799 This regressed from RHEL 6.0. We have a workaround, but how did we get into a regression? James, This bugzilla has been dealing with a problem introduced in RHEL 6.0. If you see a regression from 6.0 -> 6.1, then I think it's a different issue. Could you please elaborate on what you have seen? Jeff, I think I might be commenting in the wrong bug. Let me know. If we're dealing with the specific issue of the lpfc module using the BKL, then I have to get on to another bug. I was speaking to the fact that the emc scsi device handler is loaded in 6.0 and is _not_ in 6.1, which causes all of the blocking on BKL. Hi, James, This specific bug addresses race conditions in smp_call_function that result in a hung system. There are separate bugs to address the load order of the modules (scsi_dh_emc, multipath, lpfc) and to address the regression in driver load times for lpfc. So, in short, I agree. You are updating the wrong bug. ;-) Okay, moved to bug 690523 Patch(es) available on kernel-2.6.32-151.el6 This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. We cannot find a server with 128 CPUs (HT) having HBA connected in Red Hat. Can we find a partner or customer for testing this bug? Let me know if we do have devices capable for this bug. Thanks. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2011-1530.html |