Bug 822088
| Summary: | [libfc] WARNING: at drivers/scsi/libfc/fc_exch.c:478 fc_seq_send+0x14f/0x160 [libfc]() (Not tainted) | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Gris Ge <fge> | ||||||
| Component: | kernel | Assignee: | Neil Horman <nhorman> | ||||||
| Status: | CLOSED DUPLICATE | QA Contact: | Storage QE <storage-qe> | ||||||
| Severity: | low | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 6.3 | ||||||||
| Target Milestone: | rc | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2013-05-11 01:10:29 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Gris Ge
2012-05-16 10:08:53 UTC
This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux. This request was erroneously removed from consideration in Red Hat Enterprise Linux 6.4, which is currently under development. This request will be evaluated for inclusion in Red Hat Enterprise Linux 6.4. hmm, looks like fc_seq_send is checking esb_stat without the ep->ex_lock held. It looks to me like the ex_lock should probably be held for at least part of fc_seq_send. I'll have a build going for you shortly. http://brewweb.devel.redhat.com/brew/taskinfo?taskID=5133861 Theres a build for you Gris, please test it out asap and confirm that it solves the reported problem. This is going to need to go upstream too, so given the lateness of the 6.4 window, expidite this if you could please. Thank you. With the build you povide, server stuck at boot time when start fcoe daemon: ====== BUG: soft lockup - CPU#0 stuck for 67s! [fc_rport_eq:2823] Modules linked in: dm_round_robin tcm_fc target_core_iblock target_core_file target_core_pscsi target_core_mod configfs bnx2fc cnic uio fcoe libfcoe 8021q libfc garp stp scsi_transport_fc llc scsi_tgt sunrpc pcc_cpufreq ipv6 hpilo hpwdt e1000e microcode iTCO_wdt iTCO_vendor_support serio_raw shpchp ixgbe dca ptp pps_core mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix hpsa dm_multipath dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] CPU 0 Modules linked in: dm_round_robin tcm_fc target_core_iblock target_core_file target_core_pscsi target_core_mod configfs bnx2fc cnic uio fcoe libfcoe 8021q libfc garp stp scsi_transport_fc llc scsi_tgt sunrpc pcc_cpufreq ipv6 hpilo hpwdt e1000e microcode iTCO_wdt iTCO_vendor_support serio_raw shpchp ixgbe dca ptp pps_core mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix hpsa dm_multipath dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 2823, comm: fc_rport_eq Not tainted 2.6.32-344.el6.test.x86_64 #1 HP ProLiant DL120 G7 RIP: 0010:[<ffffffff8150f66b>] [<ffffffff8150f66b>] _spin_lock_bh+0x2b/0x40 RSP: 0018:ffff880203ca5c80 EFLAGS: 00000297 RAX: 0000000000000002 RBX: ffff880203ca5c90 RCX: 0000000000007500 RDX: 0000000000000001 RSI: ffff8801f6a6d150 RDI: ffff8801f6a6d0a8 RBP: ffffffff8100bb8e R08: 0000000000000001 R09: 0000000000000000 R10: 00000000000001e0 R11: 0000000000000000 R12: ffff880203ca5c10 R13: ffffffff8136458b R14: 003135323a313532 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88002fa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00007f3b6bfff000 CR3: 000000020634b000 CR4: 00000000000407f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process fc_rport_eq (pid: 2823, threadinfo ffff880203ca4000, task ffff8802030a4080) Stack: 00000000ffffffff ffff8801f6a6d080 ffff880203ca5cf0 ffffffffa029b5fb <d> 0000000000000000 ffff8801beeef840 0000000000000018 ffff8802030a15e0 <d> ffff880203ca5ce0 ffff8801f6a6d080 ffff8801bef65d80 ffff8801f6a6d150 Call Trace: [<ffffffffa029b5fb>] ? fc_seq_send+0x5b/0x160 [libfc] [<ffffffffa029b836>] ? fc_exch_abort_locked+0x136/0x1c0 [libfc] [<ffffffffa029ba4f>] ? fc_exch_mgr_reset+0x12f/0x260 [libfc] [<ffffffffa02a3e10>] ? fc_rport_work+0x120/0x5f0 [libfc] [<ffffffff81096d8e>] ? prepare_to_wait+0x4e/0x80 [<ffffffffa02a3cf0>] ? fc_rport_work+0x0/0x5f0 [libfc] [<ffffffff810908e0>] ? worker_thread+0x170/0x2a0 [<ffffffff81096aa0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff81090770>] ? worker_thread+0x0/0x2a0 [<ffffffff81096736>] ? kthread+0x96/0xa0 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 [<ffffffff810966a0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Code: 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 48 89 fb e8 3a 77 b6 ff b8 00 00 01 00 f0 0f c1 03 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> b7 13 eb f5 83 3b 00 75 f4 eb df 48 83 c4 08 5b c9 c3 66 90 ====== The server did not allow me login via ssh or console, no idea where server stuck. You might need to postpone the upstream submittion. Created attachment 653689 [details]
[PATCH] libfc: extend ex_lock to protect all of fc_seq_send
This warning was reported recently:
WARNING: at drivers/scsi/libfc/fc_exch.c:478 fc_seq_send+0x14f/0x160 [libfc]()
(Not tainted)
Hardware name: ProLiant DL120 G7
Modules linked in: tcm_fc target_core_iblock target_core_file target_core_pscsi
target_core_mod configfs dm_round_robin dm_multipath 8021q garp stp llc bnx2fc
cnic uio fcoe libfcoe libfc scsi_transport_fc scsi_tgt autofs4 sunrpc
pcc_cpufreq ipv6 hpilo hpwdt e1000e microcode iTCO_wdt iTCO_vendor_support
serio_raw shpchp ixgbe dca mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi
ata_generic ata_piix hpsa dm_mirror dm_region_hash dm_log dm_mod [last unloaded:
scsi_wait_scan]
Pid: 5464, comm: target_completi Not tainted 2.6.32-272.el6.x86_64 #1
Call Trace:
[<ffffffff8106b747>] ? warn_slowpath_common+0x87/0xc0
[<ffffffff8106b79a>] ? warn_slowpath_null+0x1a/0x20
[<ffffffffa025f7df>] ? fc_seq_send+0x14f/0x160 [libfc]
[<ffffffffa035cbce>] ? ft_queue_status+0x16e/0x210 [tcm_fc]
[<ffffffffa030a660>] ? target_complete_ok_work+0x0/0x4b0 [target_core_mod]
[<ffffffffa030a766>] ? target_complete_ok_work+0x106/0x4b0 [target_core_mod]
[<ffffffffa030a660>] ? target_complete_ok_work+0x0/0x4b0 [target_core_mod]
[<ffffffff8108c760>] ? worker_thread+0x170/0x2a0
[<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0
[<ffffffff81091d66>] ? kthread+0x96/0xa0
[<ffffffff8100c14a>] ? child_rip+0xa/0x20
[<ffffffff81091cd0>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20
It occurs because fc_seq_send can have multiple contexts executing within it at
the same time, and fc_seq_send doesn't consistently use the ep->ex_lock that
protects this structure. Because of that, its possible for one context to clear
the INIT bit in the ep->esb_state field while another checks it, leading to the
above stack trace generated by the WARN_ON in the function.
We should probably undertake the effort to convert access to the fc_exch
structures to use rcu, but that a larger work item. To just fix this specific
issue, we can just extend the ex_lock protection through the entire fc_seq_send
path
Signed-off-by: Neil Horman <nhorman>
---
drivers/scsi/libfc/fc_exch.c | 39 +++++++++++++++++++++++++--------------
1 file changed, 25 insertions(+), 14 deletions(-)
sorry, I did something dumb, and didn't check all the lock paths leading to fc_seq_send. Heres a new build: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=5140784 with the patch from comment 6, I've smoke tested it on my devel system for fcoe, and it works, please confirm that it fixes your bug This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux. gris, I think this still needs to be fixed, but I'd really like to get confirmation from your testing before I go upstream with it. Sorry for missed your previous bugzilla comments. The brew build expired. I build a kernel with patch you provided. http://pnate-control-01.lab.bos.redhat.com/stqe/tmp/BZ_822088/kernel-2.6.32-358.el6.bz822088.src.rpm It keep warning in console once the I/O started. ==== BUG: scheduling while atomic: fcoethread/3/1945/0x0010da00 Modules linked in: dm_round_robin scsi_dh_emc bnx2fc cnic uio fcoe libfcoe libfc 8021q scsi_transport_fc garp stp scsi_tgt llc sunrpc cpufreq_ondemand freq_table pcc_cpufreq ipv6 hpilo hpwdt e1000e microcode serio_raw iTCO_wdt iTCO_vendor_support shpchp ixgbe dca ptp pps_core mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix hpsa dm_multipath dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 1945, comm: fcoethread/3 Tainted: G W --------------- 2.6.32-358.el6.bz822088.x86_64 #1 Call Trace: [<ffffffff8105ab36>] ? __schedule_bug+0x66/0x70 [<ffffffff8150dce0>] ? thread_return+0x64e/0x76e [<ffffffffa0286ae0>] ? fc_exch_find+0x90/0xa0 [libfc] [<ffffffffa02bc243>] ? fcoe_percpu_receive_thread+0x2d3/0x530 [fcoe] [<ffffffff8150d6e0>] ? thread_return+0x4e/0x76e [<ffffffffa02bbf70>] ? fcoe_percpu_receive_thread+0x0/0x530 [fcoe] [<ffffffff81096916>] ? kthread+0x96/0xa0 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 [<ffffffff81096880>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 BUG: scheduling while atomic: fcoethread/2/1944/0x000e7600 Modules linked in: dm_round_robin scsi_dh_emc bnx2fc cnic uio fcoe libfcoe libfc 8021q scsi_transport_fc garp stp scsi_tgt llc sunrpc cpufreq_ondemand freq_table pcc_cpufreq ipv6 hpilo hpwdt e1000e microcode serio_raw iTCO_wdt iTCO_vendor_support shpchp ixgbe dca ptp pps_core mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix hpsa dm_multipath dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 1944, comm: fcoethread/2 Tainted: G W --------------- 2.6.32-358.el6.bz822088.x86_64 #1 Call Trace: [<ffffffff8105ab36>] ? __schedule_bug+0x66/0x70 [<ffffffff8150dce0>] ? thread_return+0x64e/0x76e [<ffffffffa0286ae0>] ? fc_exch_find+0x90/0xa0 [libfc] [<ffffffffa02bc243>] ? fcoe_percpu_receive_thread+0x2d3/0x530 [fcoe] [<ffffffff8150d6e0>] ? thread_return+0x4e/0x76e [<ffffffffa02bbf70>] ? fcoe_percpu_receive_thread+0x0/0x530 [fcoe] [<ffffffff81096916>] ? kthread+0x96/0xa0 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 [<ffffffff81096880>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 BUG: scheduling while atomic: fcoethread/3/1945/0x0010da00 ==== Thats a different problem. I'll look into it, but we may need a separate bug for that. I see the problem, I'll have another build for you shortly Created attachment 741991 [details]
[PATCH] libfc: extend ex_lock to protect all of fc_seq_send
This warning was reported recently:
WARNING: at drivers/scsi/libfc/fc_exch.c:478 fc_seq_send+0x14f/0x160 [libfc]()
(Not tainted)
Hardware name: ProLiant DL120 G7
Modules linked in: tcm_fc target_core_iblock target_core_file target_core_pscsi
target_core_mod configfs dm_round_robin dm_multipath 8021q garp stp llc bnx2fc
cnic uio fcoe libfcoe libfc scsi_transport_fc scsi_tgt autofs4 sunrpc
pcc_cpufreq ipv6 hpilo hpwdt e1000e microcode iTCO_wdt iTCO_vendor_support
serio_raw shpchp ixgbe dca mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi
ata_generic ata_piix hpsa dm_mirror dm_region_hash dm_log dm_mod [last unloaded:
scsi_wait_scan]
Pid: 5464, comm: target_completi Not tainted 2.6.32-272.el6.x86_64 #1
Call Trace:
[<ffffffff8106b747>] ? warn_slowpath_common+0x87/0xc0
[<ffffffff8106b79a>] ? warn_slowpath_null+0x1a/0x20
[<ffffffffa025f7df>] ? fc_seq_send+0x14f/0x160 [libfc]
[<ffffffffa035cbce>] ? ft_queue_status+0x16e/0x210 [tcm_fc]
[<ffffffffa030a660>] ? target_complete_ok_work+0x0/0x4b0 [target_core_mod]
[<ffffffffa030a766>] ? target_complete_ok_work+0x106/0x4b0 [target_core_mod]
[<ffffffffa030a660>] ? target_complete_ok_work+0x0/0x4b0 [target_core_mod]
[<ffffffff8108c760>] ? worker_thread+0x170/0x2a0
[<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0
[<ffffffff81091d66>] ? kthread+0x96/0xa0
[<ffffffff8100c14a>] ? child_rip+0xa/0x20
[<ffffffff81091cd0>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20
It occurs because fc_seq_send can have multiple contexts executing within it at
the same time, and fc_seq_send doesn't consistently use the ep->ex_lock that
protects this structure. Because of that, its possible for one context to clear
the INIT bit in the ep->esb_state field while another checks it, leading to the
above stack trace generated by the WARN_ON in the function.
We should probably undertake the effort to convert access to the fc_exch
structures to use rcu, but that a larger work item. To just fix this specific
issue, we can just extend the ex_lock protection through the entire fc_seq_send
path
Signed-off-by: Neil Horman <nhorman>
---
drivers/scsi/libfc/fc_exch.c | 37 ++++++++++++++++++++++++-------------
1 file changed, 24 insertions(+), 13 deletions(-)
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=5705908 There you go, that should solve the above problem. Neil Horman, Good work. 4 hours I/O stress with link failover provide no kernel call trace on kernel-2.6.32-344.el6.test.x86_64. Thanks. Thanks, I'll send this patch upstream today, and backport ASAP. sent the patch upstream, waiting on integration This has been accepted upstream, I'll backport this as part of the fcoe overall backport (Bz 829784) *** This bug has been marked as a duplicate of bug 829784 *** |