Bug 822088 - [libfc] WARNING: at drivers/scsi/libfc/fc_exch.c:478 fc_seq_send+0x14f/0x160 [libfc]() (Not tainted)
[libfc] WARNING: at drivers/scsi/libfc/fc_exch.c:478 fc_seq_send+0x14f/0x160 ...
Status: CLOSED DUPLICATE of bug 829784
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.3
x86_64 Linux
unspecified Severity low
: rc
: ---
Assigned To: Neil Horman
Storage QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-05-16 06:08 EDT by Gris Ge
Modified: 2013-05-10 21:10 EDT (History)
0 users

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-05-10 21:10:29 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
[PATCH] libfc: extend ex_lock to protect all of fc_seq_send (5.54 KB, patch)
2012-11-28 14:13 EST, Neil Horman
no flags Details | Diff
[PATCH] libfc: extend ex_lock to protect all of fc_seq_send (5.51 KB, patch)
2013-04-30 16:38 EDT, Neil Horman
no flags Details | Diff

  None (edit)
Description Gris Ge 2012-05-16 06:08:53 EDT
Description of problem:
------------[ cut here ]------------
WARNING: at drivers/scsi/libfc/fc_exch.c:478 fc_seq_send+0x14f/0x160 [libfc]() (Not tainted)
Hardware name: ProLiant DL120 G7
Modules linked in: tcm_fc target_core_iblock target_core_file target_core_pscsi target_core_mod configfs dm_round_robin dm_multipath 8021q garp stp llc bnx2fc cnic uio fcoe libfcoe libfc scsi_transport_fc scsi_tgt autofs4 sunrpc pcc_cpufreq ipv6 hpilo hpwdt e1000e microcode iTCO_wdt iTCO_vendor_support serio_raw shpchp ixgbe dca mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix hpsa dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 5464, comm: target_completi Not tainted 2.6.32-272.el6.x86_64 #1
Call Trace:
 [<ffffffff8106b747>] ? warn_slowpath_common+0x87/0xc0
 [<ffffffff8106b79a>] ? warn_slowpath_null+0x1a/0x20
 [<ffffffffa025f7df>] ? fc_seq_send+0x14f/0x160 [libfc]
 [<ffffffffa035cbce>] ? ft_queue_status+0x16e/0x210 [tcm_fc]
 [<ffffffffa030a660>] ? target_complete_ok_work+0x0/0x4b0 [target_core_mod]
 [<ffffffffa030a766>] ? target_complete_ok_work+0x106/0x4b0 [target_core_mod]
 [<ffffffffa030a660>] ? target_complete_ok_work+0x0/0x4b0 [target_core_mod]
 [<ffffffff8108c760>] ? worker_thread+0x170/0x2a0
 [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0
 [<ffffffff81091d66>] ? kthread+0x96/0xa0
 [<ffffffff8100c14a>] ? child_rip+0xa/0x20
 [<ffffffff81091cd0>] ? kthread+0x0/0xa0
 [<ffffffff8100c140>] ? child_rip+0x0/0x20
---[ end trace 820415b6ee9aa881 ]---

Got these warnings when performing I/O stress for fcoe-target.

Version-Release number of selected component (if applicable):
kernel -272

How reproducible:
no sure.

Steps to Reproduce:
1. Setup FCoE target.
2. Run 16 thread I/O stress on FCoE initiator.
3.
  
Actual results:
kernel warning.

Expected results:
no warning.

Additional info:
Comment 1 RHEL Product and Program Management 2012-07-10 04:29:05 EDT
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.
Comment 2 RHEL Product and Program Management 2012-07-10 19:41:46 EDT
This request was erroneously removed from consideration in Red Hat Enterprise Linux 6.4, which is currently under development.  This request will be evaluated for inclusion in Red Hat Enterprise Linux 6.4.
Comment 3 Neil Horman 2012-11-26 16:34:32 EST
hmm, looks like fc_seq_send is checking esb_stat without the ep->ex_lock held.  It looks to me like the ex_lock should probably be held for at least part of fc_seq_send. I'll have a build going for you shortly.
Comment 4 Neil Horman 2012-11-27 15:31:09 EST
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=5133861

Theres a build for you Gris, please test it out asap and confirm that it solves the reported problem.  This is going to need to go upstream too, so given the lateness of the 6.4 window, expidite this if you could please.  Thank you.
Comment 5 Gris Ge 2012-11-28 00:59:00 EST
With the build you povide, server stuck at boot time when start fcoe daemon:
======
BUG: soft lockup - CPU#0 stuck for 67s! [fc_rport_eq:2823]
Modules linked in: dm_round_robin tcm_fc target_core_iblock target_core_file target_core_pscsi target_core_mod configfs bnx2fc cnic uio fcoe libfcoe 8021q libfc garp stp scsi_transport_fc llc scsi_tgt sunrpc pcc_cpufreq ipv6 hpilo hpwdt e1000e microcode iTCO_wdt iTCO_vendor_support serio_raw shpchp ixgbe dca ptp pps_core mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix hpsa dm_multipath dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
CPU 0 
Modules linked in: dm_round_robin tcm_fc target_core_iblock target_core_file target_core_pscsi target_core_mod configfs bnx2fc cnic uio fcoe libfcoe 8021q libfc garp stp scsi_transport_fc llc scsi_tgt sunrpc pcc_cpufreq ipv6 hpilo hpwdt e1000e microcode iTCO_wdt iTCO_vendor_support serio_raw shpchp ixgbe dca ptp pps_core mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix hpsa dm_multipath dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 2823, comm: fc_rport_eq Not tainted 2.6.32-344.el6.test.x86_64 #1 HP ProLiant DL120 G7
RIP: 0010:[<ffffffff8150f66b>]  [<ffffffff8150f66b>] _spin_lock_bh+0x2b/0x40
RSP: 0018:ffff880203ca5c80  EFLAGS: 00000297
RAX: 0000000000000002 RBX: ffff880203ca5c90 RCX: 0000000000007500
RDX: 0000000000000001 RSI: ffff8801f6a6d150 RDI: ffff8801f6a6d0a8
RBP: ffffffff8100bb8e R08: 0000000000000001 R09: 0000000000000000
R10: 00000000000001e0 R11: 0000000000000000 R12: ffff880203ca5c10
R13: ffffffff8136458b R14: 003135323a313532 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88002fa00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007f3b6bfff000 CR3: 000000020634b000 CR4: 00000000000407f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process fc_rport_eq (pid: 2823, threadinfo ffff880203ca4000, task ffff8802030a4080)
Stack:
 00000000ffffffff ffff8801f6a6d080 ffff880203ca5cf0 ffffffffa029b5fb
<d> 0000000000000000 ffff8801beeef840 0000000000000018 ffff8802030a15e0
<d> ffff880203ca5ce0 ffff8801f6a6d080 ffff8801bef65d80 ffff8801f6a6d150
Call Trace:
 [<ffffffffa029b5fb>] ? fc_seq_send+0x5b/0x160 [libfc]
 [<ffffffffa029b836>] ? fc_exch_abort_locked+0x136/0x1c0 [libfc]
 [<ffffffffa029ba4f>] ? fc_exch_mgr_reset+0x12f/0x260 [libfc]
 [<ffffffffa02a3e10>] ? fc_rport_work+0x120/0x5f0 [libfc]
 [<ffffffff81096d8e>] ? prepare_to_wait+0x4e/0x80
 [<ffffffffa02a3cf0>] ? fc_rport_work+0x0/0x5f0 [libfc]
 [<ffffffff810908e0>] ? worker_thread+0x170/0x2a0
 [<ffffffff81096aa0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff81090770>] ? worker_thread+0x0/0x2a0
 [<ffffffff81096736>] ? kthread+0x96/0xa0
 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
 [<ffffffff810966a0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Code: 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 48 89 fb e8 3a 77 b6 ff b8 00 00 01 00 f0 0f c1 03 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> b7 13 eb f5 83 3b 00 75 f4 eb df 48 83 c4 08 5b c9 c3 66 90
======

The server did not allow me login via ssh or console, no idea where server stuck.

You might need to postpone the upstream submittion.
Comment 6 Neil Horman 2012-11-28 14:13:17 EST
Created attachment 653689 [details]
[PATCH] libfc: extend ex_lock to protect all of fc_seq_send


This warning was reported recently:

WARNING: at drivers/scsi/libfc/fc_exch.c:478 fc_seq_send+0x14f/0x160 [libfc]()
(Not tainted)
Hardware name: ProLiant DL120 G7
Modules linked in: tcm_fc target_core_iblock target_core_file target_core_pscsi
target_core_mod configfs dm_round_robin dm_multipath 8021q garp stp llc bnx2fc
cnic uio fcoe libfcoe libfc scsi_transport_fc scsi_tgt autofs4 sunrpc
pcc_cpufreq ipv6 hpilo hpwdt e1000e microcode iTCO_wdt iTCO_vendor_support
serio_raw shpchp ixgbe dca mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi
ata_generic ata_piix hpsa dm_mirror dm_region_hash dm_log dm_mod [last unloaded:
scsi_wait_scan]
Pid: 5464, comm: target_completi Not tainted 2.6.32-272.el6.x86_64 #1
Call Trace:
 [<ffffffff8106b747>] ? warn_slowpath_common+0x87/0xc0
 [<ffffffff8106b79a>] ? warn_slowpath_null+0x1a/0x20
 [<ffffffffa025f7df>] ? fc_seq_send+0x14f/0x160 [libfc]
 [<ffffffffa035cbce>] ? ft_queue_status+0x16e/0x210 [tcm_fc]
 [<ffffffffa030a660>] ? target_complete_ok_work+0x0/0x4b0 [target_core_mod]
 [<ffffffffa030a766>] ? target_complete_ok_work+0x106/0x4b0 [target_core_mod]
 [<ffffffffa030a660>] ? target_complete_ok_work+0x0/0x4b0 [target_core_mod]
 [<ffffffff8108c760>] ? worker_thread+0x170/0x2a0
 [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0
 [<ffffffff81091d66>] ? kthread+0x96/0xa0
 [<ffffffff8100c14a>] ? child_rip+0xa/0x20
 [<ffffffff81091cd0>] ? kthread+0x0/0xa0
 [<ffffffff8100c140>] ? child_rip+0x0/0x20

It occurs because fc_seq_send can have multiple contexts executing within it at
the same time, and fc_seq_send doesn't consistently use the ep->ex_lock that
protects this structure.  Because of that, its possible for one context to clear
the INIT bit in the ep->esb_state field while another checks it, leading to the
above stack trace generated by the WARN_ON in the function.

We should probably undertake the effort to convert access to the fc_exch
structures to use rcu, but that a larger work item.  To just fix this specific
issue, we can just extend the ex_lock protection through the entire fc_seq_send
path

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
---
 drivers/scsi/libfc/fc_exch.c | 39 +++++++++++++++++++++++++--------------
 1 file changed, 25 insertions(+), 14 deletions(-)
Comment 7 Neil Horman 2012-11-28 14:24:16 EST
sorry, I did something dumb, and didn't check all the lock paths leading to fc_seq_send.

Heres a new build:
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=5140784

with the patch from comment 6, I've smoke tested it on my devel system for fcoe, and it works, please confirm that it fixes your bug
Comment 8 RHEL Product and Program Management 2012-12-14 03:14:01 EST
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.
Comment 10 Neil Horman 2013-04-23 12:57:28 EDT
gris, I think this still needs to be fixed, but I'd really like to get confirmation from your testing before I go upstream with it.
Comment 11 Gris Ge 2013-04-24 01:45:54 EDT
Sorry for missed your previous bugzilla comments.

The brew build expired. 
I build a kernel with patch you provided.
http://pnate-control-01.lab.bos.redhat.com/stqe/tmp/BZ_822088/kernel-2.6.32-358.el6.bz822088.src.rpm

It keep warning in console once the I/O started.

====
BUG: scheduling while atomic: fcoethread/3/1945/0x0010da00
Modules linked in: dm_round_robin scsi_dh_emc bnx2fc cnic uio fcoe libfcoe libfc 8021q scsi_transport_fc garp stp scsi_tgt llc sunrpc cpufreq_ondemand freq_table pcc_cpufreq ipv6 hpilo hpwdt e1000e microcode serio_raw iTCO_wdt iTCO_vendor_support shpchp ixgbe dca ptp pps_core mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix hpsa dm_multipath dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 1945, comm: fcoethread/3 Tainted: G        W  ---------------    2.6.32-358.el6.bz822088.x86_64 #1
Call Trace:
 [<ffffffff8105ab36>] ? __schedule_bug+0x66/0x70
 [<ffffffff8150dce0>] ? thread_return+0x64e/0x76e
 [<ffffffffa0286ae0>] ? fc_exch_find+0x90/0xa0 [libfc]
 [<ffffffffa02bc243>] ? fcoe_percpu_receive_thread+0x2d3/0x530 [fcoe]
 [<ffffffff8150d6e0>] ? thread_return+0x4e/0x76e
 [<ffffffffa02bbf70>] ? fcoe_percpu_receive_thread+0x0/0x530 [fcoe]
 [<ffffffff81096916>] ? kthread+0x96/0xa0
 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
 [<ffffffff81096880>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
BUG: scheduling while atomic: fcoethread/2/1944/0x000e7600
Modules linked in: dm_round_robin scsi_dh_emc bnx2fc cnic uio fcoe libfcoe libfc 8021q scsi_transport_fc garp stp scsi_tgt llc sunrpc cpufreq_ondemand freq_table pcc_cpufreq ipv6 hpilo hpwdt e1000e microcode serio_raw iTCO_wdt iTCO_vendor_support shpchp ixgbe dca ptp pps_core mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix hpsa dm_multipath dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 1944, comm: fcoethread/2 Tainted: G        W  ---------------    2.6.32-358.el6.bz822088.x86_64 #1
Call Trace:
 [<ffffffff8105ab36>] ? __schedule_bug+0x66/0x70
 [<ffffffff8150dce0>] ? thread_return+0x64e/0x76e
 [<ffffffffa0286ae0>] ? fc_exch_find+0x90/0xa0 [libfc]
 [<ffffffffa02bc243>] ? fcoe_percpu_receive_thread+0x2d3/0x530 [fcoe]
 [<ffffffff8150d6e0>] ? thread_return+0x4e/0x76e
 [<ffffffffa02bbf70>] ? fcoe_percpu_receive_thread+0x0/0x530 [fcoe]
 [<ffffffff81096916>] ? kthread+0x96/0xa0
 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
 [<ffffffff81096880>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
BUG: scheduling while atomic: fcoethread/3/1945/0x0010da00
====
Comment 12 Neil Horman 2013-04-30 14:01:04 EDT
Thats a different problem.  I'll look into it, but we may need a separate bug for that.
Comment 13 Neil Horman 2013-04-30 16:37:52 EDT
I see the problem, I'll have another build for you shortly
Comment 14 Neil Horman 2013-04-30 16:38:56 EDT
Created attachment 741991 [details]
[PATCH] libfc: extend ex_lock to protect all of fc_seq_send


This warning was reported recently:

WARNING: at drivers/scsi/libfc/fc_exch.c:478 fc_seq_send+0x14f/0x160 [libfc]()
(Not tainted)
Hardware name: ProLiant DL120 G7
Modules linked in: tcm_fc target_core_iblock target_core_file target_core_pscsi
target_core_mod configfs dm_round_robin dm_multipath 8021q garp stp llc bnx2fc
cnic uio fcoe libfcoe libfc scsi_transport_fc scsi_tgt autofs4 sunrpc
pcc_cpufreq ipv6 hpilo hpwdt e1000e microcode iTCO_wdt iTCO_vendor_support
serio_raw shpchp ixgbe dca mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi
ata_generic ata_piix hpsa dm_mirror dm_region_hash dm_log dm_mod [last unloaded:
scsi_wait_scan]
Pid: 5464, comm: target_completi Not tainted 2.6.32-272.el6.x86_64 #1
Call Trace:
 [<ffffffff8106b747>] ? warn_slowpath_common+0x87/0xc0
 [<ffffffff8106b79a>] ? warn_slowpath_null+0x1a/0x20
 [<ffffffffa025f7df>] ? fc_seq_send+0x14f/0x160 [libfc]
 [<ffffffffa035cbce>] ? ft_queue_status+0x16e/0x210 [tcm_fc]
 [<ffffffffa030a660>] ? target_complete_ok_work+0x0/0x4b0 [target_core_mod]
 [<ffffffffa030a766>] ? target_complete_ok_work+0x106/0x4b0 [target_core_mod]
 [<ffffffffa030a660>] ? target_complete_ok_work+0x0/0x4b0 [target_core_mod]
 [<ffffffff8108c760>] ? worker_thread+0x170/0x2a0
 [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0
 [<ffffffff81091d66>] ? kthread+0x96/0xa0
 [<ffffffff8100c14a>] ? child_rip+0xa/0x20
 [<ffffffff81091cd0>] ? kthread+0x0/0xa0
 [<ffffffff8100c140>] ? child_rip+0x0/0x20

It occurs because fc_seq_send can have multiple contexts executing within it at
the same time, and fc_seq_send doesn't consistently use the ep->ex_lock that
protects this structure.  Because of that, its possible for one context to clear
the INIT bit in the ep->esb_state field while another checks it, leading to the
above stack trace generated by the WARN_ON in the function.

We should probably undertake the effort to convert access to the fc_exch
structures to use rcu, but that a larger work item.  To just fix this specific
issue, we can just extend the ex_lock protection through the entire fc_seq_send
path

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
---
 drivers/scsi/libfc/fc_exch.c | 37 ++++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 13 deletions(-)
Comment 15 Neil Horman 2013-04-30 16:41:30 EDT
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=5705908

There you go, that should solve the above problem.
Comment 16 Gris Ge 2013-05-02 00:46:36 EDT
Neil Horman,

Good work. 4 hours I/O stress with link failover provide no kernel call trace on kernel-2.6.32-344.el6.test.x86_64.

Thanks.
Comment 17 Neil Horman 2013-05-02 09:55:40 EDT
Thanks, I'll send this patch upstream today, and backport ASAP.
Comment 18 Neil Horman 2013-05-02 16:20:25 EDT
sent the patch upstream, waiting on integration
Comment 19 Neil Horman 2013-05-03 11:27:43 EDT
http://lists.open-fcoe.org/pipermail/fcoe-devel/2013-May/011990.html

upstream thread
Comment 20 Neil Horman 2013-05-10 21:10:29 EDT
This has been accepted upstream, I'll backport this as part of the fcoe overall backport (Bz 829784)

*** This bug has been marked as a duplicate of bug 829784 ***

Note You need to log in before you can comment on or make changes to this bug.