Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 822088

Summary:

[libfc] WARNING: at drivers/scsi/libfc/fc_exch.c:478 fc_seq_send+0x14f/0x160 [libfc]() (Not tainted)

Product:

Red Hat Enterprise Linux 6

Reporter:

Gris Ge <fge>

Component:

kernel

Assignee:

Neil Horman <nhorman>

Status:

CLOSED DUPLICATE

QA Contact:

Storage QE <storage-qe>

Severity:

low

Docs Contact:

Priority:

unspecified

Version:

6.3

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-05-11 01:10:29 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
[PATCH] libfc: extend ex_lock to protect all of fc_seq_send	none
[PATCH] libfc: extend ex_lock to protect all of fc_seq_send	none

Description Gris Ge 2012-05-16 10:08:53 UTC

Description of problem:
------------[ cut here ]------------
WARNING: at drivers/scsi/libfc/fc_exch.c:478 fc_seq_send+0x14f/0x160 [libfc]() (Not tainted)
Hardware name: ProLiant DL120 G7
Modules linked in: tcm_fc target_core_iblock target_core_file target_core_pscsi target_core_mod configfs dm_round_robin dm_multipath 8021q garp stp llc bnx2fc cnic uio fcoe libfcoe libfc scsi_transport_fc scsi_tgt autofs4 sunrpc pcc_cpufreq ipv6 hpilo hpwdt e1000e microcode iTCO_wdt iTCO_vendor_support serio_raw shpchp ixgbe dca mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix hpsa dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 5464, comm: target_completi Not tainted 2.6.32-272.el6.x86_64 #1
Call Trace:
 [<ffffffff8106b747>] ? warn_slowpath_common+0x87/0xc0
 [<ffffffff8106b79a>] ? warn_slowpath_null+0x1a/0x20
 [<ffffffffa025f7df>] ? fc_seq_send+0x14f/0x160 [libfc]
 [<ffffffffa035cbce>] ? ft_queue_status+0x16e/0x210 [tcm_fc]
 [<ffffffffa030a660>] ? target_complete_ok_work+0x0/0x4b0 [target_core_mod]
 [<ffffffffa030a766>] ? target_complete_ok_work+0x106/0x4b0 [target_core_mod]
 [<ffffffffa030a660>] ? target_complete_ok_work+0x0/0x4b0 [target_core_mod]
 [<ffffffff8108c760>] ? worker_thread+0x170/0x2a0
 [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0
 [<ffffffff81091d66>] ? kthread+0x96/0xa0
 [<ffffffff8100c14a>] ? child_rip+0xa/0x20
 [<ffffffff81091cd0>] ? kthread+0x0/0xa0
 [<ffffffff8100c140>] ? child_rip+0x0/0x20
---[ end trace 820415b6ee9aa881 ]---

Got these warnings when performing I/O stress for fcoe-target.

Version-Release number of selected component (if applicable):
kernel -272

How reproducible:
no sure.

Steps to Reproduce:
1. Setup FCoE target.
2. Run 16 thread I/O stress on FCoE initiator.
3.
  
Actual results:
kernel warning.

Expected results:
no warning.

Additional info:

Comment 1 RHEL Program Management 2012-07-10 08:29:05 UTC

This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 2 RHEL Program Management 2012-07-10 23:41:46 UTC

This request was erroneously removed from consideration in Red Hat Enterprise Linux 6.4, which is currently under development.  This request will be evaluated for inclusion in Red Hat Enterprise Linux 6.4.

Comment 3 Neil Horman 2012-11-26 21:34:32 UTC

hmm, looks like fc_seq_send is checking esb_stat without the ep->ex_lock held.  It looks to me like the ex_lock should probably be held for at least part of fc_seq_send. I'll have a build going for you shortly.

Comment 4 Neil Horman 2012-11-27 20:31:09 UTC

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=5133861

Theres a build for you Gris, please test it out asap and confirm that it solves the reported problem.  This is going to need to go upstream too, so given the lateness of the 6.4 window, expidite this if you could please.  Thank you.

Comment 5 Gris Ge 2012-11-28 05:59:00 UTC

With the build you povide, server stuck at boot time when start fcoe daemon:
======
BUG: soft lockup - CPU#0 stuck for 67s! [fc_rport_eq:2823]
Modules linked in: dm_round_robin tcm_fc target_core_iblock target_core_file target_core_pscsi target_core_mod configfs bnx2fc cnic uio fcoe libfcoe 8021q libfc garp stp scsi_transport_fc llc scsi_tgt sunrpc pcc_cpufreq ipv6 hpilo hpwdt e1000e microcode iTCO_wdt iTCO_vendor_support serio_raw shpchp ixgbe dca ptp pps_core mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix hpsa dm_multipath dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
CPU 0 
Modules linked in: dm_round_robin tcm_fc target_core_iblock target_core_file target_core_pscsi target_core_mod configfs bnx2fc cnic uio fcoe libfcoe 8021q libfc garp stp scsi_transport_fc llc scsi_tgt sunrpc pcc_cpufreq ipv6 hpilo hpwdt e1000e microcode iTCO_wdt iTCO_vendor_support serio_raw shpchp ixgbe dca ptp pps_core mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix hpsa dm_multipath dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 2823, comm: fc_rport_eq Not tainted 2.6.32-344.el6.test.x86_64 #1 HP ProLiant DL120 G7
RIP: 0010:[<ffffffff8150f66b>]  [<ffffffff8150f66b>] _spin_lock_bh+0x2b/0x40
RSP: 0018:ffff880203ca5c80  EFLAGS: 00000297
RAX: 0000000000000002 RBX: ffff880203ca5c90 RCX: 0000000000007500
RDX: 0000000000000001 RSI: ffff8801f6a6d150 RDI: ffff8801f6a6d0a8
RBP: ffffffff8100bb8e R08: 0000000000000001 R09: 0000000000000000
R10: 00000000000001e0 R11: 0000000000000000 R12: ffff880203ca5c10
R13: ffffffff8136458b R14: 003135323a313532 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88002fa00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007f3b6bfff000 CR3: 000000020634b000 CR4: 00000000000407f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process fc_rport_eq (pid: 2823, threadinfo ffff880203ca4000, task ffff8802030a4080)
Stack:
 00000000ffffffff ffff8801f6a6d080 ffff880203ca5cf0 ffffffffa029b5fb
<d> 0000000000000000 ffff8801beeef840 0000000000000018 ffff8802030a15e0
<d> ffff880203ca5ce0 ffff8801f6a6d080 ffff8801bef65d80 ffff8801f6a6d150
Call Trace:
 [<ffffffffa029b5fb>] ? fc_seq_send+0x5b/0x160 [libfc]
 [<ffffffffa029b836>] ? fc_exch_abort_locked+0x136/0x1c0 [libfc]
 [<ffffffffa029ba4f>] ? fc_exch_mgr_reset+0x12f/0x260 [libfc]
 [<ffffffffa02a3e10>] ? fc_rport_work+0x120/0x5f0 [libfc]
 [<ffffffff81096d8e>] ? prepare_to_wait+0x4e/0x80
 [<ffffffffa02a3cf0>] ? fc_rport_work+0x0/0x5f0 [libfc]
 [<ffffffff810908e0>] ? worker_thread+0x170/0x2a0
 [<ffffffff81096aa0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff81090770>] ? worker_thread+0x0/0x2a0
 [<ffffffff81096736>] ? kthread+0x96/0xa0
 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
 [<ffffffff810966a0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Code: 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 48 89 fb e8 3a 77 b6 ff b8 00 00 01 00 f0 0f c1 03 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> b7 13 eb f5 83 3b 00 75 f4 eb df 48 83 c4 08 5b c9 c3 66 90
======

The server did not allow me login via ssh or console, no idea where server stuck.

You might need to postpone the upstream submittion.

Comment 6 Neil Horman 2012-11-28 19:13:17 UTC

Created attachment 653689 [details]
[PATCH] libfc: extend ex_lock to protect all of fc_seq_send


This warning was reported recently:

WARNING: at drivers/scsi/libfc/fc_exch.c:478 fc_seq_send+0x14f/0x160 [libfc]()
(Not tainted)
Hardware name: ProLiant DL120 G7
Modules linked in: tcm_fc target_core_iblock target_core_file target_core_pscsi
target_core_mod configfs dm_round_robin dm_multipath 8021q garp stp llc bnx2fc
cnic uio fcoe libfcoe libfc scsi_transport_fc scsi_tgt autofs4 sunrpc
pcc_cpufreq ipv6 hpilo hpwdt e1000e microcode iTCO_wdt iTCO_vendor_support
serio_raw shpchp ixgbe dca mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi
ata_generic ata_piix hpsa dm_mirror dm_region_hash dm_log dm_mod [last unloaded:
scsi_wait_scan]
Pid: 5464, comm: target_completi Not tainted 2.6.32-272.el6.x86_64 #1
Call Trace:
 [<ffffffff8106b747>] ? warn_slowpath_common+0x87/0xc0
 [<ffffffff8106b79a>] ? warn_slowpath_null+0x1a/0x20
 [<ffffffffa025f7df>] ? fc_seq_send+0x14f/0x160 [libfc]
 [<ffffffffa035cbce>] ? ft_queue_status+0x16e/0x210 [tcm_fc]
 [<ffffffffa030a660>] ? target_complete_ok_work+0x0/0x4b0 [target_core_mod]
 [<ffffffffa030a766>] ? target_complete_ok_work+0x106/0x4b0 [target_core_mod]
 [<ffffffffa030a660>] ? target_complete_ok_work+0x0/0x4b0 [target_core_mod]
 [<ffffffff8108c760>] ? worker_thread+0x170/0x2a0
 [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0
 [<ffffffff81091d66>] ? kthread+0x96/0xa0
 [<ffffffff8100c14a>] ? child_rip+0xa/0x20
 [<ffffffff81091cd0>] ? kthread+0x0/0xa0
 [<ffffffff8100c140>] ? child_rip+0x0/0x20

It occurs because fc_seq_send can have multiple contexts executing within it at
the same time, and fc_seq_send doesn't consistently use the ep->ex_lock that
protects this structure.  Because of that, its possible for one context to clear
the INIT bit in the ep->esb_state field while another checks it, leading to the
above stack trace generated by the WARN_ON in the function.

We should probably undertake the effort to convert access to the fc_exch
structures to use rcu, but that a larger work item.  To just fix this specific
issue, we can just extend the ex_lock protection through the entire fc_seq_send
path

Signed-off-by: Neil Horman <nhorman>
---
 drivers/scsi/libfc/fc_exch.c | 39 +++++++++++++++++++++++++--------------
 1 file changed, 25 insertions(+), 14 deletions(-)

Comment 7 Neil Horman 2012-11-28 19:24:16 UTC

sorry, I did something dumb, and didn't check all the lock paths leading to fc_seq_send.

Heres a new build:
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=5140784

with the patch from comment 6, I've smoke tested it on my devel system for fcoe, and it works, please confirm that it fixes your bug

Comment 8 RHEL Program Management 2012-12-14 08:14:01 UTC

This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 10 Neil Horman 2013-04-23 16:57:28 UTC

gris, I think this still needs to be fixed, but I'd really like to get confirmation from your testing before I go upstream with it.

Comment 11 Gris Ge 2013-04-24 05:45:54 UTC

Sorry for missed your previous bugzilla comments.

The brew build expired. 
I build a kernel with patch you provided.
http://pnate-control-01.lab.bos.redhat.com/stqe/tmp/BZ_822088/kernel-2.6.32-358.el6.bz822088.src.rpm

It keep warning in console once the I/O started.

====
BUG: scheduling while atomic: fcoethread/3/1945/0x0010da00
Modules linked in: dm_round_robin scsi_dh_emc bnx2fc cnic uio fcoe libfcoe libfc 8021q scsi_transport_fc garp stp scsi_tgt llc sunrpc cpufreq_ondemand freq_table pcc_cpufreq ipv6 hpilo hpwdt e1000e microcode serio_raw iTCO_wdt iTCO_vendor_support shpchp ixgbe dca ptp pps_core mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix hpsa dm_multipath dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 1945, comm: fcoethread/3 Tainted: G        W  ---------------    2.6.32-358.el6.bz822088.x86_64 #1
Call Trace:
 [<ffffffff8105ab36>] ? __schedule_bug+0x66/0x70
 [<ffffffff8150dce0>] ? thread_return+0x64e/0x76e
 [<ffffffffa0286ae0>] ? fc_exch_find+0x90/0xa0 [libfc]
 [<ffffffffa02bc243>] ? fcoe_percpu_receive_thread+0x2d3/0x530 [fcoe]
 [<ffffffff8150d6e0>] ? thread_return+0x4e/0x76e
 [<ffffffffa02bbf70>] ? fcoe_percpu_receive_thread+0x0/0x530 [fcoe]
 [<ffffffff81096916>] ? kthread+0x96/0xa0
 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
 [<ffffffff81096880>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
BUG: scheduling while atomic: fcoethread/2/1944/0x000e7600
Modules linked in: dm_round_robin scsi_dh_emc bnx2fc cnic uio fcoe libfcoe libfc 8021q scsi_transport_fc garp stp scsi_tgt llc sunrpc cpufreq_ondemand freq_table pcc_cpufreq ipv6 hpilo hpwdt e1000e microcode serio_raw iTCO_wdt iTCO_vendor_support shpchp ixgbe dca ptp pps_core mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix hpsa dm_multipath dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 1944, comm: fcoethread/2 Tainted: G        W  ---------------    2.6.32-358.el6.bz822088.x86_64 #1
Call Trace:
 [<ffffffff8105ab36>] ? __schedule_bug+0x66/0x70
 [<ffffffff8150dce0>] ? thread_return+0x64e/0x76e
 [<ffffffffa0286ae0>] ? fc_exch_find+0x90/0xa0 [libfc]
 [<ffffffffa02bc243>] ? fcoe_percpu_receive_thread+0x2d3/0x530 [fcoe]
 [<ffffffff8150d6e0>] ? thread_return+0x4e/0x76e
 [<ffffffffa02bbf70>] ? fcoe_percpu_receive_thread+0x0/0x530 [fcoe]
 [<ffffffff81096916>] ? kthread+0x96/0xa0
 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
 [<ffffffff81096880>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
BUG: scheduling while atomic: fcoethread/3/1945/0x0010da00
====

Comment 12 Neil Horman 2013-04-30 18:01:04 UTC

Thats a different problem.  I'll look into it, but we may need a separate bug for that.

Comment 13 Neil Horman 2013-04-30 20:37:52 UTC

I see the problem, I'll have another build for you shortly

Comment 14 Neil Horman 2013-04-30 20:38:56 UTC

Created attachment 741991 [details]
[PATCH] libfc: extend ex_lock to protect all of fc_seq_send


This warning was reported recently:

WARNING: at drivers/scsi/libfc/fc_exch.c:478 fc_seq_send+0x14f/0x160 [libfc]()
(Not tainted)
Hardware name: ProLiant DL120 G7
Modules linked in: tcm_fc target_core_iblock target_core_file target_core_pscsi
target_core_mod configfs dm_round_robin dm_multipath 8021q garp stp llc bnx2fc
cnic uio fcoe libfcoe libfc scsi_transport_fc scsi_tgt autofs4 sunrpc
pcc_cpufreq ipv6 hpilo hpwdt e1000e microcode iTCO_wdt iTCO_vendor_support
serio_raw shpchp ixgbe dca mdio sg ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi
ata_generic ata_piix hpsa dm_mirror dm_region_hash dm_log dm_mod [last unloaded:
scsi_wait_scan]
Pid: 5464, comm: target_completi Not tainted 2.6.32-272.el6.x86_64 #1
Call Trace:
 [<ffffffff8106b747>] ? warn_slowpath_common+0x87/0xc0
 [<ffffffff8106b79a>] ? warn_slowpath_null+0x1a/0x20
 [<ffffffffa025f7df>] ? fc_seq_send+0x14f/0x160 [libfc]
 [<ffffffffa035cbce>] ? ft_queue_status+0x16e/0x210 [tcm_fc]
 [<ffffffffa030a660>] ? target_complete_ok_work+0x0/0x4b0 [target_core_mod]
 [<ffffffffa030a766>] ? target_complete_ok_work+0x106/0x4b0 [target_core_mod]
 [<ffffffffa030a660>] ? target_complete_ok_work+0x0/0x4b0 [target_core_mod]
 [<ffffffff8108c760>] ? worker_thread+0x170/0x2a0
 [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0
 [<ffffffff81091d66>] ? kthread+0x96/0xa0
 [<ffffffff8100c14a>] ? child_rip+0xa/0x20
 [<ffffffff81091cd0>] ? kthread+0x0/0xa0
 [<ffffffff8100c140>] ? child_rip+0x0/0x20

It occurs because fc_seq_send can have multiple contexts executing within it at
the same time, and fc_seq_send doesn't consistently use the ep->ex_lock that
protects this structure.  Because of that, its possible for one context to clear
the INIT bit in the ep->esb_state field while another checks it, leading to the
above stack trace generated by the WARN_ON in the function.

We should probably undertake the effort to convert access to the fc_exch
structures to use rcu, but that a larger work item.  To just fix this specific
issue, we can just extend the ex_lock protection through the entire fc_seq_send
path

Signed-off-by: Neil Horman <nhorman>
---
 drivers/scsi/libfc/fc_exch.c | 37 ++++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 13 deletions(-)

Comment 15 Neil Horman 2013-04-30 20:41:30 UTC

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=5705908

There you go, that should solve the above problem.

Comment 16 Gris Ge 2013-05-02 04:46:36 UTC

Neil Horman,

Good work. 4 hours I/O stress with link failover provide no kernel call trace on kernel-2.6.32-344.el6.test.x86_64.

Thanks.

Comment 17 Neil Horman 2013-05-02 13:55:40 UTC

Thanks, I'll send this patch upstream today, and backport ASAP.

Comment 18 Neil Horman 2013-05-02 20:20:25 UTC

sent the patch upstream, waiting on integration

Comment 19 Neil Horman 2013-05-03 15:27:43 UTC

http://lists.open-fcoe.org/pipermail/fcoe-devel/2013-May/011990.html

upstream thread

Comment 20 Neil Horman 2013-05-11 01:10:29 UTC

This has been accepted upstream, I'll backport this as part of the fcoe overall backport (Bz 829784)

*** This bug has been marked as a duplicate of bug 829784 ***