515176 – scsi_transport_fc: fc_user_scan can loop forever, needs mutex with rport list changes

Bug 515176 - scsi_transport_fc: fc_user_scan can loop forever, needs mutex with rport list changes

Summary: scsi_transport_fc: fc_user_scan can loop forever, needs mutex with rport list...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.3
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	David Milburn
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	499522 521239 651455
TreeView+	depends on / blocked

Reported:	2009-08-03 06:14 UTC by Mark Goodwin
Modified:	2018-11-27 19:32 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-03-30 06:52:15 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
upstream patch: scsi_transport_fc: fc_user_scan correction (3.16 KB, patch) 2009-08-03 06:14 UTC, Mark Goodwin	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2010:0178	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.5 kernel security and bug fix update	2010-03-29 12:18:21 UTC

Description Mark Goodwin 2009-08-03 06:14:26 UTC

Created attachment 355956 [details]
upstream patch: scsi_transport_fc: fc_user_scan correction

Description of problem:
Customer reported scsi_scan looping forever, causing eventual soft-lockup :

BUG: soft lockup detected on CPU#1!
 [<c044d21c>] softlockup_tick+0x96/0xa4
 [<c042ddb0>] update_process_times+0x39/0x5c
 [<c04196fb>] smp_apic_timer_interrupt+0x5b/0x6c
 [<c04059bf>] apic_timer_interrupt+0x1f/0x24
 [<f88aeccd>] fc_user_scan+0x69/0x72 [scsi_transport_fc]
 [<f88aec64>] fc_user_scan+0x0/0x72 [scsi_transport_fc]
 [<f88704bb>] store_scan+0x83/0xab [scsi_mod]
 [<f8870438>] store_scan+0x0/0xab [scsi_mod]
 [<c054cd24>] class_device_attr_store+0x1b/0x1f
 [<c04a4a3c>] sysfs_write_file+0x91/0xbb
 [<c04a49ab>] sysfs_write_file+0x0/0xbb
 [<c0470254>] vfs_write+0xa1/0x143
 [<c0470846>] sys_write+0x3c/0x63
 [<c0404eff>] syscall_call+0x7/0xb

Version-Release number of selected component (if applicable):
Reported on RHEL5.3, but all versions of RHEL believed to be affected

Step to Reproduce:
 # echo "- - -" > /sys/class/scsi_host/hostN/scan
(for HBA number N)

Actual Results:
The scan sometimes loops forever, resulting in a hung system due to no I/O possible on that HBA.

Expected Results:
 scan should complete normally in a reasonable time-frame.

Summary of actions taken to resolve issue:
 reboot the system.

Additional info:
We have identified an upstream patch (attached), built a test kernel
and the customer has verified it resolves the issue (see associated IT).
Basically, the patch re-introduces some irq locking to guard against
rport list changes during the main loop of the scan.

See the attached patch, or the upstream commit: bda232531f0c117921690ee3c060953c8f12e5a1

Thanks
-- Mark Goodwin

Comment 1 David Milburn 2009-08-05 22:41:25 UTC

I realize that you have tested the upstream patch on a -53 test kernel, would
you please try this test kernel built from the latest RHEL 5.4 sources? Thanks.

http://people.redhat.com/dmilburn/.bz515176/

Comment 10 Don Zickus 2009-09-04 18:45:54 UTC

in kernel-2.6.18-165.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 19 errata-xmlrpc 2010-03-30 06:52:15 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html

Note You need to log in before you can comment on or make changes to this bug.