Created attachment 355956[details]
upstream patch: scsi_transport_fc: fc_user_scan correction
Description of problem:
Customer reported scsi_scan looping forever, causing eventual soft-lockup :
BUG: soft lockup detected on CPU#1!
[<c044d21c>] softlockup_tick+0x96/0xa4
[<c042ddb0>] update_process_times+0x39/0x5c
[<c04196fb>] smp_apic_timer_interrupt+0x5b/0x6c
[<c04059bf>] apic_timer_interrupt+0x1f/0x24
[<f88aeccd>] fc_user_scan+0x69/0x72 [scsi_transport_fc]
[<f88aec64>] fc_user_scan+0x0/0x72 [scsi_transport_fc]
[<f88704bb>] store_scan+0x83/0xab [scsi_mod]
[<f8870438>] store_scan+0x0/0xab [scsi_mod]
[<c054cd24>] class_device_attr_store+0x1b/0x1f
[<c04a4a3c>] sysfs_write_file+0x91/0xbb
[<c04a49ab>] sysfs_write_file+0x0/0xbb
[<c0470254>] vfs_write+0xa1/0x143
[<c0470846>] sys_write+0x3c/0x63
[<c0404eff>] syscall_call+0x7/0xb
Version-Release number of selected component (if applicable):
Reported on RHEL5.3, but all versions of RHEL believed to be affected
Step to Reproduce:
# echo "- - -" > /sys/class/scsi_host/hostN/scan
(for HBA number N)
Actual Results:
The scan sometimes loops forever, resulting in a hung system due to no I/O possible on that HBA.
Expected Results:
scan should complete normally in a reasonable time-frame.
Summary of actions taken to resolve issue:
reboot the system.
Additional info:
We have identified an upstream patch (attached), built a test kernel
and the customer has verified it resolves the issue (see associated IT).
Basically, the patch re-introduces some irq locking to guard against
rport list changes during the main loop of the scan.
See the attached patch, or the upstream commit: bda232531f0c117921690ee3c060953c8f12e5a1
Thanks
-- Mark Goodwin
I realize that you have tested the upstream patch on a -53 test kernel, would
you please try this test kernel built from the latest RHEL 5.4 sources? Thanks.
http://people.redhat.com/dmilburn/.bz515176/
in kernel-2.6.18-165.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so. However feel free
to provide a comment indicating that this fix has been verified.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.
http://rhn.redhat.com/errata/RHSA-2010-0178.html
Created attachment 355956 [details] upstream patch: scsi_transport_fc: fc_user_scan correction Description of problem: Customer reported scsi_scan looping forever, causing eventual soft-lockup : BUG: soft lockup detected on CPU#1! [<c044d21c>] softlockup_tick+0x96/0xa4 [<c042ddb0>] update_process_times+0x39/0x5c [<c04196fb>] smp_apic_timer_interrupt+0x5b/0x6c [<c04059bf>] apic_timer_interrupt+0x1f/0x24 [<f88aeccd>] fc_user_scan+0x69/0x72 [scsi_transport_fc] [<f88aec64>] fc_user_scan+0x0/0x72 [scsi_transport_fc] [<f88704bb>] store_scan+0x83/0xab [scsi_mod] [<f8870438>] store_scan+0x0/0xab [scsi_mod] [<c054cd24>] class_device_attr_store+0x1b/0x1f [<c04a4a3c>] sysfs_write_file+0x91/0xbb [<c04a49ab>] sysfs_write_file+0x0/0xbb [<c0470254>] vfs_write+0xa1/0x143 [<c0470846>] sys_write+0x3c/0x63 [<c0404eff>] syscall_call+0x7/0xb Version-Release number of selected component (if applicable): Reported on RHEL5.3, but all versions of RHEL believed to be affected Step to Reproduce: # echo "- - -" > /sys/class/scsi_host/hostN/scan (for HBA number N) Actual Results: The scan sometimes loops forever, resulting in a hung system due to no I/O possible on that HBA. Expected Results: scan should complete normally in a reasonable time-frame. Summary of actions taken to resolve issue: reboot the system. Additional info: We have identified an upstream patch (attached), built a test kernel and the customer has verified it resolves the issue (see associated IT). Basically, the patch re-introduces some irq locking to guard against rport list changes during the main loop of the scan. See the attached patch, or the upstream commit: bda232531f0c117921690ee3c060953c8f12e5a1 Thanks -- Mark Goodwin