Bug 208342 - SMP kernel hangs upon resetting Symbios (sym53c8xx) SCSI bus
Summary: SMP kernel hangs upon resetting Symbios (sym53c8xx) SCSI bus
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.3
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Tom Coughlan
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-09-27 22:00 UTC by E Frank Ball
Modified: 2008-12-10 16:08 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-12-10 16:08:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Output of lspci -vv (4.15 KB, text/plain)
2007-01-22 20:55 UTC, Bojan Smojver
no flags Details

Description E Frank Ball 2006-09-27 22:00:46 UTC
Description of problem: smp kernel hangs when loading sym53c8xx.ko module
Just like http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=122476
but I'm using Enterprise release 4 instead of FC2.
My hardware is very similar to his also.  
I have a Hewlett-Packard Visualize X550.

Version-Release number of selected component (if applicable):
both:
kernel-smp-2.6.9-34.EL
kernel-smp-2.6.9-42.0.2.EL.i686.rpm

non-smp kernel works fine.


How reproducible: every time. 100%


Steps to Reproduce:
1.boot
2.
3.
  
Actual results:

Loading scsi_mod.ko module
SCSI subsystem initialized
Loading sd_mod.ko module
Loading sym53c8xx.ko module
sym0: <875> rev 0x26 at pci 0000:02:04.0 irq 15
sym0: Symbios NVRAM, ID 7, Fast-20, SE, parity checking
sym0: open drain IRQ line driver, using on-chip SRAM
sym0: using LOAD/STORE-based firmware.
sym0: SCSI BUS has been reset.
scsi0 : sym-2.1.18j
sym0:0:0: ABORT operation started.
sym0:0:0: ABORT operation timed-out.
sym0:0:0: DEVICE RESET operation started.
sym0:0:0: DEVICE RESET operation timed-out.
sym0:0:0: BUS RESET operation started.
sym0:0:0: BUS RESET operation timed-out.
sym0:0:0: HOST RESET operation started.
sym0: SCSI BUS has been reset.

then it is hung. 

With non-smp kernel I get this message:

SCSI subsystem initialized
ACPI: PCI interrupt 0000:02:04.0[A] -> GSI 15 (level, low) -> IRQ 15
sym0: <875> rev 0x26 at pci 0000:02:04.0 irq 15
sym0: Symbios NVRAM, ID 7, Fast-20, SE, parity checking
sym0: open drain IRQ line driver, using on-chip SRAM
sym0: using LOAD/STORE-based firmware.
sym0: SCSI BUS has been reset.
scsi0 : sym-2.1.18j
ACPI: PCI interrupt 0000:00:08.0[A] -> GSI 11 (level, low) -> IRQ 11
ahc_pci:0:8:0: Illegal cable configuration!!. Only two connectors on the adapter
may be used at a time!
scsi1 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
        <Adaptec aic7880 Ultra SCSI adapter>
        aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs


This is the same as the previous bug report also.  
I have two SCSI drives plugged in, nothing else.


Expected results:
not hung.

Additional info:

Comment 1 Bojan Smojver 2006-12-28 04:10:57 UTC
Same here with a UP kernel (2.6.9-42.0.3.EL):

-----------------------------------------------------------
Dec 28 06:15:59 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500
00000 SBCL=0
Dec 28 06:15:59 janis kernel: sym0:1: ERROR (81:0) (8-0-0) (1f/9f/0) @ (mem ba20
1008:ffffffff).
Dec 28 06:15:59 janis kernel: sym0: regdump: da 00 00 9f 47 1f 01 0b 00 08 81 00
 80 00 0f 02 ff a0 da 07 22 ff ff ff.
Dec 28 06:15:59 janis kernel: sym0: SCSI BUS reset detected.
Dec 28 06:15:59 janis kernel: sym0: SCSI BUS has been reset.
Dec 28 06:17:12 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500
00000 SBCL=0
Dec 28 06:17:47 janis kernel: sym0:1:0: ABORT operation started.
Dec 28 06:17:52 janis kernel: sym0:1:0: ABORT operation timed-out.
Dec 28 06:17:52 janis kernel: sym0:1:0: ABORT operation started.
Dec 28 06:17:57 janis kernel: sym0:1:0: ABORT operation timed-out.
Dec 28 06:17:57 janis kernel: sym0:1:0: ABORT operation started.
Dec 28 06:18:02 janis kernel: sym0:1:0: ABORT operation timed-out.
Dec 28 06:18:02 janis kernel: sym0:1:0: DEVICE RESET operation started.
Dec 28 06:18:07 janis kernel: sym0:1:0: DEVICE RESET operation timed-out.
Dec 28 06:18:07 janis kernel: sym0:1:0: BUS RESET operation started.
Dec 28 06:18:07 janis kernel: sym0: SCSI BUS reset detected.
Dec 28 06:18:07 janis kernel: sym0: SCSI BUS has been reset.
Dec 28 06:18:07 janis kernel: sym0:1:0: BUS RESET operation complete.
Dec 28 06:19:17 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500
00000 SBCL=0
Dec 28 06:19:17 janis kernel: sym0:0: ERROR (81:0) (8-0-0) (1f/9f/0) @ (mem fa20
1000:e21c0004).
Dec 28 06:19:17 janis kernel: sym0: regdump: da 00 00 9f 47 1f 00 0a 00 08 80 00
 80 00 0f 02 18 9c da 07 02 ff ff ff.
Dec 28 06:19:17 janis kernel: sym0: SCSI BUS reset detected.
Dec 28 06:19:17 janis kernel: sym0: SCSI BUS has been reset.
Dec 28 06:19:29 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500
00000 SBCL=0
Dec 28 06:19:29 janis kernel: sym0:0: ERROR (81:0) (8-0-0) (1f/9f/0) @ (scripta 
38:f31c0004).
Dec 28 06:19:29 janis kernel: sym0: script cmd = e21c0004
Dec 28 06:19:29 janis kernel: sym0: regdump: da 00 00 9f 47 1f 00 0a 00 08 80 00
 80 00 0f 02 ff ff ff ff 02 ff ff ff.
Dec 28 06:19:29 janis kernel: sym0: SCSI BUS reset detected.
Dec 28 06:19:29 janis kernel: sym0: SCSI BUS has been reset.
Dec 28 06:21:20 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500
00000 SBCL=0
Dec 28 06:22:48 janis last message repeated 2 times
Dec 28 06:24:16 janis last message repeated 2 times
Dec 28 06:24:28 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500
00000 SBCL=0
Dec 28 06:24:28 janis kernel: sym0:1: ERROR (81:0) (8-0-0) (1f/9f/0) @ (scripta 
38:f31c0004).
Dec 28 06:24:28 janis kernel: sym0: script cmd = e21c0004
Dec 28 06:24:28 janis kernel: sym0: regdump: da 00 00 9f 47 1f 01 0a 00 08 81 00
 80 00 0f 02 ff ff ff 00 02 ff ff ff.
Dec 28 06:24:28 janis kernel: sym0: SCSI BUS reset detected.
Dec 28 06:24:28 janis kernel: sym0: SCSI BUS has been reset.
Dec 28 06:25:42 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500
00000 SBCL=0
Dec 28 06:27:44 janis last message repeated 2 times
Dec 28 06:28:31 janis kernel: sym0:0:0: ABORT operation started.
Dec 28 06:28:36 janis kernel: sym0:0:0: ABORT operation timed-out.
Dec 28 06:28:36 janis kernel: sym0:1:0: ABORT operation started.
Dec 28 06:28:41 janis kernel: sym0:1:0: ABORT operation timed-out.
Dec 28 06:28:41 janis kernel: sym0:1:0: ABORT operation started.
Dec 28 06:28:46 janis kernel: sym0:1:0: ABORT operation timed-out.
Dec 28 06:28:46 janis kernel: sym0:1:0: ABORT operation started.
Dec 28 06:28:51 janis kernel: sym0:1:0: ABORT operation timed-out.
Dec 28 06:28:51 janis kernel: sym0:0:0: DEVICE RESET operation started.
Dec 28 06:28:56 janis kernel: sym0:0:0: DEVICE RESET operation timed-out.
Dec 28 06:28:56 janis kernel: sym0:1:0: DEVICE RESET operation started.
Dec 28 06:29:01 janis kernel: sym0:1:0: DEVICE RESET operation timed-out.
Dec 28 06:29:13 janis kernel: sym0:0:0: BUS RESET operation started.
Dec 28 06:29:13 janis kernel: sym0: SCSI BUS reset detected.
Dec 28 06:29:13 janis kernel: sym0: SCSI BUS has been reset.
Dec 28 06:29:13 janis kernel: sym0:0:0: BUS RESET operation complete.
Dec 28 06:29:14 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500
00000 SBCL=0
Dec 28 06:29:14 janis kernel: sym0:0: ERROR (81:0) (8-0-0) (1f/9f/0) @ (scripta 
30:e3100004).
Dec 28 06:29:14 janis kernel: sym0: script cmd = f31c0004
Dec 28 06:29:14 janis kernel: sym0: regdump: da 00 00 9f 47 1f 00 0a 00 08 80 00
 80 00 0f 02 ff ff ff 00 02 ff ff ff.
Dec 28 06:29:14 janis kernel: sym0: SCSI BUS reset detected.
Dec 28 06:29:14 janis kernel: sym0: SCSI BUS has been reset.
Dec 28 06:30:57 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500
00000 SBCL=0
Dec 28 06:31:36 janis kernel: sym0:0:0: ABORT operation started.
Dec 28 06:31:41 janis kernel: sym0:0:0: ABORT operation timed-out.
Dec 28 06:31:41 janis kernel: sym0:1:0: ABORT operation started.
Dec 28 06:31:46 janis kernel: sym0:1:0: ABORT operation timed-out.
Dec 28 06:31:46 janis kernel: sym0:1:0: ABORT operation started.
Dec 28 06:31:51 janis kernel: sym0:1:0: ABORT operation timed-out.
Dec 28 06:31:51 janis kernel: sym0:0:0: ABORT operation started.
Dec 28 06:31:56 janis kernel: sym0:0:0: ABORT operation timed-out.
Dec 28 06:31:56 janis kernel: sym0:0:0: DEVICE RESET operation started.
Dec 28 06:32:01 janis kernel: sym0:0:0: DEVICE RESET operation timed-out.
Dec 28 06:32:01 janis kernel: sym0:1:0: DEVICE RESET operation started.
Dec 28 06:32:06 janis kernel: sym0:1:0: DEVICE RESET operation timed-out.
Dec 28 06:32:16 janis kernel: sym0:0:0: BUS RESET operation started.
Dec 28 06:32:16 janis kernel: sym0: SCSI BUS reset detected.
Dec 28 06:32:16 janis kernel: sym0: SCSI BUS has been reset.
Dec 28 06:32:16 janis kernel: sym0:0:0: BUS RESET operation complete.
Dec 28 06:32:19 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500
00000 SBCL=0
-----------------------------------------------------------

Comment 2 Bojan Smojver 2006-12-28 04:13:50 UTC
The above only happens under heavy I/O. Kernel 2.4 seems to run fine on the same
hardware.

Comment 3 Bojan Smojver 2006-12-30 21:43:23 UTC
There are some reports on the net that kernels 2.6.17 and above have a fixed
driver, that doesn't observe this behaviour. Anyone in the mood to backport
those fixes to RHEL4?

Comment 6 Ryan Powers 2007-01-02 22:08:56 UTC
Could you enable verbose mode in the sym2 driver? You can do this by adding:
    sym53c8xx=verb:2
to the kernel parameters. It should give more messaging which will help track
down the issue.

Comment 7 Bojan Smojver 2007-01-03 23:21:45 UTC
Adding:

options sym53c8xx settle=10

to /etc/modprobe.conf, appears to help somewhat. Conditions that would normally
cause a crash, no longer do so. It would be interesting to know if this
workaround does anything on other hardware.

Comment 8 Bojan Smojver 2007-01-03 23:34:20 UTC
Regarding comment #6, I'll have to wait a few days before I can reboot the box
again.

Comment 9 Ryan Powers 2007-01-04 17:38:33 UTC
(In reply to comment #8)
> Regarding comment #6, I'll have to wait a few days before I can reboot the box
> again.

You can enable verbose logging at runtime by doing the following:
    echo "setverbose 2" >/proc/scsi/sym53c8xx/0
replacing 0 with your actual controller number.

This should at least give something to start with.

I just want to clarify, though: Are you also experiencing the hang when using
SMP kernels or are you only experiencing the resets under heavy load?

Comment 10 Bojan Smojver 2007-01-04 20:11:21 UTC
Thanks for the hint.

My box is a UP system. Problems happen during backup time (i.e. heavy I/O),
which is performed using rsync.

Comment 11 Ryan Powers 2007-01-22 19:31:42 UTC
I am unable to reproduce either issue mentioned (bus resets during heavy I/O,
lockup after bus reset on boot) with the configuration I have. Could I get some
more specific information about the hardware configuration that is being used?

Comment 12 Bojan Smojver 2007-01-22 20:54:22 UTC
This is an HP NetServer LPr. I'll attach the output of lspci -vv.

Comment 13 Bojan Smojver 2007-01-22 20:55:20 UTC
Created attachment 146226 [details]
Output of lspci -vv

Comment 14 Bojan Smojver 2008-12-01 23:09:48 UTC
This bug should be closed. I cannot replicated this any more (hardware has been decommissioned long time ago).


Note You need to log in before you can comment on or make changes to this bug.