Description of problem: smp kernel hangs when loading sym53c8xx.ko module Just like http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=122476 but I'm using Enterprise release 4 instead of FC2. My hardware is very similar to his also. I have a Hewlett-Packard Visualize X550. Version-Release number of selected component (if applicable): both: kernel-smp-2.6.9-34.EL kernel-smp-2.6.9-42.0.2.EL.i686.rpm non-smp kernel works fine. How reproducible: every time. 100% Steps to Reproduce: 1.boot 2. 3. Actual results: Loading scsi_mod.ko module SCSI subsystem initialized Loading sd_mod.ko module Loading sym53c8xx.ko module sym0: <875> rev 0x26 at pci 0000:02:04.0 irq 15 sym0: Symbios NVRAM, ID 7, Fast-20, SE, parity checking sym0: open drain IRQ line driver, using on-chip SRAM sym0: using LOAD/STORE-based firmware. sym0: SCSI BUS has been reset. scsi0 : sym-2.1.18j sym0:0:0: ABORT operation started. sym0:0:0: ABORT operation timed-out. sym0:0:0: DEVICE RESET operation started. sym0:0:0: DEVICE RESET operation timed-out. sym0:0:0: BUS RESET operation started. sym0:0:0: BUS RESET operation timed-out. sym0:0:0: HOST RESET operation started. sym0: SCSI BUS has been reset. then it is hung. With non-smp kernel I get this message: SCSI subsystem initialized ACPI: PCI interrupt 0000:02:04.0[A] -> GSI 15 (level, low) -> IRQ 15 sym0: <875> rev 0x26 at pci 0000:02:04.0 irq 15 sym0: Symbios NVRAM, ID 7, Fast-20, SE, parity checking sym0: open drain IRQ line driver, using on-chip SRAM sym0: using LOAD/STORE-based firmware. sym0: SCSI BUS has been reset. scsi0 : sym-2.1.18j ACPI: PCI interrupt 0000:00:08.0[A] -> GSI 11 (level, low) -> IRQ 11 ahc_pci:0:8:0: Illegal cable configuration!!. Only two connectors on the adapter may be used at a time! scsi1 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36 <Adaptec aic7880 Ultra SCSI adapter> aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs This is the same as the previous bug report also. I have two SCSI drives plugged in, nothing else. Expected results: not hung. Additional info:
Same here with a UP kernel (2.6.9-42.0.3.EL): ----------------------------------------------------------- Dec 28 06:15:59 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500 00000 SBCL=0 Dec 28 06:15:59 janis kernel: sym0:1: ERROR (81:0) (8-0-0) (1f/9f/0) @ (mem ba20 1008:ffffffff). Dec 28 06:15:59 janis kernel: sym0: regdump: da 00 00 9f 47 1f 01 0b 00 08 81 00 80 00 0f 02 ff a0 da 07 22 ff ff ff. Dec 28 06:15:59 janis kernel: sym0: SCSI BUS reset detected. Dec 28 06:15:59 janis kernel: sym0: SCSI BUS has been reset. Dec 28 06:17:12 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500 00000 SBCL=0 Dec 28 06:17:47 janis kernel: sym0:1:0: ABORT operation started. Dec 28 06:17:52 janis kernel: sym0:1:0: ABORT operation timed-out. Dec 28 06:17:52 janis kernel: sym0:1:0: ABORT operation started. Dec 28 06:17:57 janis kernel: sym0:1:0: ABORT operation timed-out. Dec 28 06:17:57 janis kernel: sym0:1:0: ABORT operation started. Dec 28 06:18:02 janis kernel: sym0:1:0: ABORT operation timed-out. Dec 28 06:18:02 janis kernel: sym0:1:0: DEVICE RESET operation started. Dec 28 06:18:07 janis kernel: sym0:1:0: DEVICE RESET operation timed-out. Dec 28 06:18:07 janis kernel: sym0:1:0: BUS RESET operation started. Dec 28 06:18:07 janis kernel: sym0: SCSI BUS reset detected. Dec 28 06:18:07 janis kernel: sym0: SCSI BUS has been reset. Dec 28 06:18:07 janis kernel: sym0:1:0: BUS RESET operation complete. Dec 28 06:19:17 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500 00000 SBCL=0 Dec 28 06:19:17 janis kernel: sym0:0: ERROR (81:0) (8-0-0) (1f/9f/0) @ (mem fa20 1000:e21c0004). Dec 28 06:19:17 janis kernel: sym0: regdump: da 00 00 9f 47 1f 00 0a 00 08 80 00 80 00 0f 02 18 9c da 07 02 ff ff ff. Dec 28 06:19:17 janis kernel: sym0: SCSI BUS reset detected. Dec 28 06:19:17 janis kernel: sym0: SCSI BUS has been reset. Dec 28 06:19:29 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500 00000 SBCL=0 Dec 28 06:19:29 janis kernel: sym0:0: ERROR (81:0) (8-0-0) (1f/9f/0) @ (scripta 38:f31c0004). Dec 28 06:19:29 janis kernel: sym0: script cmd = e21c0004 Dec 28 06:19:29 janis kernel: sym0: regdump: da 00 00 9f 47 1f 00 0a 00 08 80 00 80 00 0f 02 ff ff ff ff 02 ff ff ff. Dec 28 06:19:29 janis kernel: sym0: SCSI BUS reset detected. Dec 28 06:19:29 janis kernel: sym0: SCSI BUS has been reset. Dec 28 06:21:20 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500 00000 SBCL=0 Dec 28 06:22:48 janis last message repeated 2 times Dec 28 06:24:16 janis last message repeated 2 times Dec 28 06:24:28 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500 00000 SBCL=0 Dec 28 06:24:28 janis kernel: sym0:1: ERROR (81:0) (8-0-0) (1f/9f/0) @ (scripta 38:f31c0004). Dec 28 06:24:28 janis kernel: sym0: script cmd = e21c0004 Dec 28 06:24:28 janis kernel: sym0: regdump: da 00 00 9f 47 1f 01 0a 00 08 81 00 80 00 0f 02 ff ff ff 00 02 ff ff ff. Dec 28 06:24:28 janis kernel: sym0: SCSI BUS reset detected. Dec 28 06:24:28 janis kernel: sym0: SCSI BUS has been reset. Dec 28 06:25:42 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500 00000 SBCL=0 Dec 28 06:27:44 janis last message repeated 2 times Dec 28 06:28:31 janis kernel: sym0:0:0: ABORT operation started. Dec 28 06:28:36 janis kernel: sym0:0:0: ABORT operation timed-out. Dec 28 06:28:36 janis kernel: sym0:1:0: ABORT operation started. Dec 28 06:28:41 janis kernel: sym0:1:0: ABORT operation timed-out. Dec 28 06:28:41 janis kernel: sym0:1:0: ABORT operation started. Dec 28 06:28:46 janis kernel: sym0:1:0: ABORT operation timed-out. Dec 28 06:28:46 janis kernel: sym0:1:0: ABORT operation started. Dec 28 06:28:51 janis kernel: sym0:1:0: ABORT operation timed-out. Dec 28 06:28:51 janis kernel: sym0:0:0: DEVICE RESET operation started. Dec 28 06:28:56 janis kernel: sym0:0:0: DEVICE RESET operation timed-out. Dec 28 06:28:56 janis kernel: sym0:1:0: DEVICE RESET operation started. Dec 28 06:29:01 janis kernel: sym0:1:0: DEVICE RESET operation timed-out. Dec 28 06:29:13 janis kernel: sym0:0:0: BUS RESET operation started. Dec 28 06:29:13 janis kernel: sym0: SCSI BUS reset detected. Dec 28 06:29:13 janis kernel: sym0: SCSI BUS has been reset. Dec 28 06:29:13 janis kernel: sym0:0:0: BUS RESET operation complete. Dec 28 06:29:14 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500 00000 SBCL=0 Dec 28 06:29:14 janis kernel: sym0:0: ERROR (81:0) (8-0-0) (1f/9f/0) @ (scripta 30:e3100004). Dec 28 06:29:14 janis kernel: sym0: script cmd = f31c0004 Dec 28 06:29:14 janis kernel: sym0: regdump: da 00 00 9f 47 1f 00 0a 00 08 80 00 80 00 0f 02 ff ff ff 00 02 ff ff ff. Dec 28 06:29:14 janis kernel: sym0: SCSI BUS reset detected. Dec 28 06:29:14 janis kernel: sym0: SCSI BUS has been reset. Dec 28 06:30:57 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500 00000 SBCL=0 Dec 28 06:31:36 janis kernel: sym0:0:0: ABORT operation started. Dec 28 06:31:41 janis kernel: sym0:0:0: ABORT operation timed-out. Dec 28 06:31:41 janis kernel: sym0:1:0: ABORT operation started. Dec 28 06:31:46 janis kernel: sym0:1:0: ABORT operation timed-out. Dec 28 06:31:46 janis kernel: sym0:1:0: ABORT operation started. Dec 28 06:31:51 janis kernel: sym0:1:0: ABORT operation timed-out. Dec 28 06:31:51 janis kernel: sym0:0:0: ABORT operation started. Dec 28 06:31:56 janis kernel: sym0:0:0: ABORT operation timed-out. Dec 28 06:31:56 janis kernel: sym0:0:0: DEVICE RESET operation started. Dec 28 06:32:01 janis kernel: sym0:0:0: DEVICE RESET operation timed-out. Dec 28 06:32:01 janis kernel: sym0:1:0: DEVICE RESET operation started. Dec 28 06:32:06 janis kernel: sym0:1:0: DEVICE RESET operation timed-out. Dec 28 06:32:16 janis kernel: sym0:0:0: BUS RESET operation started. Dec 28 06:32:16 janis kernel: sym0: SCSI BUS reset detected. Dec 28 06:32:16 janis kernel: sym0: SCSI BUS has been reset. Dec 28 06:32:16 janis kernel: sym0:0:0: BUS RESET operation complete. Dec 28 06:32:19 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500 00000 SBCL=0 -----------------------------------------------------------
The above only happens under heavy I/O. Kernel 2.4 seems to run fine on the same hardware.
There are some reports on the net that kernels 2.6.17 and above have a fixed driver, that doesn't observe this behaviour. Anyone in the mood to backport those fixes to RHEL4?
Could you enable verbose mode in the sym2 driver? You can do this by adding: sym53c8xx=verb:2 to the kernel parameters. It should give more messaging which will help track down the issue.
Adding: options sym53c8xx settle=10 to /etc/modprobe.conf, appears to help somewhat. Conditions that would normally cause a crash, no longer do so. It would be interesting to know if this workaround does anything on other hardware.
Regarding comment #6, I'll have to wait a few days before I can reboot the box again.
(In reply to comment #8) > Regarding comment #6, I'll have to wait a few days before I can reboot the box > again. You can enable verbose logging at runtime by doing the following: echo "setverbose 2" >/proc/scsi/sym53c8xx/0 replacing 0 with your actual controller number. This should at least give something to start with. I just want to clarify, though: Are you also experiencing the hang when using SMP kernels or are you only experiencing the resets under heavy load?
Thanks for the hint. My box is a UP system. Problems happen during backup time (i.e. heavy I/O), which is performed using rsync.
I am unable to reproduce either issue mentioned (bus resets during heavy I/O, lockup after bus reset on boot) with the configuration I have. Could I get some more specific information about the hardware configuration that is being used?
This is an HP NetServer LPr. I'll attach the output of lspci -vv.
Created attachment 146226 [details] Output of lspci -vv
This bug should be closed. I cannot replicated this any more (hardware has been decommissioned long time ago).