Bug 208342
Summary: | SMP kernel hangs upon resetting Symbios (sym53c8xx) SCSI bus | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | E Frank Ball <efball> | ||||
Component: | kernel | Assignee: | Tom Coughlan <coughlan> | ||||
Status: | CLOSED CANTFIX | QA Contact: | Brian Brock <bbrock> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 4.3 | CC: | bojan, efball, jbaron | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i686 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2008-12-10 16:08:47 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
E Frank Ball
2006-09-27 22:00:46 UTC
Same here with a UP kernel (2.6.9-42.0.3.EL): ----------------------------------------------------------- Dec 28 06:15:59 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500 00000 SBCL=0 Dec 28 06:15:59 janis kernel: sym0:1: ERROR (81:0) (8-0-0) (1f/9f/0) @ (mem ba20 1008:ffffffff). Dec 28 06:15:59 janis kernel: sym0: regdump: da 00 00 9f 47 1f 01 0b 00 08 81 00 80 00 0f 02 ff a0 da 07 22 ff ff ff. Dec 28 06:15:59 janis kernel: sym0: SCSI BUS reset detected. Dec 28 06:15:59 janis kernel: sym0: SCSI BUS has been reset. Dec 28 06:17:12 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500 00000 SBCL=0 Dec 28 06:17:47 janis kernel: sym0:1:0: ABORT operation started. Dec 28 06:17:52 janis kernel: sym0:1:0: ABORT operation timed-out. Dec 28 06:17:52 janis kernel: sym0:1:0: ABORT operation started. Dec 28 06:17:57 janis kernel: sym0:1:0: ABORT operation timed-out. Dec 28 06:17:57 janis kernel: sym0:1:0: ABORT operation started. Dec 28 06:18:02 janis kernel: sym0:1:0: ABORT operation timed-out. Dec 28 06:18:02 janis kernel: sym0:1:0: DEVICE RESET operation started. Dec 28 06:18:07 janis kernel: sym0:1:0: DEVICE RESET operation timed-out. Dec 28 06:18:07 janis kernel: sym0:1:0: BUS RESET operation started. Dec 28 06:18:07 janis kernel: sym0: SCSI BUS reset detected. Dec 28 06:18:07 janis kernel: sym0: SCSI BUS has been reset. Dec 28 06:18:07 janis kernel: sym0:1:0: BUS RESET operation complete. Dec 28 06:19:17 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500 00000 SBCL=0 Dec 28 06:19:17 janis kernel: sym0:0: ERROR (81:0) (8-0-0) (1f/9f/0) @ (mem fa20 1000:e21c0004). Dec 28 06:19:17 janis kernel: sym0: regdump: da 00 00 9f 47 1f 00 0a 00 08 80 00 80 00 0f 02 18 9c da 07 02 ff ff ff. Dec 28 06:19:17 janis kernel: sym0: SCSI BUS reset detected. Dec 28 06:19:17 janis kernel: sym0: SCSI BUS has been reset. Dec 28 06:19:29 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500 00000 SBCL=0 Dec 28 06:19:29 janis kernel: sym0:0: ERROR (81:0) (8-0-0) (1f/9f/0) @ (scripta 38:f31c0004). Dec 28 06:19:29 janis kernel: sym0: script cmd = e21c0004 Dec 28 06:19:29 janis kernel: sym0: regdump: da 00 00 9f 47 1f 00 0a 00 08 80 00 80 00 0f 02 ff ff ff ff 02 ff ff ff. Dec 28 06:19:29 janis kernel: sym0: SCSI BUS reset detected. Dec 28 06:19:29 janis kernel: sym0: SCSI BUS has been reset. Dec 28 06:21:20 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500 00000 SBCL=0 Dec 28 06:22:48 janis last message repeated 2 times Dec 28 06:24:16 janis last message repeated 2 times Dec 28 06:24:28 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500 00000 SBCL=0 Dec 28 06:24:28 janis kernel: sym0:1: ERROR (81:0) (8-0-0) (1f/9f/0) @ (scripta 38:f31c0004). Dec 28 06:24:28 janis kernel: sym0: script cmd = e21c0004 Dec 28 06:24:28 janis kernel: sym0: regdump: da 00 00 9f 47 1f 01 0a 00 08 81 00 80 00 0f 02 ff ff ff 00 02 ff ff ff. Dec 28 06:24:28 janis kernel: sym0: SCSI BUS reset detected. Dec 28 06:24:28 janis kernel: sym0: SCSI BUS has been reset. Dec 28 06:25:42 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500 00000 SBCL=0 Dec 28 06:27:44 janis last message repeated 2 times Dec 28 06:28:31 janis kernel: sym0:0:0: ABORT operation started. Dec 28 06:28:36 janis kernel: sym0:0:0: ABORT operation timed-out. Dec 28 06:28:36 janis kernel: sym0:1:0: ABORT operation started. Dec 28 06:28:41 janis kernel: sym0:1:0: ABORT operation timed-out. Dec 28 06:28:41 janis kernel: sym0:1:0: ABORT operation started. Dec 28 06:28:46 janis kernel: sym0:1:0: ABORT operation timed-out. Dec 28 06:28:46 janis kernel: sym0:1:0: ABORT operation started. Dec 28 06:28:51 janis kernel: sym0:1:0: ABORT operation timed-out. Dec 28 06:28:51 janis kernel: sym0:0:0: DEVICE RESET operation started. Dec 28 06:28:56 janis kernel: sym0:0:0: DEVICE RESET operation timed-out. Dec 28 06:28:56 janis kernel: sym0:1:0: DEVICE RESET operation started. Dec 28 06:29:01 janis kernel: sym0:1:0: DEVICE RESET operation timed-out. Dec 28 06:29:13 janis kernel: sym0:0:0: BUS RESET operation started. Dec 28 06:29:13 janis kernel: sym0: SCSI BUS reset detected. Dec 28 06:29:13 janis kernel: sym0: SCSI BUS has been reset. Dec 28 06:29:13 janis kernel: sym0:0:0: BUS RESET operation complete. Dec 28 06:29:14 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500 00000 SBCL=0 Dec 28 06:29:14 janis kernel: sym0:0: ERROR (81:0) (8-0-0) (1f/9f/0) @ (scripta 30:e3100004). Dec 28 06:29:14 janis kernel: sym0: script cmd = f31c0004 Dec 28 06:29:14 janis kernel: sym0: regdump: da 00 00 9f 47 1f 00 0a 00 08 80 00 80 00 0f 02 ff ff ff 00 02 ff ff ff. Dec 28 06:29:14 janis kernel: sym0: SCSI BUS reset detected. Dec 28 06:29:14 janis kernel: sym0: SCSI BUS has been reset. Dec 28 06:30:57 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500 00000 SBCL=0 Dec 28 06:31:36 janis kernel: sym0:0:0: ABORT operation started. Dec 28 06:31:41 janis kernel: sym0:0:0: ABORT operation timed-out. Dec 28 06:31:41 janis kernel: sym0:1:0: ABORT operation started. Dec 28 06:31:46 janis kernel: sym0:1:0: ABORT operation timed-out. Dec 28 06:31:46 janis kernel: sym0:1:0: ABORT operation started. Dec 28 06:31:51 janis kernel: sym0:1:0: ABORT operation timed-out. Dec 28 06:31:51 janis kernel: sym0:0:0: ABORT operation started. Dec 28 06:31:56 janis kernel: sym0:0:0: ABORT operation timed-out. Dec 28 06:31:56 janis kernel: sym0:0:0: DEVICE RESET operation started. Dec 28 06:32:01 janis kernel: sym0:0:0: DEVICE RESET operation timed-out. Dec 28 06:32:01 janis kernel: sym0:1:0: DEVICE RESET operation started. Dec 28 06:32:06 janis kernel: sym0:1:0: DEVICE RESET operation timed-out. Dec 28 06:32:16 janis kernel: sym0:0:0: BUS RESET operation started. Dec 28 06:32:16 janis kernel: sym0: SCSI BUS reset detected. Dec 28 06:32:16 janis kernel: sym0: SCSI BUS has been reset. Dec 28 06:32:16 janis kernel: sym0:0:0: BUS RESET operation complete. Dec 28 06:32:19 janis kernel: sym0: SCSI parity error detected: SCR1=132 DBC=500 00000 SBCL=0 ----------------------------------------------------------- The above only happens under heavy I/O. Kernel 2.4 seems to run fine on the same hardware. There are some reports on the net that kernels 2.6.17 and above have a fixed driver, that doesn't observe this behaviour. Anyone in the mood to backport those fixes to RHEL4? Could you enable verbose mode in the sym2 driver? You can do this by adding: sym53c8xx=verb:2 to the kernel parameters. It should give more messaging which will help track down the issue. Adding: options sym53c8xx settle=10 to /etc/modprobe.conf, appears to help somewhat. Conditions that would normally cause a crash, no longer do so. It would be interesting to know if this workaround does anything on other hardware. Regarding comment #6, I'll have to wait a few days before I can reboot the box again. (In reply to comment #8) > Regarding comment #6, I'll have to wait a few days before I can reboot the box > again. You can enable verbose logging at runtime by doing the following: echo "setverbose 2" >/proc/scsi/sym53c8xx/0 replacing 0 with your actual controller number. This should at least give something to start with. I just want to clarify, though: Are you also experiencing the hang when using SMP kernels or are you only experiencing the resets under heavy load? Thanks for the hint. My box is a UP system. Problems happen during backup time (i.e. heavy I/O), which is performed using rsync. I am unable to reproduce either issue mentioned (bus resets during heavy I/O, lockup after bus reset on boot) with the configuration I have. Could I get some more specific information about the hardware configuration that is being used? This is an HP NetServer LPr. I'll attach the output of lspci -vv. Created attachment 146226 [details]
Output of lspci -vv
This bug should be closed. I cannot replicated this any more (hardware has been decommissioned long time ago). |