Bug 190760
Summary: | SCSI I/O errors with Fusion MPT Driver | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Victor Gregorio <contactvictorg> | ||||||||||||
Component: | kernel | Assignee: | Tom Coughlan <coughlan> | ||||||||||||
Status: | CLOSED WONTFIX | QA Contact: | Brian Brock <bbrock> | ||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||
Priority: | medium | ||||||||||||||
Version: | 4.0 | CC: | andreas, anton.fang, bigwavedave, coughlan, jbaron, jos, magnus.moren, ppokorny, sprelutsky | ||||||||||||
Target Milestone: | --- | ||||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | x86_64 | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2012-06-20 16:14:40 UTC | Type: | --- | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Attachments: |
|
Description
Victor Gregorio
2006-05-04 22:45:12 UTC
Created attachment 128639 [details]
Fusion MPT SCSI I/O Errors
Humm. "return code = 0x20008" means that the storage device, or more likely, the driver returned "busy" status. Do you get the same result if you make the partition size slightly smaller than 2 TB? By the way, this is not likely to be related, but just in case: for devices larger than 2 TB, the GPT partition table format must be used. The parted utility must be used for the creation and management of GPT partitions. To create a GPT partition, use the parted command mklabel gpt. Hello. Yes, the exact same results on smaller partitions. I tried various partition sizes between 1.8TB and 500MB. I found a workaround. To provide more detail: I am working with an nStor Wahoo RAID controller. The Wahoo controller is connected to the RHEL4 system using the previously mentioned LSI 22320 card. The Wahoo controller assigns a LUN to the RAID controller as well as each logical volume. By default, the logical volumes start at LUN 0 and the controller gets a higher numbered LUN (5, for example). Using mptlinux-3.02.62.01rh on 2.6.9-34.ELsmp, LUN 0 must be the Wahoo controller to prevent SCSI I/O errors. It is also possible to avoid errors by not assigning any LUN to the controller. Interestingly, with older RHEL4 kernels, LUN 0 can be a logical volume and the OS will not exhibit the SCSI I/O errors. The following MPT drivers and kernels worked without the need to assign LUN 0 to the Wahoo controller: - mptlinux-3.02.18 on 2.6.9-22.0.1.ELsmp - mptlinux-3.01.16 on 2.6.9-11.ELsmp Hope this helps Update: Although there are no SCSI errors with mptlinux-3.02.62.01rh and the Wahoo Raid Controller on LUN 0, performance is severely degraded. Writes to the disk happen in bursts split by long wait times. To clarify, this is 2.6.9-34.ELsmp with the controller on LUN 0 and the logical volume on LUN 1. Using the same kernel and driver, the severe performance problems went away if we did not assign the Wahoo RAID controller to a LUN. In this case, only the logical volume was assigned a LUN. Since we need access to the RAID controller from the OS for monitoring purposes, we went back to using mptlinux-3.02.18 on 2.6.9-22.0.1.ELsmp. In this configuration, the logical volume is LUN 0 and the controller is LUN 1. Like before, there are no SCSI errors with this configuration. Coincidentally, this configuration is noticeably faster than running mptlinux-3.02.62.01rh on 2.6.9-34.ELsmp with no LUN assigned to the controller. Were there major changes to mptlinux between mptlinux-3.02.18 and mptlinux-3.02.62.01rh? We are also seeing the same issues Card: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI Kernel: 2.6.9-34.ELsmp mptlinux-3.02.62.01rh I have recreated this by doing the above steps mentioned by Victor/ /proc/scsi/scsi Attached devices: Host: scsi0 Channel: 00 Id: 04 Lun: 00 Vendor: nStor Model: NexStorWahooSATA Rev: Type: Direct-Access ANSI SCSI revision: 03 Host: scsi0 Channel: 00 Id: 04 Lun: 01 Vendor: nStor Model: NexStorWahooSATA Rev: Type: Direct-Access ANSI SCSI revision: 03 Host: scsi0 Channel: 00 Id: 04 Lun: 02 Vendor: nStor Model: NexStorWahooSATA Rev: Type: Processor ANSI SCSI revision: 03 dmesg: SCSI error : <0 0 4 0> return code = 0x20008 end_request: I/O error, dev sda, sector 12323531 SCSI error : <0 0 4 0> return code = 0x20008 end_request: I/O error, dev sda, sector 12323150 SCSI error : <0 0 4 0> return code = 0x20008 end_request: I/O error, dev sda, sector 12323532 SCSI error : <0 0 4 0> return code = 0x20008 end_request: I/O error, dev sda, sector 12323533 SCSI error : <0 0 4 0> return code = 0x20008 end_request: I/O error, dev sda, sector 12323534 I am seeing a similar error under VMware ESX while trying to scp 50 GB's worth of files from remote to local (ie. heavy writes). This error causes the ESX host to dump as well. Attached devices: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: VMware Model: Virtual disk Rev: 1.0 Type: Direct-Access ANSI SCSI revision: 02 Host: scsi0 Channel: 00 Id: 01 Lun: 00 Vendor: VMware Model: Virtual disk Rev: 1.0 Type: Direct-Access ANSI SCSI revision: 02 /var/log/messages May 13 12:06:58 test-vm kernel: Buffer I/O error on device dm-2, logical bloc k 13208799 May 13 12:06:58 test-vm kernel: lost page write due to I/O error on dm-2 May 13 12:06:58 test-vm kernel: SCSI error : <0 0 1 0> return code = 0x20008 May 13 12:06:58 test-vm kernel: end_request: I/O error, dev sdb, sector 10567 2384 Can you repeat these experiments with the newer fusion driver? In other words, "modprobe mptscsi" instead of "modprobe mptscsih" (the difference between the two names is the final 'h' on the old driver). Thanks, Chip (In reply to comment #10) > Can you repeat these experiments with the newer fusion driver? In other words, > "modprobe mptscsi" instead of "modprobe mptscsih" (the difference between the > two names is the final 'h' on the old driver). Nevermind. The latter is just a wrapper around the former. It shouldn't make any difference. Chip Created attachment 129446 [details]
debug mptscsih driver
This is a version of the mptscsih driver that has some additional debugging
info enabled. Can you load this version in your kernel and send the messages
back?
cd /lib/modules/2.6.9-34.ELsmp/kernel/drivers/message
mv fusion fusion.org
tar xvfz ~/fusion.tgz
Chip
Created attachment 129508 [details]
debug-boot.txt
Thank you.
I installed the debug mptscsih driver and rebooted the system.
Relevant kernel logs during boot are attached as debug-boot.txt.
Then, I mounted the volume (/dev/sdb1, scsi1, LUN0) on /scratch and tried to
write to the partition.
[root:/]# mount -o rw -t ext3 -v /dev/sdb1 /scratch/
/dev/sdb1 on /scratch type ext3 (rw)
[root:/]# cd /scratch
[root:/scratch]# cp -a /etc .
cp: writing `./etc/gtk/gtkrc.vi_VN.viscii': Read-only file system
cp: cannot create symbolic link `./etc/gtk/gtkrc.sl': Read-only file system
cp: cannot create regular file `./etc/gtk/gtkrc.iso88592': Read-only file
system
cp: cannot create regular file `./etc/gtk/gtkrc.iso885914': Read-only file
system
[snip]
Relevant kernel messages and errors during the copy are attached as
debug-mount_and_copy.txt
Other information:
MPT devices from /proc/scsi/scsi:
Host: scsi1 Channel: 00 Id: 04 Lun: 00
Vendor: nStor Model: NexStor Wahoo Rev:
Type: Direct-Access ANSI SCSI revision: 03
Host: scsi1 Channel: 00 Id: 04 Lun: 01
Vendor: nStor Model: NexStor Wahoo Rev:
Type: Processor ANSI SCSI revision: 03
fdisk -l output for volume:
Disk /dev/sdb: 2198.7 GB, 2198750625792 bytes
255 heads, 63 sectors/track, 267316 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sdb1 1 267316 2147215738+ 83 Linux
Created attachment 129510 [details]
debug-mount_and_copy.txt
I have verified the problem still exists on RHEL 4 update 4 and RHEL 5 Beta. Fusion MPT base driver 3.04.01 From DMESG PM: Adding info for scsi:2:0:5:1 SCSI device sdb: 3866554368 512-byte hdwr sectors (1979676 MB) sdb: Write Protect is off sdb: Mode Sense: a7 00 10 08 SCSI device sdb: drive cache: write back w/ FUA SCSI device sdb: 3866554368 512-byte hdwr sectors (1979676 MB) sdb: Write Protect is off sdb: Mode Sense: a7 00 10 08 SCSI device sdb: drive cache: write back w/ FUA sdb: sdb1 sd 2:0:5:1: Attached scsi disk sdb Vendor: nStor Model: NexStorWahooSATA Rev: Type: Direct-Access ANSI SCSI revision: 03 The scsi i/o errors from /var/log/messages will be attached in Comment 16 as it is a bit long. Created attachment 148978 [details] sdb I/O erros I/O error file mentioned in Comment 15 As another data point, I have also seen the i/o errors go away by removing the management LUN just as Victor says. (In reply to comment #17) > As another data point, I have also seen the i/o errors go away by removing the > management LUN just as Victor says. The I/O errors go away under two conditions: 1) The controller is assigned LUN 0 Negative impact: Performance is severely degraded -- unusably so. 2) The controller is not assigned a LUN Negative impact: impossible to monitor the controller for failures from OS Please see comment #4 and comment #5. Here is an article (and a patch) at kb.vmware.com. Maybe this is related. "RHEL4 U3, RHEL4 U4, SLES9 SP3 or SLES10 File Systems Might Become Read-Only" http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=51306&sliceId=1&docTypeID=DT_KB_1_1&dialogID=619564&stateId=0%200%20615944 (In reply to comment #19) > Here is an article (and a patch) at kb.vmware.com. Maybe this is related. No, it is completely unrelated. Chip Actually, it might be partially related. I was having the same problems with the original release of RHEL 5 and this change: --- linux-2.6.18.x86_64/drivers/message/fusion/mptscsih.c.orig 2007-09-26 19:00:01.000000000 -0700 +++ linux-2.6.18.x86_64/drivers/message/fusion/mptscsih.c 2007-09-26 19:05:49.000000000 -0700 @@ -770,7 +770,7 @@ case MPI_IOCSTATUS_SCSI_RECOVERED_ERROR: /* 0x0040 */ case MPI_IOCSTATUS_SUCCESS: /* 0x0000 */ if (scsi_status == MPI_SCSI_STATUS_BUSY) - sc->result = (DID_BUS_BUSY << 16) | scsi_status; + sc->result = (DID_REQUEUE << 16) | scsi_status; else sc->result = (DID_OK << 16) | scsi_status; if (scsi_state == 0) { which was suggested by changes made to the MPT Fusion drivers between 2.6.9-42.0.10 and 2.6.9-55.0.6 (-42 fails reliably and -55.0.6 works reliably) resolved the issue for RHEL 5. A similar change was made upstream with commit ad8c31bb69d60c0c6bc6431bccdf67e5a96c0d31 Author: Eric Moore <eric.moore> Date: Mon Mar 19 10:31:51 2007 -0600 [SCSI] fusion: remove VMWare guest OS remounted as read only work around This address the issue of VMWare guest OS being remounted as read-only becuase the underlying device was held busy too long and at the same time address Engenio MPP driver concerns over infinite retries. This patch removes the code that snoops the SAM STATUS on busy, which would be returning DID_BUS_BUSY, instead we return the status as is. Retry hanlding seems to be properly handled in scsi_softirq_done, where a busy sam status would only occurr for the time specified by (cmd->allowed +1) * cmd->timeout_per_command. Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. Please See https://access.redhat.com/support/policy/updates/errata/ If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue. |