Red Hat Bugzilla – Bug 509022
scsi timeout could happen repeatedly when device doesn't respond to R/W but TUR
Last modified: 2014-07-25 01:07:57 EDT
Created attachment 350025 [details]
scsi timeout injection module
Description of problem:
A storage can break down in the way that it does not respond to scsi
commands such as read/write, while a storage successfully respond to
scsi commands such as test unit ready.
(It may depend on implementation of storage.)
When this type of a device trouble happens, the scsi-mid layer detects
timeout for the device, and scsi-mid layer tries to recover the error.
Then, scsi-mid layer detects that the device has been recovered by
the result of Test Unit Ready.
Therefore, the state of the device is not changed to offline and user
application can continue to issue I/Os to the device. This may cause
timeout errors repeatedly on the same device, and application can not
do proper actions quickly.
In addition, this issue seriously affects system boot time. During
device scanning in scsi-mid layer, read I/Os are issued to recognized
devices to get their partition table in check_partition(). Usually,
many types of filesystems are registered, and partition check is
executed for every filesystems. This is a very long process because
every read I/O ends up by timeout.
Moreover, scsi device scan is sequentially done, and other devices
wait to be scanned. In some Linux distributions, boot processes go
forward before valid devices are recognised, and system can not start
correctly even if devices are fully redundant by mirroring.
Version-Release number of selected component (if applicable):
Every RHEL5 kernels
Steps to Reproduce:
kernel ... 2.6.18-128.el5
scsi LLD ... qla2xxx
devices ... /dev/sdc (2:0:0:0)
scsi timeout ... 3 seconds.
1. Getting an address of scsi_host_template for LLD
Getting an address of scsi_host_template table specific to LLD.
In case of qla2xxx driver, a table name is "qla2x00_driver_template".
# grep qla2x00_driver_template /proc/kallsyms
f8a323c0 d qla2x00_driver_template [qla2xxx]
2. Building and loading the scsi timeout injection module
Loading the scsi timeout injection module with a "param" option,
which is a series of two parameters, scsi_driver_template address
got in step 1 and a scsi device target on which a timeout error is
Here is an example to inject a scsi timeout to scsi devices, 2:0:0:0.
# insmod scsi_timeout.ko param=0xf8a323c0,2:0:0:0
3. Issuing I/Os to the device (/dev/sdc)
Issue I/Os to the device several times and you can see it takes about
36 seconds for each I/Os.
# dd if=/dev/sdc of=/dev/null bs=4096 count=1
dd: reading `/dev/sdc': Input/output error
0+0 records in
0+0 records out
0 bytes (0 B) copied, 36.0002 seconds, 0.0 kB/s
Timeout happens every time when a process issues I/O to a broken device,
and the process needs to wait for a long time. This is caused because
scsi layer does not change the device to offline state in this case of
scsi layer changes the state of the device to offline when timeout
happened the number of times a user indicated, and the process which
issued the I/O can receive -EIO without significant delay.
- Patch to add a parameter to limit timeout count per device and change
a broken device to offline state is under discussion on linux-scsi.
Introduce the parameter to limit scsi timeout count
Introduce the parameter to limit scsi timeout count (take 2)
Hitachi can close this RHEL5 bug.
We will check if this issue is reproducible in RHEL6.
If yes, we will open a new bug.