Created attachment 342382 [details] scsi timeout injection module Description of problem: scsi timeout on two or more devices may cause extremely long execution time for user applications because SDEV_OFFLINE state is changed to SDEV_RUNNING state during scsi error recovery procedures triggered by a bus reset or a host reset of scsi LLD, and scsi timeout can happens on the same devices many times. Here is an example of the sequence with two no responding devices, A and B. When they are alternately accessed by a test program, the following sequence happens. 0. Original states of devices device A: "running", device B: "running" 1. The device A is accessed. 2. A scsi timeout happens on the device A and its recovery fails. 3. A state of the device A is changed to "offline". device A: "offline", device B: "running" 4. The device B is accessed. 5. A scsi timeout happens on the device B and its recovery fails. 6. The state of devices is changed as follows. device A: "running", device B: "offline" 7. (go back to step 1 and continue this iteration) Version-Release number of selected component (if applicable): Every RHEL5 kernels How reproducible: See below Steps to Reproduce: 0. Environment kernel ... 2.6.29 scsi LLD ... qla2xxx devices ... /dev/sdc (2:0:0:0), /dev/sdd (2:0:0:1) scsi timeout ... 3 seconds. 1. Getting an address of scsi_host_template for LLD Getting an address of scsi_host_template table specific to LLD. In case of qla2xxx driver, a table name is "qla2x00_driver_template". # grep qla2x00_driver_template /proc/kallsyms f8a323c0 d qla2x00_driver_template [qla2xxx] 2. Building and loading the scsi timeout injection module Loading the scsi timeout injection module with a "param" option, which is a series of three parameters, scsi_driver_template address got in step 1 and two scsi device targets on which a timeout error is injected. Here is an example to inject a scsi timeout to scsi devices, 2:0:0:0, 2:0:0:1. # insmod scsi_timeout.ko param=0xf8a323c0,2:0:0:0,2:0:0:1 3. Checking device states Both devices, /dev/sd[cd], are now in "running" state. # cat /sys/block/sdc/device/state running # cat /sys/block/sdd/device/state running 4. Issuing I/Os to the first device (/dev/sdc) Issue I/Os to the first device and it takes about 76 seconds. # dd if=/dev/sdc of=/dev/null bs=4096 count=100 dd: reading `/dev/sdc': Input/output error 0+0 records in 0+0 records out 0 bytes (0 B) copied, 75.4365 seconds, 0.0 kB/s 5. Check device states The first device (/dev/sdc) is changed to "offline". # cat /sys/block/sdc/device/state offline # cat /sys/block/sdd/device/state running 6. Issuing I/Os to the second device (/dev/sdd) Issue I/Os to the second device and it takes about 76 seconds. # dd if=/dev/sdd of=/dev/null bs=4096 count=100 dd: reading `/dev/sdd': Input/output error 0+0 records in 0+0 records out 0 bytes (0 B) copied, 75.9649 seconds, 0.0 kB/s 7. Check device states The second device (/dev/sdd) is changed to "offline", but the first device (/dev/sdc) is changed back to "running". # cat /sys/block/sdc/device/state running # cat /sys/block/sdd/device/state offline 8. Again issuing I/Os to the first device (/dev/sdc) I/Os to the first device take 76 seconds once again, because the first device is in the state of running and I/Os issued by a dd command are sent to the device. # dd if=/dev/sdc of=/dev/null bs=4096 count=100 dd: reading `/dev/sdc': Input/output error 0+0 records in 0+0 records out 0 bytes (0 B) copied, 75.8986 seconds, 0.0 kB/s Actual results: States of devices change from "offline" to "running" when the scsi error recovery procedures are executed. (See "Steps to Reproduce") Expected results: Once a device moves to "offline" state, it does not move back to "running." Additional info: - Problem analysis and reproduction tool http://marc.info/?l=linux-scsi&m=124042136915970&w=2 - A patch http://marc.info/?l=linux-scsi&m=124102164303979&w=2
Status update: In SCSI tree X-Git-Tree: SCSI http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commit;h=f1b95d1b7d621b0922490d27089b005d410e807e
Status update: In Linus tree (2.6.31-rc1) commit 5c10e63c943b4c67561ddc6bf61e01d4141f881f Author: Takahiro Yasui <tyasui> Date: Wed Apr 29 12:13:02 2009 -0400
*** This bug has been marked as a duplicate of bug 516934 ***