Bug 499031

Summary: scsi timeout could happen repeatedly on the same broken device
Product: Red Hat Enterprise Linux 5 Reporter: Takahiro Yasui <tyasui>
Component: kernelAssignee: Doug Ledford <dledford>
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: low    
Version: 5.3CC: dzickus, lwang, masaki.kimura.kz, noboru.obata.ar, saguchi, takahiro.yasui.mp
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-11-16 20:29:02 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
scsi timeout injection module none

Description Takahiro Yasui 2009-05-04 20:53:11 UTC
Created attachment 342382 [details]
scsi timeout injection module

Description of problem:
  scsi timeout on two or more devices may cause extremely long execution
  time for user applications because SDEV_OFFLINE state is changed to
  SDEV_RUNNING state during scsi error recovery procedures triggered by
  a bus reset or a host reset of scsi LLD, and scsi timeout can happens
  on the same devices many times.

  Here is an example of the sequence with two no responding devices,
  A and B. When they are alternately accessed by a test program, the
  following sequence happens.

  0. Original states of devices
       device A: "running", device B: "running"

  1. The device A is accessed.
  2. A scsi timeout happens on the device A and its recovery fails.
  3. A state of the device A is changed to "offline".
       device A: "offline", device B: "running"
  4. The device B is accessed.
  5. A scsi timeout happens on the device B and its recovery fails.
  6. The state of devices is changed as follows.
       device A: "running", device B: "offline"

  7. (go back to step 1 and continue this iteration)

Version-Release number of selected component (if applicable):
  Every RHEL5 kernels

How reproducible:
  See below

Steps to Reproduce:
0. Environment
    kernel ... 2.6.29
    scsi LLD ... qla2xxx
    devices ... /dev/sdc (2:0:0:0), /dev/sdd (2:0:0:1)
    scsi timeout ... 3 seconds.

1. Getting an address of scsi_host_template for LLD
  Getting an address of scsi_host_template table specific to LLD.
  In case of qla2xxx driver, a table name is "qla2x00_driver_template".

    # grep qla2x00_driver_template /proc/kallsyms
    f8a323c0 d qla2x00_driver_template      [qla2xxx]

2. Building and loading the scsi timeout injection module
  Loading the scsi timeout injection module with a "param" option,
  which is a series of three parameters, scsi_driver_template address
  got in step 1 and two scsi device targets on which a timeout error
  is injected.

  Here is an example to inject a scsi timeout to scsi devices,
  2:0:0:0, 2:0:0:1.

    # insmod scsi_timeout.ko param=0xf8a323c0,2:0:0:0,2:0:0:1

3. Checking device states
  Both devices, /dev/sd[cd], are now in "running" state.
 
    # cat /sys/block/sdc/device/state
    running
    # cat /sys/block/sdd/device/state
    running

4. Issuing I/Os to the first device (/dev/sdc)
  Issue I/Os to the first device and it takes about 76 seconds.

    # dd if=/dev/sdc of=/dev/null bs=4096 count=100
    dd: reading `/dev/sdc': Input/output error
    0+0 records in
    0+0 records out
    0 bytes (0 B) copied, 75.4365 seconds, 0.0 kB/s

5. Check device states
  The first device (/dev/sdc) is changed to "offline".

    # cat /sys/block/sdc/device/state
    offline
    # cat /sys/block/sdd/device/state
    running

6. Issuing I/Os to the second device (/dev/sdd)
  Issue I/Os to the second device and it takes about 76 seconds.

    # dd if=/dev/sdd of=/dev/null bs=4096 count=100
    dd: reading `/dev/sdd': Input/output error
    0+0 records in
    0+0 records out
    0 bytes (0 B) copied, 75.9649 seconds, 0.0 kB/s

7. Check device states
  The second device (/dev/sdd) is changed to "offline", but the first
  device (/dev/sdc) is changed back to "running".

    # cat /sys/block/sdc/device/state
    running
    # cat /sys/block/sdd/device/state
    offline

8. Again issuing I/Os to the first device (/dev/sdc)
  I/Os to the first device take 76 seconds once again, because the first
  device is in the state of running and I/Os issued by a dd command are
  sent to the device.

    # dd if=/dev/sdc of=/dev/null bs=4096 count=100
    dd: reading `/dev/sdc': Input/output error
    0+0 records in
    0+0 records out
    0 bytes (0 B) copied, 75.8986 seconds, 0.0 kB/s

Actual results:
  States of devices change from "offline" to "running" when the scsi error
  recovery procedures are executed. (See "Steps to Reproduce")

Expected results:
  Once a device moves to "offline" state, it does not move back to "running."

Additional info:
  - Problem analysis and reproduction tool
    http://marc.info/?l=linux-scsi&m=124042136915970&w=2
  - A patch
    http://marc.info/?l=linux-scsi&m=124102164303979&w=2

Comment 1 Takahiro Yasui 2009-05-27 22:02:47 UTC
Status update: In SCSI tree

X-Git-Tree: SCSI
http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commit;h=f1b95d1b7d621b0922490d27089b005d410e807e

Comment 2 Takahiro Yasui 2009-06-25 16:57:57 UTC
Status update: In Linus tree (2.6.31-rc1)

commit 5c10e63c943b4c67561ddc6bf61e01d4141f881f
Author: Takahiro Yasui <tyasui>
Date:   Wed Apr 29 12:13:02 2009 -0400

Comment 3 Takahiro Yasui 2009-11-16 20:29:02 UTC

*** This bug has been marked as a duplicate of bug 516934 ***