Bug 499031 - scsi timeout could happen repeatedly on the same broken device
scsi timeout could happen repeatedly on the same broken device
Status: CLOSED DUPLICATE of bug 516934
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.3
All Linux
low Severity high
: rc
: ---
Assigned To: Doug Ledford
Red Hat Kernel QE team
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-05-04 16:53 EDT by Takahiro Yasui
Modified: 2014-07-25 01:07 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-11-16 15:29:02 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
scsi timeout injection module (1.75 KB, text/plain)
2009-05-04 16:53 EDT, Takahiro Yasui
no flags Details

  None (edit)
Description Takahiro Yasui 2009-05-04 16:53:11 EDT
Created attachment 342382 [details]
scsi timeout injection module

Description of problem:
  scsi timeout on two or more devices may cause extremely long execution
  time for user applications because SDEV_OFFLINE state is changed to
  SDEV_RUNNING state during scsi error recovery procedures triggered by
  a bus reset or a host reset of scsi LLD, and scsi timeout can happens
  on the same devices many times.

  Here is an example of the sequence with two no responding devices,
  A and B. When they are alternately accessed by a test program, the
  following sequence happens.

  0. Original states of devices
       device A: "running", device B: "running"

  1. The device A is accessed.
  2. A scsi timeout happens on the device A and its recovery fails.
  3. A state of the device A is changed to "offline".
       device A: "offline", device B: "running"
  4. The device B is accessed.
  5. A scsi timeout happens on the device B and its recovery fails.
  6. The state of devices is changed as follows.
       device A: "running", device B: "offline"

  7. (go back to step 1 and continue this iteration)

Version-Release number of selected component (if applicable):
  Every RHEL5 kernels

How reproducible:
  See below

Steps to Reproduce:
0. Environment
    kernel ... 2.6.29
    scsi LLD ... qla2xxx
    devices ... /dev/sdc (2:0:0:0), /dev/sdd (2:0:0:1)
    scsi timeout ... 3 seconds.

1. Getting an address of scsi_host_template for LLD
  Getting an address of scsi_host_template table specific to LLD.
  In case of qla2xxx driver, a table name is "qla2x00_driver_template".

    # grep qla2x00_driver_template /proc/kallsyms
    f8a323c0 d qla2x00_driver_template      [qla2xxx]

2. Building and loading the scsi timeout injection module
  Loading the scsi timeout injection module with a "param" option,
  which is a series of three parameters, scsi_driver_template address
  got in step 1 and two scsi device targets on which a timeout error
  is injected.

  Here is an example to inject a scsi timeout to scsi devices,
  2:0:0:0, 2:0:0:1.

    # insmod scsi_timeout.ko param=0xf8a323c0,2:0:0:0,2:0:0:1

3. Checking device states
  Both devices, /dev/sd[cd], are now in "running" state.
 
    # cat /sys/block/sdc/device/state
    running
    # cat /sys/block/sdd/device/state
    running

4. Issuing I/Os to the first device (/dev/sdc)
  Issue I/Os to the first device and it takes about 76 seconds.

    # dd if=/dev/sdc of=/dev/null bs=4096 count=100
    dd: reading `/dev/sdc': Input/output error
    0+0 records in
    0+0 records out
    0 bytes (0 B) copied, 75.4365 seconds, 0.0 kB/s

5. Check device states
  The first device (/dev/sdc) is changed to "offline".

    # cat /sys/block/sdc/device/state
    offline
    # cat /sys/block/sdd/device/state
    running

6. Issuing I/Os to the second device (/dev/sdd)
  Issue I/Os to the second device and it takes about 76 seconds.

    # dd if=/dev/sdd of=/dev/null bs=4096 count=100
    dd: reading `/dev/sdd': Input/output error
    0+0 records in
    0+0 records out
    0 bytes (0 B) copied, 75.9649 seconds, 0.0 kB/s

7. Check device states
  The second device (/dev/sdd) is changed to "offline", but the first
  device (/dev/sdc) is changed back to "running".

    # cat /sys/block/sdc/device/state
    running
    # cat /sys/block/sdd/device/state
    offline

8. Again issuing I/Os to the first device (/dev/sdc)
  I/Os to the first device take 76 seconds once again, because the first
  device is in the state of running and I/Os issued by a dd command are
  sent to the device.

    # dd if=/dev/sdc of=/dev/null bs=4096 count=100
    dd: reading `/dev/sdc': Input/output error
    0+0 records in
    0+0 records out
    0 bytes (0 B) copied, 75.8986 seconds, 0.0 kB/s

Actual results:
  States of devices change from "offline" to "running" when the scsi error
  recovery procedures are executed. (See "Steps to Reproduce")

Expected results:
  Once a device moves to "offline" state, it does not move back to "running."

Additional info:
  - Problem analysis and reproduction tool
    http://marc.info/?l=linux-scsi&m=124042136915970&w=2
  - A patch
    http://marc.info/?l=linux-scsi&m=124102164303979&w=2
Comment 1 Takahiro Yasui 2009-05-27 18:02:47 EDT
Status update: In SCSI tree

X-Git-Tree: SCSI
http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commit;h=f1b95d1b7d621b0922490d27089b005d410e807e
Comment 2 Takahiro Yasui 2009-06-25 12:57:57 EDT
Status update: In Linus tree (2.6.31-rc1)

commit 5c10e63c943b4c67561ddc6bf61e01d4141f881f
Author: Takahiro Yasui <tyasui@redhat.com>
Date:   Wed Apr 29 12:13:02 2009 -0400
Comment 3 Takahiro Yasui 2009-11-16 15:29:02 EST

*** This bug has been marked as a duplicate of bug 516934 ***

Note You need to log in before you can comment on or make changes to this bug.