Bug 678472 - multipath recovery from scsi devices offlined by scsi err handler
Summary: multipath recovery from scsi devices offlined by scsi err handler
Keywords:
Status: CLOSED DUPLICATE of bug 641193
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: device-mapper-multipath
Version: 5.6
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Ben Marzinski
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-02-18 04:23 UTC by Mark Goodwin
Modified: 2018-11-14 14:46 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-03-14 19:49:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Mark Goodwin 2011-02-18 04:23:33 UTC
Description of problem:

Catastrophic fibre channel transport or target failure resulting in
a DRIVER_TIMEOUT causes the RHEL scsi error handler to offline scsi 
device/paths, see scsi_eh_ready_devs(). Device-mapper-multipath
can't recover from this without manual intervention to bring the devices
back to the running state via the sysfs interface before failback
is possible (SCSI rejects all I/O to offline devices, so all path_checker
probes fail).

Typical syslog messages when this occurs :

Jan 27 11:42:38 somehost kernel: lpfc 0000:13:00.0: 0:0713 SCSI layer issued LUN reset (0, 38) Data: x0 x3 x2
Jan 27 11:43:11 somehost kernel: lpfc 0000:13:00.0: 0:0714 SCSI layer issued Bus Reset Data: x2002
Jan 27 11:43:31 somehost multipathd: 8:16: readsector0 checker reports path is down
Jan 27 11:43:31 somehost multipathd: checker failed path 8:16 in map mpath1
Jan 27 11:43:31 somehost multipathd: mpath1: remaining active paths: 1
Jan 27 11:43:31 somehost kernel: scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 38
Jan 27 11:43:31 somehost kernel: SCSI error : <0 0 0 38> return code = 0x6000000
Jan 27 11:43:31 somehost kernel: device-mapper: dm-multipath: Failing path 8:16.
Jan 27 11:46:30 somehost kernel: lpfc 0000:13:00.1: 1:0713 SCSI layer issued LUN reset (0, 38) Data: x0 x3 x2
Jan 27 11:47:21 somehost kernel: lpfc 0000:13:00.0: 0:0713 SCSI layer issued LUN reset (0, 39) Data: x0 x3 x2
Jan 27 11:47:40 somehost kernel: lpfc 0000:13:00.1: 1:0713 SCSI layer issued LUN reset (0, 39) Data: x0 x3 x2
Jan 27 11:47:54 somehost kernel: lpfc 0000:13:00.0: 0:0714 SCSI layer issued Bus Reset Data: x2002
Jan 27 11:48:13 somehost kernel: lpfc 0000:13:00.1: 1:0714 SCSI layer issued Bus Reset Data: x2002
Jan 27 11:48:14 somehost kernel: scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 39
Jan 27 11:48:14 somehost kernel: SCSI error : <0 0 0 39> return code = 0x6000000
Jan 27 11:48:14 somehost kernel: end_request: I/O error, dev sdc, sector 58726016
Jan 27 11:48:14 somehost kernel: device-mapper: dm-multipath: Failing path 8:32.
Jan 27 11:48:14 somehost kernel: scsi0 (0:39): rejecting I/O to offline device
Jan 27 11:48:14 somehost last message repeated 7 times
Jan 27 11:48:14 somehost multipathd: 8:32: readsector0 checker reports path is down
Jan 27 11:48:14 somehost multipathd: checker failed path 8:32 in map mpath2
Jan 27 11:48:14 somehost kernel: scsi0 (0:39): rejecting I/O to offline device
Jan 27 11:48:14 somehost last message repeated 32 times
Jan 27 11:48:14 somehost kernel: SCSI error : <0 0 0 39> return code = 0x6000000 

Version-Release number of selected component (if applicable):
Any version of device-mapper-multipath in RHEL5 (presume any RHEL version).


How reproducible:
Does not occur often, but when it does it causes considerable grief.

Steps to Reproduce:
1. set up a multipath config
2. offline one of the active paths, e.g. for sdc :
   # echo offline > /sys/block/sdc/device/state
3. observe dm-multipath fail the paths, and never recover without
   manually changing the state to "running" again.
  
Actual results:
scsi device/paths remain failed

Expected results:
scsi device/paths brought back to the running state without requiring
manual intervention.

Additional info:
This might be considered an RFE rather than a bug - it may not always be
safe to return devices to the running state. If that is the case, then this
bug should at least aim to improve the syslog messages when the path_checker
fails.

Comment 1 Ben Marzinski 2011-02-18 22:51:21 UTC
So you want multipath to notice that the scsi device state is offlined in sysfs, and reset it to running?  I suppose that multipath could do that, but it would make it controllable by a configuration variable, since I don't think that everyone would want this behaviour.

Comment 2 Mike Christie 2011-02-19 03:55:31 UTC
There was a bug in 5.5/ where if a device was offlined and the transport came back the fc class could not set the devices back to running. This was fixed in 5.6. Not sure if that is what you are hitting. This is for cases where the remote port goes from Online->Blocked->something else like Not Present->Online.

For other cases where the remote port is not affected (so the port state stays in Online the entire time), then the fix in 5.6 would not help you. And you probably want to make this configurable. If the device has gone bad, I think a something like a INQUIRY (some path testers use that, right?) could work in some cases on some targets but READs/WRITEs might fail.

Comment 3 Mark Goodwin 2011-02-21 03:02:45 UTC
(In reply to comment #1)
> So you want multipath to notice that the scsi device state is offlined in
> sysfs, and reset it to running?

Well, only if the fix for BZ 641193 doesn't help, but it looks like that bug
may be the root cause here - this site is running 2.6.18-194.11.1.el5
and so is affected by the regression that Mike mentioned (where only
devices in state SDEV_BLOCK would be auto transitioned back to SDEV_RUNNING).
The fix in RHEL5.6 transitions devices from any state (including SDEV_OFFLINE)
back to SDEV_RUNNING when the rport returns.

So I think the only reason we'd want this RFE for multipath to force the
transition backto SDEV_RUNNING is for the case where the rport remains online, 
despite the devices being offlined. I don't know how often that has been
hit in the wild, if ever? so perhaps DUP this to BZ 641193. Thoughts?

Regards and thanks
-- Mark Goodwin

Comment 4 Ben Marzinski 2011-02-23 15:28:09 UTC
This is the first report I've heard of multipath path devices getting incorrectly marked as offline, so I'm leaning towards DUPing it.  Mike, do you know of any other cases where we'd need to worry about this?

Comment 5 Mark Goodwin 2011-03-11 04:23:49 UTC
Ben, are you going to DUP this one? (did you hear back from Mike?)

Regards
-- Mark

Comment 6 Ben Marzinski 2011-03-14 19:49:11 UTC

*** This bug has been marked as a duplicate of bug 641193 ***


Note You need to log in before you can comment on or make changes to this bug.