244967 – Frequent path failures during I/O on DM multipath devices

Bug 244967 - Frequent path failures during I/O on DM multipath devices

Summary: Frequent path failures during I/O on DM multipath devices

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.2
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Marcus Barrow
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	244968 531002 (view as bug list)
Depends On:
Blocks:	495635 524179
TreeView+	depends on / blocked

Reported:	2007-06-20 07:14 UTC by vijay
Modified:	2018-10-20 01:58 UTC (History)
CC List:	33 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-09-02 08:09:34 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
use did error for dropped frame (419 bytes, application/octet-stream) 2009-03-17 17:20 UTC, Mike Christie	no flags	Details
also return transport_disrupted (2.28 KB, patch) 2009-04-08 04:07 UTC, Marcus Barrow	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2009:1377	0	normal	SHIPPED_LIVE	device-mapper-multipath bug-fix and enhancement update	2009-09-01 12:41:23 UTC
Red Hat Product Errata	RHSA-2009:1243	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.4 kernel security and bug fix update	2009-09-01 08:53:34 UTC

Description vijay 2007-06-20 07:14:11 UTC

Description of problem:

While I/O on dm multipath devices , we are seeing frequent path failures which 
leads to unexpected I/O failover.

Snippet of syslog during failure :
****************************************
scsi(2:1:16) UNDERRUN status detected 0x15-0x0. 
resid=0x7fff8fff fw_resid=0x7fff8fff cdb=0x28 
os_underflow=0xf400 srb_flags=0x2
scsi(2:0:1:16) Dropped frame(s) detected 
(7fff8fff of f400 bytes)...retrying command.
scsi(2:1:16) qla2x00_done: did_error = 2,
comp-scsi= 0x15-0x0 pid=102056310.
SCSI error : <2 0 1 16> return code = 0x20000
end_request: I/O error, dev sdbm, sector 4192702
end_request: I/O error, dev sdbm, sector 4192708
device-mapper: dm-multipath: Failing path 68:0.


As per our understanding, We are seeing paths marked as failed for which it 
returns the status as DID_BUS_BUSY. What we understand here is, since IO's on 
multipath devices have BIO_RW_FAILFAST set (hence REQ_FASTFAIL ),   retries are 
not allowed at SCSI mid layer for errors such as QUEUEFULL, UNDERRUN..(as 
captured in the above syslog snippet) and so on. Is there any way to override 
this BIO_RW_FAILFAST  for retries to happen  in order to avoid unexpected path 
failure. 


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Present atleast 50 device(Logical Units) with 8 paths to the Host
2. Start I/O on those 50 deivices. 
3. Syslog captures  "SCSI error" and "dm-multipath: Failing path"
  
Actual results:
Unexpected path failure is seeing during the I/O.

Expected results:

Additional info:

1.multipath.conf setting:
        device {
        vendor                  "HP"
        product                 "HSV210"
        path_grouping_policy    group_by_prio
        getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
        path_checker            tur
        path_selector           "round-robin 0"
        prio_callout           "/sbin/mpath_prio_alua %n"
        rr_weight               uniform
        failback                immediate
        hardware_handler        "0"
        no_path_retry            60
}

Comment 1 Ben Marzinski 2007-10-26 20:35:33 UTC

*** Bug 244968 has been marked as a duplicate of this bug. ***

Comment 2 Richard Wojdak 2009-01-16 16:32:47 UTC

HI,
  We are experiencing trhe same issue at HP Marlboro HBA lab.

Comment 5 Mike Christie 2009-03-17 17:20:49 UTC

Created attachment 335568 [details]
use did error for dropped frame

This had qla2xxx use DID_ERROR for dropped frames. For RHEL 5.3 we changes scsi-ml so that it would retry in the scsi layer for this error. It only retries 5 times, so if you are still getting a error then you really have a problem and probably do not want to use that path anymore.

This syncs qla2xxx with lpfc for this behavior.

Comment 23 Tom Coughlan 2009-04-02 15:49:35 UTC

Marcus,

Please review and post this patch for 5.4 as soon as possible. This needs to be done so that it can be backported to 5.3.z, and provided to the customer. 

Tom

Comment 27 Marcus Barrow 2009-04-08 04:07:40 UTC

Created attachment 338653 [details]
also return transport_disrupted


Like Mike's patch, but also return DID_TRANSPORT_DISRUPTED a couple of places.

Comment 28 Marcus Barrow 2009-04-09 19:51:01 UTC

Can I request a version number? 


8.02.00.06.05.03-k  -> 8.02.00.07.05.03

Comment 29 Marcus Barrow 2009-04-09 19:52:33 UTC

I meant for the z-stream back port, if they take this patch out of sequence.

Comment 35 Don Zickus 2009-04-20 17:09:57 UTC

in kernel-2.6.18-140.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 41 errata-xmlrpc 2009-09-02 08:09:34 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Comment 44 Linda Wang 2009-10-26 13:46:23 UTC

*** Bug 531002 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.

agk
bmarzins
bmr
bubrown
casmith
christophe.varoqui
coughlan
cward
dhoward
dwysocha
dzickus
egoggin
james.brown
james.hofmeister
joshua
jpirko
junichi.nomura
jwilleford
k.georgiou
kueda
lmb
lwang
mbroz
mchristi
pep
phinchman
prockai
richard.wojdak
senthil-kumar.veluswamy
sghosh
tao
tranlan
vijay.kumar7