Description of problem: While I/O on dm multipath devices , we are seeing frequent path failures which leads to unexpected I/O failover. Snippet of syslog during failure : **************************************** scsi(2:1:16) UNDERRUN status detected 0x15-0x0. resid=0x7fff8fff fw_resid=0x7fff8fff cdb=0x28 os_underflow=0xf400 srb_flags=0x2 scsi(2:0:1:16) Dropped frame(s) detected (7fff8fff of f400 bytes)...retrying command. scsi(2:1:16) qla2x00_done: did_error = 2, comp-scsi= 0x15-0x0 pid=102056310. SCSI error : <2 0 1 16> return code = 0x20000 end_request: I/O error, dev sdbm, sector 4192702 end_request: I/O error, dev sdbm, sector 4192708 device-mapper: dm-multipath: Failing path 68:0. As per our understanding, We are seeing paths marked as failed for which it returns the status as DID_BUS_BUSY. What we understand here is, since IO's on multipath devices have BIO_RW_FAILFAST set (hence REQ_FASTFAIL ), retries are not allowed at SCSI mid layer for errors such as QUEUEFULL, UNDERRUN..(as captured in the above syslog snippet) and so on. Is there any way to override this BIO_RW_FAILFAST for retries to happen in order to avoid unexpected path failure. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Present atleast 50 device(Logical Units) with 8 paths to the Host 2. Start I/O on those 50 deivices. 3. Syslog captures "SCSI error" and "dm-multipath: Failing path" Actual results: Unexpected path failure is seeing during the I/O. Expected results: Additional info: 1.multipath.conf setting: device { vendor "HP" product "HSV210" path_grouping_policy group_by_prio getuid_callout "/sbin/scsi_id -g -u -s /block/%n" path_checker tur path_selector "round-robin 0" prio_callout "/sbin/mpath_prio_alua %n" rr_weight uniform failback immediate hardware_handler "0" no_path_retry 60 }
*** Bug 244968 has been marked as a duplicate of this bug. ***
HI, We are experiencing trhe same issue at HP Marlboro HBA lab.
Created attachment 335568 [details] use did error for dropped frame This had qla2xxx use DID_ERROR for dropped frames. For RHEL 5.3 we changes scsi-ml so that it would retry in the scsi layer for this error. It only retries 5 times, so if you are still getting a error then you really have a problem and probably do not want to use that path anymore. This syncs qla2xxx with lpfc for this behavior.
Marcus, Please review and post this patch for 5.4 as soon as possible. This needs to be done so that it can be backported to 5.3.z, and provided to the customer. Tom
Created attachment 338653 [details] also return transport_disrupted Like Mike's patch, but also return DID_TRANSPORT_DISRUPTED a couple of places.
Can I request a version number? 8.02.00.06.05.03-k -> 8.02.00.07.05.03
I meant for the z-stream back port, if they take this patch out of sequence.
in kernel-2.6.18-140.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html
*** Bug 531002 has been marked as a duplicate of this bug. ***