Bug 516170
Summary: | kernel multipath driver behaves badly on medium errors | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | David Monro <david.monro> | ||||
Component: | kernel | Assignee: | Mike Snitzer <msnitzer> | ||||
Status: | CLOSED ERRATA | QA Contact: | Gris Ge <fge> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 5.3 | CC: | bdonahue, bmarzins, coughlan, cww, david.monro, djeffery, fge, hchiramm, mchristi, qcai, tao, tom | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | kernel-2.6.18-290.el5 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2012-02-21 03:26:15 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 668957, 719046, 726799 | ||||||
Attachments: |
|
Description
David Monro
2009-08-07 06:33:47 UTC
Created attachment 361701 [details]
use scsi_debug driver to emulate multipath
The problem is confirmed and reproducible with a small code change to the scsi_debug driver. The patch makes multipath think the debug disks on different debug hosts are the same disk.
To reproduce, load the modified scsi_debug driver with parameters add_hosts=2 opts=2 to have 2 debug disks and enable the medium error option. Configure the devices in multipath.conf with no_path_retry (I used a value of 15) and start multipath. Then, read sector 0x1234 on the multipath device. scsi_debug will return a medium error for this sector. The read request will repeatedly bounce between the two scsi_debug devices and never complete.
We have witnessed this also on HP EVA storage arrays with qla2xxx FC HBAs under RHEL 5.3 x86_64. We have implemented the same workaround as the original reporter (disabling queue_if_no_path) for that map. This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.6 and Red Hat does not plan to fix this issue the currently developed update. Contact your manager or support representative in case you need to escalate this bug. (In reply to comment #0) > Each time it fails a path, multipathd correctly re-enables it (since of course > you can do IO to the rest of the disk happily). In my view the multipath driver > shouldn't retry the IO, just return the IO error up the stack to the > application. Right, and this is how RHEL6.2 should behave because it includes these upstream commits (all are in Linux 2.6.39) to immediately propagate the error: http://git.kernel.org/linus/ad63082 http://git.kernel.org/linus/63583cc http://git.kernel.org/linus/751b2a7 http://git.kernel.org/linus/7977556 I'll need to scope how difficult it would be to backport those changes to 5.8. Could be pulling in the SCSI patches (listed above) depends on other SCSI patches that RHEL5 doesn't have.. making for a backport that snowballs and ultimately isn't doable (due to kABI or some such). Setting Conditional NAK: Design Patch(es) available in kernel-2.6.18-290.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. Reproduced on RHEL 5.7 GA. Multipath will try all paths before return I/O error to application. on kernel -300, multipath return I/O error to application right after first path got "Medium Error" Commands to setup multipath: ==== modprobe scsi_debug dev_size_mb=100 num_tgts=1 \ vpd_use_hostno=0 add_host=4 delay=20 max_luns=2 no_lun_0=1 opts=2 ==== Command to hit the medire error sector: ==== dd if=/dev/mapper/mpath0 of=/dev/null bs=512 skip=4659 count=1 ==== dmesg will show whether multipath check all path or only 1 path. VERIFY. For other storage regression test on this big change, they will be reported in errata. (so far, no issue found on kernel 300). Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-0150.html |