Description of problem: When an attempt is made to read a bad sector via a multipath device, the multipath driver fails the path it hit the medium error through, and tries again through the next path. Meanwhile the multipathd daemon re-enables the path. As a result the I/O hangs forever (even if you unset the queue_if_no_path feature and set no_path_retry to something sensible, because the paths get re-enabled we never get to the point where there are no live paths). Scenario: we are using an EMC Clariion CX4-240 effectively as JBOD for this host; individual backend disks are in raid groups of type "Disk", with a single lun on each of these raid groups presented to the host. We have configured the RedHat multipath driver to disable the queue_if_no_path feature, and set the no_path_retry option to 60. The idea is that in the event of failure of a disk, the IO will eventually fail at the application level so the application can know the disk has failed and take appropriate action (the application in this instance being Oracle ASM). Version-Release number of selected component (if applicable): Tried with both 2.6.18-128.1.6.el5 x86_64 and 2.6.18-157.el5 x86_64. How reproducible: Every time (if you have access to a disk with a medium error). Steps to Reproduce: 1. Configure an EMC Clariion CX4-240 with a disk with a known medium error into a raid group of type 'Disk', bind a lun to the resulting raid group and present to a host (this means you have a one-to-one mapping between the lun presented by the array and the backend disk, with no redundancy - effectively using the array as JBOD). 2. Attempt to read from the portion of the disk containing the medium error, via the multipath device. Actual results: The IO will hang forever, as the multipath driver retries the request down different paths and the multipathd keeps re-enabling the paths behind it. Expected results: The IO will terminate with an I/O error (just as it does if you attempt a read from one of the /dev/sdX devices making up the multipath device). Reading the same sector via a different path isn't going to make it magically get better. Additional info: You will see messages like this in /var/log/messages: Aug 7 15:11:41 omnitrix kernel: sd 0:0:0:0: SCSI error: return code = 0x08070002 Aug 7 15:11:41 omnitrix kernel: sda: Current: sense key: Medium Error Aug 7 15:11:41 omnitrix kernel: Add. Sense: Unrecovered read error Aug 7 15:11:41 omnitrix kernel: Aug 7 15:11:41 omnitrix kernel: end_request: I/O error, dev sda, sector 23904512 Aug 7 15:11:41 omnitrix kernel: device-mapper: multipath: Failing path 8:0. Aug 7 15:11:41 omnitrix multipathd: 8:0: mark as failed Aug 7 15:11:41 omnitrix multipathd: plaza_0_2_6_mediumerr_asm: remaining active paths: 3 Aug 7 15:11:41 omnitrix multipathd: dm-6: add map (uevent) Aug 7 15:11:41 omnitrix multipathd: dm-6: devmap already registered Aug 7 15:11:46 omnitrix multipathd: sda: emc_clariion_checker: Path healthy Aug 7 15:11:46 omnitrix multipathd: 8:0: reinstated Aug 7 15:11:46 omnitrix multipathd: plaza_0_2_6_mediumerr_asm: remaining active paths: 4 Aug 7 15:11:46 omnitrix multipathd: dm-6: add map (uevent) Aug 7 15:11:46 omnitrix multipathd: dm-6: devmap already registered Aug 7 15:13:41 omnitrix kernel: sd 1:0:0:0: SCSI error: return code = 0x08070002 Aug 7 15:13:41 omnitrix kernel: sdg: Current: sense key: Medium Error Aug 7 15:13:41 omnitrix kernel: Add. Sense: Unrecovered read error Aug 7 15:13:41 omnitrix kernel: Aug 7 15:13:41 omnitrix kernel: end_request: I/O error, dev sdg, sector 23904512 Aug 7 15:13:41 omnitrix kernel: device-mapper: multipath: Failing path 8:96. Aug 7 15:13:41 omnitrix multipathd: 8:96: mark as failed Aug 7 15:13:41 omnitrix multipathd: plaza_0_2_6_mediumerr_asm: remaining active paths: 3 Aug 7 15:13:41 omnitrix multipathd: dm-6: add map (uevent) Aug 7 15:13:41 omnitrix multipathd: dm-6: devmap already registered Aug 7 15:13:46 omnitrix multipathd: sdg: emc_clariion_checker: Path healthy Aug 7 15:13:46 omnitrix multipathd: 8:96: reinstated Aug 7 15:13:46 omnitrix multipathd: plaza_0_2_6_mediumerr_asm: remaining active paths: 4 Aug 7 15:13:46 omnitrix multipathd: plaza_0_2_6_mediumerr_asm: remaining active paths: 4 Aug 7 15:13:46 omnitrix multipathd: dm-6: add map (uevent) Aug 7 15:13:46 omnitrix multipathd: dm-6: devmap already registered Aug 7 15:14:35 omnitrix ntpd[3722]: synchronized to 10.40.2.1, stratum 3 Aug 7 15:14:43 omnitrix kernel: sd 0:0:0:0: SCSI error: return code = 0x08070002 Aug 7 15:14:43 omnitrix kernel: sda: Current: sense key: Medium Error Aug 7 15:14:43 omnitrix kernel: Add. Sense: Unrecovered read error Aug 7 15:14:43 omnitrix kernel: Aug 7 15:14:43 omnitrix kernel: end_request: I/O error, dev sda, sector 23904512 Aug 7 15:14:43 omnitrix kernel: device-mapper: multipath: Failing path 8:0. Aug 7 15:14:43 omnitrix multipathd: 8:0: mark as failed so you can see it is cycling back and forth between devices 0:0:0:0 (sda) and 1:0:0:0 (sdg) which are the 2 active paths to the device: #/sbin/multipath -lll plaza_0_2_6_mediumerr_asm (36006016061b0220089ab8d27bd33de11) dm-6 DGC,DISK [size=134G][features=1 queue_if_no_path][hwhandler=0][rw] \_ round-robin 0 [prio=2][active] \_ 0:0:0:0 sda 8:0 [active][ready] \_ 1:0:0:0 sdg 8:96 [active][ready] \_ round-robin 0 [prio=0][enabled] \_ 0:0:2:0 sdd 8:48 [active][ready] \_ 1:0:2:0 sdj 8:144 [active][ready] Each time it fails a path, multipathd correctly re-enables it (since of course you can do IO to the rest of the disk happily). In my view the multipath driver shouldn't retry the IO, just return the IO error up the stack to the application.
Created attachment 361701 [details] use scsi_debug driver to emulate multipath The problem is confirmed and reproducible with a small code change to the scsi_debug driver. The patch makes multipath think the debug disks on different debug hosts are the same disk. To reproduce, load the modified scsi_debug driver with parameters add_hosts=2 opts=2 to have 2 debug disks and enable the medium error option. Configure the devices in multipath.conf with no_path_retry (I used a value of 15) and start multipath. Then, read sector 0x1234 on the multipath device. scsi_debug will return a medium error for this sector. The read request will repeatedly bounce between the two scsi_debug devices and never complete.
We have witnessed this also on HP EVA storage arrays with qla2xxx FC HBAs under RHEL 5.3 x86_64. We have implemented the same workaround as the original reporter (disabling queue_if_no_path) for that map.
This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.6 and Red Hat does not plan to fix this issue the currently developed update. Contact your manager or support representative in case you need to escalate this bug.
(In reply to comment #0) > Each time it fails a path, multipathd correctly re-enables it (since of course > you can do IO to the rest of the disk happily). In my view the multipath driver > shouldn't retry the IO, just return the IO error up the stack to the > application. Right, and this is how RHEL6.2 should behave because it includes these upstream commits (all are in Linux 2.6.39) to immediately propagate the error: http://git.kernel.org/linus/ad63082 http://git.kernel.org/linus/63583cc http://git.kernel.org/linus/751b2a7 http://git.kernel.org/linus/7977556 I'll need to scope how difficult it would be to backport those changes to 5.8. Could be pulling in the SCSI patches (listed above) depends on other SCSI patches that RHEL5 doesn't have.. making for a backport that snowballs and ultimately isn't doable (due to kABI or some such). Setting Conditional NAK: Design
Patch(es) available in kernel-2.6.18-290.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Reproduced on RHEL 5.7 GA. Multipath will try all paths before return I/O error to application. on kernel -300, multipath return I/O error to application right after first path got "Medium Error" Commands to setup multipath: ==== modprobe scsi_debug dev_size_mb=100 num_tgts=1 \ vpd_use_hostno=0 add_host=4 delay=20 max_luns=2 no_lun_0=1 opts=2 ==== Command to hit the medire error sector: ==== dd if=/dev/mapper/mpath0 of=/dev/null bs=512 skip=4659 count=1 ==== dmesg will show whether multipath check all path or only 1 path. VERIFY. For other storage regression test on this big change, they will be reported in errata. (so far, no issue found on kernel 300).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-0150.html