Bug 563343 - failed multipath device caching I/O errors after access is restored
Summary: failed multipath device caching I/O errors after access is restored
Keywords:
Status: CLOSED DUPLICATE of bug 590763
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: device-mapper-multipath
Version: 5.4
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: LVM and device-mapper development team
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks: 596334
TreeView+ depends on / blocked
 
Reported: 2010-02-09 22:07 UTC by David Jeffery
Modified: 2014-01-17 12:23 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 596334 (view as bug list)
Environment:
Last Closed: 2010-05-10 16:09:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
test patch using ioctl(fd, BLKFLSBUF) (1.37 KB, patch)
2010-03-10 14:20 UTC, David Jeffery
no flags Details | Diff

Description David Jeffery 2010-02-09 22:07:51 UTC
Description of problem:

The problem occurs when a multipath device is configured to return errors when all paths are down. If an app is accessing the multipath device and receives an I/O error, future accesses to the same area will return an error even after a path recovers and access to the storage device is restored.

The failed page of the storage device is being cached by the kernel.  And the restoration of access doesn't remove this cached bad page.  As long as at least one reference to the multipath device is kept open, the error will be maintained in the cache until the page is evicted either by request (e.g. blockdev --flushbufs) or to free memory.


Reproduction:

This can be reproduced when forcing down paths of a multipath device.  First, the multipath device needs to be configured to return errors when all paths are down. Setting no_path_retry=fail in multipath.conf is one way.  Then the multipath device needs an extra reference to the device to be kept open.  It can be held open either by a subdevice like a partition or by keeping a filehandle to the device open. e.g. sleep 10000000 </dev/mpath/testdev .

Next, start reading the device (dd if=/dev/mpath/testdev of=/dev/null bs=4k), and disconnect access to the storage device.  dd will receive an I/O error and exit.  So long as the amount read by dd was less than the amount that can be cached in memory, running the same dd will successfully read the same amount of data and error out in the same spot.

Now restore access to at least one path.  Running the dd command again will still result in an error in the same spot.  Access to the rest of the multipath device works, but the error is still cached.  Only by forcing out the bad page with blockdev --flushbufs /dev/mpath/testdev or filling up the cache with other data (and causing the errored page to be evicted) will the error stop.

I've opened this against the userspace multipath tools, though I don't know if the tools or kernel modules should be responsible.

Comment 3 David Jeffery 2010-03-10 14:20:16 UTC
Created attachment 399095 [details]
test patch using ioctl(fd, BLKFLSBUF)

When a path is restored, the test patch uses the ioctl BLKFLSBUF to flush a multipath device that had previously lost all paths.  The intention is only devices that have been set to return failure instead of queueing I/O should be flushed.

The patch forks off another process to perform the flush asynchronously.  Since multipathd is multithreaded, a thread may have been better.  I can re-spin and retest if that is desired.

Comment 6 Ben Marzinski 2010-03-10 22:58:23 UTC
Doing this in a thread would be better. That should allow you to reuse the already open file descriptor. Since the root filesystem device itself may be multipathed, and currently down, multipath needs to be able to run without touching device backed filesystems as much as possible.  It's not such a big deal on RHEL5, since we create our own ramfs that includes the dev directory, but we don't do that in RHEL 6.  You probably want to dup the fd for this because if you're holding onto the fd that the path has, you need to hold vecs->lock for the entire thread, to
make sure that it doesn't get closed out from under you.

Comment 7 Ben Marzinski 2010-03-10 23:15:47 UTC
Also, I'm not sure why this is a big issue.  If you run your reproducer on a regular scsi device, the exact same thing will happen.  It's not that multipath is doing something different than a regular device. It is working just like one.

The point of having the no_path_retry timeout is to allow multipath to wait for reasonable time before giving up and returning an error.  Customers are supposed to set this value longer than any temporary failure should last.  The idea is that if the no_path_retry timeout expires, you should be able to assume that this was not a transient error. Obviously there is always the possiblity that a cord will get unplugged and go unnoticed for a long time, and that the processes using the device will be able to handle the IO errors gracefully.  But this is not often the case.

If customers really want to avoid this, they can have multipath wait forerver, then the processes using it won't ever get the errors, and they will be much more likely to survive an extended all-paths-down case.

Comment 8 Bryn M. Reeves 2010-03-16 16:25:53 UTC
I think the proposed fix is racy. The BLKFLSBUF is executed from user space after the path has been re-instated (we couldn't do it before since we need to have usable paths to the storage). Since BLKFLSBUF may take some time to complete and invalidate all the cached pages for the device there is a race window where device-mapper reports the map status as good but accesses may result in an I/O error since the BLKFLSBUF has not yet gotten to all error pages.

I think this needs to happen in the kernel.

Comment 9 Ben Marzinski 2010-03-16 22:39:35 UTC
It may be racey, but I'm not sure that it matters too much in this case. It's not like doing this in kernel will stop applications from getting IO errors.  This problem will only ever happen if the device has already returned IO errors.  In fact, assuming that the user has set no_path_retry to something sensible, as soon as that timeout expires, there is going to be minutes worth of IO errors coming back from the device.  Then there will be all the IO errors that happen between when the timeout expires, and when a path comes back online.  Finally there will be the timeout errors that happen before BLKFLSBUF returns, but these will be dwarfed by the number of IO errors that came before.

Also, Aside from a log message that gets reported asynchronously anyways, multipath doesn't do any real notification when a path comes back online, so it's not likely that someone will see the path come back and instantly expect the errors to stop.

Besides, like I said before, what usually happens is that as soon as the  no_path_retry timeout expires, all the applications that were actively using the multipath device crash, and filesystems on it may go into read-only, or panic the machine, or whatever.  Unless they are very sure of everything on top of the device, nobody should assume that they will be able to gracefully recover once no_path_retries times out.

All that being said, I'm not adamantly opposed to this being fixed in the kernel, I just don't think that it being racy is a big problem.

Comment 10 Mike Christie 2010-03-19 03:49:02 UTC
(In reply to comment #7)
> Also, I'm not sure why this is a big issue.  If you run your reproducer on a
> regular scsi device, the exact same thing will happen.  It's not that multipath
> is doing something different than a regular device. It is working just like
> one.

I do not think this should be a dm-multipath only problem and solution. As Ben stated above this problem occurs with other devices. You could set the FC fast io fail tmo shorter than the dev loss one, and if the path comes back after fast fail and before dev loss we hit the same problem. In RHEL 5/4 if the dev loss tmo fires and fc_remove_on_dev_loss is not set (it is not set by default), then we hit this problem. For iscsi, if the replacement timeout fires, we can hit the problem.

I think we should take this to LKML or at least dm-devel and linux-scsi.

Comment 12 Ben Marzinski 2010-05-06 20:55:14 UTC
A patch for this was actually submitted to LKML a while ago:

http://lkml.org/lkml/2009/1/23/288

in relation to bug #481371.  However it doesn't seem to have gone anywhere. Perhaps cloning that bug for RHEL5 is the way to go, since this issue really has nothing to do with device-mapper-multipath.

Comment 15 Jeremy West 2010-05-10 16:09:15 UTC

*** This bug has been marked as a duplicate of bug 590763 ***


Note You need to log in before you can comment on or make changes to this bug.