Bug 645633
Summary: | temporary loss of path to SAN results in persistent EIO with msync | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Lachlan McIlroy <lmcilroy> | |
Component: | kernel | Assignee: | Rik van Riel <riel> | |
Status: | CLOSED ERRATA | QA Contact: | Gris Ge <fge> | |
Severity: | medium | Docs Contact: | ||
Priority: | high | |||
Version: | 4.6 | CC: | aviro, bdonahue, bmr, esandeen, fge, fhirtz, jmoyer, jweiner, riel, rwheeler, vgaikwad | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | All | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 652369 652371 670362 (view as bug list) | Environment: | ||
Last Closed: | 2011-02-16 15:55:00 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 652369, 652371, 670362 | |||
Attachments: |
I think this problem is related to the error handling in this function: static int mpage_end_io_write(struct bio *bio, unsigned int bytes_done, int err) { const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags); struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1; if (bio->bi_size) return 1; do { struct page *page = bvec->bv_page; if (--bvec >= bio->bi_io_vec) prefetchw(&bvec->bv_page->flags); if (!uptodate) SetPageError(page); end_page_writeback(page); } while (bvec >= bio->bi_io_vec); bio_put(bio); return 0; } If the bio fails then uptodate will be false and it sets the page in error. I'm not sure that makes sense - it was the bio that failed and the page is still valid. If it was a read it should mark the page in error. The msync() will be waiting in this function for the pages to be written to disk: static int wait_on_page_writeback_range(struct address_space *mapping, pgoff_t start, pgoff_t end) { struct pagevec pvec; int nr_pages; int ret = 0; pgoff_t index; if (end < start) return 0; pagevec_init(&pvec, 0); index = start; while ((index <= end) && (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, PAGECACHE_TAG_WRITEBACK, min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1)) != 0) { unsigned i; for (i = 0; i < nr_pages; i++) { struct page *page = pvec.pages[i]; /* until radix tree lookup accepts end_index */ if (page->index > end) continue; wait_on_page_writeback(page); if (PageError(page)) ret = -EIO; } pagevec_release(&pvec); cond_resched(); } /* Check for outstanding write errors */ if (test_and_clear_bit(AS_ENOSPC, &mapping->flags)) ret = -ENOSPC; if (test_and_clear_bit(AS_EIO, &mapping->flags)) ret = -EIO; return ret; } It will detect that the page had an error and fail the msync(). I don't think the error is ever removed from the page and that's why it's persistent. This seems a bit like the write side of bug 481371 and bug 590763. I tend to agree with Lachlan that the SetPageError in mpage_end_io_write is wrong; as comment #2 states uptodate here is the state of the bio, not the page and since we are in a write path there is no reason to set an error state on the page (nothing is wrong with it; if the device came back to life and we resubmitted the I/O there is no reason why it should not succeed but by setting PG_error we're preventing that from happening). Looking at this code up stream it's changed a little: if (!uptodate){ SetPageError(page); if (page->mapping) set_bit(AS_EIO, &page->mapping->flags); } The code to set AS_EIO in the mapping flags got added here: commit 854715be73b221596c7127d4042e1120d4539e19 Author: Qu Fuping <fs.ac.cn> Date: Sat Jun 4 15:43:29 2005 -0700 [PATCH] mpage_end_io_write() I/O error handling fix When fsync() runs wait_on_page_writeback_range() it only inspects pages which are actually under I/O (PAGECACHE_TAG_WRITEBACK). If a page completed I/O prior to wait_on_page_writeback_range() looking at it, it is supposed to have recorded its I/O error state in the address_space. But mpage_mpage_end_io_write() forgot to set the address_space error flag in this case. Signed-off-by: Qu Fuping <fs.ac.cn> Signed-off-by: Andrew Morton <akpm> Signed-off-by: Linus Torvalds <torvalds> Here's the thread: http://kerneltrap.org/mailarchive/linux-fsdevel/2005/6/2/309344 The patch was added to fix a problem where fsync() returns 0 (via wait_on_page_writeback_range()) when an I/O error has occurred during writeback of a mapping for the file descriptor. The patch correctly marks the mapping flags with AS_EIO and this is reported back to the caller of fsync. I'm wondering though if the patch shouldn't have also deleted the SetPageError(page); maybe there is something that I am misunderstanding in the way the page flags should be used but I don't understand why this should be set here and removing it seems like it would fix this bug. The mpage_end_io_write code seems to go back a long way (pre-git) but wasn't present in 2.4. Clearing PageError should be done before the next time we write the page. Having an old PageError bit on the page before a next write is not going to impede page cache reads from the page, which do not even check the PageError bit if the PageUptodate bit is set. Having said that, I do not know whether PageError is actually used for anything in the write path, or whether the mapping bit is enough to return -EIO to the user. Maybe it would be easier to just never set PageError in the write path... Setting the error on the address space mapping is definitely needed here since wait_on_page_writeback_range() uses that to detect a failed I/O. I've had to add that fix to RHEL4 (along with removing the SetPageError()) to get the test case to pass. The error on the address space mapping is removed when checked allowing the msync() to try again. The only problem I see with relying on the address space mapping flag is that if it's possible to have concurrent operations then the flag may be cleared in the wrong context. Ideally we should be propagating the I/O error through the call stack instead of relying on the AS_EIO flag. Fixing this bug may not just be a case of not setting the error on the page. Since the dirty bit has been cleared before issuing the I/O it needs to be set again so the page will eventually be written out. If we don't set the dirty bit the page will be considered clean and may be purged from the cache without being written to disk and data is lost. I've been able to prove this by dropping caches after offlining the device. Just setting the dirty bit on the page is not enough either - we need to set the dirty tag in the radix tree for the address space mapping so another msync() will see the pages as dirty. The potential problem with re-dirtying the page is that if the device never comes back online we'll end up with dirty pages that are stuck in the cache. Avoiding this problem usually takes priority with the argument that any application doing asynchronous writes cannot expect it's data to be safe until it syncs it to disk and if the sync fails the application needs to re-do the writes (or quit). That may apply here too with the mmap writes. Created attachment 458576 [details]
Patch to remove setting the page error flag
I think the best approach to fix this problem is to remove setting the error flag on the page and also set the AS_EIO flag on the address space mapping - as is done on RHEL5 - so that msync() gets an error. Trying to handle the re-dirtying of the page is not really part of this issue since all we are trying to do here is remove the persistent error. If msync() fails then it's up to the application to ensure that any writes done since the last successful msync() are re-issued. With this patch the reproducer continues successfully after the SAN patch is restored.
I have considered that approach, but there seems to be a problem with it. What if two programs simultaneously msync a different range of the file, and one of the msync calls hits an IO error. It is entirely possible that the msync call that did NOT hit the IO error will end up clearing the AS_EIO flag on the mapping (and returning an error to userspace), while the msync call that did hit the IO error will get delayed by the error, find the AS_EIO flag cleared and not return an error to userspace. I agree that the dirty bit is a separate issue and I should probably just discuss that upstream first without even proposing a patch. That leaves PageError - we should probably clear that from the function where we are waiting on the page IO to complete. I'll whip up a patch that does that - one for you to test on RHEL 4 and another one for upstream discussion. After that, patches for RHEL 6 & RHEL 5, which both seem to have the same behaviour. Created attachment 458901 [details]
clear PG_error in msync/fsync path
A minimal patch to clear the PG_error bit in the msync/fsync path. This should make the test case work right again.
I will discuss larger changes upstream first, since it is unclear what the correct behaviour should be.
(In reply to comment #7) > I have considered that approach, but there seems to be a problem with it. > > What if two programs simultaneously msync a different range of the file, and > one of the msync calls hits an IO error. It is entirely possible that the > msync call that did NOT hit the IO error will end up clearing the AS_EIO flag > on the mapping (and returning an error to userspace), while the msync call that > did hit the IO error will get delayed by the error, find the AS_EIO flag > cleared and not return an error to userspace. I considered this case too but it can't happen because all msync()/fsync() operations are serialised on the mapping->host->i_sem semaphore. (In reply to comment #7) > That leaves PageError - we should probably clear that from the function where > we are waiting on the page IO to complete. I'll whip up a patch that does that > - one for you to test on RHEL 4 and another one for upstream discussion. After > that, patches for RHEL 6 & RHEL 5, which both seem to have the same behaviour. I don't like setting the error on the page - there's nothing wrong with the page and it just seems like the wrong place to put the error. The same argument applies to setting AS_EIO on the mapping space too so that's not much better. Anyway your patch is less invasive and probably less risky too. Lachlan, just to confirm, does my patch fix the test case issue? I don't see how it couldn't, but it would be good to get confirmation :) I have no SAN to test with here... I also just submitted an equivalent patch upstream and am waiting for comments there. I'll take care of cloning the bug for RHEL 5 & RHEL 6 to get the equivalent patch integrated there. Rik, I meant to test this yesterday but ran out of time. You don't need a SAN to run the test case - I do this to simulate the 'hiccup': # cd /sys/class/scsi_host/host0/device/target0:0:0/0:0:0:0/ # echo offline > state ; sleep 1 ; echo running > state In the following test I offlined the device three times and msync() eventually recovered each time. # ~/mmap_test_soc_failafter testfile fd = 3 mmap addr = 0x2a95576000 msync with pattern = 0 msync with pattern = 1 msync with pattern = 2 msync error : Input/output error msync with pattern = 3 msync error : Input/output error msync with pattern = 4 msync error : Input/output error msync with pattern = 5 msync with pattern = 6 msync with pattern = 7 msync with pattern = 8 msync with pattern = 9 msync with pattern = 10 msync error : Input/output error msync with pattern = 11 msync error : Input/output error msync with pattern = 12 msync error : Input/output error msync with pattern = 13 msync with pattern = 14 msync with pattern = 15 msync with pattern = 16 msync with pattern = 17 msync with pattern = 18 msync error : Input/output error msync with pattern = 19 msync error : Input/output error msync with pattern = 20 msync error : Input/output error msync with pattern = 21 msync with pattern = 22 msync with pattern = 23 msync with pattern = 24 msync with pattern = 25 Note that this patch still causes the page to cache the error. If the application quits on receiving an error it may still see errors when it is restarted - well after the device is restored. # ~/mmap_test_soc_failafter testfile fd = 3 mmap addr = 0x2a95576000 msync with pattern = 0 msync with pattern = 1 msync with pattern = 2 msync with pattern = 3 msync with pattern = 4 msync error : Input/output error msync with pattern = 5 ^C # sleep 5 # ~/mmap_test_soc_failafter testfile fd = 3 mmap addr = 0x2a95576000 msync with pattern = 0 msync error : Input/output error msync with pattern = 1 msync with pattern = 2 msync with pattern = 3 msync with pattern = 4 This behaviour does not occur with the patch I proposed because the address mapping space is torn down and recreated which removes the AS_EIO state. Are you talking about the same patch you uploaded earlier? All that patch does is set the AS_EIO state in mapping->flags, it does not do any "address space mapping tear down"... As for setting the error bit in the mapping vs clearing PageError on fsync/msync, that should not make a difference at all. After my patch the code simply clears PageError along with AS_EIO on msync/fsync. The page cache caching the error is an intended effect - POSIX requires that userspace be informed of an error that happened. Your patch also caches the error, just in another place. (In reply to comment #13) > Are you talking about the same patch you uploaded earlier? All that patch does > is set the AS_EIO state in mapping->flags, it does not do any "address space > mapping tear down"... Yes, I'm talking about the patch I attached to this BZ earlier. The address space mapping is torn down and recreated when the application is restarted. The application should not be getting any residual errors when it is restarted if the paths to the SAN have already been restored. > > As for setting the error bit in the mapping vs clearing PageError on > fsync/msync, that should not make a difference at all. After my patch the code > simply clears PageError along with AS_EIO on msync/fsync. If the test program keeps running there is no obvious difference. But the actual difference is AS_EIO is cached in one place (and always removed when the address space is synced or unmapped) while the page error is cached on every page that is involved in a failed I/O (and not removed when the address space is unmapped and not necessarily removed when the address space is synced). The function wait_on_page_writeback_range() only checks pages that have the writeback flag set. If an I/O fails and completes before wait_on_page_writeback_range() is called then the writeback flag will have already been removed and wait_on_page_writeback_range() wont find the page and remove it's error flag. As a result the error flag on some pages can remain cached and could potentially remain there indefinitely. Some time after the path to the SAN is restored a msync() will trip over the error flag and fail the msync() even though there was no real error at that time. > > The page cache caching the error is an intended effect - POSIX requires that > userspace be informed of an error that happened. Your patch also caches the > error, just in another place. Yes that's correct but my approach ensures that the error is not cached any longer than necessary. If you want to leave the page error in place it should probably be cleared earlier in the sync path where every page is examined, maybe in mpage_writepages() - it only looks at dirty pages but that should be enough. The address_space struct is part of the page cache and is only torn down: 1) on filesystem unmount 2) when the VM evicts it due to memory pressure (can only happen if no application has it open) 3) on a drop_caches event (when the file is not open by any process) The address_space struct does NOT go away just because a process exits. Munmap also does not imply msync, so the AS_EIO bit is not automatically cleared on exit. As for the function wait_on_page_writeback_range(), it calls wait_on_page_writeback() (which checks for PageWriteback), but it will look at (and, with my patch, clear) the PageError bit for any page, not just Writeback ones. > As for the function wait_on_page_writeback_range(), it calls
> wait_on_page_writeback() (which checks for PageWriteback), but it will look at
> (and, with my patch, clear) the PageError bit for any page, not just Writeback
> ones.
Are you sure? wait_on_page_writeback_range() uses PAGECACHE_TAG_WRITEBACK - doesn't that mean it only finds pages that are in writeback state?
while ((index <= end) &&
(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
PAGECACHE_TAG_WRITEBACK,
min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1)) != 0) {
Otherwise how do you explain why the test program still gets an I/O error after it has been restarted (and that's well after the path has been restored)? This residual error does not seem to happen with the AS_EIO patch.
You are right, I had overlooked that part of the code. Now it looks like either approach breaks POSIX, which requires us to notify userspace of IO errors. Storing the error in just the address space can result in us not returning the IO error to the right msync call when there are multiple msync calls on different ranges in the file. Storing the error in the page can result in us not returning the IO error to user space immediately - only the next time that particular page is written. (In reply to comment #17) > You are right, I had overlooked that part of the code. > > Now it looks like either approach breaks POSIX, which requires us to notify > userspace of IO errors. If set, the AS_EIO state will always be seen in wait_on_page_writeback_range() and returned to the caller. Although now that I think about it if we never do a msync() the AS_EIO state may be left there for another process to find which doesn't make it any better than the page error. > > Storing the error in just the address space can result in us not returning the > IO error to the right msync call when there are multiple msync calls on > different ranges in the file. Multiple msync()'s (and fsync()'s) are serialized on the inode's i_sem and each call to filemap_fdatawait() covers the entire file so subranges are not possible. I think this is enough to prevent the AS_EIO from being picked up from the wrong msync() (for the case of concurrent msyncs). > > Storing the error in the page can result in us not returning the IO error to > user space immediately - only the next time that particular page is written. Yes, it looks that way. To be honest I'm not sold on either way of returning the error state. I'd prefer a mechanism that transfers the error state directly from the bio to the waiting thread but since the bio is not available in the waiting thread's context that is not possible. You are right, RHEL4 does indeed not do msync on subranges of the file. As for AS_EIO, I believe POSIX specifies that it should be returned to all processes calling fsync or msync on the file after an IO error. This is probably not achievable in RHEL4, RHEL5 or RHEL6 though, so we may have to go with your AS_EIO hack for RHEL... For upstream we have a different idea, which I'll take a stab at. (In reply to comment #19) > You are right, RHEL4 does indeed not do msync on subranges of the file. > > As for AS_EIO, I believe POSIX specifies that it should be returned to all > processes calling fsync or msync on the file after an IO error. This is Every process? If the path to the SAN stays down then this will happen. Are we supposed to return an error to all processes at least once even after the path comes back? That would be tricky. > probably not achievable in RHEL4, RHEL5 or RHEL6 though, so we may have to go > with your AS_EIO hack for RHEL... > > For upstream we have a different idea, which I'll take a stab at. Great, thanks. I tried combining the two patches (leaving the SetPageError() in mpage_end_io_write()) and also added a ClearPageError() in mpage_writepages() and that fixed the problem with the lingering error state on the page. After restarting the test program it never found any residual errors left over from the last run. I'm not sure if that helps in any way. We still need the AS_EIO state to guarantee msync() gets an error so it's probably easiest to just go with that. I suspect your original patch may be the way to go after all, because there is no guarantee the page will still have the writeback bit set in the radix tree, causing potential lingering errors. I'll get your patch submitted for RHEL4, RHEL5 & RHEL6 and will implement Johannes' idea for upstream. Created attachment 459797 [details]
set AS_EIO instead of PG_error - still clear PG_error
Here is a combined patch:
After an IO error, we should report an error to the next process
that wants to make sure the data is on disk, returning EIO from
fsync or msync. However, wait_on_page_writeback_range only iterates
over pages that have the WRITEBACK radix tree bit set.
If the page had an IO error, we may skip over it and not report
the IO error to userspace. To avoid that, we set the AS_EIO
bit in page->mapping from mpage_end_io_write.
Also clear the PageError bit in wait_on_page_writeback_range, which
is called from fsync and msync, in case other IO end handlers set
PageError. Not clearing it has the potential of reporting EIO
over and over again after one temporary IO error (eg. temporarily
missing path to a SAN).
Signed-off-by: Rik van Riel <riel>
Reported-by: Lachlan McIlroy <lmcilroy>
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Created attachment 463564 [details]
Updated test case
Updated test case noted in item 3) of the previous post
Created attachment 463567 [details]
Data collected from a problem reproduction
Created attachment 463568 [details]
Data collected from a problem reproduction
Created attachment 463569 [details]
Data collected from a problem reproduction
Created attachment 463571 [details]
Zipp'd set of the test and the data for ease of retreval
Committed in 93.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/ Reproduced this issue on RHEL 4.8 GA kernel-2.6.9-89.EL. Got the same error. kernel-2.6.9-95.EL fixed this issue. Using switch up/down a port to simulator the link plug/unplug Both ext3 and ext2 was checked. Resume the link will no cause filesystem corruption. Vivek Goyal, Did you contact with Frank Hirtz for the commend he left? I cannot reproduce his problem for filesystem corruption. Gris, I've managed to reproduce the filesystem corruption. It's a separate issue to this bug in a different area of code but reproducable using the same testing technique. BZ664376 was opened for it. Lachlan Filesystem corruption issue will goes to Bug #664376. Set this bug as Verified. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0263.html |
Created attachment 454988 [details] Reproducer test case. Description of problem: Customer reported a bug with msync() where if the path to the SAN is lost msync() returns EIO (this is expected) but even after the path to the SAN is restored the msync() for the same file mapping continues to return EIO even though mappings for other sections of the same file or other files in the same filesystem work okay. If the filesystem is remounted the problem clears up. Using the attached test program: # ~/mmap_test_soc_failafter testfile fd = 3 mmap addr = 0x2a95576000 msync with pattern = 0 msync with pattern = 1 msync with pattern = 2 msync with pattern = 3 msync with pattern = 4 msync with pattern = 5 msync error : Input/output error msync with pattern = 6 msync error : Input/output error msync with pattern = 7 msync error : Input/output error msync with pattern = 8 msync error : Input/output error ... The path to the SAN was unplugged during pattern 5 and reconnected after 1 second. Even though the path is restored an msync on the same section of file still fails: # ~/mmap_test_soc_failafter testfile fd = 3 mmap addr = 0x2a95576000 msync with pattern = 0 msync error : Input/output error msync with pattern = 1 msync error : Input/output error msync with pattern = 2 msync error : Input/output error msync with pattern = 3 msync error : Input/output error ... Trying a different file works: # ~/mmap_test_soc_failafter testfile1 fd = 3 mmap addr = 0x2aea749e5000 msync with pattern = 0 msync with pattern = 1 msync with pattern = 2 msync with pattern = 3 msync with pattern = 4 ... Back to original file and it continues to fail: # ~/mmap_test_soc_failafter testfile fd = 3 mmap addr = 0x2afcc32ff000 msync with pattern = 0 msync error : Input/output error msync with pattern = 1 msync error : Input/output error msync with pattern = 2 msync error : Input/output error msync with pattern = 3 msync error : Input/output error ... So it appears the error is being cached. The customer uses msync() heavily and is reporting data corruption that may be related to this bug. The customer was experiencing this problem with ext3 but when trying to reproduce it they frequently caused the filesystem to go read-only so the steps to reproduce use ext2. Version-Release number of selected component (if applicable): reported against rhel4.6 but reproduced on kernel-2.6.9-89.EL and kernel-2.6.18-228.el5 How reproducible: Everytime using the test case. Steps to Reproduce: 1. Create and mount an ext2 filesystem 2. Compile and run the reproducer program # cd /mnt/ext2 # touch testfile # ./mmap_test_soc_failafter testfile fd = 3 mmap addr = 0x7fb5cd150000 msync with pattern = 0 msync with pattern = 1 msync with pattern = 2 msync with pattern = 3 ... 3. Unplug and reconnect cable to SAN or simulate with: # cd /sys/class/scsi_host/host0/device/target0:0:0/0:0:0:0/ # echo offline > state ; sleep 1 ; echo running > state Actual results: msync() system call fails with EIO and continues to fail with EIO even when path to SAN is restored. Other sections of the same test file and other files in the same filesystem can be msync'd without error. Expected results: msync() fails with EIO upon loss of path and continues to fail until path is restored, then msync() should succeed.