Bug 435291
Summary: | LTC41974-Pages of a memory mapped NFS file get corrupted. | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Jeff Layton <jlayton> |
Component: | kernel | Assignee: | Peter Staubach <staubach> |
Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 5.1 | CC: | andriusb, bugproxy, clasohm, cmpilato, cward, dhoward, epollard, jjarvis, jkachuck, jpirko, k.georgiou, lwoodman, pcfe, sglass, steved, tao |
Target Milestone: | rc | Keywords: | OtherQA, ZStream |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2009-01-20 19:40:50 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 391501, 448732, 450335 | ||
Attachments: |
Description
Jeff Layton
2008-02-28 15:11:56 UTC
I've decided to continue working on this in RHEL5 for now. It seems to be easier to reproduce this problem on RHEL5 and it's implementation of invalidate_inode_pages2 is more up to date. I backported a couple of NFS patches that seemed like they might help this there (presuming that the race is the same as the one we see in RHEL4). I'll attach them in a bit -- they seem to help some. The test would usually fail on RHEL5 very quickly but this patch seems to make it take a bit longer. I think I'm going to have to roll a rhel5 version of this debug patch to verify that the problem is the same there... Created attachment 296231 [details]
proposed patch 1
Subject: [PATCH] NFS: Deal with failure of invalidate_inode_pages2()
If invalidate_inode_pages2() fails, then it should in principle just be
because the current process was signalled. In that case, we just want to
ensure that the inode's page cache remains marked as invalid.
Also add a helper to allow the O_DIRECT code to simply mark the page cache as
invalid once it is finished writing, instead of calling
invalidate_inode_pages2() itself.
Signed-off-by: Trond Myklebust <Trond.Myklebust>
Signed-off-by: Andrew Morton <akpm>
Signed-off-by: Linus Torvalds <torvalds>
...
Proposed patch #1. I don't agree with Trond's statement above though. I think
I've see invalidate_inode_pages2 return error even when the process wasn't
signaled. Though maybe there were differences between RHEL5's iip2() and the
upstream one at the time that this patch went in.
Created attachment 296232 [details]
proposed patch 2
Second patch backported from upstream to deal with races in this codepath.
...might need steved and dhowells to look over it and make sure I've gotten the
fscache stuff right...
These two patches seem to help things on RHEL5 -- the reproducer lasts quite a bit longer with them in place. They need more careful review though, I rather hastily backported them... There still seems to be a lingering race somewhere though. It looks like the page is getting dirtied while we're inside nfs_sync_mapping(), but it's returning 0. Ok, so the page does seem to be getting dirtied while in nfs_sync_mapping(). So we enter invalidate_inode_pages2() with a dirty page in the mapping. It then unmaps the page, and does this near the bottom of the loop in invalidate_inode_pages2_range: was_dirty = test_clear_page_dirty(page); if (!invalidate_complete_page2(mapping, page)) { if (was_dirty) set_page_dirty(page); ret = -EIO; } unlock_page(page); ...which seems goofy to me. test_clear_page_dirty(page) is going to clear the dirty flag. We'll then invalidate the page (since it's no longer marked dirty) in invalidate_complete_page2(). Only if that fails for some other reason will we reset the dirty flag and return error. That seems wrong to me (at least in this situation). If the page is dirty once we get to that point we shouldn't just be discarding it like this... Did a similar debug patch to the one I had for rhel4 and instrumented before and after nfs_sync_mapping() and before invalidate_inode_pages2(): NFS: dump_page page=ffff81000184fa60 flags=18080000000021 contents=0 0 0 0 0 0 0 0 0 0 NFS: after nfs_sync_mapping (ret=0): NFS: dump_page page=ffff81000184fa60 flags=1808000000087c contents=0 0 0 0 0 0 14101 0 0 0 NFS: before invalidate_inode_pages2: NFS: dump_page page=ffff81000184fa60 flags=1808000000087c contents=0 0 0 0 0 0 14101 0 0 0 NFS: after invalidate_inode_pages2 (ret=0) The problem was detected by pid 14101 -- it was expecting to see its pid but instead saw a 0 in its slot. I think I see some of the patchset that went in to fix this final race. This is the main one I think: From e3db7691e9f3dff3289f64e3d98583e28afe03db Mon Sep 17 00:00:00 2001 From: Trond Myklebust <Trond.Myklebust> Date: Wed, 10 Jan 2007 23:15:39 -0800 Subject: [PATCH] [PATCH] NFS: Fix race in nfs_release_page() NFS: Fix race in nfs_release_page() invalidate_inode_pages2() may find the dirty bit has been set on a page owing to the fact that the page may still be mapped after it was locked. Only after the call to unmap_mapping_range() are we sure that the page can no longer be dirtied. In order to fix this, NFS has hooked the releasepage() method and tries to write the page out between the call to unmap_mapping_range() and the call to remove_mapping(). This, however leads to deadlocks in the page reclaim code, where the page may be locked without holding a reference to the inode or dentry. Fix is to add a new address_space_operation, launder_page(), which will attempt to write out a dirty page without releasing the page lock. Signed-off-by: Trond Myklebust <Trond.Myklebust> Also, the bare SetPageDirty() can skew all sort of accounting leading to other nasties. [akpm: cleanup] Signed-off-by: Peter Zijlstra <a.p.zijlstra> Cc: Trond Myklebust <Trond.Myklebust> Signed-off-by: Andrew Morton <akpm> Signed-off-by: Linus Torvalds <torvalds> ...the problem here, of course is that adding a new address_space_op is probably out of the question since that will probably blow kabi out of the water (though maybe we could do it by adding a new address_space flag to signify those that actually have the field. Still, I may go ahead and just backport this patch and a couple of others as a proof-of-concept and see if I can eliminate the race with it. We can then look at how best to fix up kabi afterward... Backported 2 more patches to this set to attempt to fix this: Author: Trond Myklebust <Trond.Myklebust> Date: Tue Dec 5 00:35:40 2006 -0500 BZ#435291: NFS: Add nfs_set_page_dirty() Upstream commit 1a54533ec8d92a5edae97ec6ae10023ee71c4b46 ...and... Author: Trond Myklebust <Trond.Myklebust> Date: Tue Dec 5 00:35:42 2006 -0500 BZ#435291: NFS: Fix nfs_release_page Upstream commit fa8d8c5b77a2dc467b5365a5651710161b84f16d ...though I can still reproduce this even with this set. As a final stab at this, I may try to go ahead and backport the patch that adds the launder_page call, but if the other set doesn't fix it then I have a feeling that that won't either. My guess is that we're hitting a problem in the core invalidate_inode_pages2 implementation and we'll need to either patch it or add a parallel call that handles this correctly. I'll plan to take a harder look next week and see if I can pinpoint where exactly the race is coming into play with these other patches in place. I've found that running against a netapp is a bit more reliable way to reproduce this... In addition to the patches that I've backported to try and fix this, I've also added some more debugging info. Unfortunately, adding the debug code seems to change the timing in such a way that it makes the problem go away. When I turn off the debug info, it returns quickly. I'll need to ponder some less invasive method to track this down... Started with a kernel without all of these patches to make sure I understand the problem in RHEL5 correctly. I added a bit of debug code: So here pid 10979 is checking the page validity finds that it's wrong, dumps the (clean) page from the mapping and does a read, and then updates the page in memory: NFS:(10979) dump_page page=ffff81000102abe0 flags=808000000086c contents=0 0 0 0 0 0 0 0 0 0 nfs_revalidate_mapping:(10979) invalidate_inode_pages2 returns 0 NFS:(10979) dump_page page=ffff81000102abe0 flags=8080000000004 contents=0 0 0 0 0 0 0 0 0 0 NFS(10979) (0:17/2845690) data cache invalidated NFS: nfs_readpages (0:17/2845690 1) NFS: 0 initiated read call (req 0:17/2845690, 4096 bytes @ offset 0) nfs_readpage_result: 7918, (status 4096) NFS: read done (0:17/2845690 4096@0) ...next 10978 calls invalidate_inode_pages2 nfs_revalidate_mapping:(10978) calling invalidate_inode_pages2 Page is clearly dirty at this point: NFS:(10978) dump_page page=ffff8100012ad420 flags=808000000003c contents=0 0 0 0 0 0 0 0 10979 0 ...invalidate inode pages2 returns 0, and the page is now clean, _but_ there doesn't seem to have been any writeback activity in here. Why not? nfs_revalidate_mapping:(10978) invalidate_inode_pages2 returns 0 NFS:(10978) dump_page page=ffff8100012ad420 flags=8080000000004 contents=0 0 0 0 0 0 0 0 10979 0 NFS(10978) (0:17/2845690) data cache invalidated data cache is invalidate and the page is now clean and unattached to the mapping. We do another read into a new page that gets added to the inode mapping: NFS: nfs_readpages (0:17/2845690 1) NFS: 0 initiated read call (req 0:17/2845690, 4096 bytes @ offset 0) nfs_revalidate_mapping:(10980) calling invalidate_inode_pages2 NFS:(10980) dump_page page=ffff810000dd5240 flags=80000000021 contents=0 0 0 0 0 0 0 0 0 0 ...so the problem in this situation was that invalidate_inode_pages2 cleaned a dirty page without writing it back. The question is why... Ok, it looks like upstream had a patch that went in a while back that changed some of the invalidate_inode_pages2 semantics: From fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 Mon Sep 17 00:00:00 2001 From: Linus Torvalds <torvalds.org> Date: Wed, 20 Dec 2006 13:46:42 -0800 Subject: [PATCH] VM: Remove "clear_page_dirty()" and "test_clear_page_dirty()" functions ...in particular, it has this delta for iip2 which doesn't seem to depend on the new functions being added. This makes sense to me, I think. The main way that invalidate_complete_page2 returns error is when there is PagePrivate is set and we fail to release the page. In this case, it was *not* set, so the page was happily cleaned. This seems wrong to me -- we shouldn't be trying to clean any dirty pages without attempting to write them back... - was_dirty = test_clear_page_dirty(page); - if (!invalidate_complete_page2(mapping, page)) { - if (was_dirty) - set_page_dirty(page); + if (!invalidate_complete_page2(mapping, page)) ret = -EIO; - } ...I plan to play around with modifying this and seeing if it makes any difference. After fixing the above delta and applying the patch to add the launder_page operation, I've been able to run the reproducer for over 5000 iterations without it hitting the inconsistency. The downside is: 1) this patchset is large and is almost certainly a kabi breaker. It also mucks with the invalidate_inode_pages2() function, which is always a bit scary... 2) with this patchset I get a lot of failed mmap() operations. I think what's happening is that nfs_file_mmap calls nfs_revalidate_mapping. This causes invalidate_inode_pages2 to get called which then returns -EIO because the page is being dirtied by the time it calls invalidate_complete_page2(). This makes the nfs_file_mmap operation fail. I see some similar failures on rawhide kernels, but the frequency on this patched RHEL5 kernel is much higher. Now that I *think* see the problem with invalidate_inode_pages2 I may try to shoot for a smaller more targeted patch and see if we can't fix this a little less invasively. Looking back at the info in comment #10... NFS:(10978) dump_page page=ffff8100012ad420 flags=808000000003c contents=0 0 0 0 0 0 0 0 10979 0 ...this page was dirty before we enter invalidate_inode_pages2, but PG_private seems to be unset. So even if we stop clearing the dirty bit before calling invalidate_complete_page2, we still won't call try_to_release_page and it won't be written back. That mainly gets set during writeback apparently, so since the other process raced in and updated the page after the writeback completed, it's not set here. I think the right approach might be to turn invalidate_inode_pages2_range into a wrapper around a new function that does close to the same thing but also takes a "launder_page" function pointer. In most cases that will be called with a NULL pointer and it will do what it does now, but when NFS calls this directly we'll take extra care to try and write back the page after unmapping it. This might take some careful adjustment of the nfs_releasepage function as well since we may be interfering with it. I am not sure that I see the value in this above just remapping the page when it is discovered that it is dirty? Writing out the contents to the server makes it a valid page, does it not? Yes, but is there a potential race there? Suppose we have 2 tasks with the same page mapped: Task1 calls invalidate_inode_pages2_range which calls unmap_mapping_range... Task2 incurs a page fault and does a readpage to attach a new page where the old one was. Makes a change to that page and dirties it. Task1 checks the original page, finds it dirty and then remaps the original page back into the mapping. What happens at that point? (we're getting outside of my meager VM knowledge here :-) Can we even undo unmap_mapping_range like that? If so, how? Presuming we can just put the mapping back, there may be a way to do what you suggest and prevent this race I suppose -- maybe by holding the i_mutex for the inode? Right. The trick is prevent the new page from being created until we are done with the old page. The page fault code needs to be blocked until the process which wants to invalidate all of the pages is completed. Off hand, I don't know how to do this yet, but I will take a look. At this point, I think we could probably use Larry's expertise here. Let me see if I can summarize the problem: When the cache revalidation logic decides to invalidate the pagecache for a file on NFS, nfs_revalidate_inode will be called. It in turn will call nfs_sync_mapping() to writeback any dirty pages and then call invalidate_inode_pages2() to unmap them. The problem here is that there are several races possible here. The page can be redirtied after nfs_sync_mapping completes but before invalidate_inode_pages2 (iip2) can call unmap_mapping_range. It looks like iip2 tries to handle this situation correctly, but it does a pretty poor job. From invalidate_inode_pages2_range: -------------[snip]-------------- was_dirty = test_clear_page_dirty(page); if (!invalidate_complete_page2(mapping, page)) { if (was_dirty) set_page_dirty(page); ret = -EIO; } -------------[snip]-------------- ...so if the page is dirty, we clean it and then iff invalidate_complete_page2() returns error do we set the page dirty again and return error. There is some logic in nfs_releasepage to attempt a page writeback, but try_to_release_page() only gets called if PagePrivate is set. That only seems to be so when the page already has an outstanding write request... So my question is: When the page is redirtied just before the page is unmapped in invalidate_inode_pages2, should it just be clearing the dirty bit on the page like this without attempting to write it back? I'm presuming the answer there is no... If that does occur, I presume that we should expect to get -EIO back from invalidate_inode_pages2. We'll need to change the NFS code that calls that function to expect this possibility and deal with it accordingly. > Right. The trick is prevent the new page from being created until we are
> done with the old page. The page fault code needs to be blocked until the
> process which wants to invalidate all of the pages is completed.
I did a bit of reading in Documentation/filesystems/vfs.txt:
"Adding and removing pages to/from an address_space is protected by the
inode's i_mutex."
It's not clear to me where it's taken in that codepath and whether we can call
invalidate_inode_pages2 with that held without deadlocking...
Created attachment 297648 [details]
experimental patch 1 -- don't clean pages as aggressively in invalidate_inode_pages2_range
First patch in a smaller, more targeted patchset I've been working with.
Created attachment 297649 [details]
experimental patch 2 -- have nfs_revalidate_mapping deal with errors from invalidate_inode_pages2
Second patch
I did some testing on a kernel with the two patches in comment #19 and comment #20. I was able to run the reproducer for over an hour with no corruption. I ended up killing the reproducer and it never failed. The downside is that this does cause nfs_file_mmap to occasionally fail when it races with pages being dirtied: Run 24939 Run 24940 <10910> mmap failed (5:Input/output error) <10911> mmap failed (5:Input/output error) <10912> mmap failed (5:Input/output error) <10913> mmap failed (5:Input/output error) Run 24941 <10925> mmap failed (5:Input/output error) Run 24942 ...this seems to happen fairly frequently. I think this is probably an improvement -- it's better to have the mmap fail rather than corrupt the data, but it's not ideal. We may be able to fix these though with a mutex to serialize mmap and nfs_revalidate_mapping. I want to do a bit more testing with these two patches and then I'll see if we can do that. This is not what Peter was suggesting. I'm still open to his idea, but this does seem to more closely follow upstream behavior. Upstream seems to behave similarly -- we get the occasional failed mmap there, but the frequency seems to be a lot lower than on RHEL5 with these patches. One of these patches touches invalidate_inode_pages2_range(). I'd like for Larry to look at that one and comment when he gets some time. I think it's a sane change, but that's definitely an area where I could use some guidance. I don't think my serialization idea in comment #21 will work. The pages are being dirtied via userspace mmap and that's completely disconnected from any kernel activity. I think all we can really do is try to unmap the pages, and then to make an attempt to write them back if they're dirty after being unmapped. I have a patch that adds some launder_page functionality, but it doesn't seem to work correctly at the moment. The nfs_wb_page() function seems to work completely differently upstream than it does in RHEL5, so I think we'd need to have nfs_launder_page do something different. The RHEL5 version doesn't seem to actually clean the page. PG_dirty seems to be consistently set on exit from the function. I've experimented a bit with different ways of attempting to write the page back in nfs_launder_page, but they've all been buggy. It's a bit tricky since we're calling this from invalidate_inode_pages2_range and the page is already locked, etc... If the page is locked and unmapped and the mapping is locked, then would it be possible to just map it back in instead of attempting to clean it and destroy it? Good question... Presuming that that's possible and that it's acceptible to do this, what would be the proper return value in this case? -EAGAIN? Then again, is it even necessary to remap the pages like that? Presuming that we don't remove the page from the pagecache, then it should still be lingering around. The next pagefault from userspace should make the page get mapped back into the process' pagetables. That may actually be a source of one of the races I've been seeing now that I think about it. Even though we call unmap_mapping_range, is there anything that prevents the page from being mapped back and dirtied before we get a chance to write it out? It seems to me that, at this point, we don't want to have to worry about writing out the page. We tried earlier and if it is dirty again, then we know that someone is accessing it and we should just leave it alone. However, we do have to ensure that it is not accessible to userspace while we are checking to see whether it is dirty or not and also that another page can not be created with this same identity in the meantime. I don't think we can create another page with the same identity until this one has been remove from the pagecache (as in invalidate_complete_page2()). With this patchset, that should not be happening now when the page is dirty since invalidate_complete_page2 should generally just fail in that case... That said, I decided to test the original reproducer for this problem and hit an issue after 253 runs: Run 253 <5353> can't find me to logout - expected at slot 2 (5353) 0: val=-1 next=1 (5353) 1: val=5351 next=-1 (5353) 2: val=5353 next=-1 (5353) 3: val=-1 next=-1 (5353) 4: val=-1 next=-1 <5352> mmap failed (5:Input/output error) ...I've not done any detailed looking at why it failed yet. Modified the linked-list reproducer a bit so that it prints out the pids for each run that are running concurrently: 25859: 25860 25861 25862 25863 25864 25865 25866 25867 25868 25869 <25869> can't find me to logout - expected at slot 1 (25869) 0: val=-1 next=-1 (25869) 1: val=25869 next=1 (25869) 2: val=-1 next=-1 (25869) 3: val=-1 next=-1 (25869) 4: val=-1 next=-1 (25869) 5: val=-1 next=-1 (25869) 6: val=-1 next=-1 (25869) 7: val=-1 next=-1 (25869) 8: val=-1 next=-1 (25869) 9: val=-1 next=-1 (25869) 10: val=-1 next=-1 ...I also added some debug code to tell me when invalidate_inode_pages2 returns error, and changed the dfprintk for the "data cache invalidated" message to be printed with my private NFSDBG_MAPPING bit. NFS: (25862) (0:17/2489340) data cache invalidated NFS: (25861) (0:17/2489340) data cache invalidated NFS: (25863) (0:17/2489340) data cache invalidated NFS: (25869) (0:17/2489340) data cache invalidated NFS: (25866) (0:17/2489340) data cache invalidated ...so when the above run failed, invalidate_inode_pages2 did not return error to nfs_revalidate_mapping. It did, however, invalidate the cache. One thing that might help this is figuring out why we're doing so many cache invalidations with this reproducer. This host is the only writer of the file, so it shouldn't be invalidating the cache so often. While that wouldn't really "fix" this problem, I think that it would make it much more difficult to hit and would probably help performance some here and possibly in other situations. With the 2 patches I have so far, I'm no longer able to hit the race with my own reproducer, but the mutex_lt3 reproducer can still hit it. I'm not sure of why this is, but it may have something to do with the fact that that reproducer deals with multiple pages... On a hunch, I decided to try the following patch: --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -704,13 +704,16 @@ int nfs_revalidate_mapping(struct inode *inode, struct add || nfs_attribute_timeout(inode)) ret = __nfs_revalidate_inode(NFS_SERVER(inode), inode); + mutex_lock(&inode->i_mutex); if (nfsi->cache_validity & NFS_INO_INVALID_DATA) { nfs_inc_stats(inode, NFSIOS_DATAINVALIDATE); if (S_ISREG(inode->i_mode)) nfs_sync_mapping(mapping); err = invalidate_inode_pages2(mapping); - if (err != 0) - return err; + if (err != 0) { + ret = err; + goto out; + } spin_lock(&inode->i_lock); nfsi->cache_validity &= ~NFS_INO_INVALID_DATA; @@ -724,6 +727,8 @@ int nfs_revalidate_mapping(struct inode *inode, struct addre inode->i_sb->s_id, (long long)NFS_FILEID(inode)); } +out: + mutex_unlock(&inode->i_mutex); return ret; } ------------------------------- There is some similar serialization code upstream, but it's a bit more complex. It's prob. not safe to take the i_mutex here in every case, but for this test it seems to work well enough. This actually seems to make the race harder to hit. I'm still able to hit it with this patch in place, so it's not a complete fix, but it does seem to help for reasons I don't completely understand. I suspect that we might have multiple threads ending up in the nfs_revalidate_mapping logic at once, and that accounts for at least some of the races. Since this doesn't fix it completely though, I have to assume that there are other races we're hitting too... I spent some time looking to see if we can try to reduce the number of cache invalidations to work around this problem. I've identified some things we can do to try and help that, but haven't been able to eliminate the cache invalidations in this situation. The problem that I've seen now is that there are some races between write calls on the wire and the getattr that is done when the file is opened. When there are writes outstanding and a getattr is done, the mtime can seemingly reflect any intermediate time on the file. One thing we could consider is making sure that we wait for any outstanding writes to complete when doing a getattr, but that could significantly slow things down. This is the price to pay if there is a need to validate the attribute cache while writing to a file. The file has to be quiesced, ie. not changing, while the validation is done. Do you know where the GETATTR is coming from? The WRITE responses contain attributes which should be used to update the attributes. There shouldn't be a need to issue the GETATTR for attribute cache validation purposes because the attribute cache is continually getting updated from the WRITE responses. All this said, I don't think that this problem can be solved by addressing the issue of cache invalidations done as a result of processing WRITE responses. There has to be a safe way implemented which allows the pages to be safely unmapped, cleaned if necessary, and then destroyed. How does the system do it under low memory situations when memory must be reclaimed? The GETATTR call is coming from another process opening the file. It's doing the
GETATTR as the open part of a CTO check. It just happens to be that the open
sometimes races with writes from other threads.
> All this said, I don't think that this problem can be solved by
> addressing the issue of cache invalidations done as a result of
> processing WRITE responses.
Agreed, while this might help us from getting into this situation as often,
we'll still be vulnerable to it in some cases (particularly if there are other
hosts occasionally writing to the file).
------- Comment From ffilz.com 2008-03-14 10:52 EDT------- > The problem that I've seen now is that there are some races > between write calls on the wire and the getattr that is done when the file is > opened. When there are writes outstanding and a getattr is done, themtime can > seemingly reflect any intermediate time on the file. Ok, so we have an open getattr racing with a write. Hmm, can we skip the getattr for the open if a write is in progress? But what do we do if we want to write and a getattr is in progress? With UDP, we can't guarantee ordering, can we with TCP? (If we can, I'd be comfortable requiring TCP for this type of situation). > Agreed, while this might help us from getting into this situation as often, > we'll still be vulnerable to it in some cases (particularly if thereare other > hosts occasionally writing to the file). Hmm, can we hope to support two clients mmapping the same file (for write)? > Hmm, can we skip the getattr for the open if a write is in progress? Perhaps. That would mean that we'd probably need to reintroduce the data_updates counter that the patch in bug 321111 removes. I played around a bit with reimplementing it today and still hit some races that caused the cache to be invalidated. I've been using TCP for all of this testing. Packet loss and network delays aren't really the cause of most of this. The problem seems to stem from the fact that the server doesn't necessarily respond to calls in the order that they were received. I suppose this basically comes down to that we need to serialize getattrs vs. writes somehow. Allow writes to occur however the client wants but we have to make sure they're all complete before we do a getattr and not allow any new ones until the getattr completes. Maybe we could do this with some sort of rq semaphore? I'd need to think on it a bit. > Hmm, can we hope to support two clients mmapping the same file (for write)? Good question. With proper locking it seems like that should work. You'd have to make sure that when the client gets the lock and the file has changed since you last had it that the cache is invalidated and reloaded before allowing a process to make changes to the mmap. Regardless though, it might be best to consider this in the context of a different BZ. Preventing false cache invalidations is really a separate issue from what happens once the invalidation occurs. I don't think that we want to muck with the open-to-close semantics at this point. If we need to seriliaze writes and opens, then sobeit. This problem can be solved with a safe way to unmap pages which does not leave a window where data can be lost. The other issues of reducing cache invalidations can be solved separately. Yep... I decided to step back a bit and make sure I understand what's supposed to happen when the page is mapped and dirtied. I've written a simple program to mmap a file, read from it via the mmap and write back to it via the mmap. Starting out playing with this on a file in /tmp and will look at NFS after I see what happens here: I start the program, which opens the file and mmaps it: # ./mmap-test /tmp/testfile mmap at address 0x2aaaaaaac000 crash> vtop 0x2aaaaaaac000 VIRTUAL PHYSICAL 2aaaaaaac000 (not mapped) ...so there is no page attached at this point. Now, read from the mapping (incur a page fault): PTE PHYSICAL FLAGS 80000000158cc025 158cc000 (PRESENT|USER|ACCESSED|NX) VMA START END FLAGS FILE ffff8100153f6218 2aaaaaaac000 2aaaaaaad000 fb /tmp/testfile PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffff810001515c80 158cc000 ffff81001eabfbc8 0 3 1008000000086c ...now, write to it: PTE PHYSICAL FLAGS 80000000158cc067 158cc000 (PRESENT|RW|USER|ACCESSED|DIRTY|NX) PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffff810001515c80 158cc000 ffff81001eabfbc8 0 3 1008000000087c ...now msync: PTE PHYSICAL FLAGS 80000000158cc025 158cc000 (PRESENT|USER|ACCESSED|NX) PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffff810001515c80 158cc000 ffff81001eabfbc8 0 3 1008000000086c ...nothing terribly surprising here. I suspect that what's happening is that the page is being allowed to be dirtied somehow after unmap_mapping_range() is complete and after the PageDirty check is done in invalidate_complete_page2(). Could it be something like: thread1: unmap_mapping_range completes (no more mmaps) thread2: page fault occurs, page is remapped back into process address space thread1: invalidate_complete_page2 is called and gets past the PageDirty check thread2: dirties the page thread1: __remove_from_page_cache ...there's probably some way to test for this, but I need to figure out to do it in such a way that nothing can slip by. It seems like mapping a page into the process' address space and dirtying it is pretty much lockless... Testing with my current test kernels (which contain the patches I have so far for this). Here's the output from the run: 9216: 9217 9218 9219 9220 9221 9222 9223 9224 9225 9226 <9222> can't find me to logout - expected at slot 1 (9222) 0: val=-1 next=-1 (9222) 1: val=9222 next=1 (9222) 2: val=-1 next=-1 (9222) 3: val=-1 next=-1 (9222) 4: val=-1 next=-1 (9222) 5: val=-1 next=-1 (9222) 6: val=-1 next=-1 (9222) 7: val=-1 next=-1 (9222) 8: val=-1 next=-1 (9222) 9: val=-1 next=-1 (9222) 10: val=-1 next=-1 <9226> looped 2 (9226) 0: val=-1 next=1 (9226) 1: val=9222 next=1 (9226) 2: val=-1 next=-1 (9226) 3: val=-1 next=-1 (9226) 4: val=-1 next=-1 (9226) 5: val=-1 next=-1 (9226) 6: val=-1 next=-1 (9226) 7: val=-1 next=-1 (9226) 8: val=-1 next=-1 (9226) 9: val=-1 next=-1 (9226) 10: val=-1 next=-1 I ran this with the following systemtap script: probe kernel.function("__remove_from_page_cache") { if ($page->_mapcount->counter >= 0) { printf("__remove_from_page_cache(%u): _mapcount=%d flags=0x%x\n", pid(), $page->_mapcount->counter, $page->flags); print_backtrace(); } } ...here's the output: __remove_from_page_cache(9226): _mapcount=0 flags=0x8000000002d 0xffffffff800ca61f : __remove_from_page_cache+0x1/0x3d 0xffffffff800ce245 : invalidate_inode_pages2_range+0x158/0x209 0xffffffff8842075a : nfs_revalidate_mapping+0xac/0x1af [nfs] 0xffffffff8841e378 : nfs_file_mmap+0x21/0x4c [nfs] 0xffffffff8000e584 : do_mmap_pgoff+0x4f1/0x7da 0xffffffff80024cff : sys_mmap+0x89/0xc1 0xffffffff800602a6 : tracesys+0xd5/0xdf ...I think this may be related to the problem here. The page should have been unmapped when we called __remove_from_page_cache, but I'm guessing that sometime after unmap_mapping_range was called, a process incurred a page fault and mapped the page back into its page tables. It wasn't dirty at this point, but it easily could have been dirtied after this and before trashing the page. It seems like we need something that prevents the page from getting mapped back into anything's pagetables before we toss it out. I'm going to see if I can get a bit more milage out of systemtap here... Some of this may be incidental to the actual problem though. While testing this, I've seen mapping go >= 0 when no problem occurred, and I've seen the problem occur with no message pop. The problem still could be the same, just with the race happening a bit later. I may see if I can do this same check after invalidate_complete_page2 returns, but that means rebuilding the kernel since that function is static. Created attachment 298407 [details]
latest linked-list reproducer
This is the latest linked list reproducer that I'm using to test this. It still
fails reliably even with the two experimental patches above. To build it:
# gcc -Wall -g -o ./mutex_lt3 ./mutex_lt3-7.c -lpthread
...I then run it under a shell script that runs it in a loop:
-------[snip]---------
#!/bin/sh
RESULT=0
RUN=1
while [ $RESULT -eq 0 ]; do
./mutex_lt3 $1
RESULT=$?
RUN=`expr $RUN + 1`
done
exit $RESULT
-------[snip]---------
You run the shell script with a file on a NFS share as an argument. I've been
using a netapp server. The fact that it has sub-second mtimes I think makes for
more cache invalidations and helps to trigger the problem.
The code in this reproducer is very hard to follow, but basically it has a
bunch of threads adding and removing entries from a linked list that's on a
mmapped file. It uses a semaphore for serialization.
When run, one of the threads will eventually notice that the list is not what
is expected and will exit with an error.
To summarize: The issue with this case seems to be that when a file is mmapped on NFS and processes are updating it, it occasionally ends up having different contents vs. what's expected. I *think* the problem is related to the client noticing mtime changes on the file and invalidating its contents based on that. I suspect that the problem is either: 1) we're occasionally invalidating a page that has dirty data ...and/or... 2) we're occasionally reading new data onto a dirty page My initial thought was that the problem was a race where a page would be dirtied after writeback but before being invalidated. The two experimental patches attached to this case try to fix that. While they fix a simpler reproducer that I have, they do not seem to fix the reproducer in comment #38. Actually.... I'm not convinced that this problem is actually fixed in rawhide. It may be, but it may be that this problem is being sidestepped by reducing cache invalidations. When doing a mutex_lt3_loop test on rawhide (kernel 2.6.25-0.121.rc5.git4.fc9), I see virtually no read calls occurring. This tells me that, we're almost never invalidating the cache and so we virtually never end up clobbering the data in a dirty page... ------- Comment From sglass.com 2008-03-31 12:15 EDT------- Red Hat, could I get a quick update on where we are with this bug please. IBM really needs to update the customer on this bug. Thanks Stephanie Glass SWG Program Manager I am working on a instrumenting a kernel so that I can attempt to see what is happening when the corruption is lost. Some of the instrumentation needs to go into the page fault handling code, so this must be done very carefully. Also, due to the high level of concurrency, this must be done very carefully in order to not disturb the ordering by serializing all of the processes on a single lock which is used for logging. I have almost completed the design for this instrumentation and will be implementing it. Once I get it in place, then I would hope that we can find out the updates to some pages are being lost. ------- Comment From ffilz.com 2008-04-07 09:51 EDT------- Any updates on this? Not yet. I needed to spend some time on another high priority bug for a few days. I have come to a stopping point on that other bug and will be getting back to work on this situation this week. ------- Comment From chavez.com 2008-05-01 12:05 EDT------- Given that Red Hat has requested a RHIT issue to assist in getting a fix into a Z stream, I am going to have to break the mirroring link for a short time while I create a RHIT issue to link the LTC bug to. After this, Red Hat will need to establish a link from the RHIT issue to RH BZ 435291 with their Tao system. Please be patient and refrain from adding comments until this is complete. Thanks for your patience. Hello ENG, I have attached the patch IBM tested and said was good. Thank You Joe Kachuck This event sent from IssueTracker by jkachuck issue 178083 it_file 131821 Created attachment 306161 [details] Proposed patch Here is a proposed patch to address the situation. I have not completed testing using the patch, but the current results look promising. A kernel with this patch runs the reproducer in this bugzilla and the more complex reproducer from bz435291 without failure on my test systems. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. ----- Additional Comments From sglass.com 2008-05-21 14:17 EDT ------- From Red Hat bug 435291 which did not get mirrored: Comment #52 From Peter Staubach (staubach) on 2008-05-20 13:38 EST [reply] [edit] Proposed patch Here is a proposed patch to address the situation. I have not completed testing using the patch, but the current results look promising. A kernel with this patch runs the reproducer in this bugzilla and the more complex reproducer from bz435291 without failure on my test systems. I will post the patch separately. This event sent from IssueTracker by jkachuck issue 178083 File uploaded: 435291.rhel-5 This event sent from IssueTracker by jkachuck issue 178083 it_file 134389 file attachment from LTC:435291.rhel-5 This event sent from IssueTracker by jkachuck issue 178083 ----- Additional Comments From sglass.com 2008-05-21 14:26 EDT ------- New Patch Please test on both RHEL 4.5 and RHEL 5.1 This event sent from IssueTracker by jkachuck issue 178083 Please be aware that the patch for RHEL-4 is different than the patch for RHEL-5. You will need the patch from bz432974 for the RHEL-4 version. changed: What |Removed |Added ---------------------------------------------------------------------------- Status|FIXEDAWAITINGTEST |TESTED ------- Additional Comments From sglass.com 2008-05-28 14:50 EDT ------- IBM has tested this latest patch and it works great! Thanks for all the hardwork. This event sent from IssueTracker by jkachuck issue 178083 Great news! Thank you for the update. I added this patch to the test kernels that I have here: http://people.redhat.com/jlayton/ Today, I was doing some testing for an upstream server-side patch I'm working on and I noticed that iozone had hung on the rhel5 client I was using. Here's the stack there: PID: 23154 TASK: ffff810003980c00 CPU: 0 COMMAND: "iozone" #0 [ffff810013f23cf8] schedule at ffffffff80065f83 #1 [ffff810013f23dd0] inode_wait at ffffffff800eebb1 #2 [ffff810013f23de0] __wait_on_bit at ffffffff80066a75 #3 [ffff810013f23e20] __writeback_single_inode at ffffffff80031779 #4 [ffff810013f23eb0] sync_sb_inodes at ffffffff80022109 #5 [ffff810013f23ef0] sync_inodes_sb at ffffffff800f54ec #6 [ffff810013f23f40] __sync_inodes at ffffffff800f555d #7 [ffff810013f23f60] do_sync at ffffffff800e44e8 #8 [ffff810013f23f70] sys_sync at ffffffff800e451a #9 [ffff810013f23f80] tracesys at ffffffff800602a6 (via system_call) ...looking around, I noticed that rhn_check seemed to also be hung (and it had been running since yesterday): PID: 20997 TASK: ffff81000333ec80 CPU: 0 COMMAND: "rhn_check" #0 [ffff810015fcd998] schedule at ffffffff80065f83 #1 [ffff810015fcda70] io_schedule at ffffffff80066849 #2 [ffff810015fcda90] sync_page at ffffffff80029f58 #3 [ffff810015fcdaa0] __wait_on_bit_lock at ffffffff8006698d #4 [ffff810015fcdad0] __lock_page at ffffffff80041cfa #5 [ffff810015fcdb20] mpage_writepages at ffffffff8001d9f1 #6 [ffff810015fcdc30] do_writepages at ffffffff8005d5c4 #7 [ffff810015fcdc40] __writeback_single_inode at ffffffff800317ed #8 [ffff810015fcdcd0] sync_sb_inodes at ffffffff80022109 #9 [ffff810015fcdd10] writeback_inodes at ffffffff800532ef #10 [ffff810015fcdd30] balance_dirty_pages_ratelimited_nr at ffffffff800cda2c #11 [ffff810015fcddd0] __handle_mm_fault at ffffffff80008f1a #12 [ffff810015fcde60] do_page_fault at ffffffff8006a7db #13 [ffff810015fcdf50] error_exit at ffffffff80060e9d ...it looks like it's hung trying to lock the page here while in the page fault handler. The page seems to be part of a file on ext3. I'm not that familiar with this code, but perhaps we're trying to lock a page that we already have locked? Yup, trying to lock to a page that is already locked. Deadlock. Created attachment 309133 [details]
Proposed patch
Here is a new patch which should address the deadlock described in Comment #69. It passes the reproducers and is in use on a test system. Stephanie, can you please get the appropriate IBM resources testing this new patch? ----- Additional Comments From cguild.com 2008-06-20 12:05 EDT ------- I'll be more than happy to test the patch: If you can tell me the kernel that the patch was based off of, I have several images that have been in use for months. The only one I do not have is RHAS4 U7, and I will see about downloading the latest ISOs.... Or, you can point me to the kernel sources which would be better. Thanks! This event sent from IssueTracker by jkachuck issue 178083 If there is a Hotfix for this in RHAS4 U5 can some point us to where we can get that hotfix or is required to go to update 6 to have this fixed period? This is the RHEL-5 bugzilla, by the way. The RHEL-4 bugzilla is 432974. But anyway, there isn't either one, that I know of. The patch for this has been proposed for RHEL-4 U8. I believe that there are rpms for testing which contain the patch, but others would know better which ones and where they are located. The changes to address this situation are in very commonly used areas and are considered to be fairly high risk. Until we can get some more experience with them, then backporting them to older RHEL-4 releases is not a good idea. in kernel-2.6.18-99.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 I've seen serious issues with RHEL 5.0 when accessing BerkeleyDB-backed Subversion repositories housed on a NetApp device. Relatively straightforward operations hang indefinitely with only two concurrent accessing processes in play. An upgrade to RHEL 5.2 didn't help. But Don Zickus' kernel at http://people.redhat.com/dzickus/el5/107.el5 (applied as an upgrade to an RHEL 5.0 installation) seems to have remedied the problem! Can anyone speak about the trajection of this issue, specifically regarding the possibility of the fix showing up in a blessed Red Hat release? Sorry about previous comment. It might be helpful to know that I and my co-workers stumbed upon this issue while testing RHEL 5.0, Subversion 1.5, and Berkeley DB 4.7. We were able to rule out Subversion and Berkeley DB as likely causes, and eventually found that the one key change since our last successful configuration was an upgrade from RHEL-3 to RHEL-5. There are no shortage of NFS-related bugs in this tracker, but this one caught my eye because it referred to memory mapped files. In some of my tests I didn't see an application hang but I *did* see results that seemed to indicate ... "wonkiness" (for lack of a better word) in the data returned from Berkeley DB: Subversion transaction records suddenly disappearing, errors about "unpinned pages", etc. We have many applications reading/writing to the NetApp, but Berkeley DB makes fairly heavy use of file-backed shared memory regions. *That's* what drew my attention here, and prompted us to try out Don's test kernel. Sorry, you are asking when a kernel, containing this patch, might be released? This patch is targetted for release in RHEL-5.3. Attention Partners! RHEL 5.3 public Beta will be released soon. This URGENT priority/severity bug should have a fix in place in the recently released Partner Alpha drop, available at ftp://partners.redhat.com. If you haven't had a chance yet to test this bug, please do so at your earliest convenience, to ensure the highest possible quality bits in the upcoming Beta drop. Thanks, more information about Beta testing to come. - Red Hat QE Partner Management We (not me personally this time, but peers of mine) at CollabNet have tested the RHEL 5.3 alpha drop and found that it did *not* fix our problems. Strangely, Don Zickus' kernel at http://people.redhat.com/dzickus/el5/107.el5 *did*. I'm left wondering if perhaps there were additional patches in Don's kernel that are not present in the RHEL 5.3 alpha kernel. Did the RHEL 5.3 alpha kernel remedy the problems seen by the original reporter of this issue? Which version of the RHEL-5 kernel was included in the alpha described in Comment #90? The patch was included in 2.6.18-99.el5. The version is as stated on RH announcement here: https://www.redhat.com/archives/rhelv5-announce/2008-October/msg00000.html (And note that I was saying "alpha" when apparently this was actually a "beta"?) $ uname -a Linux *****.collab.net 2.6.18-120.el5PAE #1 SMP Fri Oct 17 18:17:11 EDT 2008 i686 i686 i386 GNU/Linux $ The patch as attached to this bugzilla does indeed address the problems as reproduced by the reproducer also attached to this bugzilla. The patch was included in RHEL-5 kernels as of 2.6.18-99. If it does not address some other situation, then I will need some way to reproduce that other situation so that I can see it and do some debugging and analysis. Hrm. We received the following update from RedHat this morning via other channels: --------------------------------------- | Case Information | --------------------------------------- Case Title : LTC41974-Pages of a memory mapped NFS file get corrupted Case Number : 1857478 Case Open Date : 11-SEP-2008 Problem Type : Last Update Comment as of 05-NOV-2008 09:10:41 : Hello, I have checked and the current beta kernel which is included in RHEL5.3 beta *does not* include the patch described in the bugzilla. This is because the bugzilla and patch has still to get a qa acknowledgment from the quality management. Thank you, Shijoe George. I'm trying to reconcile the various bits of information at my disposal, but that's proving difficult. :-) Let me assure you -- the patch associated with this bugzilla is in every RHEL-5 kernel after 2.6.18-99. If someone disagrees, please have them contact me. Consider this to be authoritative. Update from IBM: This has tested well on RHEL 5.3 for IBM and its customer. I have a request that if there is another problem that a new bug is open on that issue rather then confusing this issue with other problems. I like to not delay getting this fix out to our cusomers. Thanks Stephanie LTC Program Manager for SWG Okay. We've retested this beta kernel, and things *do* seem to work. (There arose some question about the validity of our previous round of tests which encouraged us to try again.) Embarrassing, yes, but in the long run all we'll care about is that stuff works. :-) An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html |