+++ This bug was initially created as a clone of Bug #432974 +++ [ some comments removed for brevity ] -- Additional comment from jlayton on 2008-02-20 14:14 EST -- Created an attachment (id=295442) reproducer 6 Current reproducer that does work, based off the original one from IBM. I'm running this in a shell script that runs it in a loop. RHEL4 routinely fails after just a couple of runs. RHEL5 usually fails within 4 or 5. I haven't seen any rawhide kernels fail with it. -----------------[snip]-------------- #!/bin/sh RESULT=0 RUN=1 while [ $RESULT -eq 0 ]; do echo "Run $RUN" ./mutex_lt3 $1 RESULT=$? RUN=`expr $RUN + 1` done exit $RESULT -- Additional comment from tao on 2008-02-20 16:35 EST -- ----- Additional Comments From ffilz.com 2008-02-20 16:33 EDT ------- Jeff's full comment: Current reproducer that does work, based off the original one from IBM. I'm running this in a shell script that runs it in a loop. RHEL4 routinely fails after just a couple of runs. RHEL5 usually fails within 4 or 5. I haven't seen any rawhide kernels fail with it. -----------------[snip]-------------- #!/bin/sh RESULT=0 RUN=1 while [ $RESULT -eq 0 ]; do echo "Run $RUN" ./mutex_lt3 $1 RESULT=$? RUN=`expr $RUN + 1` done exit $RESULT This event sent from IssueTracker by Glen Johnson issue 163042 -- Additional comment from tao on 2008-02-20 16:40 EST -- file attachment from LTC:reproducer6.c This event sent from IssueTracker by Glen Johnson issue 163042 it_file 120446 -- Additional comment from tao on 2008-02-20 16:41 EST -- ----- Additional Comments From ffilz.com 2008-02-20 16:34 EDT ------- reproducer 6 This event sent from IssueTracker by Glen Johnson issue 163042 -- Additional comment from jlayton on 2008-02-21 12:20 EST -- Still trying to get my arms around this... Reducing the number of threads seems to make this race harder to hit (which isn't too surprising). I've reduced it to 10 and still been able to hit it pretty consistently however. Removing the multiplication busy loop seems to make the problem go away altogether. I ran 20 concurrent threads in a loop and it went over 100 iterations without failing. I'm going to try changing that busy loop slightly and see if it makes this easier or harder to hit... -- Additional comment from jlayton on 2008-02-21 12:43 EST -- Reducing the number of forked processes to 5 still allows the problem to occur, but I have yet to see it occur when I reduce the number to 3. Still working on this on RHEL4 at the moment, hopefully once we figure out the problem there we can see whether RHEL5's problem is the same. -- Additional comment from jlayton on 2008-02-21 15:47 EST -- Added a new debug flag and changed the dfprintk() in nfs_revalidate_mapping to key off of it and added the pid of the current task. Here was the run: # ./mutex_lt3_loop.sh /mnt/rhel5/mtxfile Run 1 Run 2 Run 3 <17935> can't find me to logout - expected at slot 2 <17936> can't find me to logout - expected at slot 3 ...here's the output from the run: NFS: 17921 (0:13/1755723) data cache invalidated NFS: 17925 (0:13/1755723) data cache invalidated NFS: 17927 (0:13/1755723) data cache invalidated NFS: 17933 (0:13/1755723) data cache invalidated NFS: 17937 (0:13/1755723) data cache invalidated ...so we know that the data cache was never invalidated in the context of the threads that saw the mapping in this case. -- Additional comment from jlayton on 2008-02-21 16:06 EST -- Added a new printk to show on every entry to nfs_revalidate_mapping too. Here's output from this run: # ./mutex_lt3_loop.sh /mnt/rhel5/mtxfile Run 1 Run 2 Run 3 <18255> can't find me to logout - expected at slot 2 <18256> can't find me to logout - expected at slot 3 So I think this means that 18255 and 18256 had their writes get lost... Here, the mapping was invalidated by 18252 and then revalidated by 5 and 6 (as well as by others) just before the data cache was invalidated by 3. NFS: 18255 (0:13/1755723) nfs_revalidate_mapping on 0000010011fda410 NFS: 18256 (0:13/1755723) nfs_revalidate_mapping on 0000010011fda410 NFS: 18253 (0:13/1755723) nfs_revalidate_mapping on 0000010011fda410 NFS: 18253 (0:13/1755723) nfs_revalidate_mapping on 0000010011fda410 NFS: 18253 (0:13/1755723) data cache invalidated for mapping 0000010011fda410 NFS: 18254 (0:13/1755723) nfs_revalidate_mapping on 0000010011fda410 NFS: 18255 (0:13/1755723) nfs_revalidate_mapping on 0000010011fda410 ...the address at the end is the address_space object address. That's the same for everyone. So I don't think that the problem is that we have inconsistencies between different processes for the mmap, rather I think that we're most likely dealing with a race that's causing writes to occasionally get lost. I could be wrong on this, but I think it makes the most sense... -- Additional comment from jlayton on 2008-02-22 11:20 EST -- Created an attachment (id=295644) reproducer 7 Some more cleanup of the reproducer: This should make each "slot" in the array be page-aligned (by extending the garbage array to 1008 entries). Change MP to be based on the number of spawned processes -- there should be no need to go past that and that cuts down the run time since the main thread doesn't need to write out the huge file. Change a couple of places that check for excessive loop iterations to error out when we go past MP iterations (signifying that we traversed the list more than once). Reduce the number of busy loop multiplication ops -- this "feels" like it makes things reproduce faster but it might just be my imagination. Add a routine to dump the state of the array when an error is detected. -- Additional comment from jlayton on 2008-02-22 11:25 EST -- That should read... (by extending the garbage array to 1022 entries) Here's some interesting output from the latest reproducer: <26595> can't find me to logout - expected at slot 4 (26595) 0: val=-1 next=2 (26595) 1: val=26593 next=5 (26595) 2: val=26594 next=1 (26595) 3: val=-1 next=-1 (26595) 4: val=26595 next=-1 (26595) 5: val=-1 next=-1 <26596> can't find me to logout - expected at slot 5 (26596) 0: val=-1 next=2 (26596) 1: val=26593 next=5 (26596) 2: val=26594 next=1 (26596) 3: val=-1 next=-1 (26596) 4: val=26595 next=-1 (26596) 5: val=-1 next=-1 ...so 26595 couldn't find itself in the array. But, since the loop ran, the mapping has evidently been updated and now it *is* in the right spot. -- Additional comment from jlayton on 2008-02-25 09:12 EST -- Another run from this morning. Here's the output from the program: # ./mutex_lt3_loop.sh /mnt/rhel5/mtxfile Run 1 <2289> can't find me to logout - expected at slot 3 (2289) 0: val=-1 next=4 (2289) 1: val=2287 next=3 (2289) 2: val=-1 next=-1 (2289) 3: val=2289 next=-1 (2289) 4: val=-1 next=-1 <2287> can't find me to logout - expected at slot 1 (2287) 0: val=-1 next=4 (2287) 1: val=2287 next=3 (2287) 2: val=-1 next=-1 (2287) 3: val=2289 next=-1 (2287) 4: val=-1 next=-1 <2288> can't find me to logout - expected at slot 4 (2288) 0: val=-1 next=4 (2288) 1: val=2287 next=3 (2288) 2: val=-1 next=-1 (2288) 3: val=2289 next=-1 (2288) 4: val=-1 next=-1 ...hexdump of the final contents. It pretty much matches what we see above. Presuming that the reproducer locking semantics are correct (and I think that they are), it looks like 2288's list insertion got lost. # hexdump /export/mtxfile 0000000 ffff ffff 0004 0000 0000 0000 0000 0000 0000010 0000 0000 0000 0000 0000 0000 0000 0000 * 0001000 08ef 0000 0003 0000 0000 0000 0000 0000 0001010 0000 0000 0000 0000 0000 0000 0000 0000 * 0002000 ffff ffff ffff ffff 0000 0000 0000 0000 0002010 0000 0000 0000 0000 0000 0000 0000 0000 * 0003000 08f1 0000 ffff ffff 0000 0000 0000 0000 0003010 0000 0000 0000 0000 0000 0000 0000 0000 * 0004000 ffff ffff ffff ffff 0000 0000 0000 0000 0004010 0000 0000 0000 0000 0000 0000 0000 0000 * 0005000 ...the good news is that turning up nfs_debug seems to make this race even easier to hit. I have the debug output from the above run and will attach it soon. -- Additional comment from jlayton on 2008-02-25 09:13 EST -- Created an attachment (id=295797) debug messages from run in comment #47 nfs_debug output from run in comment #47. -- Additional comment from jlayton on 2008-02-25 09:47 EST -- Since the slots are now page aligned, we should be able to determine where they are in the debug output. Slot 4 corresponds to 0x4000 above, so it should start at byte 16384. There are other races involved since the other processes obviously didn't find their entries, but the missing write from 2288 is probably as good a place to start as any. Just after the first mmap is done, we do a big read and load the cache for the page starting at 16384: NFS: read done (0:13/1755723 4096@0) NFS: read done (0:13/1755723 4096@4096) NFS: read done (0:13/1755723 4096@8192) NFS: read done (0:13/1755723 4096@12288) NFS: read done (0:13/1755723 4096@16384) ...then later, the data cache is invalidated (though it's not clear to me why): NFS: nfs_update_inode(0:13/1755723 ct=1 info=0x6) NFS: (0:13/1755723) data cache invalidated NFS: nfs3_forget_cached_acls(0:13/1755723) NFS: (0:13/1755723) revalidation complete ...at that point, we do an nfs_readpage call, but for only 4k at page offset 4: NFS: nfs_readpage (0000010000b64450 4096@4) NFS: 47 initiated read call (req 0:13/1755723, 4096 bytes @ offset 16384) ...are the other pages already flagged uptodate? Should they be if the cache is invalidated? A little while later: NFS: 52 initiated write call (req 0:13/1755723, 4096 bytes @ offset 16384) NFS: 52 nfs_writeback_done (status 4096) NFS: write (0:13/1755723 4096@16384) marked for commit NFS: 53 initiated commit call nfs: fsync(0:13/1755723) NFS: nfs_update_inode(0:13/1755723 ct=2 info=0x6) NFS: 53 nfs_commit_done (status 0) ...and that's the first time we see a write go out on the wire for that page. I think the fsync represents a msync call, so at that point, one of the processes had detected and error and started exiting. The hard part here is that we don't have any real visibility into when the changes to the mmap happened here. That's all done via direct access to the memory so we probably won't be able to tell that. What might be interesting is to dump the first few bytes of each page just before a read call is done so that we can see if potential changes are being clobbered before they ever go out on the wire. Possibly the same for write calls. -- Additional comment from tao on 2008-02-25 10:15 EST -- changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |OPEN ------- Additional Comments From ffilz.com 2008-02-25 10:10 EDT ------- > ...at that point, we do an nfs_readpage call, but for only 4k at page > offset 4: > > NFS: nfs_readpage (0000010000b64450 4096@4) > NFS: 47 initiated read call (req 0:13/1755723, 4096 bytes @ offset > 16384) Is that supposed to be byte offset 4 or page offset 4? Printing the first few bytes of each page during read and write would be good. Also pringing more details when the cache is invalidated would be good. Moving to open since development/Red Hat is investigationg, we are not waiting on submitter for anything. This event sent from IssueTracker by Glen Johnson issue 163042 -- Additional comment from jlayton on 2008-02-25 10:44 EST -- > NFS: nfs_readpage (0000010000b64450 4096@4) ... 4096 bytes at page index 4 > NFS: 47 initiated read call (req 0:13/1755723, 4096 bytes @ offset 16384) ...which becomes a read call for 4k at byte offset 16384. I may do that (add some debugging code that adds info about what data is currently in the page), but I want to see first if I can rewrite this reproducer such that it doesn't use this linked list. I think if I have different processes updating different offsets in different pages then I might be able to reproduce the same thing, and it might be a little clearer about who's updates are getting clobbered. -- Additional comment from tao on 2008-02-25 11:05 EST -- ----- Additional Comments From ffilz.com 2008-02-25 11:01 EDT ------- Ok, figured that was page index 4. Yes, having each process have it's own part of the page to use would help. One thought would be for each process to take a snapshot of the other processes page areas and put that in it's own section, so the layout of each page might be something like a 2 dimensional array. It might be interesting to see what the other processes views are when a page update appears to be lost. This event sent from IssueTracker by Glen Johnson issue 163042 -- Additional comment from jlayton on 2008-02-25 11:39 EST -- Created an attachment (id=295812) simpler reproducer 1 Here's a simpler reproducer program. It just declares the file to be a large array of pages, and gives each process a slot in every page. The process: gets the lock checks that its slot is zero in every page writes its pid to its slot in every page releases the lock does a busy loop gets the lock again checks that its pid is in its slot in every page writes 0 to its slot in every page ...it does this for 400 iterations. Usually, it fails after just a few. I need to clean it up a little more and have it dump the state of the problem page when it detects a problem. w -- Additional comment from jlayton on 2008-02-25 11:41 EST -- From what I've seen so far, I don't think the processes have an inconsistent view of the pages. The address_space struct seems to be the same one, so the pages attached to it ought to also be the same. Rather, I think that the problem is that updates to the pages are occasionally getting clobbered. The locking should be sufficient to make that not occur, but it still seems to be happening. I could be wrong here, however... -- Additional comment from jlayton on 2008-02-25 12:07 EST -- I've been able to reduce NRTHREADS to 10 and still reproduce it but it takes longer. Sample error from the above program: 7981: error p=1 s=7 v=0 ...the interesting thing is that every time I've run this this message always has p=1 (problem in the first page index). Still need to add a bit of code to dump the state of the problem page when this occurs though... -- Additional comment from jlayton on 2008-02-25 12:59 EST -- Running with nfs_debug cranked up, I got a message (that I lost), but: p=0 s=0 v=0 ...here's first mmap call: nfs: mmap(//mtxfile) ...then we see this: NFS: revalidating (0:13/1755725) NFS call getattr NFS: 15394 initiated write call (req 0:13/1755725, 32768 bytes @ offset 4096) NFS: 15395 initiated write call (req 0:13/1755725, 4096 bytes @ offset 36864) NFS: nfs_update_inode(0:13/1755725 ct=2 info=0x6) NFS: mtime change on server for file 0:13/1755725 NFS: 15394 nfs_writeback_done (status 32768) ...what happened to the stuff at offset 0? Later in the file, I never see that offset written out, but I do see this a little later on: NFS: nfs_readpages (0:13/1755725 10) NFS: 15398 initiated read call (req 0:13/1755725, 32768 bytes @ offset 0) NFS: 15399 initiated read call (req 0:13/1755725, 8192 bytes @ offset 32768) ...so here the kernel decided to go ahead and reload page 0, even though it hadn't been written out. We also know from the design of the program that it should definitely have been written out. Looking now to see if I can determine how the decision to write out pages was made here... -- Additional comment from tao on 2008-02-25 13:50 EST -- ----- Additional Comments From nevdull.com (prefers email at ricklind.com) 2008-02-25 13:47 EDT ------- jlayton: in our early testing we noticed a similar strange coincidence -- when a failure was detected it was always the first time through the loop for the victim. Made us wonder if our initial view of the world was mistaken when we chose a free slot, but couldn't spot anything obvious to back up that thinking. This event sent from IssueTracker by Glen Johnson issue 163042 -- Additional comment from jlayton on 2008-02-25 14:04 EST -- Ok, bug in my program :-) I had this: for (i=1; i<NRTHREADS; i++) { ...in both of my insert and delete loops. i needs to start at 0. So that explains why the first page wasn't being written out. It was never getting changed. After making this change, the problem seems to be more difficult to hit for some reason. Probably timing-related somehow, but I'm not sure yet... -- Additional comment from tao on 2008-02-26 09:46 EST -- Nokia is running their IBM Rational ClearCase SCM environment mainly on top of RHEL platform. Recently there have been many issues especially with the newer ClearCase releases. IBM has given Nokia a list of issues on RHEL which affect their Clear Case implementation. This issue is part of that list and has been reported by Nokia to be a problem issue which they would like to see fixed. This event sent from IssueTracker by sprabhu issue 163042 -- Additional comment from jlayton on 2008-02-26 10:35 EST -- Created an attachment (id=295935) simpler reproducer 2 Another reproducer. This one dumps the state of the problem page once we hit a failure. I've also tried to make it so that as soon as a problem is detected, the other threads stop what they're doing. I don't think that's working right though since the erroring thread is probably exiting before that... The fact that this reproducer works I think lends creedence to the idea that the problem is a read/write race and not an issue of inconsistency in the mmap between processes. Each process in this reproducer is writing to its own area of each page, so inconsistencies shouldn't really matter... I've also started running this against a server that uses xfs as the underlying filesystem, since xfs has sub-second timestamps. I think that the problem is most likely related to cache invalidation events, and I'm hoping that will make them occur more frequently. In any case, current run: # ./mmap-repro-loop.sh /mnt/xfs/mtxfile Run 1 Run 2 16852: error p=0 s=8 v=0 (16852) 0: 0 0 0 0 0 0 0 0 0 16853 <16853> lock errsem (11:Resource temporarily unavailable) ...this means that process 16852 was expecting to see its pid in page 0, slot 8, but there's a 0 in there instead. I've also added a bit of debug code to dump the first 10 bytes of the page prior to and after a readpage call: NFS: nfs_readpage (0000010000bc88d8 4096@0) NFS: readpage_async page=0000010000bc88d8 len=4096 NFS: page contents: 0 0 0 0 0 0 0 0 16852 0 ^^^ page contents before the readpage was done. The pid seems to be present. NFS: 25468 initiated read call (req 0:13/131, 4096 bytes @ offset 0) NFS: nfs_update_inode(0:13/131 ct=1 info=0x6) NFS: 25468 nfs_readpage_result, (status 4096) NFS: page contents: 0 0 0 0 0 0 0 0 0 0 ^^^ page contents after -- it's no longer there. NFS: read done (0:13/131 4096@0) ...this suggests that the page contents are being overwritten by a read from the server prior to them being flushed out in a write. I need to instrument the corresponding writepages calls too however to make sure I'm understanding this correctly. -- Additional comment from jlayton on 2008-02-26 10:43 EST -- If my conclusion in comment #60 is correct, this is probably indicative of a couple of different problems: 1) we may be too aggressive about invalidating the cache. This could be causing performance issues if so. That said, prior to 4.5 we were definitely not aggressive enough about invalidating the cache and that lead to cache consistency problems as well. That change in behavior is probably what makes the next problem more evident. There may not be much that we can do about this problem. 2) we're occasionally allowing a read call to overwrite a dirty page. The mechanism for this isn't clear to me yet, but I think this this the real bug here. I'm going to continue trying to track this down... -- Additional comment from tao on 2008-02-26 11:25 EST -- ----- Additional Comments From ffilz.com 2008-02-26 11:23 EDT ------- Reposting Jeff's truncated comment: Another reproducer. This one dumps the state of the problem page once we hit a failure. I've also tried to make it so that as soon as a problem is detected, the other threads stop what they're doing. I don't think that's working right though since the erroring thread is probably exiting before that... The fact that this reproducer works I think lends creedence to the idea that the problem is a read/write race and not an issue of inconsistency in the mmap between processes. Each process in this reproducer is writing to its own area of each page, so inconsistencies shouldn't really matter... I've also started running this against a server that uses xfs as the underlying filesystem, since xfs has sub-second timestamps. I think that the problem is most likely related to cache invalidation events, and I'm hoping that will make them occur more frequently. In any case, current run: # ./mmap-repro-loop.sh /mnt/xfs/mtxfile Run 1 Run 2 16852: error p=0 s=8 v=0 (16852) 0: 0 0 0 0 0 0 0 0 0 16853 <16853> lock errsem (11:Resource temporarily unavailable) ...this means that process 16852 was expecting to see its pid in page 0, slot 8, but there's a 0 in there instead. I've also added a bit of debug code to dump the first 10 bytes of the page prior to and after a readpage call: NFS: nfs_readpage (0000010000bc88d8 4096@0) NFS: readpage_async page=0000010000bc88d8 len=4096 NFS: page contents: 0 0 0 0 0 0 0 0 16852 0 ^^^ page contents before the readpage was done. The pid seems to be present. NFS: 25468 initiated read call (req 0:13/131, 4096 bytes @ offset 0) NFS: nfs_update_inode(0:13/131 ct=1 info=0x6) NFS: 25468 nfs_readpage_result, (status 4096) NFS: page contents: 0 0 0 0 0 0 0 0 0 0 ^^^ page contents after -- it's no longer there. NFS: read done (0:13/131 4096@0) ...this suggests that the page contents are being overwritten by a read from the server prior to them being flushed out in a write. I need to instrument the corresponding writepages calls too however to make sure I'm understanding this correctly. This event sent from IssueTracker by Glen Johnson issue 163042 -- Additional comment from tao on 2008-02-26 11:30 EST -- file attachment from LTC:simple-reproducer2.c This event sent from IssueTracker by Glen Johnson issue 163042 it_file 121097 -- Additional comment from tao on 2008-02-26 11:30 EST -- ----- Additional Comments From ffilz.com 2008-02-26 11:24 EDT ------- Simple reproducer 2 This event sent from IssueTracker by Glen Johnson issue 163042 -- Additional comment from jlayton on 2008-02-26 13:35 EST -- Another run: [root@dhcp231-236 432974]# ./mmap-repro-loop.sh /mnt/xfs/mtxfile Run 1 Run 2 26374: error p=0 s=0 v=0 (26374) 0: 0 0 0 26377 26378 0 0 0 0 0 Slightly updated debug patch (I'll post that in a bit) that shows the page flags: NFS: nfs_readpage (00000100008612d0 4096@0) NFS: readpage_async page=00000100008612d0 len=4096 NFS: dump_page flags=20000065 contents= 26374 0 0 0 0 0 0 0 0 0 NFS: 46999 initiated read call (req 0:13/131, 4096 bytes @ offset 0) NFS reply access, status = 0 NFS: permission(0:13/131), mask=0x6, res=0 NFS: nfs_update_inode(0:13/131 ct=1 info=0x6) NFS: 46999 nfs_readpage_result, (status 4096) NFS: dump_page flags=2000006d contents= 0 0 0 0 0 0 0 0 0 0 ...the main difference is that the PG_uptodate flag is set after the read completes (which is not unexpected). What I need to understand now is why we're allowing the read to happen when the page has been modified by userspace... -- Additional comment from tao on 2008-02-26 13:50 EST -- ----- Additional Comments From ffilz.com 2008-02-26 13:48 EDT ------- Hmm, any idea why PG_dirty is not set? This event sent from IssueTracker by Glen Johnson issue 163042 -- Additional comment from jlayton on 2008-02-26 15:38 EST -- Created an attachment (id=295990) simpler reproducer 3 Modified reproducer to divorce the number of pages from the number of threads. I can now reproduce this when dealing with a single page and that seems to make this occur a bit more easily, I suppose because access to the page is more hotly contended. -- Additional comment from jlayton on 2008-02-26 15:43 EST -- Not sure on the PG_dirty bit yet, but I think the real problem is that we're doing this: NFS: revalidating (0:13/131) NFS call getattr NFS reply getattr NFS: nfs_update_inode(0:13/131 ct=1 info=0x6) NFS: (0:13/131) data cache invalidated ...when we see "data cache invalidated" then we know that invalidate_inode_pages2() has been called and the PG_uptodate bit has been cleared. In the example in comment #65, had pid 26374 already written to the page when invalidate_inode_pages2() was called here? If so, why wasn't the page written back to the server. If not, then why wasn't the PG_uptodate bit set when the page was modified? Now that we have the reproducer down to using a single page, it makes this a little easier to focus in on... -- Additional comment from jlayton on 2008-02-27 09:04 EST -- Created an attachment (id=296065) kernel debug patch 1 Patch I'm currently using for debugging. It just adds some extra functions to dump page contents and flags at different points along the way. -- Additional comment from jlayton on 2008-02-27 09:38 EST -- Run with above debug patch applied and nfs_debug turned up to 32767: Run 11 5140: error p=0 s=8 v=0 (5140) 0: 0 0 0 0 0 0 0 5139 0 0 <5139> lock errsem (11:Resource temporarily unavailable) <5137> lock errsem (11:Resource temporarily unavailable) ...looking in the logs: nfs: mmap(//mtxfile) pid=5140 NFS: nfs_update_inode(0:13/131 ct=2 info=0x6) NFS: mtime change on server for file 0:13/131 NFS: 832 nfs_writeback_done (status 4096) NFS: dump_page page=0000010000a67228 flags=2000202c contents= 0 0 0 0 0 0 0 0 0 0 ^^^ this is from the nfs_dump_mapping_pages call. The following messages indicate that the page was written back to the server. Notice that the pid has not been written to the page. 2000202C = 100000000000000010000000101100 = PG_mappedtodisk | PG_lru | PG_uptodate | PG_referenced | PG_writeback NFS: write (0:13/131 4096@0) marked for commit NFS: nfs_wait_on_requests returns 1 NFS: 833 initiated commit call nfs: fsync(0:13/131) NFS: nfs_update_inode(0:13/131 ct=2 info=0x6) NFS: 833 nfs_commit_done (status 0) NFS: commit (0:13/131 4096@0) OK NFS: nfs_wait_on_requests returns 1 NFS: nfs_wait_on_requests returns 0 nfs: flush(0:13/131) NFS: nfs_wait_on_requests returns 0 ...write, commit, etc... NFS: revalidating (0:13/131) NFS call getattr NFS reply getattr NFS: nfs_update_inode(0:13/131 ct=1 info=0x6) ...we revalidate the mapping again... NFS: dump_page page=0000010000a67228 flags=2000002c contents= 0 0 0 0 0 0 0 0 5140 0 NFS: nfs_wait_on_requests returns 0 NFS: nfs_wait_on_requests returns 0 NFS: (0:13/131) data cache invalidated ...the dump_page output is from the nfs_dump_mapping_page call. It shows that the pid has been written to memory, but nfs_revalidate_mapping did not make it write back the page. Page flags at that time were: 2000002C = 100000000000000000000000101100 = PG_mappedtodisk | PG_lru | PG_uptodate | PG_referenced I think the main difference is that in the first case, the page was already under writeback when nfs_revalidate_mapping() was called. -- Additional comment from jlayton on 2008-02-27 09:41 EST -- This comment above nfs_writeback_done_full() is interesting: /* * Handle a write reply that flushes a whole page. * * FIXME: There is an inherent race with invalidate_inode_pages and * writebacks since the page->count is kept > 1 for as long * as the page has a write request pending. */ ...though the nature of the race is a little vague and I'm not sure that it's the problem here. -- Additional comment from jlayton on 2008-02-27 09:47 EST -- To summarize: The problem appears to be that the client is occasionally noticing that the file has changed on the server and is invalidating pages that have been modified via the shared mmap without initiating writeback on them. My main question at this point is: We're definitely calling nfs_writepages() on the mapping. Why is it not picking up that the page needs to be written back? I've never seen the PG_dirty bit set, so I'm not sure that nfs_writepages uses that to decide whether to write back the page. If not, what criteria *does* it use? -- Additional comment from jlayton on 2008-02-27 14:26 EST -- Actually, I take that back. I generally do see PG_dirty set on the page, but only when the file is initially written out in the reproducer. In my (possiby wrong) understanding, PG_dirty should be getting set by the MMU when the page is updated via the mmap from userspace, and should only be getting cleared after the page is written out. The writeback should all be done with the page locked, so that should prevent userspace from modifying the page while writeback is occurring, correct? The question is -- why do we never see the PG_dirty bit set in the debug output? Pages are clearly being dirtied and no writeback is occurring so that seems to suggest that PG_dirty is never getting set for some reason. -- Additional comment from staubach on 2008-02-27 16:52 EST -- It could be that PG_DIRTY is not getting set, but it could also be being cleared in invalidate_inode_pages2() before it can be detected. -- Additional comment from jlayton on 2008-02-27 18:44 EST -- IRC notes from Peter: 16:26 < ps> Sorry to bother -- but check out invalidate_complete_page2 which is called from invalidate_inode_pages2 in the upstream kernel... 16:26 < ps> It does not invalidate the page if the page is dirty. 16:26 * ps thinks that steved is a funny guy and clearly does not have enough to do, so needs this bz? 16:29 < ps> I think that that might address this, if the problem is that the page is redirtied after being cleaned but before invalidate_inode_pages2 is done. 16:30 < steved> ps: it would not be the first time we found problem in that invalidate_inode_pagesX path... 16:30 < steved> and no... I do not need any more bz! :_ 16:30 < ps> Yes, I am not surprised. Invalidating pages is a trciky business. 16:31 < steved> especially for all the different filesystems... 16:32 < ps> At least, we could instrument invalidate_inode_pages2 to do something interesting when it detects a dirty page... 16:32 < steved> yes... ...I'll plan to try and confirm whether Peter's suspicion is correct tomorrow and have a look at what he's mentioning here in the upstream kernel if so. -- Additional comment from jlayton on 2008-02-28 08:22 EST -- Good call, Peter! Sprinkled some nfs_dump_page() calls in that area of nfs_revalidate_inode. Snippet from that patch: ----------[snip]----------- if (nfsi->cache_validity & NFS_INO_INVALID_DATA) { if (S_ISREG(inode->i_mode)) { + nfs_dump_mapping_page(mapping, 10); if (filemap_fdatawrite(mapping) == 0) filemap_fdatawait(mapping); nfs_wb_all(inode); + nfs_dump_mapping_page(mapping, 10); } invalidate_inode_pages2(mapping); + nfs_dump_mapping_page(mapping, 10); ----------[snip]----------- Output from the run: Run 19 13247: error p=0 s=4 v=0 (13247) 0: 0 0 0 0 0 0 0 0 0 0 ...so we were expecting to see the pid 13247 in there, but it wasn't. Debug output: NFS: nfs_update_inode(0:13/131 ct=4 info=0x6) NFS: mtime change on server for file 0:13/131 NFS: dump_page page=0000010000616aa8 flags=20000021 contents= 0 0 0 0 0 0 0 0 0 0 I think this one is from just before we call filemap_fdatawrite() ^^^ NFS: nfs_wait_on_requests returns 0 NFS: nfs_wait_on_requests returns 0 NFS: dump_page page=0000010000616aa8 flags=20000021 contents= 0 0 0 0 0 0 0 0 0 0 ^^^ just after nfs_wb_all() NFS: 2255 nfs_readpage_result, (status 4096) NFS: dump_page page=0000010000616aa8 flags=20000029 contents= 0 0 0 0 0 0 0 0 0 0 NFS: read done (0:13/131 4096@0) ^^^ read that raced in there somehow NFS: dump_page page=0000010000616aa8 flags=20000024 contents= 0 0 0 0 13247 0 0 0 0 0 ^^^ just after invalidate_inode_pages2 was called NFS: (0:13/131) data cache invalidated NFS: nfs3_forget_cached_acls(0:13/131) NFS: (0:13/131) revalidation complete So it looks the change to the page contents is racing in between nfs_wb_all() and invalidate_inode_pages2(). I'll have a look over the invalidate_inode_pages2() codepath today and the upstream patch that Peter mentioned on IRC. -- Additional comment from jlayton on 2008-02-28 09:36 EST -- Created an attachment (id=296212) possible upstream patch Actually, it looks like RHEL4 already does something similar in invalidate_complete_page(). If the page is dirty it too returns 0 and shouldn't invalidate anything. That function is not called, however if the page is mapped (and I think it would be in this case)... This patch from the old-2.6-bkcvs tree might be where we need to start, but it looks like there were some bugs with it that were fixed later. See also: commit 1bdeba19a4885374a0031b25e6f2326d486bb65e Author: zach.brown <zach.brown> Date: Sat Mar 5 17:26:38 2005 +0000 [PATCH] invalidate range of pages after direct IO write ...and... commit 2eff44802548850761e79c1257db48b9a19d91dd Author: sct <sct> Date: Thu Mar 10 16:39:55 2005 +0000 [PATCH] invalidate_inode_pages2_range livelock fix ...also from old-2.6-bkcvs tree. Still, poking around in this area at this late stage of RHEL4's lifecycle makes me a little nervous. This could break more than it solves...
I've decided to continue working on this in RHEL5 for now. It seems to be easier to reproduce this problem on RHEL5 and it's implementation of invalidate_inode_pages2 is more up to date. I backported a couple of NFS patches that seemed like they might help this there (presuming that the race is the same as the one we see in RHEL4). I'll attach them in a bit -- they seem to help some. The test would usually fail on RHEL5 very quickly but this patch seems to make it take a bit longer. I think I'm going to have to roll a rhel5 version of this debug patch to verify that the problem is the same there...
Created attachment 296231 [details] proposed patch 1 Subject: [PATCH] NFS: Deal with failure of invalidate_inode_pages2() If invalidate_inode_pages2() fails, then it should in principle just be because the current process was signalled. In that case, we just want to ensure that the inode's page cache remains marked as invalid. Also add a helper to allow the O_DIRECT code to simply mark the page cache as invalid once it is finished writing, instead of calling invalidate_inode_pages2() itself. Signed-off-by: Trond Myklebust <Trond.Myklebust> Signed-off-by: Andrew Morton <akpm> Signed-off-by: Linus Torvalds <torvalds> ... Proposed patch #1. I don't agree with Trond's statement above though. I think I've see invalidate_inode_pages2 return error even when the process wasn't signaled. Though maybe there were differences between RHEL5's iip2() and the upstream one at the time that this patch went in.
Created attachment 296232 [details] proposed patch 2 Second patch backported from upstream to deal with races in this codepath. ...might need steved and dhowells to look over it and make sure I've gotten the fscache stuff right...
These two patches seem to help things on RHEL5 -- the reproducer lasts quite a bit longer with them in place. They need more careful review though, I rather hastily backported them... There still seems to be a lingering race somewhere though. It looks like the page is getting dirtied while we're inside nfs_sync_mapping(), but it's returning 0.
Ok, so the page does seem to be getting dirtied while in nfs_sync_mapping(). So we enter invalidate_inode_pages2() with a dirty page in the mapping. It then unmaps the page, and does this near the bottom of the loop in invalidate_inode_pages2_range: was_dirty = test_clear_page_dirty(page); if (!invalidate_complete_page2(mapping, page)) { if (was_dirty) set_page_dirty(page); ret = -EIO; } unlock_page(page); ...which seems goofy to me. test_clear_page_dirty(page) is going to clear the dirty flag. We'll then invalidate the page (since it's no longer marked dirty) in invalidate_complete_page2(). Only if that fails for some other reason will we reset the dirty flag and return error. That seems wrong to me (at least in this situation). If the page is dirty once we get to that point we shouldn't just be discarding it like this...
Did a similar debug patch to the one I had for rhel4 and instrumented before and after nfs_sync_mapping() and before invalidate_inode_pages2(): NFS: dump_page page=ffff81000184fa60 flags=18080000000021 contents=0 0 0 0 0 0 0 0 0 0 NFS: after nfs_sync_mapping (ret=0): NFS: dump_page page=ffff81000184fa60 flags=1808000000087c contents=0 0 0 0 0 0 14101 0 0 0 NFS: before invalidate_inode_pages2: NFS: dump_page page=ffff81000184fa60 flags=1808000000087c contents=0 0 0 0 0 0 14101 0 0 0 NFS: after invalidate_inode_pages2 (ret=0) The problem was detected by pid 14101 -- it was expecting to see its pid but instead saw a 0 in its slot.
I think I see some of the patchset that went in to fix this final race. This is the main one I think: From e3db7691e9f3dff3289f64e3d98583e28afe03db Mon Sep 17 00:00:00 2001 From: Trond Myklebust <Trond.Myklebust> Date: Wed, 10 Jan 2007 23:15:39 -0800 Subject: [PATCH] [PATCH] NFS: Fix race in nfs_release_page() NFS: Fix race in nfs_release_page() invalidate_inode_pages2() may find the dirty bit has been set on a page owing to the fact that the page may still be mapped after it was locked. Only after the call to unmap_mapping_range() are we sure that the page can no longer be dirtied. In order to fix this, NFS has hooked the releasepage() method and tries to write the page out between the call to unmap_mapping_range() and the call to remove_mapping(). This, however leads to deadlocks in the page reclaim code, where the page may be locked without holding a reference to the inode or dentry. Fix is to add a new address_space_operation, launder_page(), which will attempt to write out a dirty page without releasing the page lock. Signed-off-by: Trond Myklebust <Trond.Myklebust> Also, the bare SetPageDirty() can skew all sort of accounting leading to other nasties. [akpm: cleanup] Signed-off-by: Peter Zijlstra <a.p.zijlstra> Cc: Trond Myklebust <Trond.Myklebust> Signed-off-by: Andrew Morton <akpm> Signed-off-by: Linus Torvalds <torvalds> ...the problem here, of course is that adding a new address_space_op is probably out of the question since that will probably blow kabi out of the water (though maybe we could do it by adding a new address_space flag to signify those that actually have the field. Still, I may go ahead and just backport this patch and a couple of others as a proof-of-concept and see if I can eliminate the race with it. We can then look at how best to fix up kabi afterward...
Backported 2 more patches to this set to attempt to fix this: Author: Trond Myklebust <Trond.Myklebust> Date: Tue Dec 5 00:35:40 2006 -0500 BZ#435291: NFS: Add nfs_set_page_dirty() Upstream commit 1a54533ec8d92a5edae97ec6ae10023ee71c4b46 ...and... Author: Trond Myklebust <Trond.Myklebust> Date: Tue Dec 5 00:35:42 2006 -0500 BZ#435291: NFS: Fix nfs_release_page Upstream commit fa8d8c5b77a2dc467b5365a5651710161b84f16d ...though I can still reproduce this even with this set. As a final stab at this, I may try to go ahead and backport the patch that adds the launder_page call, but if the other set doesn't fix it then I have a feeling that that won't either. My guess is that we're hitting a problem in the core invalidate_inode_pages2 implementation and we'll need to either patch it or add a parallel call that handles this correctly. I'll plan to take a harder look next week and see if I can pinpoint where exactly the race is coming into play with these other patches in place.
I've found that running against a netapp is a bit more reliable way to reproduce this... In addition to the patches that I've backported to try and fix this, I've also added some more debugging info. Unfortunately, adding the debug code seems to change the timing in such a way that it makes the problem go away. When I turn off the debug info, it returns quickly. I'll need to ponder some less invasive method to track this down...
Started with a kernel without all of these patches to make sure I understand the problem in RHEL5 correctly. I added a bit of debug code: So here pid 10979 is checking the page validity finds that it's wrong, dumps the (clean) page from the mapping and does a read, and then updates the page in memory: NFS:(10979) dump_page page=ffff81000102abe0 flags=808000000086c contents=0 0 0 0 0 0 0 0 0 0 nfs_revalidate_mapping:(10979) invalidate_inode_pages2 returns 0 NFS:(10979) dump_page page=ffff81000102abe0 flags=8080000000004 contents=0 0 0 0 0 0 0 0 0 0 NFS(10979) (0:17/2845690) data cache invalidated NFS: nfs_readpages (0:17/2845690 1) NFS: 0 initiated read call (req 0:17/2845690, 4096 bytes @ offset 0) nfs_readpage_result: 7918, (status 4096) NFS: read done (0:17/2845690 4096@0) ...next 10978 calls invalidate_inode_pages2 nfs_revalidate_mapping:(10978) calling invalidate_inode_pages2 Page is clearly dirty at this point: NFS:(10978) dump_page page=ffff8100012ad420 flags=808000000003c contents=0 0 0 0 0 0 0 0 10979 0 ...invalidate inode pages2 returns 0, and the page is now clean, _but_ there doesn't seem to have been any writeback activity in here. Why not? nfs_revalidate_mapping:(10978) invalidate_inode_pages2 returns 0 NFS:(10978) dump_page page=ffff8100012ad420 flags=8080000000004 contents=0 0 0 0 0 0 0 0 10979 0 NFS(10978) (0:17/2845690) data cache invalidated data cache is invalidate and the page is now clean and unattached to the mapping. We do another read into a new page that gets added to the inode mapping: NFS: nfs_readpages (0:17/2845690 1) NFS: 0 initiated read call (req 0:17/2845690, 4096 bytes @ offset 0) nfs_revalidate_mapping:(10980) calling invalidate_inode_pages2 NFS:(10980) dump_page page=ffff810000dd5240 flags=80000000021 contents=0 0 0 0 0 0 0 0 0 0 ...so the problem in this situation was that invalidate_inode_pages2 cleaned a dirty page without writing it back. The question is why...
Ok, it looks like upstream had a patch that went in a while back that changed some of the invalidate_inode_pages2 semantics: From fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 Mon Sep 17 00:00:00 2001 From: Linus Torvalds <torvalds.org> Date: Wed, 20 Dec 2006 13:46:42 -0800 Subject: [PATCH] VM: Remove "clear_page_dirty()" and "test_clear_page_dirty()" functions ...in particular, it has this delta for iip2 which doesn't seem to depend on the new functions being added. This makes sense to me, I think. The main way that invalidate_complete_page2 returns error is when there is PagePrivate is set and we fail to release the page. In this case, it was *not* set, so the page was happily cleaned. This seems wrong to me -- we shouldn't be trying to clean any dirty pages without attempting to write them back... - was_dirty = test_clear_page_dirty(page); - if (!invalidate_complete_page2(mapping, page)) { - if (was_dirty) - set_page_dirty(page); + if (!invalidate_complete_page2(mapping, page)) ret = -EIO; - } ...I plan to play around with modifying this and seeing if it makes any difference.
After fixing the above delta and applying the patch to add the launder_page operation, I've been able to run the reproducer for over 5000 iterations without it hitting the inconsistency. The downside is: 1) this patchset is large and is almost certainly a kabi breaker. It also mucks with the invalidate_inode_pages2() function, which is always a bit scary... 2) with this patchset I get a lot of failed mmap() operations. I think what's happening is that nfs_file_mmap calls nfs_revalidate_mapping. This causes invalidate_inode_pages2 to get called which then returns -EIO because the page is being dirtied by the time it calls invalidate_complete_page2(). This makes the nfs_file_mmap operation fail. I see some similar failures on rawhide kernels, but the frequency on this patched RHEL5 kernel is much higher. Now that I *think* see the problem with invalidate_inode_pages2 I may try to shoot for a smaller more targeted patch and see if we can't fix this a little less invasively.
Looking back at the info in comment #10... NFS:(10978) dump_page page=ffff8100012ad420 flags=808000000003c contents=0 0 0 0 0 0 0 0 10979 0 ...this page was dirty before we enter invalidate_inode_pages2, but PG_private seems to be unset. So even if we stop clearing the dirty bit before calling invalidate_complete_page2, we still won't call try_to_release_page and it won't be written back. That mainly gets set during writeback apparently, so since the other process raced in and updated the page after the writeback completed, it's not set here. I think the right approach might be to turn invalidate_inode_pages2_range into a wrapper around a new function that does close to the same thing but also takes a "launder_page" function pointer. In most cases that will be called with a NULL pointer and it will do what it does now, but when NFS calls this directly we'll take extra care to try and write back the page after unmapping it. This might take some careful adjustment of the nfs_releasepage function as well since we may be interfering with it.
I am not sure that I see the value in this above just remapping the page when it is discovered that it is dirty? Writing out the contents to the server makes it a valid page, does it not?
Yes, but is there a potential race there? Suppose we have 2 tasks with the same page mapped: Task1 calls invalidate_inode_pages2_range which calls unmap_mapping_range... Task2 incurs a page fault and does a readpage to attach a new page where the old one was. Makes a change to that page and dirties it. Task1 checks the original page, finds it dirty and then remaps the original page back into the mapping. What happens at that point? (we're getting outside of my meager VM knowledge here :-) Can we even undo unmap_mapping_range like that? If so, how? Presuming we can just put the mapping back, there may be a way to do what you suggest and prevent this race I suppose -- maybe by holding the i_mutex for the inode?
Right. The trick is prevent the new page from being created until we are done with the old page. The page fault code needs to be blocked until the process which wants to invalidate all of the pages is completed. Off hand, I don't know how to do this yet, but I will take a look.
At this point, I think we could probably use Larry's expertise here. Let me see if I can summarize the problem: When the cache revalidation logic decides to invalidate the pagecache for a file on NFS, nfs_revalidate_inode will be called. It in turn will call nfs_sync_mapping() to writeback any dirty pages and then call invalidate_inode_pages2() to unmap them. The problem here is that there are several races possible here. The page can be redirtied after nfs_sync_mapping completes but before invalidate_inode_pages2 (iip2) can call unmap_mapping_range. It looks like iip2 tries to handle this situation correctly, but it does a pretty poor job. From invalidate_inode_pages2_range: -------------[snip]-------------- was_dirty = test_clear_page_dirty(page); if (!invalidate_complete_page2(mapping, page)) { if (was_dirty) set_page_dirty(page); ret = -EIO; } -------------[snip]-------------- ...so if the page is dirty, we clean it and then iff invalidate_complete_page2() returns error do we set the page dirty again and return error. There is some logic in nfs_releasepage to attempt a page writeback, but try_to_release_page() only gets called if PagePrivate is set. That only seems to be so when the page already has an outstanding write request... So my question is: When the page is redirtied just before the page is unmapped in invalidate_inode_pages2, should it just be clearing the dirty bit on the page like this without attempting to write it back? I'm presuming the answer there is no... If that does occur, I presume that we should expect to get -EIO back from invalidate_inode_pages2. We'll need to change the NFS code that calls that function to expect this possibility and deal with it accordingly.
> Right. The trick is prevent the new page from being created until we are > done with the old page. The page fault code needs to be blocked until the > process which wants to invalidate all of the pages is completed. I did a bit of reading in Documentation/filesystems/vfs.txt: "Adding and removing pages to/from an address_space is protected by the inode's i_mutex." It's not clear to me where it's taken in that codepath and whether we can call invalidate_inode_pages2 with that held without deadlocking...
Created attachment 297648 [details] experimental patch 1 -- don't clean pages as aggressively in invalidate_inode_pages2_range First patch in a smaller, more targeted patchset I've been working with.
Created attachment 297649 [details] experimental patch 2 -- have nfs_revalidate_mapping deal with errors from invalidate_inode_pages2 Second patch
I did some testing on a kernel with the two patches in comment #19 and comment #20. I was able to run the reproducer for over an hour with no corruption. I ended up killing the reproducer and it never failed. The downside is that this does cause nfs_file_mmap to occasionally fail when it races with pages being dirtied: Run 24939 Run 24940 <10910> mmap failed (5:Input/output error) <10911> mmap failed (5:Input/output error) <10912> mmap failed (5:Input/output error) <10913> mmap failed (5:Input/output error) Run 24941 <10925> mmap failed (5:Input/output error) Run 24942 ...this seems to happen fairly frequently. I think this is probably an improvement -- it's better to have the mmap fail rather than corrupt the data, but it's not ideal. We may be able to fix these though with a mutex to serialize mmap and nfs_revalidate_mapping. I want to do a bit more testing with these two patches and then I'll see if we can do that. This is not what Peter was suggesting. I'm still open to his idea, but this does seem to more closely follow upstream behavior. Upstream seems to behave similarly -- we get the occasional failed mmap there, but the frequency seems to be a lot lower than on RHEL5 with these patches. One of these patches touches invalidate_inode_pages2_range(). I'd like for Larry to look at that one and comment when he gets some time. I think it's a sane change, but that's definitely an area where I could use some guidance.
I don't think my serialization idea in comment #21 will work. The pages are being dirtied via userspace mmap and that's completely disconnected from any kernel activity. I think all we can really do is try to unmap the pages, and then to make an attempt to write them back if they're dirty after being unmapped. I have a patch that adds some launder_page functionality, but it doesn't seem to work correctly at the moment. The nfs_wb_page() function seems to work completely differently upstream than it does in RHEL5, so I think we'd need to have nfs_launder_page do something different. The RHEL5 version doesn't seem to actually clean the page. PG_dirty seems to be consistently set on exit from the function. I've experimented a bit with different ways of attempting to write the page back in nfs_launder_page, but they've all been buggy. It's a bit tricky since we're calling this from invalidate_inode_pages2_range and the page is already locked, etc...
If the page is locked and unmapped and the mapping is locked, then would it be possible to just map it back in instead of attempting to clean it and destroy it?
Good question... Presuming that that's possible and that it's acceptible to do this, what would be the proper return value in this case? -EAGAIN? Then again, is it even necessary to remap the pages like that? Presuming that we don't remove the page from the pagecache, then it should still be lingering around. The next pagefault from userspace should make the page get mapped back into the process' pagetables. That may actually be a source of one of the races I've been seeing now that I think about it. Even though we call unmap_mapping_range, is there anything that prevents the page from being mapped back and dirtied before we get a chance to write it out?
It seems to me that, at this point, we don't want to have to worry about writing out the page. We tried earlier and if it is dirty again, then we know that someone is accessing it and we should just leave it alone. However, we do have to ensure that it is not accessible to userspace while we are checking to see whether it is dirty or not and also that another page can not be created with this same identity in the meantime.
I don't think we can create another page with the same identity until this one has been remove from the pagecache (as in invalidate_complete_page2()). With this patchset, that should not be happening now when the page is dirty since invalidate_complete_page2 should generally just fail in that case... That said, I decided to test the original reproducer for this problem and hit an issue after 253 runs: Run 253 <5353> can't find me to logout - expected at slot 2 (5353) 0: val=-1 next=1 (5353) 1: val=5351 next=-1 (5353) 2: val=5353 next=-1 (5353) 3: val=-1 next=-1 (5353) 4: val=-1 next=-1 <5352> mmap failed (5:Input/output error) ...I've not done any detailed looking at why it failed yet.
Modified the linked-list reproducer a bit so that it prints out the pids for each run that are running concurrently: 25859: 25860 25861 25862 25863 25864 25865 25866 25867 25868 25869 <25869> can't find me to logout - expected at slot 1 (25869) 0: val=-1 next=-1 (25869) 1: val=25869 next=1 (25869) 2: val=-1 next=-1 (25869) 3: val=-1 next=-1 (25869) 4: val=-1 next=-1 (25869) 5: val=-1 next=-1 (25869) 6: val=-1 next=-1 (25869) 7: val=-1 next=-1 (25869) 8: val=-1 next=-1 (25869) 9: val=-1 next=-1 (25869) 10: val=-1 next=-1 ...I also added some debug code to tell me when invalidate_inode_pages2 returns error, and changed the dfprintk for the "data cache invalidated" message to be printed with my private NFSDBG_MAPPING bit. NFS: (25862) (0:17/2489340) data cache invalidated NFS: (25861) (0:17/2489340) data cache invalidated NFS: (25863) (0:17/2489340) data cache invalidated NFS: (25869) (0:17/2489340) data cache invalidated NFS: (25866) (0:17/2489340) data cache invalidated ...so when the above run failed, invalidate_inode_pages2 did not return error to nfs_revalidate_mapping. It did, however, invalidate the cache. One thing that might help this is figuring out why we're doing so many cache invalidations with this reproducer. This host is the only writer of the file, so it shouldn't be invalidating the cache so often. While that wouldn't really "fix" this problem, I think that it would make it much more difficult to hit and would probably help performance some here and possibly in other situations.
With the 2 patches I have so far, I'm no longer able to hit the race with my own reproducer, but the mutex_lt3 reproducer can still hit it. I'm not sure of why this is, but it may have something to do with the fact that that reproducer deals with multiple pages... On a hunch, I decided to try the following patch: --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -704,13 +704,16 @@ int nfs_revalidate_mapping(struct inode *inode, struct add || nfs_attribute_timeout(inode)) ret = __nfs_revalidate_inode(NFS_SERVER(inode), inode); + mutex_lock(&inode->i_mutex); if (nfsi->cache_validity & NFS_INO_INVALID_DATA) { nfs_inc_stats(inode, NFSIOS_DATAINVALIDATE); if (S_ISREG(inode->i_mode)) nfs_sync_mapping(mapping); err = invalidate_inode_pages2(mapping); - if (err != 0) - return err; + if (err != 0) { + ret = err; + goto out; + } spin_lock(&inode->i_lock); nfsi->cache_validity &= ~NFS_INO_INVALID_DATA; @@ -724,6 +727,8 @@ int nfs_revalidate_mapping(struct inode *inode, struct addre inode->i_sb->s_id, (long long)NFS_FILEID(inode)); } +out: + mutex_unlock(&inode->i_mutex); return ret; } ------------------------------- There is some similar serialization code upstream, but it's a bit more complex. It's prob. not safe to take the i_mutex here in every case, but for this test it seems to work well enough. This actually seems to make the race harder to hit. I'm still able to hit it with this patch in place, so it's not a complete fix, but it does seem to help for reasons I don't completely understand. I suspect that we might have multiple threads ending up in the nfs_revalidate_mapping logic at once, and that accounts for at least some of the races. Since this doesn't fix it completely though, I have to assume that there are other races we're hitting too...
I spent some time looking to see if we can try to reduce the number of cache invalidations to work around this problem. I've identified some things we can do to try and help that, but haven't been able to eliminate the cache invalidations in this situation. The problem that I've seen now is that there are some races between write calls on the wire and the getattr that is done when the file is opened. When there are writes outstanding and a getattr is done, the mtime can seemingly reflect any intermediate time on the file. One thing we could consider is making sure that we wait for any outstanding writes to complete when doing a getattr, but that could significantly slow things down.
This is the price to pay if there is a need to validate the attribute cache while writing to a file. The file has to be quiesced, ie. not changing, while the validation is done. Do you know where the GETATTR is coming from? The WRITE responses contain attributes which should be used to update the attributes. There shouldn't be a need to issue the GETATTR for attribute cache validation purposes because the attribute cache is continually getting updated from the WRITE responses. All this said, I don't think that this problem can be solved by addressing the issue of cache invalidations done as a result of processing WRITE responses. There has to be a safe way implemented which allows the pages to be safely unmapped, cleaned if necessary, and then destroyed. How does the system do it under low memory situations when memory must be reclaimed?
The GETATTR call is coming from another process opening the file. It's doing the GETATTR as the open part of a CTO check. It just happens to be that the open sometimes races with writes from other threads. > All this said, I don't think that this problem can be solved by > addressing the issue of cache invalidations done as a result of > processing WRITE responses. Agreed, while this might help us from getting into this situation as often, we'll still be vulnerable to it in some cases (particularly if there are other hosts occasionally writing to the file).
------- Comment From ffilz.com 2008-03-14 10:52 EDT------- > The problem that I've seen now is that there are some races > between write calls on the wire and the getattr that is done when the file is > opened. When there are writes outstanding and a getattr is done, themtime can > seemingly reflect any intermediate time on the file. Ok, so we have an open getattr racing with a write. Hmm, can we skip the getattr for the open if a write is in progress? But what do we do if we want to write and a getattr is in progress? With UDP, we can't guarantee ordering, can we with TCP? (If we can, I'd be comfortable requiring TCP for this type of situation). > Agreed, while this might help us from getting into this situation as often, > we'll still be vulnerable to it in some cases (particularly if thereare other > hosts occasionally writing to the file). Hmm, can we hope to support two clients mmapping the same file (for write)?
> Hmm, can we skip the getattr for the open if a write is in progress? Perhaps. That would mean that we'd probably need to reintroduce the data_updates counter that the patch in bug 321111 removes. I played around a bit with reimplementing it today and still hit some races that caused the cache to be invalidated. I've been using TCP for all of this testing. Packet loss and network delays aren't really the cause of most of this. The problem seems to stem from the fact that the server doesn't necessarily respond to calls in the order that they were received. I suppose this basically comes down to that we need to serialize getattrs vs. writes somehow. Allow writes to occur however the client wants but we have to make sure they're all complete before we do a getattr and not allow any new ones until the getattr completes. Maybe we could do this with some sort of rq semaphore? I'd need to think on it a bit. > Hmm, can we hope to support two clients mmapping the same file (for write)? Good question. With proper locking it seems like that should work. You'd have to make sure that when the client gets the lock and the file has changed since you last had it that the cache is invalidated and reloaded before allowing a process to make changes to the mmap. Regardless though, it might be best to consider this in the context of a different BZ. Preventing false cache invalidations is really a separate issue from what happens once the invalidation occurs.
I don't think that we want to muck with the open-to-close semantics at this point. If we need to seriliaze writes and opens, then sobeit. This problem can be solved with a safe way to unmap pages which does not leave a window where data can be lost. The other issues of reducing cache invalidations can be solved separately.
Yep... I decided to step back a bit and make sure I understand what's supposed to happen when the page is mapped and dirtied. I've written a simple program to mmap a file, read from it via the mmap and write back to it via the mmap. Starting out playing with this on a file in /tmp and will look at NFS after I see what happens here: I start the program, which opens the file and mmaps it: # ./mmap-test /tmp/testfile mmap at address 0x2aaaaaaac000 crash> vtop 0x2aaaaaaac000 VIRTUAL PHYSICAL 2aaaaaaac000 (not mapped) ...so there is no page attached at this point. Now, read from the mapping (incur a page fault): PTE PHYSICAL FLAGS 80000000158cc025 158cc000 (PRESENT|USER|ACCESSED|NX) VMA START END FLAGS FILE ffff8100153f6218 2aaaaaaac000 2aaaaaaad000 fb /tmp/testfile PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffff810001515c80 158cc000 ffff81001eabfbc8 0 3 1008000000086c ...now, write to it: PTE PHYSICAL FLAGS 80000000158cc067 158cc000 (PRESENT|RW|USER|ACCESSED|DIRTY|NX) PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffff810001515c80 158cc000 ffff81001eabfbc8 0 3 1008000000087c ...now msync: PTE PHYSICAL FLAGS 80000000158cc025 158cc000 (PRESENT|USER|ACCESSED|NX) PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffff810001515c80 158cc000 ffff81001eabfbc8 0 3 1008000000086c ...nothing terribly surprising here. I suspect that what's happening is that the page is being allowed to be dirtied somehow after unmap_mapping_range() is complete and after the PageDirty check is done in invalidate_complete_page2(). Could it be something like: thread1: unmap_mapping_range completes (no more mmaps) thread2: page fault occurs, page is remapped back into process address space thread1: invalidate_complete_page2 is called and gets past the PageDirty check thread2: dirties the page thread1: __remove_from_page_cache ...there's probably some way to test for this, but I need to figure out to do it in such a way that nothing can slip by. It seems like mapping a page into the process' address space and dirtying it is pretty much lockless...
Testing with my current test kernels (which contain the patches I have so far for this). Here's the output from the run: 9216: 9217 9218 9219 9220 9221 9222 9223 9224 9225 9226 <9222> can't find me to logout - expected at slot 1 (9222) 0: val=-1 next=-1 (9222) 1: val=9222 next=1 (9222) 2: val=-1 next=-1 (9222) 3: val=-1 next=-1 (9222) 4: val=-1 next=-1 (9222) 5: val=-1 next=-1 (9222) 6: val=-1 next=-1 (9222) 7: val=-1 next=-1 (9222) 8: val=-1 next=-1 (9222) 9: val=-1 next=-1 (9222) 10: val=-1 next=-1 <9226> looped 2 (9226) 0: val=-1 next=1 (9226) 1: val=9222 next=1 (9226) 2: val=-1 next=-1 (9226) 3: val=-1 next=-1 (9226) 4: val=-1 next=-1 (9226) 5: val=-1 next=-1 (9226) 6: val=-1 next=-1 (9226) 7: val=-1 next=-1 (9226) 8: val=-1 next=-1 (9226) 9: val=-1 next=-1 (9226) 10: val=-1 next=-1 I ran this with the following systemtap script: probe kernel.function("__remove_from_page_cache") { if ($page->_mapcount->counter >= 0) { printf("__remove_from_page_cache(%u): _mapcount=%d flags=0x%x\n", pid(), $page->_mapcount->counter, $page->flags); print_backtrace(); } } ...here's the output: __remove_from_page_cache(9226): _mapcount=0 flags=0x8000000002d 0xffffffff800ca61f : __remove_from_page_cache+0x1/0x3d 0xffffffff800ce245 : invalidate_inode_pages2_range+0x158/0x209 0xffffffff8842075a : nfs_revalidate_mapping+0xac/0x1af [nfs] 0xffffffff8841e378 : nfs_file_mmap+0x21/0x4c [nfs] 0xffffffff8000e584 : do_mmap_pgoff+0x4f1/0x7da 0xffffffff80024cff : sys_mmap+0x89/0xc1 0xffffffff800602a6 : tracesys+0xd5/0xdf ...I think this may be related to the problem here. The page should have been unmapped when we called __remove_from_page_cache, but I'm guessing that sometime after unmap_mapping_range was called, a process incurred a page fault and mapped the page back into its page tables. It wasn't dirty at this point, but it easily could have been dirtied after this and before trashing the page. It seems like we need something that prevents the page from getting mapped back into anything's pagetables before we toss it out. I'm going to see if I can get a bit more milage out of systemtap here...
Some of this may be incidental to the actual problem though. While testing this, I've seen mapping go >= 0 when no problem occurred, and I've seen the problem occur with no message pop. The problem still could be the same, just with the race happening a bit later. I may see if I can do this same check after invalidate_complete_page2 returns, but that means rebuilding the kernel since that function is static.
Created attachment 298407 [details] latest linked-list reproducer This is the latest linked list reproducer that I'm using to test this. It still fails reliably even with the two experimental patches above. To build it: # gcc -Wall -g -o ./mutex_lt3 ./mutex_lt3-7.c -lpthread ...I then run it under a shell script that runs it in a loop: -------[snip]--------- #!/bin/sh RESULT=0 RUN=1 while [ $RESULT -eq 0 ]; do ./mutex_lt3 $1 RESULT=$? RUN=`expr $RUN + 1` done exit $RESULT -------[snip]--------- You run the shell script with a file on a NFS share as an argument. I've been using a netapp server. The fact that it has sub-second mtimes I think makes for more cache invalidations and helps to trigger the problem. The code in this reproducer is very hard to follow, but basically it has a bunch of threads adding and removing entries from a linked list that's on a mmapped file. It uses a semaphore for serialization. When run, one of the threads will eventually notice that the list is not what is expected and will exit with an error.
To summarize: The issue with this case seems to be that when a file is mmapped on NFS and processes are updating it, it occasionally ends up having different contents vs. what's expected. I *think* the problem is related to the client noticing mtime changes on the file and invalidating its contents based on that. I suspect that the problem is either: 1) we're occasionally invalidating a page that has dirty data ...and/or... 2) we're occasionally reading new data onto a dirty page My initial thought was that the problem was a race where a page would be dirtied after writeback but before being invalidated. The two experimental patches attached to this case try to fix that. While they fix a simpler reproducer that I have, they do not seem to fix the reproducer in comment #38.
Actually.... I'm not convinced that this problem is actually fixed in rawhide. It may be, but it may be that this problem is being sidestepped by reducing cache invalidations. When doing a mutex_lt3_loop test on rawhide (kernel 2.6.25-0.121.rc5.git4.fc9), I see virtually no read calls occurring. This tells me that, we're almost never invalidating the cache and so we virtually never end up clobbering the data in a dirty page...
------- Comment From sglass.com 2008-03-31 12:15 EDT------- Red Hat, could I get a quick update on where we are with this bug please. IBM really needs to update the customer on this bug. Thanks Stephanie Glass SWG Program Manager
I am working on a instrumenting a kernel so that I can attempt to see what is happening when the corruption is lost. Some of the instrumentation needs to go into the page fault handling code, so this must be done very carefully. Also, due to the high level of concurrency, this must be done very carefully in order to not disturb the ordering by serializing all of the processes on a single lock which is used for logging. I have almost completed the design for this instrumentation and will be implementing it. Once I get it in place, then I would hope that we can find out the updates to some pages are being lost.
------- Comment From ffilz.com 2008-04-07 09:51 EDT------- Any updates on this?
Not yet. I needed to spend some time on another high priority bug for a few days. I have come to a stopping point on that other bug and will be getting back to work on this situation this week.
------- Comment From chavez.com 2008-05-01 12:05 EDT------- Given that Red Hat has requested a RHIT issue to assist in getting a fix into a Z stream, I am going to have to break the mirroring link for a short time while I create a RHIT issue to link the LTC bug to. After this, Red Hat will need to establish a link from the RHIT issue to RH BZ 435291 with their Tao system. Please be patient and refrain from adding comments until this is complete. Thanks for your patience.
Hello ENG, I have attached the patch IBM tested and said was good. Thank You Joe Kachuck This event sent from IssueTracker by jkachuck issue 178083 it_file 131821
Created attachment 306161 [details] Proposed patch Here is a proposed patch to address the situation. I have not completed testing using the patch, but the current results look promising. A kernel with this patch runs the reproducer in this bugzilla and the more complex reproducer from bz435291 without failure on my test systems.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
----- Additional Comments From sglass.com 2008-05-21 14:17 EDT ------- From Red Hat bug 435291 which did not get mirrored: Comment #52 From Peter Staubach (staubach) on 2008-05-20 13:38 EST [reply] [edit] Proposed patch Here is a proposed patch to address the situation. I have not completed testing using the patch, but the current results look promising. A kernel with this patch runs the reproducer in this bugzilla and the more complex reproducer from bz435291 without failure on my test systems. I will post the patch separately. This event sent from IssueTracker by jkachuck issue 178083
File uploaded: 435291.rhel-5 This event sent from IssueTracker by jkachuck issue 178083 it_file 134389
file attachment from LTC:435291.rhel-5 This event sent from IssueTracker by jkachuck issue 178083
----- Additional Comments From sglass.com 2008-05-21 14:26 EDT ------- New Patch Please test on both RHEL 4.5 and RHEL 5.1 This event sent from IssueTracker by jkachuck issue 178083
Please be aware that the patch for RHEL-4 is different than the patch for RHEL-5. You will need the patch from bz432974 for the RHEL-4 version.
changed: What |Removed |Added ---------------------------------------------------------------------------- Status|FIXEDAWAITINGTEST |TESTED ------- Additional Comments From sglass.com 2008-05-28 14:50 EDT ------- IBM has tested this latest patch and it works great! Thanks for all the hardwork. This event sent from IssueTracker by jkachuck issue 178083
Great news! Thank you for the update.
I added this patch to the test kernels that I have here: http://people.redhat.com/jlayton/ Today, I was doing some testing for an upstream server-side patch I'm working on and I noticed that iozone had hung on the rhel5 client I was using. Here's the stack there: PID: 23154 TASK: ffff810003980c00 CPU: 0 COMMAND: "iozone" #0 [ffff810013f23cf8] schedule at ffffffff80065f83 #1 [ffff810013f23dd0] inode_wait at ffffffff800eebb1 #2 [ffff810013f23de0] __wait_on_bit at ffffffff80066a75 #3 [ffff810013f23e20] __writeback_single_inode at ffffffff80031779 #4 [ffff810013f23eb0] sync_sb_inodes at ffffffff80022109 #5 [ffff810013f23ef0] sync_inodes_sb at ffffffff800f54ec #6 [ffff810013f23f40] __sync_inodes at ffffffff800f555d #7 [ffff810013f23f60] do_sync at ffffffff800e44e8 #8 [ffff810013f23f70] sys_sync at ffffffff800e451a #9 [ffff810013f23f80] tracesys at ffffffff800602a6 (via system_call) ...looking around, I noticed that rhn_check seemed to also be hung (and it had been running since yesterday): PID: 20997 TASK: ffff81000333ec80 CPU: 0 COMMAND: "rhn_check" #0 [ffff810015fcd998] schedule at ffffffff80065f83 #1 [ffff810015fcda70] io_schedule at ffffffff80066849 #2 [ffff810015fcda90] sync_page at ffffffff80029f58 #3 [ffff810015fcdaa0] __wait_on_bit_lock at ffffffff8006698d #4 [ffff810015fcdad0] __lock_page at ffffffff80041cfa #5 [ffff810015fcdb20] mpage_writepages at ffffffff8001d9f1 #6 [ffff810015fcdc30] do_writepages at ffffffff8005d5c4 #7 [ffff810015fcdc40] __writeback_single_inode at ffffffff800317ed #8 [ffff810015fcdcd0] sync_sb_inodes at ffffffff80022109 #9 [ffff810015fcdd10] writeback_inodes at ffffffff800532ef #10 [ffff810015fcdd30] balance_dirty_pages_ratelimited_nr at ffffffff800cda2c #11 [ffff810015fcddd0] __handle_mm_fault at ffffffff80008f1a #12 [ffff810015fcde60] do_page_fault at ffffffff8006a7db #13 [ffff810015fcdf50] error_exit at ffffffff80060e9d ...it looks like it's hung trying to lock the page here while in the page fault handler. The page seems to be part of a file on ext3. I'm not that familiar with this code, but perhaps we're trying to lock a page that we already have locked?
Yup, trying to lock to a page that is already locked. Deadlock.
Created attachment 309133 [details] Proposed patch
Here is a new patch which should address the deadlock described in Comment #69. It passes the reproducers and is in use on a test system.
Stephanie, can you please get the appropriate IBM resources testing this new patch?
----- Additional Comments From cguild.com 2008-06-20 12:05 EDT ------- I'll be more than happy to test the patch: If you can tell me the kernel that the patch was based off of, I have several images that have been in use for months. The only one I do not have is RHAS4 U7, and I will see about downloading the latest ISOs.... Or, you can point me to the kernel sources which would be better. Thanks! This event sent from IssueTracker by jkachuck issue 178083
If there is a Hotfix for this in RHAS4 U5 can some point us to where we can get that hotfix or is required to go to update 6 to have this fixed period?
This is the RHEL-5 bugzilla, by the way. The RHEL-4 bugzilla is 432974. But anyway, there isn't either one, that I know of. The patch for this has been proposed for RHEL-4 U8. I believe that there are rpms for testing which contain the patch, but others would know better which ones and where they are located. The changes to address this situation are in very commonly used areas and are considered to be fairly high risk. Until we can get some more experience with them, then backporting them to older RHEL-4 releases is not a good idea.
in kernel-2.6.18-99.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
I've seen serious issues with RHEL 5.0 when accessing BerkeleyDB-backed Subversion repositories housed on a NetApp device. Relatively straightforward operations hang indefinitely with only two concurrent accessing processes in play. An upgrade to RHEL 5.2 didn't help. But Don Zickus' kernel at http://people.redhat.com/dzickus/el5/107.el5 (applied as an upgrade to an RHEL 5.0 installation) seems to have remedied the problem! Can anyone speak about the trajection of this issue, specifically regarding the possibility of the fix showing up in a blessed Red Hat release?
Sorry about previous comment. It might be helpful to know that I and my co-workers stumbed upon this issue while testing RHEL 5.0, Subversion 1.5, and Berkeley DB 4.7. We were able to rule out Subversion and Berkeley DB as likely causes, and eventually found that the one key change since our last successful configuration was an upgrade from RHEL-3 to RHEL-5. There are no shortage of NFS-related bugs in this tracker, but this one caught my eye because it referred to memory mapped files. In some of my tests I didn't see an application hang but I *did* see results that seemed to indicate ... "wonkiness" (for lack of a better word) in the data returned from Berkeley DB: Subversion transaction records suddenly disappearing, errors about "unpinned pages", etc. We have many applications reading/writing to the NetApp, but Berkeley DB makes fairly heavy use of file-backed shared memory regions. *That's* what drew my attention here, and prompted us to try out Don's test kernel.
Sorry, you are asking when a kernel, containing this patch, might be released? This patch is targetted for release in RHEL-5.3.
Attention Partners! RHEL 5.3 public Beta will be released soon. This URGENT priority/severity bug should have a fix in place in the recently released Partner Alpha drop, available at ftp://partners.redhat.com. If you haven't had a chance yet to test this bug, please do so at your earliest convenience, to ensure the highest possible quality bits in the upcoming Beta drop. Thanks, more information about Beta testing to come. - Red Hat QE Partner Management
We (not me personally this time, but peers of mine) at CollabNet have tested the RHEL 5.3 alpha drop and found that it did *not* fix our problems. Strangely, Don Zickus' kernel at http://people.redhat.com/dzickus/el5/107.el5 *did*. I'm left wondering if perhaps there were additional patches in Don's kernel that are not present in the RHEL 5.3 alpha kernel. Did the RHEL 5.3 alpha kernel remedy the problems seen by the original reporter of this issue?
Which version of the RHEL-5 kernel was included in the alpha described in Comment #90? The patch was included in 2.6.18-99.el5.
The version is as stated on RH announcement here: https://www.redhat.com/archives/rhelv5-announce/2008-October/msg00000.html (And note that I was saying "alpha" when apparently this was actually a "beta"?) $ uname -a Linux *****.collab.net 2.6.18-120.el5PAE #1 SMP Fri Oct 17 18:17:11 EDT 2008 i686 i686 i386 GNU/Linux $
The patch as attached to this bugzilla does indeed address the problems as reproduced by the reproducer also attached to this bugzilla. The patch was included in RHEL-5 kernels as of 2.6.18-99. If it does not address some other situation, then I will need some way to reproduce that other situation so that I can see it and do some debugging and analysis.
Hrm. We received the following update from RedHat this morning via other channels: --------------------------------------- | Case Information | --------------------------------------- Case Title : LTC41974-Pages of a memory mapped NFS file get corrupted Case Number : 1857478 Case Open Date : 11-SEP-2008 Problem Type : Last Update Comment as of 05-NOV-2008 09:10:41 : Hello, I have checked and the current beta kernel which is included in RHEL5.3 beta *does not* include the patch described in the bugzilla. This is because the bugzilla and patch has still to get a qa acknowledgment from the quality management. Thank you, Shijoe George. I'm trying to reconcile the various bits of information at my disposal, but that's proving difficult. :-)
Let me assure you -- the patch associated with this bugzilla is in every RHEL-5 kernel after 2.6.18-99. If someone disagrees, please have them contact me. Consider this to be authoritative.
Update from IBM: This has tested well on RHEL 5.3 for IBM and its customer. I have a request that if there is another problem that a new bug is open on that issue rather then confusing this issue with other problems. I like to not delay getting this fix out to our cusomers. Thanks Stephanie LTC Program Manager for SWG
Okay. We've retested this beta kernel, and things *do* seem to work. (There arose some question about the validity of our previous round of tests which encouraged us to try again.) Embarrassing, yes, but in the long run all we'll care about is that stuff works. :-)
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html