The RHEL5.3 client does not always invalidate previously cached READDIRPLUS reply pages after individual pages are repopulated with fresh results across the wire. When the repopulated page is in the middle of the READDIRPLUS stream, and the server replies with less (or more) data, it will throw off the cookie matching algorithm. This causes the client to miss (or possibly duplicate) entries during getdents requests. The good news is that it has already been fixed upstream. The following patch predicted this problem could occur and fixed it by invalidating subsequent pages when an over-the-wire refresh is issued: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2aac05a91971fbd1bf6cbed78b8731eb7454b9b7 We've recently started hitting this problem in production with a RHEL5.3 client and a NetApp filer, both under heavy load. In our latest case, an earlier READDIRPLUS sequence populated the readdir cache. Some time later, memory pressure must have forced the client to invalidate a _subset_ of the pages holding the cached READDIRPLUS responses, but not all of them. A new 'ls' invocation, forced an over-the-wire READDIRPLUS for a given cookie to be sent to populate a new page (cached page had been flushed). The server replied with one less entry than it had before (for the same cookie). The server is allowed to return less bytes than asked for, as long as the cookie indicates where to "pick up" the processing on the next READDIRPLUS request. After processing all the entries in the reply, nfs_do_filldir() updates desc->dir_cookie, and increments the page_index in the descriptor. We're now back at the top of the while loop in nfs_readdir(), looking for the next page containing data for this directory. find_dirent_page() finds the next page in memory, and calls find_dirent() to try to find the cookie where we left off. Note that we found a valid mapping in memory, so we did not have to go across the wire. find_dirent(0 XDR decodes the first entry in the page. The nfs_entry prev_cookie pointer is set to the cookie entry of the last entry we had read. ***This is where the problem occurs***. We miss the fact that we've skipped the entry that the filer did not return in its short reply. The check in find_dirent() for entry->pre_cookie == *desc->dir_cookie is successful, and we believe that we got the entry we were looking for. nfs_readdir() will now call nfs_do_filldir() to proceed filling in the dirent with the remaining entries from the page. This is why we miss the entry, because we go back to the cached page instead of issuing a new READDIRPLUS request with the new cookie. I was able to successfully reproduce this problem with an instrumented client and an instrumented server. We created a directory on the filer with 50 entries, enough to require 3 READDIRPLUS requests to fulfill a directory listing. The filer is instrumented to artificially cut the READDIRPLUS reply short every 5 replies, and return the "mising" entry on the next READDIRPLUS reply. This effectively reproduced the situation seen in our production environment. To trigger the problem, I issue a directory listing (ls) which populates 3 pages of the READDIR cache, but artificially invalidates the second page of the READDIR cache after it has filled-in the dirents. This makes a subsequent 'ls' request (getdents syscall) satisfy the first 22 entries from the first cached page, but forces the second page to be filled over the wire on every request. With the instrumented filer, we can see the client failing to list the "missing" entry from the second READDIRPLUS reply. The client never sees it, because it gets the remaining entries from the third cached page, instead of issuing a fresh request for it over-the-wire.
Created attachment 353922 [details] Network Trace file I'm attaching a network trace that shows the reproduced problem. Packets 65, 70, and 73 contain the READDIRPLUS packets that are used to populate the readdir cache. The client indeed lists all 50 entries: $ ls /mnt/readdir f.1 f.13 f.17 f.20 f.24 f.28 f.31 f.35 f.39 f.42 f.46 f.5 f.8 f.10 f.14 f.18 f.21 f.25 f.29 f.32 f.36 f.4 f.43 f.47 f.50 f.9 f.11 f.15 f.19 f.22 f.26 f.3 f.33 f.37 f.40 f.44 f.48 f.6 f.12 f.16 f.2 f.23 f.27 f.30 f.34 f.38 f.41 f.45 f.49 f.7 As noted before, the instrumented client artificially invalidates the second page of the cache, forcing a subsequent 'ls' to issue an over-the-wire READDIRPLUS request for cookie '22' in packet 77. Notice that the server replies with 21 entries (one short than in packet 70). This is legal though. The client should then issue a follow up READDIRPLUS request with cookie '43'. Instead, the client uses the cached contents for the next page, missing entry f.47 altogether. labiaga@royalflush$ ls /mnt/readdir f.1 f.13 f.17 f.20 f.24 f.28 f.31 f.35 f.39 f.42 f.46 f.50 f.9 f.10 f.14 f.18 f.21 f.25 f.29 f.32 f.36 f.4 f.43 f.48 f.6 f.11 f.15 f.19 f.22 f.26 f.3 f.33 f.37 f.40 f.44 f.49 f.7 f.12 f.16 f.2 f.23 f.27 f.30 f.34 f.38 f.41 f.45 f.5 f.8
Created attachment 356164 [details] patchset -- backport 2 fixes to code that calls invalidate_inode_pages2 Initial stab at a patchset to fix this. In addition to the patch that Ricardo pointed out, this backports an earlier patch that makes the code deal correctly with invalidate_inode_pages2 failures. I suspect that that change is not really needed to fix this issue, but it seems reasonable and it makes the other patch apply cleanly.
Ricardo, Any chance you could pass along your fault injection patch so we can use it to verify the fix here? I'd also appreciate your opinion on the possible patchset in comment #3.
I'm also trying to determine why the fix for bug 364351 doesn't help this. It seems like we should be getting an iput on the dir since you're doing multiple ls's with the reproducer in comment #2.
(In reply to comment #4) > Ricardo, > Any chance you could pass along your fault injection patch so we can use it > to verify the fix here? The client side is pretty simple. The diffs are from a pNFS git (2.6.31) tree but hopefully make sense in a vanilla 2.6.31 context. The first nfs_readdir_filler() change simply backs out the fix and brings it to the equivalent RHEL5.3 behavior. The nfs_do_filldir() changes invalidate the second page, so that a subsequent readdir request will be forced across the wire. $ git diff dir.c diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index 89f98e9..a765c99 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -208,10 +208,8 @@ int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, stru * Note: assumes we have exclusive access to this mapping either * through inode->i_mutex or some other mechanism. */ - if (invalidate_inode_pages2_range(inode->i_mapping, page->index + 1, -1) - /* Should never happen */ - nfs_zap_mapping(inode, inode->i_mapping); - } + if (page->index == 0) + invalidate_inode_pages2_range(inode->i_mapping, PAGE_CACHE_SIZE, unlock_page(page); return 0; error: @@ -404,6 +402,10 @@ int nfs_do_filldir(nfs_readdir_descriptor_t *desc, void *di filldir_t filldir) { struct file *file = desc->file; +/* XXXRLHACK */ + struct inode *inode = file->f_path.dentry->d_inode; +/* XXXRLHACK */ + struct nfs_entry *entry = desc->entry; struct dentry *dentry = NULL; u64 fileid; @@ -439,7 +441,13 @@ int nfs_do_filldir(nfs_readdir_descriptor_t *desc, void *di *desc->dir_cookie = entry->cookie; if (dir_decode(desc) != 0) { desc->page_index ++; +/* XXXRL HACK to test */ +if (desc->page_index == 1) { + printk("Invalidating page 1\n"); + invalidate_inode_pages2_range(inode->i_mapping, 1, 1); +} break; + } if (loop_count++ > 200) { loop_count = 0; Unfortunately, the server side instrumentation was based on ONTAP 8 cluster-mode code. It has not yet been released, and may not be straight forward for you to install inhouse. I see two options: 1) Instrument the Linux NFS server to do the equivalent behavior. 2) Send me a binary (perhaps a virtual machine image) with the fix and the above instrumentation, and I can test it internally. #2 is definitely more expedient. > I'd also appreciate your opinion on the possible > patchset in comment #3. Seems reasonable.
P.S. I've already tested the recommended fix and it indeed addressed the problem.
Thanks for the info Ricardo. I'll see about getting a test kernel rolled up soon with your fault injection code. Do you have a RHEL5 host set up for it already? If so, what kernel variant is it using? I also understand why the patch for bug 364351 doesn't help. We're not deallocating the dentry in this case, so dentry_iput doesn't get called. We'd prefer to have a way to independently verify this, but it doesn't appear to be trivial to do a similar fault injection patch for knfsd. I'll look over the code a bit more before I give up on it, but it's probably not going to be worth it.
(In reply to comment #8) > Do you have a RHEL5 host set up for it > already? If so, what kernel variant is it using? RHEL5.3: 2.6.18-128.el5 #1 SMP Wed Dec 17 11:42:39 EST 2008 i686 i686 i386 GNU/Linux
Created attachment 357200 [details] patchset -- two patches + fault injection patch Hi Ricardo, Sorry for the delay. I built some test kernels with the attached patchset and put the i686 and srpm here: http://people.redhat.com/jlayton/bz511170/ They have the patchset in comment #3 + the fault injection patch that you posted in comment #6. Can you test this kernel and verify that it fixes the problem? Let me know if you need a different arch or kernel variant.
Thanks Jeff. I verified that kernel-2.6.18-162.el5.bz511170.1.i686.rpm indeed addresses the problem when mounting the instrumented ONTAP filer. I'll go ahead and have our QA group test on one of the RHEL5.3 machines that was failing in the "test production" and verify that it indeed addresses the problem in "real life".
Thanks Ricardo. Let me know what your QA group finds.
Ricardo, any feedback from your QA folks?
I hope to hear from them soon.
I should know more in about two weeks when the person who owns the test infrastructure returns to the office to test the patch. As I mentioned before, I verified in my "fault injection" environment, but would be good to verify the fix in the environment that triggered the initial problem.
Thanks Ricardo. No huge hurry from our standpoint. The patch deadline for the next RHEL5 release is still quite a ways out.
Jeff, can you provide a 64-bit patch for test in our QA environment? Thanks.
I assume you mean an x86_64 kernel package? If so, then yes but it'll take a day or so to rebuild. The build system has already cleaned out my last one.
Yes, x86_64. Thanks.
Our QA team was unable to reproduce the problem after applying this patch. Looks good to us, thanks for addressing this problem so rapidly!
Can we get a patch for RHEL5.3 as well, while the patch is going to be available in RHEL5.4?
(In reply to comment #22) > Can we get a patch for RHEL5.3 as well, while the patch is going to be > available in RHEL5.4? Sorry Bikash, missed your request... The same patch will apply to 5.3 kernels with a little fuzz. YMMV with that patch, of course...
Thanks Jeff. I think the question was more along the lines of an official patch. Can you integrate this fix into your patch stream for 5.3 and 5.4?
It would be better for that sort of request to bubble up from our support folks. Andrius, can you help drive that?
Jeff - there is no official path for NetApp through RH Support since they do not have TAM coverage. Let me bring in jwest, maybe he can approve this for errata streams...
This may be a good idea to have in 5.3. While the number of customers we have using 5.3 extended update support is relatively small, most of that number are customers who would be interested in having this patch. I'll approve this from the support side for 5.3 and 5.4. This still needs additional approvals however and isn't a guarantee. Thanks Jeremy West
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-168.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
We are unable to reproduce the problem with the 2.6.18-168.el5 kernel on RHEL5.3. So the fix is verified as far as we're concerned.
Ricardo - thanks for testing this! This will be closed when it is delivered in RHEL 5.5. Thanks again!
*** Bug 570061 has been marked as a duplicate of this bug. ***
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html