517508 – nfs_readdir() may fail to return all the files in the directory

Bug 517508 - nfs_readdir() may fail to return all the files in the directory

Summary: nfs_readdir() may fail to return all the files in the directory

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.8
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	4.9
Assignee:	Jeff Layton
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:	511170
Blocks:
TreeView+	depends on / blocked

Reported:	2009-08-14 12:28 UTC by Jeff Layton
Modified:	2009-08-28 18:23 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:	511170
Environment:
Last Closed:	2009-08-28 18:23:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Jeff Layton 2009-08-14 12:28:42 UTC

+++ This bug was initially created as a clone of Bug #511170 +++

The RHEL5.3 client does not always invalidate previously cached READDIRPLUS reply pages after individual pages are repopulated with fresh results across the wire.  When the repopulated page is in the middle of the READDIRPLUS stream, and the server replies with less (or more) data, it will throw off the cookie matching algorithm.  This causes the client to miss (or possibly duplicate) entries during getdents requests.

The good news is that it has already been fixed upstream.  The following patch predicted this problem could occur and fixed it by invalidating subsequent pages when an over-the-wire refresh is issued:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2aac05a91971fbd1bf6cbed78b8731eb7454b9b7

We've recently started hitting this problem in production with a RHEL5.3 client and a NetApp filer, both under heavy load.  

In our latest case, an earlier READDIRPLUS sequence populated the readdir cache.  Some time later, memory pressure must have forced the client to invalidate a _subset_ of the pages holding the cached READDIRPLUS responses, but not all of them.  A new 'ls' invocation, forced an over-the-wire READDIRPLUS for a given cookie to be sent to populate a new page (cached page had been flushed).  The server replied with one less entry than it had before (for the same cookie).  The server is allowed to return less bytes than asked for, as long as the cookie indicates where to "pick up" the processing on the next READDIRPLUS request.

After processing all the entries in the reply, nfs_do_filldir() updates desc->dir_cookie, and increments the page_index in the descriptor.  We're now back at the top of the while loop in nfs_readdir(), looking for the next page containing data for this directory.  find_dirent_page() finds the next page in memory, and calls find_dirent() to try to find the cookie where we left off.  Note that we found a valid mapping in memory, so we did not have to go across the wire.

find_dirent(0 XDR decodes the first entry in the page.  The nfs_entry prev_cookie pointer is set to the cookie entry of the last entry we had read.  ***This is where the problem occurs***.  We miss the fact that we've skipped the entry that the filer did not return in its short reply.  The check in find_dirent() for entry->pre_cookie == *desc->dir_cookie is  successful, and we believe that we got the entry we were looking for.  nfs_readdir() will now call nfs_do_filldir() to proceed filling in the dirent with the remaining entries from the page.  This is why we miss the entry, because we go back to the cached page instead of issuing a new READDIRPLUS request with the new cookie.

I was able to successfully reproduce this problem with an instrumented client and an instrumented server.  We created a directory on the filer with 50 entries, enough to require 3 READDIRPLUS requests to fulfill a directory listing.  The filer is instrumented to artificially cut the READDIRPLUS reply short every 5 replies, and return the "mising" entry on the next READDIRPLUS reply.  This effectively reproduced the situation seen in our production environment.

To trigger the problem, I issue a directory listing (ls) which populates 3 pages of the READDIR cache, but artificially invalidates the second page of the READDIR cache after it has filled-in the dirents.  This makes a subsequent 'ls' request (getdents syscall) satisfy the first 22 entries from the first cached page, but forces the second page to be filled over the wire on every request.  With the instrumented filer, we can see the client failing to list  the "missing" entry from the second READDIRPLUS reply.  The client never sees it, because it gets the remaining entries from the third cached page, instead of issuing a fresh request for it over-the-wire.

--- Additional comment from ricardo.labiaga on 2009-07-15 18:58:04 EDT ---

Created an attachment (id=353922)
Network Trace file

I'm attaching a network trace that shows the reproduced problem.  Packets 65, 70, and 73 contain the READDIRPLUS packets that are used to populate the readdir cache.  The client indeed lists all 50 entries:

$ ls /mnt/readdir
f.1   f.13  f.17  f.20  f.24  f.28  f.31  f.35  f.39  f.42  f.46  f.5   f.8
f.10  f.14  f.18  f.21  f.25  f.29  f.32  f.36  f.4   f.43  f.47  f.50  f.9
f.11  f.15  f.19  f.22  f.26  f.3   f.33  f.37  f.40  f.44  f.48  f.6
f.12  f.16  f.2   f.23  f.27  f.30  f.34  f.38  f.41  f.45  f.49  f.7

As noted before, the instrumented client artificially invalidates the second page of the cache, forcing a subsequent 'ls' to issue an over-the-wire READDIRPLUS request for cookie '22' in packet 77.  Notice that the server replies with 21 entries (one short than in packet 70).  This is legal though.  The client should then issue a follow up READDIRPLUS request with cookie '43'.  Instead, the client uses the cached contents for the next page, missing entry f.47 altogether.

labiaga@royalflush$ ls /mnt/readdir
f.1   f.13  f.17  f.20  f.24  f.28  f.31  f.35  f.39  f.42  f.46  f.50  f.9
f.10  f.14  f.18  f.21  f.25  f.29  f.32  f.36  f.4   f.43  f.48  f.6
f.11  f.15  f.19  f.22  f.26  f.3   f.33  f.37  f.40  f.44  f.49  f.7
f.12  f.16  f.2   f.23  f.27  f.30  f.34  f.38  f.41  f.45  f.5   f.8

--- Additional comment from jlayton on 2009-08-04 08:58:44 EDT ---

Created an attachment (id=356164)
patchset -- backport 2 fixes to code that calls invalidate_inode_pages2

Initial stab at a patchset to fix this. In addition to the patch that Ricardo pointed out, this backports an earlier patch that makes the code deal correctly with invalidate_inode_pages2 failures.

I suspect that that change is not really needed to fix this issue, but it seems reasonable and it makes the other patch apply cleanly.

--- Additional comment from jlayton on 2009-08-04 09:13:16 EDT ---

Ricardo,
  Any chance you could pass along your fault injection patch so we can use it to verify the fix here? I'd also appreciate your opinion on the possible patchset in comment #3.

--- Additional comment from jlayton on 2009-08-04 09:28:32 EDT ---

I'm also trying to determine why the fix for bug 364351 doesn't help this. It seems like we should be getting an iput on the dir since you're doing multiple ls's with the reproducer in comment #2.

--- Additional comment from ricardo.labiaga on 2009-08-04 20:08:58 EDT ---

(In reply to comment #4)
> Ricardo,
>   Any chance you could pass along your fault injection patch so we can use it
> to verify the fix here? 

The client side is pretty simple.  The diffs are from a pNFS git (2.6.31) tree but hopefully make sense in a vanilla 2.6.31 context.  The first nfs_readdir_filler() change simply backs out the fix and brings it to the equivalent RHEL5.3 behavior.  The nfs_do_filldir() changes invalidate the second page, so that a subsequent readdir request will be forced across the wire.

$ git diff dir.c
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 89f98e9..a765c99 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -208,10 +208,8 @@ int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, stru
         * Note: assumes we have exclusive access to this mapping either
         *       through inode->i_mutex or some other mechanism.
         */
-       if (invalidate_inode_pages2_range(inode->i_mapping, page->index + 1, -1)
-               /* Should never happen */
-               nfs_zap_mapping(inode, inode->i_mapping);
-       }
+       if (page->index == 0)
+               invalidate_inode_pages2_range(inode->i_mapping, PAGE_CACHE_SIZE,
        unlock_page(page);
        return 0;
  error:
@@ -404,6 +402,10 @@ int nfs_do_filldir(nfs_readdir_descriptor_t *desc, void *di
                   filldir_t filldir)
 {
        struct file     *file = desc->file;
+/* XXXRLHACK */
+       struct inode    *inode = file->f_path.dentry->d_inode;
+/* XXXRLHACK */
+
        struct nfs_entry *entry = desc->entry;
        struct dentry   *dentry = NULL;
        u64             fileid;
@@ -439,7 +441,13 @@ int nfs_do_filldir(nfs_readdir_descriptor_t *desc, void *di
                *desc->dir_cookie = entry->cookie;
                if (dir_decode(desc) != 0) {
                        desc->page_index ++;
+/* XXXRL HACK to test */
+if (desc->page_index == 1) {
+       printk("Invalidating page 1\n");
+       invalidate_inode_pages2_range(inode->i_mapping, 1, 1);
+}
                        break;
+
                }
                if (loop_count++ > 200) {
                        loop_count = 0;

Unfortunately, the server side instrumentation was based on ONTAP 8 cluster-mode code.  It has not yet been released, and may not be straight forward for you to install inhouse.  I see two options:  

1) Instrument the Linux NFS server to do the equivalent behavior.  
2) Send me a binary (perhaps a virtual machine image) with the fix and the above instrumentation, and I can test it internally.  

#2 is definitely more expedient.

> I'd also appreciate your opinion on the possible
> patchset in comment #3.  

Seems reasonable.

--- Additional comment from ricardo.labiaga on 2009-08-04 20:51:34 EDT ---

P.S. I've already tested the recommended fix and it indeed addressed the problem.

--- Additional comment from jlayton on 2009-08-05 13:44:41 EDT ---

Thanks for the info Ricardo. I'll see about getting a test kernel rolled up soon with your fault injection code. Do you have a RHEL5 host set up for it already? If so, what kernel variant is it using?

I also understand why the patch for bug 364351 doesn't help. We're not deallocating the dentry in this case, so dentry_iput doesn't get called.

We'd prefer to have a way to independently verify this, but it doesn't appear to be trivial to do a similar fault injection patch for knfsd. I'll look over the code a bit more before I give up on it, but it's probably not going to be worth it.

--- Additional comment from ricardo.labiaga on 2009-08-06 12:59:59 EDT ---

(In reply to comment #8)
> Do you have a RHEL5 host set up for it
> already? If so, what kernel variant is it using?

RHEL5.3: 2.6.18-128.el5 #1 SMP Wed Dec 17 11:42:39 EST 2008 i686 i686 i386 GNU/Linux

--- Additional comment from jlayton on 2009-08-12 12:40:11 EDT ---

Created an attachment (id=357200)
patchset -- two patches + fault injection patch

Hi Ricardo,
Sorry for the delay. I built some test kernels with the attached patchset and put the i686 and srpm here:

http://people.redhat.com/jlayton/bz511170/

They have the patchset in comment #3 + the fault injection patch that you posted in comment #6. Can you test this kernel and verify that it fixes the problem?

Let me know if you need a different arch or kernel variant.

--- Additional comment from ricardo.labiaga on 2009-08-12 17:39:08 EDT ---

Thanks Jeff.  I verified that kernel-2.6.18-162.el5.bz511170.1.i686.rpm indeed addresses the problem when mounting the instrumented ONTAP filer.

I'll go ahead and have our QA group test on one of the RHEL5.3 machines that was failing in the "test production" and verify that it indeed addresses the problem in "real life".

Comment 1 Jeff Layton 2009-08-14 12:29:41 UTC

Going ahead and cloning this for RHEL4 since I'm pretty sure it has the same bug.

Comment 2 Jeff Layton 2009-08-28 18:23:58 UTC

I was mistaken. It looks like RHEL4 does not have this bug. Closing as NOTABUG.

Note You need to log in before you can comment on or make changes to this bug.