Bug 511170

Summary: [NetApp 5.5 bug] nfs_readdir() may fail to return all the files in the directory
Product: Red Hat Enterprise Linux 5 Reporter: Ricardo Labiaga <ricardo.labiaga>
Component: kernelAssignee: Jeff Layton <jlayton>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.4CC: andriusb, bikash.choudhury, bikash, bzeranski, cward, dhoward, dzickus, jlayton, jpirko, jtluka, jwest, rwheeler, samuel.li, sardella, steved, xiaowei.hu
Target Milestone: rcKeywords: OtherQA, ZStream
Target Release: 5.5   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 517508 (view as bug list) Environment:
Last Closed: 2010-03-30 06:54:12 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 517508, 526775, 526959, 526960, 533192, 533941    
Attachments:
Description Flags
Network Trace file
none
patchset -- backport 2 fixes to code that calls invalidate_inode_pages2
none
patchset -- two patches + fault injection patch none

Description Ricardo Labiaga 2009-07-14 00:59:15 UTC
The RHEL5.3 client does not always invalidate previously cached READDIRPLUS reply pages after individual pages are repopulated with fresh results across the wire.  When the repopulated page is in the middle of the READDIRPLUS stream, and the server replies with less (or more) data, it will throw off the cookie matching algorithm.  This causes the client to miss (or possibly duplicate) entries during getdents requests.

The good news is that it has already been fixed upstream.  The following patch predicted this problem could occur and fixed it by invalidating subsequent pages when an over-the-wire refresh is issued:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2aac05a91971fbd1bf6cbed78b8731eb7454b9b7

We've recently started hitting this problem in production with a RHEL5.3 client and a NetApp filer, both under heavy load.  

In our latest case, an earlier READDIRPLUS sequence populated the readdir cache.  Some time later, memory pressure must have forced the client to invalidate a _subset_ of the pages holding the cached READDIRPLUS responses, but not all of them.  A new 'ls' invocation, forced an over-the-wire READDIRPLUS for a given cookie to be sent to populate a new page (cached page had been flushed).  The server replied with one less entry than it had before (for the same cookie).  The server is allowed to return less bytes than asked for, as long as the cookie indicates where to "pick up" the processing on the next READDIRPLUS request.

After processing all the entries in the reply, nfs_do_filldir() updates desc->dir_cookie, and increments the page_index in the descriptor.  We're now back at the top of the while loop in nfs_readdir(), looking for the next page containing data for this directory.  find_dirent_page() finds the next page in memory, and calls find_dirent() to try to find the cookie where we left off.  Note that we found a valid mapping in memory, so we did not have to go across the wire.

find_dirent(0 XDR decodes the first entry in the page.  The nfs_entry prev_cookie pointer is set to the cookie entry of the last entry we had read.  ***This is where the problem occurs***.  We miss the fact that we've skipped the entry that the filer did not return in its short reply.  The check in find_dirent() for entry->pre_cookie == *desc->dir_cookie is  successful, and we believe that we got the entry we were looking for.  nfs_readdir() will now call nfs_do_filldir() to proceed filling in the dirent with the remaining entries from the page.  This is why we miss the entry, because we go back to the cached page instead of issuing a new READDIRPLUS request with the new cookie.

I was able to successfully reproduce this problem with an instrumented client and an instrumented server.  We created a directory on the filer with 50 entries, enough to require 3 READDIRPLUS requests to fulfill a directory listing.  The filer is instrumented to artificially cut the READDIRPLUS reply short every 5 replies, and return the "mising" entry on the next READDIRPLUS reply.  This effectively reproduced the situation seen in our production environment.

To trigger the problem, I issue a directory listing (ls) which populates 3 pages of the READDIR cache, but artificially invalidates the second page of the READDIR cache after it has filled-in the dirents.  This makes a subsequent 'ls' request (getdents syscall) satisfy the first 22 entries from the first cached page, but forces the second page to be filled over the wire on every request.  With the instrumented filer, we can see the client failing to list  the "missing" entry from the second READDIRPLUS reply.  The client never sees it, because it gets the remaining entries from the third cached page, instead of issuing a fresh request for it over-the-wire.

Comment 1 Ricardo Labiaga 2009-07-15 22:58:04 UTC
Created attachment 353922 [details]
Network Trace file

I'm attaching a network trace that shows the reproduced problem.  Packets 65, 70, and 73 contain the READDIRPLUS packets that are used to populate the readdir cache.  The client indeed lists all 50 entries:

$ ls /mnt/readdir
f.1   f.13  f.17  f.20  f.24  f.28  f.31  f.35  f.39  f.42  f.46  f.5   f.8
f.10  f.14  f.18  f.21  f.25  f.29  f.32  f.36  f.4   f.43  f.47  f.50  f.9
f.11  f.15  f.19  f.22  f.26  f.3   f.33  f.37  f.40  f.44  f.48  f.6
f.12  f.16  f.2   f.23  f.27  f.30  f.34  f.38  f.41  f.45  f.49  f.7

As noted before, the instrumented client artificially invalidates the second page of the cache, forcing a subsequent 'ls' to issue an over-the-wire READDIRPLUS request for cookie '22' in packet 77.  Notice that the server replies with 21 entries (one short than in packet 70).  This is legal though.  The client should then issue a follow up READDIRPLUS request with cookie '43'.  Instead, the client uses the cached contents for the next page, missing entry f.47 altogether.

labiaga@royalflush$ ls /mnt/readdir
f.1   f.13  f.17  f.20  f.24  f.28  f.31  f.35  f.39  f.42  f.46  f.50  f.9
f.10  f.14  f.18  f.21  f.25  f.29  f.32  f.36  f.4   f.43  f.48  f.6
f.11  f.15  f.19  f.22  f.26  f.3   f.33  f.37  f.40  f.44  f.49  f.7
f.12  f.16  f.2   f.23  f.27  f.30  f.34  f.38  f.41  f.45  f.5   f.8

Comment 3 Jeff Layton 2009-08-04 12:58:44 UTC
Created attachment 356164 [details]
patchset -- backport 2 fixes to code that calls invalidate_inode_pages2

Initial stab at a patchset to fix this. In addition to the patch that Ricardo pointed out, this backports an earlier patch that makes the code deal correctly with invalidate_inode_pages2 failures.

I suspect that that change is not really needed to fix this issue, but it seems reasonable and it makes the other patch apply cleanly.

Comment 4 Jeff Layton 2009-08-04 13:13:16 UTC
Ricardo,
  Any chance you could pass along your fault injection patch so we can use it to verify the fix here? I'd also appreciate your opinion on the possible patchset in comment #3.

Comment 5 Jeff Layton 2009-08-04 13:28:32 UTC
I'm also trying to determine why the fix for bug 364351 doesn't help this. It seems like we should be getting an iput on the dir since you're doing multiple ls's with the reproducer in comment #2.

Comment 6 Ricardo Labiaga 2009-08-05 00:08:58 UTC
(In reply to comment #4)
> Ricardo,
>   Any chance you could pass along your fault injection patch so we can use it
> to verify the fix here? 

The client side is pretty simple.  The diffs are from a pNFS git (2.6.31) tree but hopefully make sense in a vanilla 2.6.31 context.  The first nfs_readdir_filler() change simply backs out the fix and brings it to the equivalent RHEL5.3 behavior.  The nfs_do_filldir() changes invalidate the second page, so that a subsequent readdir request will be forced across the wire.

$ git diff dir.c
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 89f98e9..a765c99 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -208,10 +208,8 @@ int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, stru
         * Note: assumes we have exclusive access to this mapping either
         *       through inode->i_mutex or some other mechanism.
         */
-       if (invalidate_inode_pages2_range(inode->i_mapping, page->index + 1, -1)
-               /* Should never happen */
-               nfs_zap_mapping(inode, inode->i_mapping);
-       }
+       if (page->index == 0)
+               invalidate_inode_pages2_range(inode->i_mapping, PAGE_CACHE_SIZE,
        unlock_page(page);
        return 0;
  error:
@@ -404,6 +402,10 @@ int nfs_do_filldir(nfs_readdir_descriptor_t *desc, void *di
                   filldir_t filldir)
 {
        struct file     *file = desc->file;
+/* XXXRLHACK */
+       struct inode    *inode = file->f_path.dentry->d_inode;
+/* XXXRLHACK */
+
        struct nfs_entry *entry = desc->entry;
        struct dentry   *dentry = NULL;
        u64             fileid;
@@ -439,7 +441,13 @@ int nfs_do_filldir(nfs_readdir_descriptor_t *desc, void *di
                *desc->dir_cookie = entry->cookie;
                if (dir_decode(desc) != 0) {
                        desc->page_index ++;
+/* XXXRL HACK to test */
+if (desc->page_index == 1) {
+       printk("Invalidating page 1\n");
+       invalidate_inode_pages2_range(inode->i_mapping, 1, 1);
+}
                        break;
+
                }
                if (loop_count++ > 200) {
                        loop_count = 0;

Unfortunately, the server side instrumentation was based on ONTAP 8 cluster-mode code.  It has not yet been released, and may not be straight forward for you to install inhouse.  I see two options:  

1) Instrument the Linux NFS server to do the equivalent behavior.  
2) Send me a binary (perhaps a virtual machine image) with the fix and the above instrumentation, and I can test it internally.  

#2 is definitely more expedient.

> I'd also appreciate your opinion on the possible
> patchset in comment #3.  

Seems reasonable.

Comment 7 Ricardo Labiaga 2009-08-05 00:51:34 UTC
P.S. I've already tested the recommended fix and it indeed addressed the problem.

Comment 8 Jeff Layton 2009-08-05 17:44:41 UTC
Thanks for the info Ricardo. I'll see about getting a test kernel rolled up soon with your fault injection code. Do you have a RHEL5 host set up for it already? If so, what kernel variant is it using?

I also understand why the patch for bug 364351 doesn't help. We're not deallocating the dentry in this case, so dentry_iput doesn't get called.

We'd prefer to have a way to independently verify this, but it doesn't appear to be trivial to do a similar fault injection patch for knfsd. I'll look over the code a bit more before I give up on it, but it's probably not going to be worth it.

Comment 9 Ricardo Labiaga 2009-08-06 16:59:59 UTC
(In reply to comment #8)
> Do you have a RHEL5 host set up for it
> already? If so, what kernel variant is it using?

RHEL5.3: 2.6.18-128.el5 #1 SMP Wed Dec 17 11:42:39 EST 2008 i686 i686 i386 GNU/Linux

Comment 10 Jeff Layton 2009-08-12 16:40:11 UTC
Created attachment 357200 [details]
patchset -- two patches + fault injection patch

Hi Ricardo,
Sorry for the delay. I built some test kernels with the attached patchset and put the i686 and srpm here:

http://people.redhat.com/jlayton/bz511170/

They have the patchset in comment #3 + the fault injection patch that you posted in comment #6. Can you test this kernel and verify that it fixes the problem?

Let me know if you need a different arch or kernel variant.

Comment 11 Ricardo Labiaga 2009-08-12 21:39:08 UTC
Thanks Jeff.  I verified that kernel-2.6.18-162.el5.bz511170.1.i686.rpm indeed addresses the problem when mounting the instrumented ONTAP filer.

I'll go ahead and have our QA group test on one of the RHEL5.3 machines that was failing in the "test production" and verify that it indeed addresses the problem in "real life".

Comment 12 Jeff Layton 2009-08-17 12:52:46 UTC
Thanks Ricardo. Let me know what your QA group finds.

Comment 13 Jeff Layton 2009-08-20 13:19:47 UTC
Ricardo, any feedback from your QA folks?

Comment 14 Ricardo Labiaga 2009-08-21 16:43:41 UTC
I hope to hear from them soon.

Comment 15 Ricardo Labiaga 2009-08-21 19:06:34 UTC
I should know more in about two weeks when the person who owns the test infrastructure returns to the office to test the patch.  As I mentioned before, I verified in my "fault injection" environment, but would be good to verify the fix in the environment that triggered the initial problem.

Comment 16 Jeff Layton 2009-08-21 19:18:47 UTC
Thanks Ricardo. No huge hurry from our standpoint. The patch deadline for the next RHEL5 release is still quite a ways out.

Comment 17 Ricardo Labiaga 2009-08-24 18:10:45 UTC
Jeff, can you provide a 64-bit patch for test in our QA environment?  Thanks.

Comment 18 Jeff Layton 2009-08-24 18:22:59 UTC
I assume you mean an x86_64 kernel package? If so, then yes but it'll take a day or so to rebuild. The build system has already cleaned out my last one.

Comment 19 Ricardo Labiaga 2009-08-24 18:25:01 UTC
Yes, x86_64.  Thanks.

Comment 20 Ricardo Labiaga 2009-08-27 17:29:21 UTC
Our QA team was unable to reproduce the problem after applying this patch.  Looks good to us, thanks for addressing this problem so rapidly!

Comment 22 Bikash 2009-08-28 20:58:43 UTC
Can we get a patch for RHEL5.3 as well, while the patch is going to be available in RHEL5.4?

Comment 23 Jeff Layton 2009-09-22 14:59:20 UTC
(In reply to comment #22)
> Can we get a patch for RHEL5.3 as well, while the patch is going to be
> available in RHEL5.4?  

Sorry Bikash, missed your request...

The same patch will apply to 5.3 kernels with a little fuzz. YMMV with that patch, of course...

Comment 24 Ricardo Labiaga 2009-09-23 17:46:33 UTC
Thanks Jeff.
I think the question was more along the lines of an official patch.  Can you
integrate this fix into your patch stream for 5.3 and 5.4?

Comment 25 Jeff Layton 2009-09-23 18:01:55 UTC
It would be better for that sort of request to bubble up from our support folks. Andrius, can you help drive that?

Comment 26 Andrius Benokraitis 2009-09-23 19:09:54 UTC
Jeff - there is no official path for NetApp through RH Support since they do not have TAM coverage. Let me bring in jwest, maybe he can approve this for errata streams...

Comment 27 Jeremy West 2009-09-23 20:14:28 UTC
This may be a good idea to have in 5.3. While the number of customers we have using 5.3 extended update support is relatively small, most of that number are customers who would be interested in having this patch.  I'll approve this from the support side for 5.3 and 5.4.  This still needs additional approvals however and isn't a guarantee. 

Thanks
Jeremy West

Comment 30 RHEL Program Management 2009-09-25 17:40:49 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 33 Don Zickus 2009-10-06 19:38:25 UTC
in kernel-2.6.18-168.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 35 Ricardo Labiaga 2009-10-15 22:26:14 UTC
We are unable to reproduce the problem with the 2.6.18-168.el5 kernel on RHEL5.3.  So the fix is verified as far as we're concerned.

Comment 36 Andrius Benokraitis 2009-10-16 03:07:32 UTC
Ricardo - thanks for testing this! This will be closed when it is delivered in RHEL 5.5. Thanks again!

Comment 37 Jeff Layton 2010-03-23 10:29:12 UTC
*** Bug 570061 has been marked as a duplicate of this bug. ***

Comment 39 errata-xmlrpc 2010-03-30 06:54:12 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html