Bug 1740954 - 5.1.20 regression in NFS4 directory listing
Summary: 5.1.20 regression in NFS4 directory listing
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 30
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-14 00:54 UTC by Jason Tibbitts
Modified: 2020-09-11 02:25 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-25 22:28:35 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Jason Tibbitts 2019-08-14 00:54:13 UTC
I've reported this to linux-net though the messages don't appear in the spinics.net archive yet.  I figured I would report here just to increase the chance that the issue is seen by someone who might understand what's happening.

Basically, 5.1.20 introduced a regression in the NFS4 client which manifests as one of my users not being able to list the complete contents of his home directory (which contains nearly 8000 individual files for whatever reason):

[root@ld00 ~]# ls -l ~dblecher|wc -l
ls: reading directory '/home/dblecher': Input/output error
1844
[root@ld00 ~]# cat /proc/version Linux version 5.1.20-300.fc30.x86_64 (mockbuild.fedoraproject.org) (gcc version 9.1.1 20190503 (Red Hat 9.1.1-1) (GCC)) #1 SMP Fri Jul 26 15:03:11 UTC 2019

Mount options are: nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=krb5i
The server is running CentOS 7 (kernel 3.10.0-957.12.2.el7.x86_64).

Nothing is logged, on the server or the client, when this happens.  I don't know what is special about that directory; it's not merely the size, because I can't reproduce by just touching ten thousand randomly named files.  But any client running anything newer than 5.1.19 (including 5.2.8) will show the issue if I just run ls on his directory.  I haven't tried any 5.3 RCs but I can tomorrow.

I pulled 5.1.19 from koji and it does not having the problem.  Checking the changelog, I saw commit 3536b79ba75ba44b9ac1a9f1634f2e833bbb735c:
  Revert "NFS: readdirplus optimization by cache mechanism" (memleak)
So I build a new RPM with the revert reverted and it does not have the issue.

I'm not sure where to go from here.  According to the commit description, the revert fixes a real issue so I guess simply undoing the revert isn't the right answer.  But maybe it wasn't a clean revert and the fixup introduces some other issue.  I don't really know.

Comment 1 Laura Abbott 2019-08-14 12:58:50 UTC
we probably need to talk to the NFS team upstream to do anything

Comment 2 Jason Tibbitts 2019-08-14 14:31:50 UTC
To be clear, I was mistaken in my original message; I actually emailed the linux-nfs mailing list, not "linux-net" as I wrote.  I also pinged on IRC (#linux-nfs on oftc) but that's not often a useful vehicle for support these days.  Either way, I've seen no response but it's only been a day.  I suppose I could repost with the CC list widened, but I'm not sure who I should CC.  The original patch submission was CC'd all over the place.

Also, the thread has now appeared in the spinics.net archive: https://www.spinics.net/lists/linux-nfs/msg74322.html

Comment 3 Jason Tibbitts 2019-10-04 18:29:05 UTC
The underlying problem was found to be a bug in how kerfberos contexts are parsed.  The issue only showed up with krb5i mounts, so I switched to using krb5p as a workaround.  Patches have been produced ("[PATCH V3 1/2] SUNRPC: Fix buffer handling of GSS MIC without slack"), but I'm not sure if they've made it into any of the main trees yet.  Stable was CC'd but I guess they won't be seen there until after they get pulled into Linus's tree.

Comment 4 Jason Tibbitts 2019-10-04 18:34:35 UTC
After digging through the kernel tree I see that this made it into Linus's tree on September 26th:

commit 972a2bf7dfe39ebf49dd47f68d27c416392e53b1
Merge: 7be3cb019db1 a8fd0feeca35
Author: Linus Torvalds <torvalds>
Date:   Thu Sep 26 12:20:14 2019 -0700

    Merge tag 'nfs-for-5.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

    Pull NFS client updates from Anna Schumaker:
     "Stable bugfixes:
       - Dequeue the request from the receive queue while we're re-encoding
         # v4.20+
       - Fix buffer handling of GSS MIC without slack # 5.1
[...]


This did not make 5.2.18 but... maybe 5.2.19?

Comment 5 Justin M. Forbes 2020-03-03 16:31:20 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 30 kernel bugs.

Fedora 30 has now been rebased to 5.5.7-100.fc30.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 31, and are still experiencing this issue, please change the version to Fedora 31.

If you experience different issues, please open a new bug report for those.

Comment 6 Justin M. Forbes 2020-03-25 22:28:35 UTC
*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 3 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.


Note You need to log in before you can comment on or make changes to this bug.