I've reported this to linux-net though the messages don't appear in the spinics.net archive yet. I figured I would report here just to increase the chance that the issue is seen by someone who might understand what's happening. Basically, 5.1.20 introduced a regression in the NFS4 client which manifests as one of my users not being able to list the complete contents of his home directory (which contains nearly 8000 individual files for whatever reason): [root@ld00 ~]# ls -l ~dblecher|wc -l ls: reading directory '/home/dblecher': Input/output error 1844 [root@ld00 ~]# cat /proc/version Linux version 5.1.20-300.fc30.x86_64 (mockbuild.fedoraproject.org) (gcc version 9.1.1 20190503 (Red Hat 9.1.1-1) (GCC)) #1 SMP Fri Jul 26 15:03:11 UTC 2019 Mount options are: nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=krb5i The server is running CentOS 7 (kernel 3.10.0-957.12.2.el7.x86_64). Nothing is logged, on the server or the client, when this happens. I don't know what is special about that directory; it's not merely the size, because I can't reproduce by just touching ten thousand randomly named files. But any client running anything newer than 5.1.19 (including 5.2.8) will show the issue if I just run ls on his directory. I haven't tried any 5.3 RCs but I can tomorrow. I pulled 5.1.19 from koji and it does not having the problem. Checking the changelog, I saw commit 3536b79ba75ba44b9ac1a9f1634f2e833bbb735c: Revert "NFS: readdirplus optimization by cache mechanism" (memleak) So I build a new RPM with the revert reverted and it does not have the issue. I'm not sure where to go from here. According to the commit description, the revert fixes a real issue so I guess simply undoing the revert isn't the right answer. But maybe it wasn't a clean revert and the fixup introduces some other issue. I don't really know.
we probably need to talk to the NFS team upstream to do anything
To be clear, I was mistaken in my original message; I actually emailed the linux-nfs mailing list, not "linux-net" as I wrote. I also pinged on IRC (#linux-nfs on oftc) but that's not often a useful vehicle for support these days. Either way, I've seen no response but it's only been a day. I suppose I could repost with the CC list widened, but I'm not sure who I should CC. The original patch submission was CC'd all over the place. Also, the thread has now appeared in the spinics.net archive: https://www.spinics.net/lists/linux-nfs/msg74322.html
The underlying problem was found to be a bug in how kerfberos contexts are parsed. The issue only showed up with krb5i mounts, so I switched to using krb5p as a workaround. Patches have been produced ("[PATCH V3 1/2] SUNRPC: Fix buffer handling of GSS MIC without slack"), but I'm not sure if they've made it into any of the main trees yet. Stable was CC'd but I guess they won't be seen there until after they get pulled into Linus's tree.
After digging through the kernel tree I see that this made it into Linus's tree on September 26th: commit 972a2bf7dfe39ebf49dd47f68d27c416392e53b1 Merge: 7be3cb019db1 a8fd0feeca35 Author: Linus Torvalds <torvalds> Date: Thu Sep 26 12:20:14 2019 -0700 Merge tag 'nfs-for-5.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfs Pull NFS client updates from Anna Schumaker: "Stable bugfixes: - Dequeue the request from the receive queue while we're re-encoding # v4.20+ - Fix buffer handling of GSS MIC without slack # 5.1 [...] This did not make 5.2.18 but... maybe 5.2.19?
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There are a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 30 kernel bugs. Fedora 30 has now been rebased to 5.5.7-100.fc30. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 31, and are still experiencing this issue, please change the version to Fedora 31. If you experience different issues, please open a new bug report for those.
*********** MASS BUG UPDATE ************** This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 3 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.