Bug 1740954

Summary: 5.1.20 regression in NFS4 directory listing
Product: [Fedora] Fedora Reporter: Jason Tibbitts <j>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 30CC: airlied, bskeggs, hdegoede, ichavero, itamar, jarodwilson, jeremy, jglisse, john.j5live, jonathan, josef, kernel-maint, labbott, linville, masami256, mchehab, mjg59, nfs-maint, pasik, steved
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-25 22:28:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jason Tibbitts 2019-08-14 00:54:13 UTC
I've reported this to linux-net though the messages don't appear in the spinics.net archive yet.  I figured I would report here just to increase the chance that the issue is seen by someone who might understand what's happening.

Basically, 5.1.20 introduced a regression in the NFS4 client which manifests as one of my users not being able to list the complete contents of his home directory (which contains nearly 8000 individual files for whatever reason):

[root@ld00 ~]# ls -l ~dblecher|wc -l
ls: reading directory '/home/dblecher': Input/output error
1844
[root@ld00 ~]# cat /proc/version Linux version 5.1.20-300.fc30.x86_64 (mockbuild.fedoraproject.org) (gcc version 9.1.1 20190503 (Red Hat 9.1.1-1) (GCC)) #1 SMP Fri Jul 26 15:03:11 UTC 2019

Mount options are: nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=krb5i
The server is running CentOS 7 (kernel 3.10.0-957.12.2.el7.x86_64).

Nothing is logged, on the server or the client, when this happens.  I don't know what is special about that directory; it's not merely the size, because I can't reproduce by just touching ten thousand randomly named files.  But any client running anything newer than 5.1.19 (including 5.2.8) will show the issue if I just run ls on his directory.  I haven't tried any 5.3 RCs but I can tomorrow.

I pulled 5.1.19 from koji and it does not having the problem.  Checking the changelog, I saw commit 3536b79ba75ba44b9ac1a9f1634f2e833bbb735c:
  Revert "NFS: readdirplus optimization by cache mechanism" (memleak)
So I build a new RPM with the revert reverted and it does not have the issue.

I'm not sure where to go from here.  According to the commit description, the revert fixes a real issue so I guess simply undoing the revert isn't the right answer.  But maybe it wasn't a clean revert and the fixup introduces some other issue.  I don't really know.

Comment 1 Laura Abbott 2019-08-14 12:58:50 UTC
we probably need to talk to the NFS team upstream to do anything

Comment 2 Jason Tibbitts 2019-08-14 14:31:50 UTC
To be clear, I was mistaken in my original message; I actually emailed the linux-nfs mailing list, not "linux-net" as I wrote.  I also pinged on IRC (#linux-nfs on oftc) but that's not often a useful vehicle for support these days.  Either way, I've seen no response but it's only been a day.  I suppose I could repost with the CC list widened, but I'm not sure who I should CC.  The original patch submission was CC'd all over the place.

Also, the thread has now appeared in the spinics.net archive: https://www.spinics.net/lists/linux-nfs/msg74322.html

Comment 3 Jason Tibbitts 2019-10-04 18:29:05 UTC
The underlying problem was found to be a bug in how kerfberos contexts are parsed.  The issue only showed up with krb5i mounts, so I switched to using krb5p as a workaround.  Patches have been produced ("[PATCH V3 1/2] SUNRPC: Fix buffer handling of GSS MIC without slack"), but I'm not sure if they've made it into any of the main trees yet.  Stable was CC'd but I guess they won't be seen there until after they get pulled into Linus's tree.

Comment 4 Jason Tibbitts 2019-10-04 18:34:35 UTC
After digging through the kernel tree I see that this made it into Linus's tree on September 26th:

commit 972a2bf7dfe39ebf49dd47f68d27c416392e53b1
Merge: 7be3cb019db1 a8fd0feeca35
Author: Linus Torvalds <torvalds>
Date:   Thu Sep 26 12:20:14 2019 -0700

    Merge tag 'nfs-for-5.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

    Pull NFS client updates from Anna Schumaker:
     "Stable bugfixes:
       - Dequeue the request from the receive queue while we're re-encoding
         # v4.20+
       - Fix buffer handling of GSS MIC without slack # 5.1
[...]


This did not make 5.2.18 but... maybe 5.2.19?

Comment 5 Justin M. Forbes 2020-03-03 16:31:20 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 30 kernel bugs.

Fedora 30 has now been rebased to 5.5.7-100.fc30.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 31, and are still experiencing this issue, please change the version to Fedora 31.

If you experience different issues, please open a new bug report for those.

Comment 6 Justin M. Forbes 2020-03-25 22:28:35 UTC
*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 3 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.