Description of problem: nfs mount produces a huge number of [NFSv4 callback] processes. Version-Release number of selected component (if applicable): Any fedora 4.9 kernel. No problem with fedora 4.8 kernel. In all cases the default Nfs4.2 mount is used NFS Server: CentOS 6.8 with 2.6.32-642.11.1.el6.x86_64 or CentOS 7.2.1511 with 4.4.29-1.el7.elrepo.x86_64 /proc/sys/fs/leases-enable contains '1' How reproducible: Sometimes a new [NFSv4 callback] is created while mounting a NFS share. It's unclear to me how to trigger a creation of a new [NFSv4 callback] Actual results: root:~ # ps aux|grep 'NFSv4 callback'|wc -l 1290 Expected results: root:~ # ps aux|grep 'NFSv4 callback'|wc -l 1
Update: Problem seems to be occur if server is on Centos 6.8 with 2.6.32-642.11.1.el6.x86_64 or with 3.10.104-1.el6.elrepo.x86_64 In these cases NFSv4.0 is used. It is reproducible with something like this: for i in $(seq 1 100);do echo $i;mount -tnfs4 fs-scratch1:/theorie/scratch1 /tmp/t; date +%s >>/tmp/t/nfstest;umount /tmp/t;done With this Code 100 new 'NFSv4 callback' processes arises.
I see the same: % ps -eaf root 31317 2 0 Feb23 ? 00:00:00 [NFSv4 callback] root 31318 2 0 Feb23 ? 00:00:00 [NFSv4 callback] root 31319 2 0 Feb23 ? 00:00:00 [NFSv4 callback] root 31452 2 0 Feb23 ? 00:00:00 [NFSv4 callback] root 31453 2 0 Feb23 ? 00:00:00 [NFSv4 callback] root 31454 2 0 Feb23 ? 00:00:00 [NFSv4 callback] root 31687 2 0 Feb23 ? 00:00:00 [NFSv4 callback] root 31688 2 0 Feb23 ? 00:00:00 [NFSv4 callback] root 31689 2 0 Feb23 ? 00:00:00 [NFSv4 callback] root 31796 2 0 Feb23 ? 00:00:00 [NFSv4 callback] root 31797 2 0 Feb23 ? 00:00:00 [NFSv4 callback] root 31798 2 0 Feb23 ? 00:00:00 [NFSv4 callback] root 31839 2 0 Feb23 ? 00:00:00 [NFSv4 callback] root 31840 2 0 Feb23 ? 00:00:00 [NFSv4 callback] root 31841 2 0 Feb23 ? 00:00:00 [NFSv4 callback] root 31943 2 0 Feb23 ? 00:00:00 [NFSv4 callback] % cat /etc/system-release Fedora release 24 (Twenty Four) % uname -a Linux wasa.netnix.ee 4.9.4-100.fc24.x86_64 #1 SMP Tue Jan 17 19:08:56 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux # cat /etc/system-release CentOS release 6.8 (Final) # uname -a Linux alca.netnix.ee 2.6.32-642.6.2.el6.x86_64 #1 SMP Wed Oct 26 06:52:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
# cat /proc/sys/fs/leases-enable 1
I use autofs and with this bug my process table fills up in a two weeks. Currentl uptime 9 days: # ps f -elf | grep "NFSv4 callback" | wc -l 30397
There are definitely some relevant changes in that range: $ git log --pretty=oneline v4.8..v4.9 fs/nfs/callback* d55b352b01bc NFSv4.x: hide array-bounds warning a1d617d8f134 nfs: allow blocking locks to be awoken by lock callbacks db783688d4a2 nfs: add handling for CB_NOTIFY_LOCK in client b60475c9401b nfs: the length argument to read_buf should be unsigned 5405fc44c337 NFSv4.x: Add kernel parameter to control the callback server bb6aeba736ba NFSv4.x: Switch to using svc_set_num_threads() to manage the callback threads 3b01c11ee8bf NFSv4.x: Fix up the global tracking of the callback server d00252688604 SUNRPC: Initialise struct svc_serv backchannel fields during __svc_create() f4b52bb08426 NFSv4.x: Set up struct svc_serv_ops for the callback channel That "svc_set_num_threads()" patch would be at the top of my list of suspects. On a quick skim I don't see any fixes or discussion upstream. Would it be possible to report this upstream? (linux-nfs at vger.kernel.org) If not I'll get to it eventually.
More reports of this in OFTC/#linux-nfs.. I think Kinglong Mee posted two patches January 19th that might fix it up, but I haven't tested: [PATCH v2 1/2] NFSv4.x/callback: Create the callback service through svc_create_pooled [PATCH v2 2/2] NFSv4.x/callback: make sure callback threads are interruptible
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 25 kernel bugs. Fedora 25 has now been rebased to 4.10.9-200.fc25. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 26, and are still experiencing this issue, please change the version to Fedora 26. If you experience different issues, please open a new bug report for those.
Still present in 4.10.8-200.fc25.x86_64
Still present Linux example.com 4.10.10-200.fc25.x86_64 #1 SMP Thu Apr 13 01:11:51 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux % ps -eaf|grep -c NFSv4 163 I think this is the reason why KDE desktop plasmashell and dolphin processess constantly hang. Really annoying.
4.10.11 is also buggy
(In reply to Jürgen Holm from comment #10) > 4.10.11 is also buggy I confirm, after a boot: % ps -eaf|grep -c NFSv4 52
(In reply to Jürgen Holm from comment #10) > 4.10.11 is also buggy Do you run some desktop with that system, can you see any anomalies that could/would be filesystem related? I've constantly issues with plasmashell and dolphin and need to kill them to get new windows appear/start.
(In reply to Jürgen Holm from comment #10) > 4.10.11 is also buggy We'll probably not see a fix until 4.11. I've asked the maintainers to take the two patches from comment 6, which should fix up the problem: http://marc.info/?l=linux-nfs&m=149303618609704&w=2
I'm still seeing a lot of this as well. I think it would mostly hit people who use the automounter, so the filesystems are constantly coming and going. @bcodding: Do you think those patches are appropriate for a stable release? Were they sent to the stable maintainer or just to the main tree? If the latter, do you really think there's time before 4.11, given how late in the cycle it is? I'm making a local kernel package now so I can test this. I'm happy to share it once it's done.
(In reply to Jason Tibbitts from comment #14) > I'm still seeing a lot of this as well. I think it would mostly hit people > who use the automounter, so the filesystems are constantly coming and going. > > @bcodding: Do you think those patches are appropriate for a stable release? > Were they sent to the stable maintainer or just to the main tree? If the > latter, do you really think there's time before 4.11, given how late in the > cycle it is? They weren't sent to stable yet. I think they need to be picked up mainline first. I think it's probably too late to see them in 4.10.
(In reply to Benjamin Coddington from comment #15) > (In reply to Jason Tibbitts from comment #14) > > I'm still seeing a lot of this as well. I think it would mostly hit people > > who use the automounter, so the filesystems are constantly coming and going. > > > > @bcodding: Do you think those patches are appropriate for a stable release? > > Were they sent to the stable maintainer or just to the main tree? If the > > latter, do you really think there's time before 4.11, given how late in the > > cycle it is? > > They weren't sent to stable yet. I think they need to be picked up mainline > first. I think it's probably too late to see them in 4.10. Oh, I am off-by-one.. I mean it is probably too late to see them in 4.11, hope to have them in 4.12.
Created attachment 1273977 [details] Patch adding the two NFS patches For the record, the two relevant patches are https://www.spinics.net/lists/linux-nfs/msg61696.html and https://www.spinics.net/lists/linux-nfs/msg61697.html If you want to build your own kernel, here's a patch which applies those two patches. You may want to change the "%define buildid" to your liking. I've built the resulting package but haven't booted into it yet.
(In reply to Juha Tuomala from comment #12) > (In reply to Jürgen Holm from comment #10) > > 4.10.11 is also buggy > > Do you run some desktop with that system, can you see any anomalies that > could/would be filesystem related? No, I didn't any issues. ps: My fix was to upgrade our last NFS server to COS7
(In reply to Jürgen Holm from comment #18) > ps: My fix was to upgrade our last NFS server to COS7 with kernel-lt-4.4.59-1.el7.elrepo.x86_64
(In reply to Jason Tibbitts from comment #17) > Created attachment 1273977 [details] > Patch adding the two NFS patches > > For the record, the two relevant patches are > https://www.spinics.net/lists/linux-nfs/msg61696.html and > https://www.spinics.net/lists/linux-nfs/msg61697.html > > If you want to build your own kernel, here's a patch which applies those two > patches. You may want to change the "%define buildid" to your liking. I've > built the resulting package but haven't booted into it yet. We can bring those patches into Fedora if they solve the issue. The comments here just mentioned that they were a likely fix and hadn't been tested, so looking for your feedback after testing.
(In reply to Jürgen Holm from comment #18) > ps: My fix was to upgrade our last NFS server to COS7 > with kernel-lt-4.4.59-1.el7.elrepo.x86_64 By fix you mean, that problem disappeared?
(In reply to Justin M. Forbes from comment #20) > (In reply to Jason Tibbitts from comment #17) > > If you want to build your own kernel, here's a patch which applies those two > > patches. You may want to change the "%define buildid" to your liking. I've > > built the resulting package but haven't booted into it yet. > > We can bring those patches into Fedora if they solve the issue. The comments > here just mentioned that they were a likely fix and hadn't been tested, so > looking for your feedback after testing. If someone can provide test kernel, I can test it and report back. I've no time to modify spec and build my own one.
(In reply to Juha Tuomala from comment #21) > (In reply to Jürgen Holm from comment #18) > > ps: My fix was to upgrade our last NFS server to COS7 > > with kernel-lt-4.4.59-1.el7.elrepo.x86_64 > > By fix you mean, that problem disappeared? Yes
(In reply to Jürgen Holm from comment #23) > (In reply to Juha Tuomala from comment #21) > > (In reply to Jürgen Holm from comment #18) > > > ps: My fix was to upgrade our last NFS server to COS7 > > > with kernel-lt-4.4.59-1.el7.elrepo.x86_64 > > > > By fix you mean, that problem disappeared? > > Yes That's interesting - since this bug has been all about fixing the problem at client side - and you solved it at server side.
(In reply to Justin M. Forbes from comment #20) > We can bring those patches into Fedora if they solve the issue. The comments > here just mentioned that they were a likely fix and hadn't been tested, so > looking for your feedback after testing. Well, I've been running the kernel for a bit now. So far there's no proliferation of "[NFSv4 callback]" processes and in fact I can't seem to make more than just one appear. When I unmount my last imported NFS4 filesystem, the process goes away, which is something that doesn't ever seem to happen with the stock Fedora kernel. Of course I'll have to see how it is in a few days. (In reply to Juha Tuomala from comment #22) > If someone can provide test kernel, I can test it and report back. I've no > time to modify spec and build my own one. I put what I have at https://www.math.uh.edu/~tibbs/patched-kernel/, assuming you trust kernel packages from some random person. They were built locally in mock and are signed with my personal key, which I believe is linked via trust to the main Fedora signing keys if you want to check. I also kicked off a scratch build at https://koji.fedoraproject.org/koji/taskinfo?taskID=19237773
Hey Jason, Trond is fixing this a different way upstream -- see the thread: http://marc.info/?l=linux-nfs&m=149322214627678&w=2 We should probably be testing those patches instead.
Yeah, I saw the thread yesterday but there was some following discussion and I wasn't sure if bfields was going to take the patches directly. I was already booted into the patched kernel so I figured that reporting my findings couldn't hurt. Are just those two patches ("SUNRPC: Refactor svc_set_num_threads()" and "NFSv4: Fix callback server shutdown") sufficient? I'll get another kernel build started.
Created attachment 1274721 [details] Patch against 4.10.13 with upstream NFS patches Attached is an updated kernel spec patch. I started a scratch build at https://koji.fedoraproject.org/koji/taskinfo?taskID=19239218 I have a local mockbuild running as well and will update https://www.math.uh.edu/~tibbs/patched-kernel/ when it finishes.
Trond's patches fix it for me on the head of Linus's tree.
No issues so far with that kernel. I have been doing mounts from five servers and unmounting them in random orders and there's not been more than a single "[NFSv4 callback]" thread which exits when the last mount goes away.
So last night there was an nfs-utils update and as the service was restarting I got an oops (actually a "divide error") resulting in a broken NFS server; a reboot was required to get NFS services working but otherwise the machine was fine (including client NFS). I only mention this here in the off chance that the problem is related to the two patches I applied. I don't think it is, but I'm must not familiar enough with the internals to know for sure. Here's the log: divide error: 0000 [#1] SMP Modules linked in: nfsv4 dns_resolver nfs fscache rfcomm rpcsec_gss_krb5 cmac nf_conntrack tpm nfsd nfs_acl lockd grace auth_rpcgss sunrpc binfmt_misc xfs libcrc32c hid_logitech_hi CPU: 7 PID: 17065 Comm: rpc.nfsd Not tainted 4.10.13-200.uh.1.fc25.x86_64 #1 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99E-ITX/ac, BIOS P3.40 08/03 task: ffff9481ebe60000 task.stack: ffffb7954d298000 RIP: 0010:svc_pool_for_cpu+0x2b/0x80 [sunrpc] RSP: 0018:ffffb7954d29bc18 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff947e17384000 RCX: ffff9482ba94d228 RDX: 0000000000000000 RSI: 0000000000000007 RDI: ffff9482ba94d200 RBP: ffffb7954d29bc18 R08: ffff9482ba94d228 R09: 0000000000018783 R10: ffffe727408c4580 R11: 0000000000000000 R12: ffff947e17384000 R13: ffff947e17384018 R14: ffff9482ba94d200 R15: ffff9482ba94d210 FS: 00007fb243703c40(0000) GS:ffff9482df3c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffe581110a8 CR3: 0000001044108000 CR4: 00000000003406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: svc_xprt_do_enqueue+0xf2/0x2a0 [sunrpc] svc_xprt_received+0x51/0xb0 [sunrpc] svc_add_new_perm_xprt+0x76/0x90 [sunrpc] svc_addsock+0x14b/0x200 [sunrpc] ? recalc_sigpending+0x1b/0x50 ? __getnstimeofday64+0x41/0xd0 ? do_gettimeofday+0x29/0x90 write_ports+0x255/0x2c0 [nfsd] ? _copy_from_user+0x4e/0x80 ? write_recoverydir+0x100/0x100 [nfsd] nfsctl_transaction_write+0x48/0x80 [nfsd] __vfs_write+0x37/0x160 ? selinux_file_permission+0xd7/0x110 ? security_file_permission+0x3b/0xc0 vfs_write+0xb5/0x1a0 SyS_write+0x55/0xc0 entry_SYSCALL_64_fastpath+0x1a/0xa9 RIP: 0033:0x7fb24301dc30 RSP: 002b:00007ffe580d3cd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fb24301dc30 RDX: 0000000000000002 RSI: 0000561dd3a33640 RDI: 0000000000000003 RBP: 00007ffe580d3cd0 R08: 0000000000000001 R09: 0000000000000002 R10: 0000000000000064 R11: 0000000000000246 R12: 0000000000000004 R13: 0000561dd3cbd7c0 R14: 0000561dd3cbd740 R15: 00007ffe580d3788 Code: 0f 1f 44 00 00 48 8b 87 98 00 00 00 55 48 89 e5 48 83 78 08 00 74 10 8b 05 57 52 02 RIP: svc_pool_for_cpu+0x2b/0x80 [sunrpc] RSP: ffffb7954d29bc18 ---[ end trace 75c980265f0dfd0d ]---
(In reply to Jason Tibbitts from comment #31) > So last night there was an nfs-utils update and as the service was > restarting I got an oops (actually a "divide error") resulting in a broken > NFS server; a reboot was required to get NFS services working but otherwise > the machine was fine (including client NFS). I only mention this here in > the off chance that the problem is related to the two patches I applied. I > don't think it is, but I'm must not familiar enough with the internals to > know for sure. Here's the log: See Kinglong Mee's df807fffaabd "NFSv4.x/callback: Create the callback service through svc_create_pooled" in my -next tree.
I'm getting the impression I'd be better off just waiting for 4.12. The bug is annoying but not a deal breaker.
I probably got bitten by this *again*, having your desktop running constantly gathers these NFSv4 processes, made a phone call, needed to write couple numbers down into kwrite, press Save and it hang - soon whole desktop crashed. dmesg shows errors about Q****dbus crash. Embarrasing to call back for same thing :-( Howcome this kind of bugs get easily introduced but there is no hurry to rollback/fix them? Can't that commit just be reverted?
For those seeing hangs (Juha), can you see my new bug #1455086 to see if you're having the same problem? I'm not seeing >1 callback in ps so I don't think I'm seeing this bug, but any hangs people have might be related. Someone screwed something up in F24's 4.11.x kernel. 4.10.x series does not have my bug. I'm starting a vanilla bisect now, but 14 steps, gonna take a couple of weeks.
This message is a reminder that Fedora 25 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 25. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '25'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 25 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 25 changed to end-of-life (EOL) status on 2017-12-12. Fedora 25 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.