Bug 1427493 (nfscallback) - kernel 4.9/10: NFSv4 callback processes fills process table
Summary: kernel 4.9/10: NFSv4 callback processes fills process table
Keywords:
Status: CLOSED EOL
Alias: nfscallback
Product: Fedora
Classification: Fedora
Component: kernel
Version: 25
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: 1451996
TreeView+ depends on / blocked
 
Reported: 2017-02-28 11:30 UTC by Jürgen Holm
Modified: 2019-01-09 12:54 UTC (History)
21 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2017-12-12 10:04:56 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Patch adding the two NFS patches (5.70 KB, patch)
2017-04-25 16:40 UTC, Jason Tibbitts
no flags Details | Diff
Patch against 4.10.13 with upstream NFS patches (10.53 KB, patch)
2017-04-27 15:48 UTC, Jason Tibbitts
no flags Details | Diff

Description Jürgen Holm 2017-02-28 11:30:12 UTC
Description of problem:
nfs mount produces a huge number of 

 [NFSv4 callback]

processes.

Version-Release number of selected component (if applicable):
Any fedora 4.9 kernel.
No problem with fedora 4.8 kernel.
In all cases the default Nfs4.2 mount is used

NFS Server: 
CentOS 6.8 with 2.6.32-642.11.1.el6.x86_64 or
CentOS 7.2.1511 with 4.4.29-1.el7.elrepo.x86_64
/proc/sys/fs/leases-enable contains '1'

How reproducible:
Sometimes a new [NFSv4 callback] is created while mounting a NFS share.
It's unclear to me how to trigger a creation of a new [NFSv4 callback]

Actual results:
root:~ # ps aux|grep 'NFSv4 callback'|wc -l
1290


Expected results:
root:~ # ps aux|grep 'NFSv4 callback'|wc -l
1

Comment 1 Jürgen Holm 2017-02-28 12:18:33 UTC
Update:

Problem seems to be occur if server is on Centos 6.8 with 2.6.32-642.11.1.el6.x86_64  or with 3.10.104-1.el6.elrepo.x86_64

In these cases NFSv4.0 is used.
It is reproducible with something like this:

for i in $(seq 1 100);do echo $i;mount -tnfs4 fs-scratch1:/theorie/scratch1 /tmp/t; date +%s >>/tmp/t/nfstest;umount /tmp/t;done

With this Code 100 new 'NFSv4 callback' processes arises.

Comment 2 Juha Tuomala 2017-03-02 15:53:51 UTC
I see the same:

% ps -eaf 

root     31317     2  0 Feb23 ?        00:00:00 [NFSv4 callback]
root     31318     2  0 Feb23 ?        00:00:00 [NFSv4 callback]
root     31319     2  0 Feb23 ?        00:00:00 [NFSv4 callback]
root     31452     2  0 Feb23 ?        00:00:00 [NFSv4 callback]
root     31453     2  0 Feb23 ?        00:00:00 [NFSv4 callback]
root     31454     2  0 Feb23 ?        00:00:00 [NFSv4 callback]
root     31687     2  0 Feb23 ?        00:00:00 [NFSv4 callback]
root     31688     2  0 Feb23 ?        00:00:00 [NFSv4 callback]
root     31689     2  0 Feb23 ?        00:00:00 [NFSv4 callback]
root     31796     2  0 Feb23 ?        00:00:00 [NFSv4 callback]
root     31797     2  0 Feb23 ?        00:00:00 [NFSv4 callback]
root     31798     2  0 Feb23 ?        00:00:00 [NFSv4 callback]
root     31839     2  0 Feb23 ?        00:00:00 [NFSv4 callback]
root     31840     2  0 Feb23 ?        00:00:00 [NFSv4 callback]
root     31841     2  0 Feb23 ?        00:00:00 [NFSv4 callback]
root     31943     2  0 Feb23 ?        00:00:00 [NFSv4 callback]

% cat /etc/system-release
Fedora release 24 (Twenty Four)

% uname -a
Linux wasa.netnix.ee 4.9.4-100.fc24.x86_64 #1 SMP Tue Jan 17 19:08:56 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux


# cat /etc/system-release
CentOS release 6.8 (Final)

# uname -a
Linux alca.netnix.ee 2.6.32-642.6.2.el6.x86_64 #1 SMP Wed Oct 26 06:52:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Comment 3 Juha Tuomala 2017-03-02 15:55:11 UTC
# cat /proc/sys/fs/leases-enable
1

Comment 4 ernest.beinrohr 2017-03-08 08:55:25 UTC
I use autofs and with this bug my process table fills up in a two weeks. Currentl uptime 9 days:

# ps f -elf | grep "NFSv4 callback" | wc -l
30397

Comment 5 J. Bruce Fields 2017-03-08 17:18:23 UTC
There are definitely some relevant changes in that range:

$ git log --pretty=oneline v4.8..v4.9 fs/nfs/callback*
d55b352b01bc NFSv4.x: hide array-bounds warning
a1d617d8f134 nfs: allow blocking locks to be awoken by lock callbacks
db783688d4a2 nfs: add handling for CB_NOTIFY_LOCK in client
b60475c9401b nfs: the length argument to read_buf should be unsigned
5405fc44c337 NFSv4.x: Add kernel parameter to control the callback server
bb6aeba736ba NFSv4.x: Switch to using svc_set_num_threads() to manage the callback threads
3b01c11ee8bf NFSv4.x: Fix up the global tracking of the callback server
d00252688604 SUNRPC: Initialise struct svc_serv backchannel fields during __svc_create()
f4b52bb08426 NFSv4.x: Set up struct svc_serv_ops for the callback channel

That "svc_set_num_threads()" patch would be at the top of my list of suspects.

On a quick skim I don't see any fixes or discussion upstream.

Would it be possible to report this upstream?  (linux-nfs at vger.kernel.org)  If not I'll get to it eventually.

Comment 6 Benjamin Coddington 2017-03-20 22:25:35 UTC
More reports of this in OFTC/#linux-nfs.. I think Kinglong Mee posted two patches January 19th that might fix it up, but I haven't tested:

[PATCH v2 1/2] NFSv4.x/callback: Create the callback service through svc_create_pooled
[PATCH v2 2/2] NFSv4.x/callback: make sure callback threads are interruptible

Comment 7 Justin M. Forbes 2017-04-11 14:57:06 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 25 kernel bugs.

Fedora 25 has now been rebased to 4.10.9-200.fc25.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 26, and are still experiencing this issue, please change the version to Fedora 26.

If you experience different issues, please open a new bug report for those.

Comment 8 Orion Poplawski 2017-04-11 17:54:51 UTC
Still present in 4.10.8-200.fc25.x86_64

Comment 9 Juha Tuomala 2017-04-24 10:42:58 UTC
Still present

Linux example.com 4.10.10-200.fc25.x86_64 #1 SMP Thu Apr 13 01:11:51 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

% ps -eaf|grep -c NFSv4
163

I think this is the reason why KDE desktop plasmashell and dolphin processess constantly hang. Really annoying.

Comment 10 Jürgen Holm 2017-04-25 11:09:55 UTC
4.10.11 is also buggy

Comment 11 Juha Tuomala 2017-04-25 11:11:42 UTC
(In reply to Jürgen Holm from comment #10)
> 4.10.11 is also buggy

I confirm, after a boot:
% ps -eaf|grep -c NFSv4
52

Comment 12 Juha Tuomala 2017-04-25 11:21:10 UTC
(In reply to Jürgen Holm from comment #10)
> 4.10.11 is also buggy

Do you run some desktop with that system, can you see any anomalies that could/would be filesystem related?

I've constantly issues with plasmashell and dolphin and need to kill them to get new windows appear/start.

Comment 13 Benjamin Coddington 2017-04-25 12:27:01 UTC
(In reply to Jürgen Holm from comment #10)
> 4.10.11 is also buggy

We'll probably not see a fix until 4.11.  I've asked the maintainers to take the two patches from comment 6, which should fix up the problem:
http://marc.info/?l=linux-nfs&m=149303618609704&w=2

Comment 14 Jason Tibbitts 2017-04-25 14:44:09 UTC
I'm still seeing a lot of this as well.  I think it would mostly hit people who use the automounter, so the filesystems are constantly coming and going.

@bcodding:  Do you think those patches are appropriate for a stable release?  Were they sent to the stable maintainer or just to the main tree?  If the latter, do you really think there's time before 4.11, given how late in the cycle it is?

I'm making a local kernel package now so I can test this.  I'm happy to share it once it's done.

Comment 15 Benjamin Coddington 2017-04-25 14:51:05 UTC
(In reply to Jason Tibbitts from comment #14)
> I'm still seeing a lot of this as well.  I think it would mostly hit people
> who use the automounter, so the filesystems are constantly coming and going.
> 
> @bcodding:  Do you think those patches are appropriate for a stable release?
> Were they sent to the stable maintainer or just to the main tree?  If the
> latter, do you really think there's time before 4.11, given how late in the
> cycle it is?

They weren't sent to stable yet. I think they need to be picked up mainline first.  I think it's probably too late to see them in 4.10.

Comment 16 Benjamin Coddington 2017-04-25 14:53:38 UTC
(In reply to Benjamin Coddington from comment #15)
> (In reply to Jason Tibbitts from comment #14)
> > I'm still seeing a lot of this as well.  I think it would mostly hit people
> > who use the automounter, so the filesystems are constantly coming and going.
> > 
> > @bcodding:  Do you think those patches are appropriate for a stable release?
> > Were they sent to the stable maintainer or just to the main tree?  If the
> > latter, do you really think there's time before 4.11, given how late in the
> > cycle it is?
> 
> They weren't sent to stable yet. I think they need to be picked up mainline
> first.  I think it's probably too late to see them in 4.10.

Oh, I am off-by-one.. I mean it is probably too late to see them in 4.11, hope to have them in 4.12.

Comment 17 Jason Tibbitts 2017-04-25 16:40:23 UTC
Created attachment 1273977 [details]
Patch adding the two NFS patches

For the record, the two relevant patches are https://www.spinics.net/lists/linux-nfs/msg61696.html and https://www.spinics.net/lists/linux-nfs/msg61697.html

If you want to build your own kernel, here's a patch which applies those two patches.  You may want to change the "%define buildid" to your liking.  I've built the resulting package but haven't booted into it yet.

Comment 18 Jürgen Holm 2017-04-26 09:50:19 UTC
(In reply to Juha Tuomala from comment #12)
> (In reply to Jürgen Holm from comment #10)
> > 4.10.11 is also buggy
> 
> Do you run some desktop with that system, can you see any anomalies that
> could/would be filesystem related?

No, I didn't any issues.

ps: My fix was to upgrade our last NFS server to COS7

Comment 19 Jürgen Holm 2017-04-26 09:52:21 UTC
(In reply to Jürgen Holm from comment #18)

> ps: My fix was to upgrade our last NFS server to COS7

with kernel-lt-4.4.59-1.el7.elrepo.x86_64

Comment 20 Justin M. Forbes 2017-04-26 12:50:51 UTC
(In reply to Jason Tibbitts from comment #17)
> Created attachment 1273977 [details]
> Patch adding the two NFS patches
> 
> For the record, the two relevant patches are
> https://www.spinics.net/lists/linux-nfs/msg61696.html and
> https://www.spinics.net/lists/linux-nfs/msg61697.html
> 
> If you want to build your own kernel, here's a patch which applies those two
> patches.  You may want to change the "%define buildid" to your liking.  I've
> built the resulting package but haven't booted into it yet.

We can bring those patches into Fedora if they solve the issue. The comments here just mentioned that they were a likely fix and hadn't been tested, so looking for your feedback after testing.

Comment 21 Juha Tuomala 2017-04-27 08:59:58 UTC
(In reply to Jürgen Holm from comment #18)
> ps: My fix was to upgrade our last NFS server to COS7
> with kernel-lt-4.4.59-1.el7.elrepo.x86_64

By fix you mean, that problem disappeared?

Comment 22 Juha Tuomala 2017-04-27 09:01:58 UTC
(In reply to Justin M. Forbes from comment #20)
> (In reply to Jason Tibbitts from comment #17)
> > If you want to build your own kernel, here's a patch which applies those two
> > patches.  You may want to change the "%define buildid" to your liking.  I've
> > built the resulting package but haven't booted into it yet.
> 
> We can bring those patches into Fedora if they solve the issue. The comments
> here just mentioned that they were a likely fix and hadn't been tested, so
> looking for your feedback after testing.

If someone can provide test kernel, I can test it and report back. I've no time to modify spec and build my own one.

Comment 23 Jürgen Holm 2017-04-27 09:09:22 UTC
(In reply to Juha Tuomala from comment #21)
> (In reply to Jürgen Holm from comment #18)
> > ps: My fix was to upgrade our last NFS server to COS7
> > with kernel-lt-4.4.59-1.el7.elrepo.x86_64
> 
> By fix you mean, that problem disappeared?

Yes

Comment 24 Juha Tuomala 2017-04-27 09:11:32 UTC
(In reply to Jürgen Holm from comment #23)
> (In reply to Juha Tuomala from comment #21)
> > (In reply to Jürgen Holm from comment #18)
> > > ps: My fix was to upgrade our last NFS server to COS7
> > > with kernel-lt-4.4.59-1.el7.elrepo.x86_64
> > 
> > By fix you mean, that problem disappeared?
> 
> Yes

That's interesting - since this bug has been all about fixing the problem at client side - and you solved it at server side.

Comment 25 Jason Tibbitts 2017-04-27 14:53:35 UTC
(In reply to Justin M. Forbes from comment #20)
> We can bring those patches into Fedora if they solve the issue. The comments
> here just mentioned that they were a likely fix and hadn't been tested, so
> looking for your feedback after testing.

Well, I've been running the kernel for a bit now.  So far there's no proliferation of "[NFSv4 callback]" processes and in fact I can't seem to make more than just one appear.  When I unmount my last imported NFS4 filesystem, the process goes away, which is something that doesn't ever seem to happen with the stock Fedora kernel.

Of course I'll have to see how it is in a few days.

(In reply to Juha Tuomala from comment #22)
> If someone can provide test kernel, I can test it and report back. I've no
> time to modify spec and build my own one.

I put what I have at https://www.math.uh.edu/~tibbs/patched-kernel/, assuming you trust kernel packages from some random person.  They were built locally in mock and are signed with my personal key, which I believe is linked via trust to the main Fedora signing keys if you want to check.

I also kicked off a scratch build at https://koji.fedoraproject.org/koji/taskinfo?taskID=19237773

Comment 26 Benjamin Coddington 2017-04-27 14:55:35 UTC
Hey Jason,  Trond is fixing this a different way upstream -- see the thread:

http://marc.info/?l=linux-nfs&m=149322214627678&w=2

We should probably be testing those patches instead.

Comment 27 Jason Tibbitts 2017-04-27 15:20:28 UTC
Yeah, I saw the thread yesterday but there was some following discussion and I wasn't sure if bfields was going to take the patches directly.  I was already booted into the patched kernel so I figured that reporting my findings couldn't hurt.

Are just those two patches ("SUNRPC: Refactor svc_set_num_threads()" and "NFSv4: Fix callback server shutdown") sufficient?  I'll get another kernel build started.

Comment 28 Jason Tibbitts 2017-04-27 15:48:22 UTC
Created attachment 1274721 [details]
Patch against 4.10.13 with upstream NFS patches

Attached is an updated kernel spec patch.  I started a scratch build at https://koji.fedoraproject.org/koji/taskinfo?taskID=19239218

I have a local mockbuild running as well and will update https://www.math.uh.edu/~tibbs/patched-kernel/ when it finishes.

Comment 29 David Howells 2017-04-28 16:34:16 UTC
Trond's patches fix it for me on the head of Linus's tree.

Comment 30 Jason Tibbitts 2017-04-28 17:45:03 UTC
No issues so far with that kernel.  I have been doing mounts from five servers and unmounting them in random orders and there's not been more than a single "[NFSv4 callback]" thread which exits when the last mount goes away.

Comment 31 Jason Tibbitts 2017-05-03 19:01:30 UTC
So last night there was an nfs-utils update and as the service was restarting I got an oops (actually a "divide error") resulting in a broken NFS server; a reboot was required to get NFS services working but otherwise the machine was fine (including client NFS).  I only mention this here in the off chance that the problem is related to the two patches I applied.  I don't think it is, but I'm must not familiar enough with the internals to know for sure.  Here's the log:

divide error: 0000 [#1] SMP
Modules linked in: nfsv4 dns_resolver nfs fscache rfcomm rpcsec_gss_krb5 cmac nf_conntrack
 tpm nfsd nfs_acl lockd grace auth_rpcgss sunrpc binfmt_misc xfs libcrc32c hid_logitech_hi
CPU: 7 PID: 17065 Comm: rpc.nfsd Not tainted 4.10.13-200.uh.1.fc25.x86_64 #1
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99E-ITX/ac, BIOS P3.40 08/03
task: ffff9481ebe60000 task.stack: ffffb7954d298000
RIP: 0010:svc_pool_for_cpu+0x2b/0x80 [sunrpc]
RSP: 0018:ffffb7954d29bc18 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff947e17384000 RCX: ffff9482ba94d228
RDX: 0000000000000000 RSI: 0000000000000007 RDI: ffff9482ba94d200
RBP: ffffb7954d29bc18 R08: ffff9482ba94d228 R09: 0000000000018783
R10: ffffe727408c4580 R11: 0000000000000000 R12: ffff947e17384000
R13: ffff947e17384018 R14: ffff9482ba94d200 R15: ffff9482ba94d210
FS:  00007fb243703c40(0000) GS:ffff9482df3c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffe581110a8 CR3: 0000001044108000 CR4: 00000000003406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 svc_xprt_do_enqueue+0xf2/0x2a0 [sunrpc]
 svc_xprt_received+0x51/0xb0 [sunrpc]
 svc_add_new_perm_xprt+0x76/0x90 [sunrpc]
 svc_addsock+0x14b/0x200 [sunrpc]
 ? recalc_sigpending+0x1b/0x50
 ? __getnstimeofday64+0x41/0xd0
 ? do_gettimeofday+0x29/0x90
 write_ports+0x255/0x2c0 [nfsd]
 ? _copy_from_user+0x4e/0x80
 ? write_recoverydir+0x100/0x100 [nfsd]
 nfsctl_transaction_write+0x48/0x80 [nfsd]
 __vfs_write+0x37/0x160
 ? selinux_file_permission+0xd7/0x110
 ? security_file_permission+0x3b/0xc0
 vfs_write+0xb5/0x1a0
 SyS_write+0x55/0xc0
 entry_SYSCALL_64_fastpath+0x1a/0xa9
RIP: 0033:0x7fb24301dc30
RSP: 002b:00007ffe580d3cd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fb24301dc30
RDX: 0000000000000002 RSI: 0000561dd3a33640 RDI: 0000000000000003
RBP: 00007ffe580d3cd0 R08: 0000000000000001 R09: 0000000000000002
R10: 0000000000000064 R11: 0000000000000246 R12: 0000000000000004
R13: 0000561dd3cbd7c0 R14: 0000561dd3cbd740 R15: 00007ffe580d3788
Code: 0f 1f 44 00 00 48 8b 87 98 00 00 00 55 48 89 e5 48 83 78 08 00 74 10 8b 05 57 52 02
RIP: svc_pool_for_cpu+0x2b/0x80 [sunrpc] RSP: ffffb7954d29bc18
---[ end trace 75c980265f0dfd0d ]---

Comment 32 J. Bruce Fields 2017-05-03 19:05:57 UTC
(In reply to Jason Tibbitts from comment #31)
> So last night there was an nfs-utils update and as the service was
> restarting I got an oops (actually a "divide error") resulting in a broken
> NFS server; a reboot was required to get NFS services working but otherwise
> the machine was fine (including client NFS).  I only mention this here in
> the off chance that the problem is related to the two patches I applied.  I
> don't think it is, but I'm must not familiar enough with the internals to
> know for sure.  Here's the log:

See Kinglong Mee's df807fffaabd "NFSv4.x/callback: Create the callback service through svc_create_pooled" in my -next tree.

Comment 33 Jason Tibbitts 2017-05-04 20:04:53 UTC
I'm getting the impression I'd be better off just waiting for 4.12.  The bug is annoying but not a deal breaker.

Comment 34 Juha Tuomala 2017-05-15 10:32:18 UTC
I probably got bitten by this *again*, having your desktop running constantly gathers these NFSv4 processes, made a phone call, needed to write couple numbers down into kwrite, press Save and it hang - soon whole desktop crashed. dmesg shows errors about Q****dbus crash. Embarrasing to call back for same thing :-(

Howcome this kind of bugs get easily introduced but there is no hurry to rollback/fix them? Can't that commit just be reverted?

Comment 35 Trevor Cordes 2017-07-19 08:50:43 UTC
For those seeing hangs (Juha), can you see my new bug #1455086 to see if you're having the same problem?  I'm not seeing >1 callback in ps so I don't think I'm seeing this bug, but any hangs people have might be related.  Someone screwed something up in F24's 4.11.x kernel.  4.10.x series does not have my bug.  I'm starting a vanilla bisect now, but 14 steps, gonna take a couple of weeks.

Comment 36 Fedora End Of Life 2017-11-16 19:27:12 UTC
This message is a reminder that Fedora 25 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 25. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '25'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 25 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged  change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

Comment 37 Fedora End Of Life 2017-12-12 10:04:56 UTC
Fedora 25 changed to end-of-life (EOL) status on 2017-12-12. Fedora 25 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.