RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1369600 - Console spammed with receive_cb_reply log and sunrpc/sched.c warnings until OOM
Summary: Console spammed with receive_cb_reply log and sunrpc/sched.c warnings until OOM
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel
Version: 7.2
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: nfs-maint
QA Contact: Filesystem QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-23 22:45 UTC by Jason Tibbitts
Modified: 2020-02-14 17:53 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-11-26 21:07:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Jason Tibbitts 2016-08-23 22:45:34 UTC
kernel-3.10.0-327.28.2.el7.x86_64
nfs-utils-1.3.0-0.21.el7_2.1.x86_64

An NFS server running Centos (not RHEL, but it shouldn't really matter here) is exporting a number NFSv4.1 sec=krb5p mounts.  It is under a fairly large amount of load and has about 120 clients (all on Fedora 24) active and accessing user home dirs (so mostly small random reads and writes).  It has 48GB of RAM which is generally almost completely unused save for cache.  This morning I found it in a rather poor state; the OOM killer had tripped and killed random bits of the system.  I ended up rebooting it.

Keeping an eye on the server, I noticed that after of peaceful operation, the console was spammed with about a hundred of these a second:

[15045.338602] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff8803a7c2e000 xid 08336b5b

The xid changed on each line but the 6b5b was consistent.  Sadly that message does not include anything which identifies the client.  During this time, rpc.gssd would be consuming 100% CPU.  Sometimes only a few would come at random interfals and in blocks of random size, though this was generally followed by a continuous spew which would last until the machine fell over.

Interspersed in the log spew was the following warning:

[14987.600295] ------------[ cut here ]------------
[14987.600316] WARNING: at net/sunrpc/sched.c:692 rpc_exit_task+0x85/0x90 [sunrpc]()
[14987.600339] Modules linked in: nfsv4 dns_resolver nfs fscache cts rpcsec_gss_krb5 ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter intel_powerclamp coretemp kvm_intel kvm crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmulni_intel aesni_intel sg pcspkr i2c_i801 i7core_edac lpc_ich lrw mfd_core gf128mul glue_helper ablk_helper cryptd edac_core ioatdma shpchp dca acpi_cpufreq nfsd nfs_acl lockd auth_rpcgss grace sunrpc ip_tables xfs libcrc32c raid1 sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ttm e1000e ahci drm libahci crct10dif_pclmul crct10dif_common ptp libata crc32c_intel serio_raw i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
[14987.600342] CPU: 7 PID: 572 Comm: kworker/7:1H Tainted: G        W      ------------   3.10.0-327.28.2.el7.x86_64 #1
[14987.600343] Hardware name: Supermicro X8DTT-H/X8DTT-H, BIOS 2.1c       04/22/2014
[14987.600350] Workqueue: rpciod rpc_async_schedule [sunrpc]
[14987.600354]  0000000000000000 00000000af94c47c ffff880c1a577d40 ffffffff81636553
[14987.600355]  ffff880c1a577d78 ffffffff8107b200 ffff88061a5e2900 ffff8800bada5600
[14987.600357]  ffffffffa02ede00 ffffffffa02ede00 00000000000001e0 ffff880c1a577d88
[14987.600358] Call Trace:
[14987.600366]  [<ffffffff81636553>] dump_stack+0x19/0x1b
[14987.600370]  [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0
[14987.600377]  [<ffffffffa02ede00>] ? rpc_destroy_wait_queue+0x20/0x20 [sunrpc]
[14987.600384]  [<ffffffffa02ede00>] ? rpc_destroy_wait_queue+0x20/0x20 [sunrpc]
[14987.600385]  [<ffffffff8107b34a>] warn_slowpath_null+0x1a/0x20
[14987.600392]  [<ffffffffa02ede85>] rpc_exit_task+0x85/0x90 [sunrpc]
[14987.600398]  [<ffffffffa02eeb84>] __rpc_execute+0x94/0x420 [sunrpc]
[14987.600405]  [<ffffffffa02eef36>] rpc_async_schedule+0x26/0x30 [sunrpc]
[14987.600409]  [<ffffffff8109d63b>] process_one_work+0x17b/0x470
[14987.600410]  [<ffffffff8109e40b>] worker_thread+0x11b/0x400
[14987.600411]  [<ffffffff8109e2f0>] ? rescuer_thread+0x400/0x400
[14987.600415]  [<ffffffff810a5b2f>] kthread+0xcf/0xe0
[14987.600417]  [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140
[14987.600421]  [<ffffffff81646b98>] ret_from_fork+0x58/0x90
[14987.600422]  [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140
[14987.600423] ---[ end trace c2ac225ae8a62724 ]---

Eventually (even though I didn't notice the free memory count going down) the OOM killer kicked in and after that the machine was a mess (random things killed, etc.)

This happened a couple more times throughout the day.  Sometimes iftop would identify a client hammering the server.  That client would basically be doing nothing but spewing traffic at the server and would have a pretty high load. Rebooting the client stopped the log spew.  I also found that a simple systemctl restart nfs on the server would generally clean things up (until it happened again, of course).

This server and client configuration has been working well for a while, after some initial issues with NFSv4.1 were ironed out back when it was first enabled in EL7.  What's changed recently:

The load went up as today is the second day of the fall semester.  But it's been this high before.

The server was recently rebooted into kernel-3.10.0-327.28.2.el7.x86_64.  Previously it was running kernel-3.10.0-327.10.1.el7.x86_64 recompiled with a local patch to fix some NFSv4.1 issues which was later patched officially (in 327.13.1 or 327.18.2, I think).  I did not boot any of the intermediate kernels, but I can try them.

nfs-utils on the server was updated from nfs-utils-1:1.3.0-0.21.el7_2.x86_64 to 1:1.3.0-0.21.el7_2.1.x86_64.

The clients were upgraded from Fedora 23 to Fedora 24.  The kernel version stayed pretty much the same, though.  Some machines had kernel-4.6.5-200.fc23.x86_64 and some had 4.6.6.  They're all on kernel-4.6.6-300.fc24.x86_64 or 4.6.7 now.

My plan for now is to disable NFSv4.1 and just go with 4.0; I never had any issues at all before 4.1 was enabled.  But I am happy to try out any other suggestions first.

Comment 1 Jason Tibbitts 2016-08-23 22:47:25 UTC
Please make this bug public if possible; I can't seem to do that myself.

Comment 3 Jason Tibbitts 2016-08-24 15:09:14 UTC
This has happened on another server this morning, so I know it isn't specific to that one machine.  The hardware and OS install should be identical, however.  Fortunately systemctl restart nfs seems to have cleared it up without any issues, and the clients seem to have recovered.  So at least the workaround is relatively simple, if I can manage to catch the problem before the machine dies from OOM.

Comment 4 J. Bruce Fields 2016-08-24 15:54:57 UTC
> net/sunrpc/sched.c:692

That's the RPC_IS_ASSASSINATED(task) check in rpc_exit_task.  Maybe that means some cleanup isn't being done on rpc exit (e.g., rpc_release not getting called?), causing the leak?

And I wonder if there's something wrong with the backchannel locking that's letting the xid's get out of sync.

Comment 5 Jason Tibbitts 2016-08-24 18:26:51 UTC
If you have any suggestions I'm happy to try them, but sadly I cannot reproduce this at will.  I know I can drop back to NFSv4.0 (which worked for years before v4.1 was enabled) but I was keeping things this way in the hope of being able to debug a bit.  Of course it happens just often enough to cause real problems but not often enough to present obvious debugging opportunities.

Comment 6 J. Bruce Fields 2016-08-24 21:28:18 UTC
I have a feeling we did actually fix some sort of callback race there recently, but I'm not finding it at the moment....

Of course, you could just run on a upstream kernel for a few days and see if that helps, but I don't know if that's practical for you.

Comment 7 Jason Tibbitts 2016-08-24 22:00:40 UTC
I can do that, no problem, assuming that the userspace would actually work.  I'm not at all sure how to generate a fresh kernel using the RHEL configs, though.  I can try building a current Fedora kernel on EL7, but I have my doubts that what I got at the end would be useful.  RHEL7 is pretty old now in Fedora terms.

Comment 8 Jason Tibbitts 2016-08-25 01:54:29 UTC
Some further notes:

Sometimes I can catch a client in the act of spamming the server.  On such a client, this is generally the only thing in dmesg:

[100711.117385] NFS: state manager: check lease failed on NFSv4 server nas01 with error 13
[101050.513502] NFS: nfs4_reclaim_open_state: Lock reclaim failed!
[...repeated...]
[101050.542845] NFS: nfs4_reclaim_open_state: Lock reclaim failed!
[101050.547814] NFS: nfs4_reclaim_open_state: Lock reclaim failed!
[104950.859195] Callback slot table overflowed
[...repeated...]
[104971.209458] Callback slot table overflowed
[104971.210004] RPC: Could not send backchannel reply error: -105
[104971.243881] Callback slot table overflowed
[104971.289402] Callback slot table overflowed

I managed to capture some traffic from such a host and looked at the XIDs going over the wire and what's odd is that _none_ of them correspond to the XIDs spewed in the receive_cb_reply lines.  (The dumps from the captures are huge and they're made on a krb5p export, but I can supply one if that would be useful.)  I've started exporting everything as krb5i and for whatever reason the clients appear to prefer that, so I should be able to get some more useful captures in the future if necessary.

In the meantime, elrepo makes upstream kernels easy to get, so I'll try that as soon as I can squeeze in a reboot.

Comment 9 Jason Tibbitts 2016-08-25 02:52:19 UTC
OK, I installed 4.7.2 from elrepo and it fell over in about two minutes with:

kernel BUG at mm/slub.c:3657

So I'm now running 4.4.19 from elrepo and it seems to be OK, but it's only been fifteen minutes.

The other official Centos kernels fail to boot due to some problem with dracut which causes it to not include the proper modules.  I need to rebuild the boot images before I'll be able to try them.

Comment 10 Jason Tibbitts 2016-08-25 22:34:49 UTC
And, sadly, 4.4.19 seemed to have worked but appears to have some incompatibility somewhere.  I had to modify /etc/idmapd.conf to add Domain = whatever to get something to stop complaining about not being able to map uids, and then users somehow can't actually see their home directories at login even though they appear to mount just fine.

So that's out.

Chuck Lever on IRC suggested that some networking issue could be to blame, and while that hardware has transferred petabytes of information over its lifetime, I notice that the 327.28.2 kernel has a number of changes to the e1000 driver.  I need to stick a 10G card in this machine at some point anyway, so I'll go ahead and do that tonight and see if the situation changes.

Comment 11 Jason Tibbitts 2016-08-26 03:49:11 UTC
I installed a 10G Ethernet card and switched over to it; no change in behavior.

I took a hard look at my logs and found that the problem started on two hosts on the morning and early afternoon of August 18.  Now, I updated pretty much all of my clients from Fedora 23 to 24 that evening previous, and the log spew on the servers started pretty much when people started working.  (A faculty file server at 9AM and then a graduate student server at noon, which makes a log of sense.)

In the F24 update, the kernel didn't really change; the version's the same and I don't think the config is any different.  nfs-utils was updated, though, and that strikes me as the most suspicious thing. The version before the updates was 1.3.3-8.rc5, which I think was really a release candidate of 1.3.4. (The nfs-utils versioning in Fedora seems pretty broken.)  After the update, it's 1.3.4-1.rc1 (which I think is actually a release candidate of 1.3.5, aargh).

So... could a difference in nfs-utlls on the client possibly have anything to do with this?  The client only really uses rpc.gssd and mount.nfs, doesn't it?  rpc.gssd especially sticks out.

So I'll concentrate on the clients now, and try a bit to get 4.4.19 actually working on the server.  But any ideas would be greatly appreciated.

Comment 12 J. Bruce Fields 2016-08-26 14:50:00 UTC
(In reply to Jason Tibbitts from comment #11)
> In the F24 update, the kernel didn't really change; the version's the same
> and I don't think the config is any different. nfs-utils was updated,
> though, and that strikes me as the most suspicious thing. The version before
> the updates was 1.3.3-8.rc5, which I think was really a release candidate of
> 1.3.4. (The nfs-utils versioning in Fedora seems pretty broken.)  After the
> update, it's 1.3.4-1.rc1 (which I think is actually a release candidate of
> 1.3.5, aargh).

Hm.  From a quick look, there might have been bugfixes to version negotiation, so maybe the protocol minor version used by those clients changes?

(Also, if you want to argue with SteveD about nfs-utils versioning, open a bug against nfs-utils and that'll go to him.)

Comment 13 Jason Tibbitts 2016-08-26 15:08:40 UTC
Well, sadly, this morning I downgraded nfs-utils on all client machines and restarted rpc.gssd and the problem still occurred.  Of course, there were existing active mounts so it could have been some residual thing.  I'll leave it that way and weed out problematic clients.

Really the only thing I can think of doing now is booting the clients into the F23 kernel.  Sadly I can't use the F23 version of nfs-utils due to libtirpc being updated.  (Which I guess is something else to look at.)  I'm so out of ideas that I'm even considering subtle miscompilation of things by gcc6.  (F23 had gcc5.)

But really what I need to do is just disable nfs4.1 on the server.  If that still shows the problem then something is seriously off.  I've just been avoiding it because I wanted to better understand the 4.1 problem before just punting.  

Anyway, that's a job for tonight.  Good thing it's Friday as the load will be lower.

I was just going to bug steved the next time I see him on IRC.  That grousing was just me with no sleep, and the fact that I maintain those Fedora packaging guidelies so I like to see packages following them.

Comment 14 Jason Tibbitts 2016-08-28 05:17:58 UTC
And the latest.  From a packet capture on a krb5i mount, the client gets into a state where it spams the server mercilessly with BIND_CONN_TO_SESSION.  The server responds with NFS4_OK to each, and eventually dies.

Chuck found that the description of upstream kernel commit bdc54d8e3cb4a41dddcabfd86d9eb3aa5f622b75, which fixes a client problem that went into 4.6 rc1, sees very related:

"
If there were less than 2 entries in the multipath list, then
xprt_iter_next_entry_multiple() would never advance beyond the
first entry, which is correct for round robin behaviour, but not
for the list iteration.

The end result would be infinite looping in rpc_clnt_iterate_for_each_xprt()
as we would never see the xprt == NULL condition fulfilled.
"

I have no idea why clients running Fedora 23 never had this problem given that they're running a 4.6 kernel, but I've deployed a kernel with this patch to all of my clients, so we'll see how things work come Monday.  With luck we'll at least know what client behavior is causing the server to die.

Comment 15 Jason Tibbitts 2016-08-29 19:13:06 UTC
Everything seems to be fine after that client patch.  At least, the clients are no longer tickling the server issue.  Of course, this doesn't actually fix the server issue, but for now at least I can get on with life.

Comment 16 J. Bruce Fields 2016-09-06 14:36:50 UTC
Thanks for the report.  Unfortunately I think it's likely then that there are multiple issues:

  - the client bug causing the excessive BIND_CONN_TO_SESSION calls (sound like you've identified the bug and the fix, thanks).
  - a server leak in the case of excessive BIND_CONN_TO_SESSION.  It could be as simple as the server lacking any limit on the number of bound connections, so setting some per-session (or per-client) limit may be all that's necessary.
  - the "unrecognized reply" messages: we don't know if this was just another consequence of the client bug, or a separate client or server bug causing the callback xid stream to get out of sync in the presence of multiple connections.  A network trace might help sort that out.

Comment 17 Eric Smith 2016-09-29 20:13:53 UTC
I was having other NFS problems (client side crashes under heavy NFS load), and upgraded my CentOS 7 file server and Fedora 24 clients on the same day. That eliminated the client side crashes, but I experienced the server side connection failures (with continuously spewing "Got unrecognized reply" in log) about 20-40 times a day. I mistakenly thought it was the file server upgrade that caused the problem, and wasted a bunch of time trying to debug that.  Now that I've upgraded the kernel on the F24 clients to 4.7.4-200, I've gone a day without seeing the issue.

It would still be nice to get the server side issue fixed. In my experience it is really easy to provoke the issue with clients running the F24 4.6.x kernels, including 4.6.4-301 and 4.6.5.300, when the clients have home directories NFS mounted from the file server.

Comment 18 Dave Wysochanski 2018-11-26 21:07:30 UTC
Closing INSUFFICIENT_DATA.  Please re-open if you have new information and especially if you can reproduce on a later RHEL7 kernel (such as 3.10.0-957.el7 or above)


Note You need to log in before you can comment on or make changes to this bug.