Bug 1375767

Summary: NFS blocks after "Callback slot table overflowed" message
Product: [Fedora] Fedora Reporter: Brian J. Murrell <brian>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 24CC: anton, Bert.Deknuydt, brian, dougsland, extras-qa, gansalmon, gm.outside+redhat, ichavero, itamar, jonathan, josep.puigdemont, kernel-maint, kevin.constantine, madhu.chinakonda, mchehab, tamisoft
Target Milestone: ---Flags: jforbes: needinfo?
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 607695 Environment:
Last Closed: 2017-04-28 17:27:30 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Blocked tasks (echo w >/proc/sysrq-trigger) none

Description Brian J. Murrell 2016-09-14 01:22:22 UTC
Created attachment 1200673 [details]
Blocked tasks (echo w >/proc/sysrq-trigger)

+++ This bug was initially created as a clone of Bug #607695 +++

My $HOME is an NFS mount. I noticed that sometimes (~90% of the time) when copying a large from the NFS mount to a local mount point like /tmp, will lock completely the desktop and apparently all NFS mounts.

I noticed that this happens always after the following message is shown in /var/log/messages:
kernel: Callback slot table overflowed

Version-Release number of selected component:
$ uname -a
Linux sestows351 2.6.33.5-124.fc13.i686.PAE #1 SMP Fri Jun 11 09:42:24 UTC 2010 i686 i686 i386 GNU/Linux

Server NFS seems to use version 3:
$ cat /proc/fs/nfsfs/servers 
NV SERVER   PORT USE HOSTNAME
v3 ac150302  801   3 nfs_server

fstab line for /home:
nfs_server:/home       /home                  nfs     rw,async,rsize=8192,wsize=8192,timeo=14,retrans=6 0 0


I can fairly easily reproduce this issue, I just need to copy one big file from the NFS mount to a locally mounted directory. Sometimes it also happens when an application like evolution starts.

The only solution I found so far is to reboot the computer. I once left it overnight to see if it recovered, but the next morning the NFS mounts were still not responding.

--- Additional comment from Kevin Constantine on 2010-07-06 21:30:24 EDT ---

I'm seeing similar behavior both on FC13 and RHEL6 beta.  My clients hang for up to 15 seconds, and then send a tcp reset packet to the server.  There are far more frequent 3 second pauses and then it seems like the tcp conversation gets restarted.  

When I run a simple lmdd of a 150MB file from an NFSv3 server, throughput can be measured anywhere from 114MB/s when there are no Callback errors, to 2-6MB/s when there are many Callback errors.

I found that if I set /proc/sys/sunrpc/tcp_slot_table_entries to 2 on the client, the "Callback slot table overflowed" errors disappear, and throughput is consistently 114MB/s.  Setting it to 3, there are a few errors in the logs, but performance is not affected.  Setting it to 4, performance is significantly affected (30MB/s), and there are consistent Callback errors in the logs.

What's interesting to me, is that I see this behavior when reading from one particular vendor's NAS device, and not from another's.

--- Additional comment from Josep on 2010-08-04 05:18:47 EDT ---

In my case the server is a BlueArc model Titan 2100.

I've set tcp_slot_table_entries to 2 as Kevin mentioned in previous comment, and so far things look good, no errors.

--- Additional comment from Josep on 2010-08-06 04:56:55 EDT ---

Just wanted to report that setting tcp_slot_table_entries to 2 as mentioned in comment #1 does not do the trick here, I still see this issue.
I'm using the latest kernel available in Fedora 13 as of this writing: 2.6.33.6-147.2.4.fc13.i686.PAE

If a process blocks for more than 120 seconds, the kernel is configured to provide a stack trace, I copy it here in case it can be helpful:

Callback slot table overflowed
INFO: task gconfd-2:1759 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
gconfd-2      D 00004acf     0  1759      1 0x00000080
f524be54 00000086 01425172 00004acf f524be34 c0a48394 c0a4cf40 c0a4cf40
c0a4cf40 f52b5bdc 00000000 f524bec0 eec41328 00000000 00000000 00004acf
f52b5940 b3a6d1bb 10321c80 f524be9c c2008f40 f52b5940 c2008f40 f524bea4
Call Trace:
[<c0781a57>] io_schedule+0x5a/0x93
[<f93af766>] nfs_wait_bit_uninterruptible+0x8/0xc [nfs]
[<c0781f2c>] __wait_on_bit+0x34/0x5b
[<f93af75e>] ? nfs_wait_bit_uninterruptible+0x0/0xc [nfs]
[<f93af75e>] ? nfs_wait_bit_uninterruptible+0x0/0xc [nfs]
[<c0781fee>] out_of_line_wait_on_bit+0x9b/0xa3
[<c0452b48>] ? wake_bit_function+0x0/0x37
[<f93af75b>] nfs_wait_on_request+0x1e/0x21 [nfs]
[<f93b3274>] nfs_sync_mapping_wait+0xc3/0x1aa [nfs]
[<f93b3a4b>] nfs_write_mapping+0x5a/0x7a [nfs]
[<f93b3a90>] nfs_wb_all+0x10/0x12 [nfs]
[<f93a7a52>] nfs_do_fsync+0x16/0x31 [nfs]
[<f93a7be6>] nfs_file_flush+0x58/0x60 [nfs]
[<c04cef40>] filp_close+0x32/0x5b
[<c04cefc4>] sys_close+0x5b/0x8a
[<c040889f>] sysenter_do_call+0x12/0x28

(same back trace for other applications are shown too)

--- Additional comment from Josep on 2010-11-11 05:16:41 EST ---

I haven't seen this issue for a very long time, and now with F14 I don't see it either.

--- Additional comment from Levente Tamas on 2011-02-02 05:43:42 EST ---

The problem is easily reproducible if someone installs a freenas and maps nfs drive from it on Fedora 13 and up. All hangs/slows on big transfers or many concurrent requests with kernel: Callback slot table overflowed in the messages log.

I don't know how come that fedora ships with a broken kernel now since F12. I am currently trying to recompile the kernel RPM without NFS_V4_1 as it seems this message can only come if that is defined.

--- Additional comment from Bug Zapper on 2011-06-01 11:41:51 EDT ---


This message is a reminder that Fedora 13 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 13.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '13'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 13's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 13 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

--- Additional comment from Bug Zapper on 2011-06-27 14:52:27 EDT ---


Fedora 13 changed to end-of-life (EOL) status on 2011-06-25. Fedora 13 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

--- Additional comment from (GalaxyMaster) on 2015-04-03 06:07:12 EDT ---

Just experienced this issue on RHEL7.  The NFS mount is mounted with the following options on the client where this issue appears:
===
192.168.0.250:user	/home/user	nfs	_netdev,vers=4.1,bg,intr,ro,soft,rsize=32768,wsize=32768,timeo=100,context=system_u:object_r:httpd_sys_content_t:s0	0 0
===

The system log was spammed with "Callback slot table overflowed":
===
Apr 03 20:44:29 ots-prod-web.production.internal kernel: Callback slot table overflowed
Apr 03 20:44:31 ots-prod-web.production.internal kernel: Callback slot table overflowed
Apr 03 20:44:31 ots-prod-web.production.internal kernel: Callback slot table overflowed
===

The running kernel version is 3.10.0-123.20.1.el7.x86_64:
===
# uname -s -r -v
Linux 3.10.0-123.20.1.el7.x86_64 #1 SMP Thu Jan 29 18:05:33 UTC 2015
# rpm -q kernel
kernel-3.10.0-123.20.1.el7.x86_64
#
===

--- Additional comment from Brian J. Murrell on 2016-09-13 21:13:16 EDT ---

This is happening on F24 also.

# uname -r
4.6.4-301.fc24.x86_64

Comment 1 Brian J. Murrell 2016-09-14 02:33:16 UTC
Just updated to current F24:

$ uname -r
4.7.2-201.fc24.x86_64

This is still happening and is making NFS on F24 pretty much useless.  Having to reboot after every burst of NFS activity is just a complete non-starter, as I am sure you can imagine.

Server is an EL 7.2 machine, up-to-date.

Comment 2 Justin M. Forbes 2017-04-11 14:50:19 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 24 kernel bugs.

Fedora 25 has now been rebased to 4.10.9-100.fc24.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 26, and are still experiencing this issue, please change the version to Fedora 26.

If you experience different issues, please open a new bug report for those.

Comment 3 Justin M. Forbes 2017-04-28 17:27:30 UTC
*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 2 weeks. If you are still experiencing this issue, please reopen and attach the 
relevant data from the latest kernel you are running and any data that might have been requested previously.