Bug 468092 - number of lockd socket connections is capped at 80
Summary: number of lockd socket connections is capped at 80
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.3
Hardware: i386
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Jeff Layton
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-10-22 18:47 UTC by Jeff Layton
Modified: 2012-02-06 01:45 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-09-02 08:48:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
patch -- add sv_maxconn field to svc_serv (3.72 KB, patch)
2008-12-12 20:41 UTC, Jeff Layton
no flags Details | Diff
patch -- increase sv_maxconn for lockd and add module parameter to tune it (1.92 KB, patch)
2008-12-12 20:42 UTC, Jeff Layton
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1243 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.4 kernel security and bug fix update 2009-09-01 08:53:34 UTC

Description Jeff Layton 2008-10-22 18:47:35 UTC
+++ This bug was initially created as a clone of Bug #457405 +++

Description of problem:
Kernel printk: lockd: too many open TCP sockets, consider increasing number of 
nfsd threads.  

Have increased the number to rpc.nfsd 256 and higher, testing locking single 
nfs file on ~200+ servers at the same time, on nfs server watching number of 
uniq socket connections to nlockmgr port.  Once the number of connections 
reaches 80 - the above printk is logged and additional connections are denied 
until some of the first 80 established ports start closing.  Changing the 
number of nfsd threads running higher or lower makes no change in number of rpc 
connections - 80 it is....eventually all clients do finish locking the file 
however not in the time expected as this is causing alot of timeo client 
retries and very timely lock response.

net/sunrpc/svcsock.c reads:
  if (serv-_sv_tmpcnt > (serv->sv_nrthreads+3)*20) {.....

if "sv_nrthreads=1" then the number of socket connections is limited to 80, 
this is what is happening.  I have rebuilt a test kernel mod'ing 20->40 in the 
above code and this does in fact increase the number of sockets limited to 
160.  Seems to me that sv_nrthreads should equal the actual number of 
rpc.nfsd's running else this limitation makes no sense.

Version-Release number of selected component (if applicable):
Have tested on both ES4.4 - ES4.6 kernel baselines

How reproducible:
everytime rpc sockets reach 80 on nfs server

Steps to Reproduce:
above....
  
Actual results:
socket connections limited to 80 established connections.

Expected results:
number of rpc socket connections descrease/increase with the number of nfsd's 
running.

--- Additional comment from jlayton on 2008-08-12 07:25:47 EDT ---

Actually...

It looks like the problem is that lockd is running out of sockets. So increasing the number of nfsd's won't have any effect here. The warning message here is bogus. It comes from generic RPC code but is a warning about "nfsd" sockets (that, at the very least, should be fixed). Unfortunately, you can't increase the number of lockd threads -- it's necessarily single-threaded...

From a glance at the upstream code, it looks like the same problem exists there. Are you also able to reproduce this with recent fedora or something closer to current mainline kernels?

--- Additional comment from jlayton on 2008-08-12 07:27:07 EDT ---

Fixing this will probably require changing the check you mention, but I need to first understand the purpose of that check in the first place (i.e. are there other hard limits that we'll hit if we remove it).

--- Additional comment from tjp on 2008-08-12 11:03:16 EDT ---

I have just tested with 2.6.18-53.1.4 (ES5.1) and the same limit/printk exists.  Looking at the code for 5.2 kernel it will also exist there.  We were originally thinking that this was a lockd thread limitation, but glancing at referencing code it seemed to increment sv_nrthreads when more nfsd's were started, which would be in line with the message, guess this is not the case.  I havent seen any degraded I/O performance since increasing the count.

--- Additional comment from jlayton on 2008-08-14 08:07:18 EDT ---

Increasing the number of nfsd threads increases sv_nrthreads for nfsd only. It doesn't have any effect on lockd. I think this BZ points out the need for a couple of things:

1) fix this printk to be more generic. It shouldn't explicitly mention nfsd threads since the number of nfsd sockets isn't a problem in this case.

2) check and see why this limit on the number of sockets exists in the first place. It's probably there to try and limit DoS attacks on an RPC service, but it seems like this limit ought to be tunable (or maybe just go away entirely for services that are single-threaded).

I'll probably need to toss this question out to the upstream linux-nfs mailing list since the reason for setting this limit where it is isn't exactly clear...

--- Additional comment from jlayton on 2008-09-12 10:51:36 EDT ---

Sent a patch to make the warning message more generic upstream. I also asked for clarification about why the hardcoded check uses:

(sv_nrthreads  + 3) * 20

...as a formula.

--- Additional comment from jlayton on 2008-10-15 11:55:46 EDT ---

Created an attachment (id=320451)
patch -- remove svc_check_conn_limits

RFC patch that I've sent upstream. This just removes the check altogether (and some other code that won't be needed with it gone).

This may get shot down altogether or need some modification, but it's at least a starting point for discussion. Awaiting comment there now.

--- Additional comment from jlayton on 2008-10-20 12:37:11 EDT ---

Created an attachment (id=320884)
patchset -- add sv_maxconn field to svc_serv

After some upstream discussion, this patchset seems to be pretty close to being accepted. We'll probably also be able to do something similar for RHEL, but it's likely to look different since we'll have to fix up kABI.

--- Additional comment from jlayton on 2008-10-22 14:45:10 EDT ---

Bruce Fields took the latest patchset into his tree so it seems likely to go upstream. I don't think this will be appropriate for RHEL4 though. It's a kABI-breaker, for one thing. It's also too late for 4.8 and I don't think it meets the threshold of criticality that 4.9 will have.

For this reason, I'm going to go ahead and close this WONTFIX and clone the bug for RHEL5. We can evaluate it for inclusion there.

Comment 1 Jeff Layton 2008-12-12 20:41:49 UTC
Created attachment 326769 [details]
patch -- add sv_maxconn field to svc_serv

Comment 2 Jeff Layton 2008-12-12 20:42:34 UTC
Created attachment 326770 [details]
patch -- increase sv_maxconn for lockd and add module parameter to tune it

Comment 3 Jeff Layton 2008-12-12 20:44:09 UTC
I *think* this is still kabi safe since anyone creating a new svc_serv should be using the svc_create interface. If it isn't we'll probably have to take a different approach and put the sv_maxconn field elsewhere (lookaside cache maybe?)

That probably won't be as efficient however.

Comment 5 RHEL Program Management 2009-01-27 20:39:03 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 6 RHEL Program Management 2009-02-16 15:36:50 UTC
Updating PM score.

Comment 7 Don Zickus 2009-03-16 15:21:46 UTC
in kernel-2.6.18-135.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 10 errata-xmlrpc 2009-09-02 08:48:01 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html


Note You need to log in before you can comment on or make changes to this bug.