Bug 459397 - Cannot create more than 1024 nfsd threads
Summary: Cannot create more than 1024 nfsd threads
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.2
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Jeff Layton
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks: 456229 483701 485920
TreeView+ depends on / blocked
 
Reported: 2008-08-18 13:31 UTC by Jeff Layton
Modified: 2018-10-20 02:00 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-09-02 08:26:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
patchset -- overhaul knfsd readahead cache (13.72 KB, patch)
2008-12-12 20:37 UTC, Jeff Layton
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1243 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.4 kernel security and bug fix update 2009-09-01 08:53:34 UTC

Description Jeff Layton 2008-08-18 13:31:07 UTC
+++ This bug was initially created as a clone of Bug #456229 +++

Escalated to Bugzilla from IssueTracker

--- Additional comment from tao on 2008-07-22 08:30:22 EDT ---

State the problem

   2. Provide clear and concise problem description as it is understood at the time of escalation

A customer running a big search engine that has a large number of NFS clients, with a minimum of 200 NFS threads used at a given time, cannot increase those above 1024. It is a migration from Solaris where they were running 2048 threads without issues and they are coming across this issue only with Linux which apparently is not letting them working well.

The errors are these:

Feb 13 11:59:41 racedo nfsd[29991]: nfssvc: Cannot allocate memory
Feb 13 11:59:41 racedo kernel: nfsd: Could not allocate memory read-ahead cache.

   3. State specific action requested of SEG

Check whether it exists any tunable to be able to increase this limit. Be aware that I managed to start 2048 threads by running two times rpc.nfsd but I can't start 2048 by setting RPCNFSDCOUNT=1024  in /etc/sysconfig/nfs, please see reproduction steps.

Provide supporting info

   2. Attach sosreport

done.

   4. Provide issue repro information:

running 'rpc.nfsd 1024' works fine. If afterwards I run again 'rpc.nfsd 2048' it works for me. If I run directly rpc.nfsd 2048 it won't work. Please follow this sequence:

# killall -2 nfsd
# ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
pending signals                 (-i) 1024
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 2048
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 16360
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
#  rpc.nfsd 1024
# ps -ef|grep nfsd|wc -l
1025
#  rpc.nfsd 2048
# ps -ef|grep nfsd|wc -l
2049
# killall -2 nfsd
#  rpc.nfsd 2048
# ps -ef|grep nfsd|wc -l
1

Note this is the same in RHEL3 and RHEL4

Many thanks,

Ramon

This event sent from IssueTracker by sfernand  [Support Engineering Group]
 issue 163807

--- Additional comment from tao on 2008-07-22 08:30:23 EDT ---

File uploaded: kfarmer.tar.bz2
This event sent from IssueTracker by sfernand  [Support Engineering Group]
 issue 163807
it_file 119535

--- Additional comment from tao on 2008-07-22 08:30:24 EDT ---

> IIRC, rpc.nfsd simply writes the argument to /proc/fs/nfsd/threads, so
could you please capture the value in this file between the runs of
rpc.nfsd ?

RHEL3:

# rpc.nfsd 1024
# grep th /proc/net/rpc/nfsd
th 1024 0 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
# grep th /proc/net/rpc/nfsd
th 2048 0 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000


RHEL4:

# rpc.nfsd 1024
# cat /proc/fs/nfsd/threads
(empty)
# grep th /proc/net/rpc/nfsd
th 1024 0 0.538 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
# rpc.nfsd 2048
# grep th /proc/net/rpc/nfsd
th 2048 0 0.538 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000


> Since RHEL 3 is in maintenance mode, I think this ticket won't make
much head way if it ever gets to engineering. I suggest that you verify
whether this is reproducible on RHEL 4 (although the description did
mention it, but I'm not too sure whether this was validated or just
something the customer claimed). If it does exists change the
'Product/Version' field to the correct release.
 It does:

# uname -a
Linux racedo.fab.redhat.com 2.6.9-55.0.2.EL #1 Tue Jun 12 17:47:10 EDT
2007 i686 i686 i386 GNU/Linux
#  rpc.nfsd 2048
# tail -2 /var/log/messages
Feb 21 11:02:00 racedo nfsd[940]: nfssvc: Cannot allocate memory
Feb 21 11:02:00 racedo kernel: nfsd: Could not allocate memory read-ahead
cache.

The kernel is not the last one but I don't think it makes any
difference.

Thanks!

Ramon


Product changed from 'Red Hat Enterprise Linux 3.9' to 'Red Hat Enterprise
Linux 4.6'
Internal Status set to 'Waiting on SEG'

This event sent from IssueTracker by sfernand  [Support Engineering Group]
 issue 163807

--- Additional comment from tao on 2008-07-22 08:30:24 EDT ---

Hi Eva,

> can I have an update on this?
Sorry about the extremely delayed response. This one is kinda hard to
figure out. I managed (with some bit of effort) to reproduce this issue,
however, I can't really say why this is happening.

Anyways, short description -- the only way I have managed to reproduce
this is by first having an unclean killing all nfsd process in a manner
that it leaves some rpc.* processes hanging around[1].
$ killall nfsd
or 
$ killall -9 nfsd

In such a case, even when you execute ...

$ rpc.nfsd 1

...nfsd complains with the message:

nfsd[9554]: nfssvc: Cannot allocate memory

Now, to recover from this situation you need to cleanly restart nfs (by
first killing off all the rpc.* processes):

[root@dhcp6-104 ~]# ps ax | grep "rpc\\."
 4679 ?        Ss     0:00 rpc.idmapd
[root@dhcp6-104 ~]# killall rpc.idmapd
[root@dhcp6-104 ~]# ps ax | grep "rpc\\."
[root@dhcp6-104 ~]# service nfs start
Starting NFS services:                                     [  OK  ]
Starting NFS quotas:                                       [  OK  ]
Starting NFS daemon:                                       [  OK  ]
Starting NFS mountd:                                       [  OK  ]
Starting RPC idmapd:                                       [  OK  ]
[root@dhcp6-104 ~]# rpc.nfsd 2048
[root@dhcp6-104 ~]# tail -2 /var/log/messages
Jun 13 20:03:49 dhcp6-104 rpcidmapd: rpc.idmapd startup succeeded
Jun 13 20:04:07 dhcp6-104 nfsd[11753]: nfssvc_versbits: +2 +3 +4
[root@dhcp6-104 ~]# ps ax| grep nfsd | wc -l
2049

I'll send out the long description in my next update (after I verify what
I /think/ might be happening). Just FYI, I think, this behaviour might have
been reported at least once earlier in IT 106266, but that just closed
without resolution.

regards,
- steve

[1] ie: you get messages similar to this in the log file:
Jun 13 19:48:48 dhcp6-104 kernel: rpciod: active tasks at shutdown?!




This event sent from IssueTracker by sfernand  [Support Engineering Group]
 issue 163807

--- Additional comment from tao on 2008-07-22 08:30:25 EDT ---

also: https://bugzilla.redhat.com/show_bug.cgi?id=202420


This event sent from IssueTracker by sfernand  [Support Engineering Group]
 issue 163807

--- Additional comment from tao on 2008-07-22 08:30:26 EDT ---

Escalating.

Engineering: I am sorry I do not know how to debug this further. Maybe you
could provide some pointers.

- steve


This event sent from IssueTracker by sfernand  [Support Engineering Group]
 issue 163807

--- Additional comment from jlayton on 2008-07-29 09:34:24 EDT ---

We're definitely returning -ENOMEM here:

open("/proc/fs/nfsd/threads", O_WRONLY) = 3
write(3, "1024\n", 5)                   = -1 ENOMEM (Cannot allocate memory)

nfsd_debug doesn't tell us much:

    nfsd: creating service: port 2049 vers 0xe proto 0x30000
    nfsd: Could not allocate memory read-ahead cache.


--- Additional comment from jlayton on 2008-07-29 09:49:39 EDT ---

RHEL5 seems to behave the same way. rawhide seems to do the right thing, but I
recently did a fairly major overhaul of the nfsd startup/shutdown code upstream
so that may be part of the reason.


--- Additional comment from jlayton on 2008-07-29 10:18:26 EDT ---

The problem is that this allocation is failing in nfsd_racache_init():

        raparml = kmalloc(sizeof(struct raparms) * cache_size, GFP_KERNEL);

...cache_size here is 2 * nrthreads. So with 1024 threads, it does 2048 *
sizeof(struct raparms) (not sure how big struct raparms is right offhand).

Upstream does this very similarly (with kcalloc rather than kmalloc, but
basically the same). The structs are different sizes, but I don't think it's
that significant.

It may just be that rawhide is better able to handle these large allocations.


--- Additional comment from jlayton on 2008-07-29 10:29:25 EDT ---

As far as I can tell, there's no real reason that this needs to be a contiguous
allocation anyway. nfsd_racache_init() uses that fact when it sets up the cache,
but it looks like this could be done just as easily if each raparms struct was
separately allocated.

So this may be fixable, but it's probably going to take some upstream work and
may be too invasive for RHEL4 at this stage.


--- Additional comment from jlayton on 2008-08-13 07:16:38 EDT ---

On my x86_64 xen guests, sizeof(struct raparms) is:

2.6.27-0.244.rc2.git1.fc10.x86_64 = 72
2.6.18-103.el5.jtltest.45debug = 112

...and rhel4 looks like it has this sized similarly to rhel5. Starting 1024 nfsd threads also fails on rhel5. The breakover point seems to be at 586 threads:

586 * 2 * 112 = 131264

...which is just over 131072. That is the largest kmalloc() that you can do in RHEL4/5, and that explains why this falls down. I think the slub allocator (which is used in recent fedora) has a different scheme for large kmallocs and isn't subject to the same limitation.

Still, doing this as one large allocation means that we need non-fragmented memory if we want to start a bunch of nfsd's, and that can be a problem even in recent kernels.

--- Additional comment from jlayton on 2008-08-14 07:46:52 EDT ---

I've sent an initial patch upstream for this and am awaiting comment. It has the kernel allocate each raparm struct individually and then puts them together to build up the racache.

This approach seems to work fine, but we might consider adding a new slabcache for this. On my x86_64 rawhide box each of these allocations comes out of the kmalloc-96 slab, so we're wasting 24 bytes on each allocation. This adds up with a lot of nfsd threads. With a dedicated slabcache we can more efficiently pack these structs into a page and will waste less memory when there are a lot of them. The downside is that we could waste up to a page - sizeof(struct raparm) depending on the number allocated. So it might be better to just stick with kmalloc.

Comment 1 Jeff Layton 2008-08-18 13:33:16 UTC
Upstream patch has been modified some and taken into Bruce Fields' git tree. It looks like it'll be on track for 2.6.28.

Comment 2 Jeff Layton 2008-12-12 20:37:18 UTC
Created attachment 326768 [details]
patchset -- overhaul knfsd readahead cache

I went back through the patch archives and pulled out a couple of other patches that might be useful here. They also make it so that the upstream patch applies cleanly to RHEL.

This set changes the readahead cache to use more granular locking which supposedly reduces CPU utilization on heavily loaded SMP NFS servers (mostly due to spinlock contention). This will need to be well tested, but it looks like a reasonable change to consider.

Comment 5 RHEL Program Management 2009-01-27 20:40:41 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 6 Don Zickus 2009-02-09 18:25:47 UTC
in kernel-2.6.18-131.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 7 RHEL Program Management 2009-02-16 15:05:22 UTC
Updating PM score.

Comment 18 errata-xmlrpc 2009-09-02 08:26:10 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html


Note You need to log in before you can comment on or make changes to this bug.