+++ This bug was initially created as a clone of Bug #456229 +++ Escalated to Bugzilla from IssueTracker --- Additional comment from tao on 2008-07-22 08:30:22 EDT --- State the problem 2. Provide clear and concise problem description as it is understood at the time of escalation A customer running a big search engine that has a large number of NFS clients, with a minimum of 200 NFS threads used at a given time, cannot increase those above 1024. It is a migration from Solaris where they were running 2048 threads without issues and they are coming across this issue only with Linux which apparently is not letting them working well. The errors are these: Feb 13 11:59:41 racedo nfsd[29991]: nfssvc: Cannot allocate memory Feb 13 11:59:41 racedo kernel: nfsd: Could not allocate memory read-ahead cache. 3. State specific action requested of SEG Check whether it exists any tunable to be able to increase this limit. Be aware that I managed to start 2048 threads by running two times rpc.nfsd but I can't start 2048 by setting RPCNFSDCOUNT=1024 in /etc/sysconfig/nfs, please see reproduction steps. Provide supporting info 2. Attach sosreport done. 4. Provide issue repro information: running 'rpc.nfsd 1024' works fine. If afterwards I run again 'rpc.nfsd 2048' it works for me. If I run directly rpc.nfsd 2048 it won't work. Please follow this sequence: # killall -2 nfsd # ulimit -a core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited pending signals (-i) 1024 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 2048 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 16360 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited # rpc.nfsd 1024 # ps -ef|grep nfsd|wc -l 1025 # rpc.nfsd 2048 # ps -ef|grep nfsd|wc -l 2049 # killall -2 nfsd # rpc.nfsd 2048 # ps -ef|grep nfsd|wc -l 1 Note this is the same in RHEL3 and RHEL4 Many thanks, Ramon This event sent from IssueTracker by sfernand [Support Engineering Group] issue 163807 --- Additional comment from tao on 2008-07-22 08:30:23 EDT --- File uploaded: kfarmer.tar.bz2 This event sent from IssueTracker by sfernand [Support Engineering Group] issue 163807 it_file 119535 --- Additional comment from tao on 2008-07-22 08:30:24 EDT --- > IIRC, rpc.nfsd simply writes the argument to /proc/fs/nfsd/threads, so could you please capture the value in this file between the runs of rpc.nfsd ? RHEL3: # rpc.nfsd 1024 # grep th /proc/net/rpc/nfsd th 1024 0 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 # grep th /proc/net/rpc/nfsd th 2048 0 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 RHEL4: # rpc.nfsd 1024 # cat /proc/fs/nfsd/threads (empty) # grep th /proc/net/rpc/nfsd th 1024 0 0.538 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 # rpc.nfsd 2048 # grep th /proc/net/rpc/nfsd th 2048 0 0.538 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 > Since RHEL 3 is in maintenance mode, I think this ticket won't make much head way if it ever gets to engineering. I suggest that you verify whether this is reproducible on RHEL 4 (although the description did mention it, but I'm not too sure whether this was validated or just something the customer claimed). If it does exists change the 'Product/Version' field to the correct release. It does: # uname -a Linux racedo.fab.redhat.com 2.6.9-55.0.2.EL #1 Tue Jun 12 17:47:10 EDT 2007 i686 i686 i386 GNU/Linux # rpc.nfsd 2048 # tail -2 /var/log/messages Feb 21 11:02:00 racedo nfsd[940]: nfssvc: Cannot allocate memory Feb 21 11:02:00 racedo kernel: nfsd: Could not allocate memory read-ahead cache. The kernel is not the last one but I don't think it makes any difference. Thanks! Ramon Product changed from 'Red Hat Enterprise Linux 3.9' to 'Red Hat Enterprise Linux 4.6' Internal Status set to 'Waiting on SEG' This event sent from IssueTracker by sfernand [Support Engineering Group] issue 163807 --- Additional comment from tao on 2008-07-22 08:30:24 EDT --- Hi Eva, > can I have an update on this? Sorry about the extremely delayed response. This one is kinda hard to figure out. I managed (with some bit of effort) to reproduce this issue, however, I can't really say why this is happening. Anyways, short description -- the only way I have managed to reproduce this is by first having an unclean killing all nfsd process in a manner that it leaves some rpc.* processes hanging around[1]. $ killall nfsd or $ killall -9 nfsd In such a case, even when you execute ... $ rpc.nfsd 1 ...nfsd complains with the message: nfsd[9554]: nfssvc: Cannot allocate memory Now, to recover from this situation you need to cleanly restart nfs (by first killing off all the rpc.* processes): [root@dhcp6-104 ~]# ps ax | grep "rpc\\." 4679 ? Ss 0:00 rpc.idmapd [root@dhcp6-104 ~]# killall rpc.idmapd [root@dhcp6-104 ~]# ps ax | grep "rpc\\." [root@dhcp6-104 ~]# service nfs start Starting NFS services: [ OK ] Starting NFS quotas: [ OK ] Starting NFS daemon: [ OK ] Starting NFS mountd: [ OK ] Starting RPC idmapd: [ OK ] [root@dhcp6-104 ~]# rpc.nfsd 2048 [root@dhcp6-104 ~]# tail -2 /var/log/messages Jun 13 20:03:49 dhcp6-104 rpcidmapd: rpc.idmapd startup succeeded Jun 13 20:04:07 dhcp6-104 nfsd[11753]: nfssvc_versbits: +2 +3 +4 [root@dhcp6-104 ~]# ps ax| grep nfsd | wc -l 2049 I'll send out the long description in my next update (after I verify what I /think/ might be happening). Just FYI, I think, this behaviour might have been reported at least once earlier in IT 106266, but that just closed without resolution. regards, - steve [1] ie: you get messages similar to this in the log file: Jun 13 19:48:48 dhcp6-104 kernel: rpciod: active tasks at shutdown?! This event sent from IssueTracker by sfernand [Support Engineering Group] issue 163807 --- Additional comment from tao on 2008-07-22 08:30:25 EDT --- also: https://bugzilla.redhat.com/show_bug.cgi?id=202420 This event sent from IssueTracker by sfernand [Support Engineering Group] issue 163807 --- Additional comment from tao on 2008-07-22 08:30:26 EDT --- Escalating. Engineering: I am sorry I do not know how to debug this further. Maybe you could provide some pointers. - steve This event sent from IssueTracker by sfernand [Support Engineering Group] issue 163807 --- Additional comment from jlayton on 2008-07-29 09:34:24 EDT --- We're definitely returning -ENOMEM here: open("/proc/fs/nfsd/threads", O_WRONLY) = 3 write(3, "1024\n", 5) = -1 ENOMEM (Cannot allocate memory) nfsd_debug doesn't tell us much: nfsd: creating service: port 2049 vers 0xe proto 0x30000 nfsd: Could not allocate memory read-ahead cache. --- Additional comment from jlayton on 2008-07-29 09:49:39 EDT --- RHEL5 seems to behave the same way. rawhide seems to do the right thing, but I recently did a fairly major overhaul of the nfsd startup/shutdown code upstream so that may be part of the reason. --- Additional comment from jlayton on 2008-07-29 10:18:26 EDT --- The problem is that this allocation is failing in nfsd_racache_init(): raparml = kmalloc(sizeof(struct raparms) * cache_size, GFP_KERNEL); ...cache_size here is 2 * nrthreads. So with 1024 threads, it does 2048 * sizeof(struct raparms) (not sure how big struct raparms is right offhand). Upstream does this very similarly (with kcalloc rather than kmalloc, but basically the same). The structs are different sizes, but I don't think it's that significant. It may just be that rawhide is better able to handle these large allocations. --- Additional comment from jlayton on 2008-07-29 10:29:25 EDT --- As far as I can tell, there's no real reason that this needs to be a contiguous allocation anyway. nfsd_racache_init() uses that fact when it sets up the cache, but it looks like this could be done just as easily if each raparms struct was separately allocated. So this may be fixable, but it's probably going to take some upstream work and may be too invasive for RHEL4 at this stage. --- Additional comment from jlayton on 2008-08-13 07:16:38 EDT --- On my x86_64 xen guests, sizeof(struct raparms) is: 2.6.27-0.244.rc2.git1.fc10.x86_64 = 72 2.6.18-103.el5.jtltest.45debug = 112 ...and rhel4 looks like it has this sized similarly to rhel5. Starting 1024 nfsd threads also fails on rhel5. The breakover point seems to be at 586 threads: 586 * 2 * 112 = 131264 ...which is just over 131072. That is the largest kmalloc() that you can do in RHEL4/5, and that explains why this falls down. I think the slub allocator (which is used in recent fedora) has a different scheme for large kmallocs and isn't subject to the same limitation. Still, doing this as one large allocation means that we need non-fragmented memory if we want to start a bunch of nfsd's, and that can be a problem even in recent kernels. --- Additional comment from jlayton on 2008-08-14 07:46:52 EDT --- I've sent an initial patch upstream for this and am awaiting comment. It has the kernel allocate each raparm struct individually and then puts them together to build up the racache. This approach seems to work fine, but we might consider adding a new slabcache for this. On my x86_64 rawhide box each of these allocations comes out of the kmalloc-96 slab, so we're wasting 24 bytes on each allocation. This adds up with a lot of nfsd threads. With a dedicated slabcache we can more efficiently pack these structs into a page and will waste less memory when there are a lot of them. The downside is that we could waste up to a page - sizeof(struct raparm) depending on the number allocated. So it might be better to just stick with kmalloc.
Upstream patch has been modified some and taken into Bruce Fields' git tree. It looks like it'll be on track for 2.6.28.
Created attachment 326768 [details] patchset -- overhaul knfsd readahead cache I went back through the patch archives and pulled out a couple of other patches that might be useful here. They also make it so that the upstream patch applies cleanly to RHEL. This set changes the readahead cache to use more granular locking which supposedly reduces CPU utilization on heavily loaded SMP NFS servers (mostly due to spinlock contention). This will need to be well tested, but it looks like a reasonable change to consider.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-131.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
Updating PM score.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html