Created attachment 317787 [details] page allocation failure Description of problem: There seems to be a leak in the NFS4 code. After 14 days runtime, some NFS4 related modules have very high refcounters: # lsmod Module Size Used by des_generic 25216 52736 cbc 12160 52736 rpcsec_gss_krb5 15632 26370 I get lot of 'page allocation failure. order:1, mode:0x4000' kernel messages too. Version-Release number of selected component (if applicable): kernel-2.6.26.3-29.fc9.x86_64 (seen with 2.6.25 too) How reproducible: 100% Additional info: There are mounted filesystems from two NFS4 servers with | exec,nodev,nosuid,rw,sec=krb5i,posix,nointr | noexec,nodev,nosuid,rw,sec=krb5i,posix,nointr mount flags. User credentials expire over night so that /var/log/messages is cluttered up by thousands of | WARNING: Failed to create krb5 context for user with uid 505 for server dunster.... messages.
umounting all NFS4 partitions, stopping the NFS4 server (which exported some dirs) and all the rpc.* daemons, and unloading the 'nfs' module reduces refcounts only a little bit (they are still >50000). Nevertheless, I am unable to unload the 'nfsd' module as it has still a refcount of 1.
Adding Steve D... This sounds somewhat familiar, but I don't recall any specifics. Steve, is this something you've seen?
While the refcount leaks are almost certainly a bug, I don't see any reason right offhand why the page allocation should have failed in the log output you attached: Sep 26 09:21:49 ensc-virt kernel: sshd: page allocation failure. order:1, mode:0x4000 ...so we have an order=1 allocation, which means we need 2 contiguous pages: Sep 26 09:21:49 ensc-virt kernel: Node 0 DMA: 9*4kB 13*8kB 5*16kB 17*32kB 10*64kB 7*128kB 2*256kB 2*512kB 0*1024kB 2*2048kB 0*4096kB = 7932kB Sep 26 09:21:49 ensc-virt kernel: Node 0 DMA32: 21876*4kB 4*8kB 11*16kB 7*32kB 1*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 89280kB ...and the Mem-info output shows that we have a lot of memory free -- there are 13 8k chunks that could be easily be used here. This seems like a bug to me, but maybe I'm misreading it. It's quite possible that the memory allocation failures are contributing to the module refcount leaks, so I'll have a look at those codepaths.
I had someone better at VM issues sanity check my thinking on the allocation failures here. His thoughts were that between the allocation failure and the mem-info getting printed out, some pages were freed. So the page allocation errors may not point to a bug per-se. The refcounts pretty clearly do seem to be a bug though. My guess is that we're occasionally bumping the module reference counts and then not putting those references in some failure codepath. The RPC gssapi code is pretty hairy stuff though, so it's not immediately obvious to me. We'll probably need to come up with a reproducer. Ideally, we'd be able to get the client into the state where the refcounts start increasing and then figure out some action we can take that makes them predictably go higher. Enrico, when your client does this, are there any particular activities that you can do on the filesystem that always make the module refcounts go higher? i.e. trying to read a file, listing a directory, etc?
actually I have seen refcount problems with the GSSAPI code.... Question: when the User credentials expire, are they every renewed? If so how?
I will run a script which logs number of refcounts over a day or two. Perhaps this will show a pattern when they begin to increase abnormally. Else, I see increasing refcounters by normal filesystem operations (ls, quota) but these seem to go down after some time. Credentials expire around 0200 - 0300 and get renewed around 0900 - 1000 either by logging in over ssh as root[1] and executing "su <user> -c 'kinit'", or through pam/xscreensaver. [1] Logging in as normal user usually does not work as ssh blocks when it tries to access $HOME (there is used public key auth or GSSAPI based auth without forwarding, so that credentials are not created).
Created attachment 319631 [details] while sleep 5; do printf '%s: %s\n' "`date +'%Y%m%d%H%M%S'`" "`lsmod | grep des_generic | awk '{ print $3 }'`"; done refcounts increase constantly. The NFS server is the responsible party; not the client. There is a koji client (kernel 2.6.22.24) which polls periodically data from an NFS share. The share is mounted as | ensc-virt.intern.sigma-chemnitz.de:/koji /srv/mnt/koji-data nfs4 rw,nosuid,nodev,noexec,vers=4,rsize=262144,wsize=262144,hard,intr,proto=tcp,timeo=600,retrans=3,sec=krb5i,addr=ensc-virt.intern.sigma-chemnitz.de 0 0 I will capture and submit network traffic when refcount increases....
Created attachment 319633 [details] tcpdump -U -s 9999 -w /var/tmp/cardiff.tr host cardiff I saw an increase of the refcounter at 13:08:44 (reltime 00:27) which is around packet #494 in the dump.
Created attachment 320268 [details] reduced tcpdump log after disabling a program polling filesystem periodically, I could reduce the tcpdump log. There, every 60-65s some NFS activitity happens which increases refcounter permanently.
Looks like we have some problems with refcounting in general with rpc. I did a service nfs start on my rawhide machine and got: sunrpc 255056 14 nfsd,lockd,nfs_acl,auth_rpcgss ...then did a service nfs restart: sunrpc 255056 12 nfsd,lockd,nfs_acl,auth_rpcgss ...and again: sunrpc 255056 10 nfsd,lockd,nfs_acl,auth_rpcgss ...so we're losing 2 RPC module references every time we restart nfs. It looks like it's happening as a result of nfsd going up and down. I'm going to see whether we can fix that problem first since it's easily reproducible and then we can see whether it helps the problem you're seeing.
Tom Tucker posted a patch over the weekend that should fix the refcounting problems when starting and stopping nfsd. I think, however that that problem is separate from the one originally reported here.
Applied Tom's svc_addsock patch and started testing with krb5i. I see similar growth in the refcounts. Each time I mount and unmount the share from the client, the server's refcounts for those 3 modules grows: First mount and unmount: Module Size Used by cbc 12288 2 rpcsec_gss_krb5 19208 1 des_generic 25344 2 Second mount and unmount: Module Size Used by cbc 12288 4 rpcsec_gss_krb5 19208 2 des_generic 25344 4 Third mount and unmount: Module Size Used by cbc 12288 6 rpcsec_gss_krb5 19208 3 des_generic 25344 6 ...looks like on each mount the server does some module_get's but never does module_put's for them.
Created attachment 325574 [details] rpc_debug output plus stack dump on each module_get
The two things of interest here are: try_module_get(rpcsec_gss_krb5) Pid: 3215, comm: rpc.svcgssd Not tainted 2.6.28-0.106.rc6.git4.fc11.x86_64 #1 Call Trace: [<ffffffffa0226e47>] try_module_get+0x3b/0x5b [auth_rpcgss] [<ffffffffa0226f27>] gss_mech_get_by_name+0x3e/0x71 [auth_rpcgss] [<ffffffffa0227b6b>] rsc_parse+0x1b0/0x284 [auth_rpcgss] [<ffffffffa01d80c1>] cache_write+0x9e/0xbe [sunrpc] [<ffffffffa01d8023>] ? cache_write+0x0/0xbe [sunrpc] [<ffffffff81118a66>] proc_reg_write+0x87/0xa4 [<ffffffff810d6a1f>] vfs_write+0xab/0x105 [<ffffffff810d6b3d>] sys_write+0x47/0x6f [<ffffffff8101116a>] system_call_fastpath+0x16/0x1b __module_get(rpcsec_gss_krb5) Pid: 3215, comm: rpc.svcgssd Not tainted 2.6.28-0.106.rc6.git4.fc11.x86_64 #1 Call Trace: [<ffffffffa0226f8b>] gss_mech_get+0x31/0x5f [auth_rpcgss] [<ffffffffa0227000>] gss_import_sec_context+0x47/0x67 [auth_rpcgss] [<ffffffffa0227ba7>] rsc_parse+0x1ec/0x284 [auth_rpcgss] [<ffffffffa01d80c1>] cache_write+0x9e/0xbe [sunrpc] [<ffffffffa01d8023>] ? cache_write+0x0/0xbe [sunrpc] [<ffffffff81118a66>] proc_reg_write+0x87/0xa4 [<ffffffff810d6a1f>] vfs_write+0xab/0x105 [<ffffffff810d6b3d>] sys_write+0x47/0x6f [<ffffffff8101116a>] system_call_fastpath+0x16/0x1b ...I'm guessing that one of these references is never being put.
I don't know whether this is related, but it certainly seems broken: ------------------[snip]---------------------- int gss_import_sec_context(const void *input_token, size_t bufsize, struct gss_api_mech *mech, struct gss_ctx **ctx_id) { if (!(*ctx_id = kzalloc(sizeof(**ctx_id), GFP_KERNEL))) return GSS_S_FAILURE; (*ctx_id)->mech_type = gss_mech_get(mech); return mech->gm_ops ->gss_import_sec_context(input_token, bufsize, *ctx_id); } ------------------[snip]---------------------- ...so in one failure case here, we take a reference to the mech. The other, we don't. The return from the ->gss_import_sec_context also doesn't seem to be consistent with the return val we use when the kzalloc fails. The ->gss_import_sec_context call seems to use "standard" errno return codes. Presumably it returns 0 on success, so it may not matter, but if you're not checking for GSS_S_FAILURE (and we don't) then you don't know whether to put the reference or not.
Seems to be ok (afais); e.g. rsc_parse() calls rsc_free() -> gss_delete_sec_context() -> gss_mech_put() based upon *ctx_id.
Yep, I noticed that after i posted the above... We also do this just before the rsc_free call there: gss_mech_put(gm); ...which should be putting the other reference. So from what I can tell, we should be exiting rsc_parse with 0 extra references taken. I'm tinkering with some printk debugging now to see if I can figure out why that's not working.
Ahh I see. Some debug output: gss_import_sec_context returned 0 rsci.mechctx=ffff8800100ca420 after rsc_update, rsci.mechctx=0000000000000000 rscp->mechctx=ffff8800100ca420 We do this before then: rscp = rsc_update(&rsci, rscp); ...that does a sunrpc_cache_update(), which has this in the comments: /* The 'old' entry is to be replaced by 'new'. * If 'old' is not VALID, we update it directly, * otherwise we need to replace it */ ...that calls the cache_detail->update method in several places, which for this cache is update_rsc(). That does this: new->mechctx = tmp->mechctx; tmp->mechctx = NULL; ...so I guess the idea here is that the "new" entry will take over the reference from the old entry (tmp). Not very straightforward to follow, unfortunately... Eventually, we're going to want to call rsc_free (probably via rsc_put) on the new cache entry too, I guess. The part I'm not clear on is when that happens...
I'm starting to wonder whether the inflated reference counts here might be normal... It looks like each entry in the cache will hold a reference. Eventually those cache entries should get flushed out, at which point their module references get released. Not a particularly great design since you'd have to wait for the cache to be purged before you can unplug the module. I'm not sure that the implementation is broken, however. I suppose what we need to do is to set up a server and do a few mounts/unmounts to inflate the module refcounts. Then, let it sit idle for a while and see if we can flush the caches and make the refcounts go back to "normal".
Ok, after some upstream discussion, I think I understand the situation... The problem is that the userspace rpc.svcgssd is stuffing the cache with entries that never expire. Kevin Coffman posted a patchset last week to make these cache entries have a timeout that corresponds to the expiration of the ticket. So if you have a ticket that expires in 24h, the rsc cache entry should also expire at the same time. Not only is this making the module refcounts go up, it's also effectively a memory leak. Over a long period the cache can grow quite large. To verify this, I tested Kevin's patchset by setting up krb5.conf to get tickets with a short timeout. With that I was able to verify that the module refcounts went back down to 0 after the ticket timed out. Obviously, this means that you can't actually unplug the module until the cache times out. It's not an ideal situation, but it does seem like there isn't a kernel bug involved here. It's working as designed. Userspace arguably has a bug, however, so I'm moving this to a nfs-utils bug. At some point in the future, it'll probably make more sense to change how the module refcounts are handled, probably by making it so that having a krb5-enabled export causes the kernel to hold a module reference, and to purge the cache when the last one goes away (or on rmmod). It's hard to consider that high priority however since this does basically work and there's not much reason to unplug this module. In any case, here's the upstream discussion on the userspace problem: http://marc.info/?t=122819524700001&r=1&w=2 ...and here is the patchset: http://marc.info/?l=linux-nfs&m=122877893404903&w=2 I'm pretty sure SteveD will take this patchset or some variant and at that point, this will be fixed upstream and eventually in Fedora. I'll leave this open for now to keep an eye on progress with that.
There is one other piece of info... Supposedly recent clients are supposed to send a RPC on unmount that tells the server to destroy the GSS context. That should make the cache entry go away. This does not seem to be working however and the reason for it hasn't yet been determined.
Created attachment 326989 [details] rpcsec_gss_krb5 refcounter The patchset fixes the refcount problem. Unfortunately, the "page allocation failure" still occur :(
The page allocation failures sound like a different problem entirely (though the fact that the gss context cache never shrinks probably exacerbates the problem). I'd suggest opening a different BZ for that since it looks like a kernel problem.
This message is a reminder that Fedora 9 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 9. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '9'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 9's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 9 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Since we have not seen this for a while, I'm going to close out this bug. Please feel free reopen it if the problem comes back....