Bug 464123

Summary: module reference counting problem with rpc_auth_gss
Product: [Fedora] Fedora Reporter: Enrico Scholz <rh-bugzilla>
Component: nfs-utilsAssignee: Steve Dickson <steved>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: medium    
Version: 9CC: jlayton, kernel-maint, staubach, steved
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-06-11 09:42:37 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Attachments:
Description Flags
page allocation failure
none
while sleep 5; do printf '%s: %s\n' "`date +'%Y%m%d%H%M%S'`" "`lsmod | grep des_generic | awk '{ print $3 }'`"; done
none
tcpdump -U -s 9999 -w /var/tmp/cardiff.tr host cardiff
none
reduced tcpdump log
none
rpc_debug output plus stack dump on each module_get
none
rpcsec_gss_krb5 refcounter none

Description Enrico Scholz 2008-09-26 08:33:14 EDT
Created attachment 317787 [details]
page allocation failure

Description of problem:

There seems to be a leak in the NFS4 code. After 14 days runtime, some
NFS4 related modules have very high refcounters:

# lsmod
Module                  Size  Used by
des_generic            25216  52736 
cbc                    12160  52736 
rpcsec_gss_krb5        15632  26370 

I get lot of 'page allocation failure. order:1, mode:0x4000' kernel
messages too.


Version-Release number of selected component (if applicable):

kernel-2.6.26.3-29.fc9.x86_64
(seen with 2.6.25 too)

How reproducible:

100%


Additional info:

There are mounted filesystems from two NFS4 servers with

| exec,nodev,nosuid,rw,sec=krb5i,posix,nointr
| noexec,nodev,nosuid,rw,sec=krb5i,posix,nointr

mount flags.  User credentials expire over night so that /var/log/messages
is cluttered up by thousands of

| WARNING: Failed to create krb5 context for user with uid 505 for server dunster....

messages.
Comment 1 Enrico Scholz 2008-09-26 11:47:25 EDT
umounting all NFS4 partitions, stopping the NFS4 server (which exported some dirs) and all the rpc.* daemons, and unloading the 'nfs' module reduces refcounts only a little bit (they are still >50000).  Nevertheless, I am unable to unload the 'nfsd' module as it has still a refcount of 1.
Comment 2 Jeff Layton 2008-09-30 07:11:18 EDT
Adding Steve D...

This sounds somewhat familiar, but I don't recall any specifics. Steve, is this something you've seen?
Comment 3 Jeff Layton 2008-09-30 07:30:22 EDT
While the refcount leaks are almost certainly a bug, I don't see any reason right offhand why the page allocation should have failed in the log output you attached:

Sep 26 09:21:49 ensc-virt kernel: sshd: page allocation failure. order:1, mode:0x4000

...so we have an order=1 allocation, which means we need 2 contiguous pages:

Sep 26 09:21:49 ensc-virt kernel: Node 0 DMA: 9*4kB 13*8kB 5*16kB 17*32kB 10*64kB 7*128kB 2*256kB 2*512kB 0*1024kB 2*2048kB 0*4096kB = 7932kB
Sep 26 09:21:49 ensc-virt kernel: Node 0 DMA32: 21876*4kB 4*8kB 11*16kB 7*32kB 1*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 89280kB

...and the Mem-info output shows that we have a lot of memory free -- there are 13 8k chunks that could be easily be used here. This seems like a bug to me, but maybe I'm misreading it.

It's quite possible that the memory allocation failures are contributing to the module refcount leaks, so I'll have a look at those codepaths.
Comment 4 Jeff Layton 2008-09-30 10:41:28 EDT
I had someone better at VM issues sanity check my thinking on the allocation failures here. His thoughts were that between the allocation failure and the mem-info getting printed out, some pages were freed. So the page allocation errors may not point to a bug per-se.

The refcounts pretty clearly do seem to be a bug though. My guess is that we're occasionally bumping the module reference counts and then not putting those references in some failure codepath. The RPC gssapi code is pretty hairy stuff though, so it's not immediately obvious to me. We'll probably need to come up with a reproducer.

Ideally, we'd be able to get the client into the state where the refcounts start increasing and then figure out some action we can take that makes them predictably go higher. Enrico, when your client does this, are there any particular activities that you can do on the filesystem that always make the module refcounts go higher? i.e. trying to read a file, listing a directory, etc?
Comment 5 Steve Dickson 2008-09-30 11:24:57 EDT
actually I have seen refcount problems with the GSSAPI code....

Question: when the User credentials expire, are they every renewed? If so how?
Comment 6 Enrico Scholz 2008-09-30 13:06:06 EDT
I will run a script which logs number of refcounts over a day or two.  Perhaps this will show a pattern when they begin to increase abnormally.  Else, I see increasing refcounters by normal filesystem operations (ls, quota) but these seem to go down after some time.

Credentials expire around 0200 - 0300 and get renewed around 0900 - 1000 either by logging in over ssh as root[1] and executing "su <user> -c 'kinit'", or through pam/xscreensaver.


[1]  Logging in as normal user usually does not work as ssh blocks when it tries to access $HOME  (there is used public key auth or GSSAPI based auth without forwarding, so that credentials are not created).
Comment 7 Enrico Scholz 2008-10-07 06:56:50 EDT
Created attachment 319631 [details]
while sleep 5; do printf '%s: %s\n' "`date +'%Y%m%d%H%M%S'`" "`lsmod | grep des_generic | awk '{ print $3 }'`"; done

refcounts increase constantly.

The NFS server is the responsible party; not the client.  There is a
koji client (kernel 2.6.22.24) which polls periodically data from an
NFS share.  The share is mounted as

| ensc-virt.intern.sigma-chemnitz.de:/koji /srv/mnt/koji-data nfs4 rw,nosuid,nodev,noexec,vers=4,rsize=262144,wsize=262144,hard,intr,proto=tcp,timeo=600,retrans=3,sec=krb5i,addr=ensc-virt.intern.sigma-chemnitz.de 0 0


I will capture and submit network traffic when refcount increases....
Comment 8 Enrico Scholz 2008-10-07 07:14:30 EDT
Created attachment 319633 [details]
tcpdump -U -s 9999 -w /var/tmp/cardiff.tr host cardiff

I saw an increase of the refcounter at 13:08:44 (reltime 00:27) which is around packet #494 in the dump.
Comment 9 Enrico Scholz 2008-10-14 05:30:29 EDT
Created attachment 320268 [details]
reduced tcpdump log

after disabling a program polling filesystem periodically, I could reduce the tcpdump log.  There, every 60-65s some NFS activitity happens which increases refcounter permanently.
Comment 10 Jeff Layton 2008-10-16 11:14:05 EDT
Looks like we have some problems with refcounting in general with rpc. I did a service nfs start on my rawhide machine and got:

sunrpc                255056  14 nfsd,lockd,nfs_acl,auth_rpcgss

...then did a service nfs restart:

sunrpc                255056  12 nfsd,lockd,nfs_acl,auth_rpcgss

...and again:

sunrpc                255056  10 nfsd,lockd,nfs_acl,auth_rpcgss


...so we're losing 2 RPC module references every time we restart nfs. It looks like it's happening as a result of nfsd going up and down. I'm going to see whether we can fix that problem first since it's easily reproducible and then we can see whether it helps the problem you're seeing.
Comment 11 Jeff Layton 2008-11-24 06:27:12 EST
Tom Tucker posted a patch over the weekend that should fix the refcounting problems when starting and stopping nfsd. I think, however that that problem is separate from the one originally reported here.
Comment 12 Jeff Layton 2008-12-03 08:23:10 EST
Applied Tom's svc_addsock patch and started testing with krb5i. I see similar growth in the refcounts. Each time I mount and unmount the share from the client, the server's refcounts for those 3 modules grows:

First mount and unmount:
Module                  Size  Used by
cbc                    12288  2 
rpcsec_gss_krb5        19208  1 
des_generic            25344  2 


Second mount and unmount:
Module                  Size  Used by
cbc                    12288  4 
rpcsec_gss_krb5        19208  2 
des_generic            25344  4 

Third mount and unmount:
Module                  Size  Used by
cbc                    12288  6 
rpcsec_gss_krb5        19208  3 
des_generic            25344  6 

...looks like on each mount the server does some module_get's but never does module_put's for them.
Comment 13 Jeff Layton 2008-12-03 13:51:16 EST
Created attachment 325574 [details]
rpc_debug output plus stack dump on each module_get
Comment 14 Jeff Layton 2008-12-03 14:33:38 EST
The two things of interest here are:

try_module_get(rpcsec_gss_krb5)
Pid: 3215, comm: rpc.svcgssd Not tainted 2.6.28-0.106.rc6.git4.fc11.x86_64 #1
Call Trace:
 [<ffffffffa0226e47>] try_module_get+0x3b/0x5b [auth_rpcgss]
 [<ffffffffa0226f27>] gss_mech_get_by_name+0x3e/0x71 [auth_rpcgss]
 [<ffffffffa0227b6b>] rsc_parse+0x1b0/0x284 [auth_rpcgss]
 [<ffffffffa01d80c1>] cache_write+0x9e/0xbe [sunrpc]
 [<ffffffffa01d8023>] ? cache_write+0x0/0xbe [sunrpc]
 [<ffffffff81118a66>] proc_reg_write+0x87/0xa4
 [<ffffffff810d6a1f>] vfs_write+0xab/0x105
 [<ffffffff810d6b3d>] sys_write+0x47/0x6f
 [<ffffffff8101116a>] system_call_fastpath+0x16/0x1b

__module_get(rpcsec_gss_krb5)
Pid: 3215, comm: rpc.svcgssd Not tainted 2.6.28-0.106.rc6.git4.fc11.x86_64 #1
Call Trace:
 [<ffffffffa0226f8b>] gss_mech_get+0x31/0x5f [auth_rpcgss]
 [<ffffffffa0227000>] gss_import_sec_context+0x47/0x67 [auth_rpcgss]
 [<ffffffffa0227ba7>] rsc_parse+0x1ec/0x284 [auth_rpcgss]
 [<ffffffffa01d80c1>] cache_write+0x9e/0xbe [sunrpc]
 [<ffffffffa01d8023>] ? cache_write+0x0/0xbe [sunrpc]
 [<ffffffff81118a66>] proc_reg_write+0x87/0xa4
 [<ffffffff810d6a1f>] vfs_write+0xab/0x105
 [<ffffffff810d6b3d>] sys_write+0x47/0x6f
 [<ffffffff8101116a>] system_call_fastpath+0x16/0x1b

...I'm guessing that one of these references is never being put.
Comment 15 Jeff Layton 2008-12-03 15:23:45 EST
I don't know whether this is related, but it certainly seems broken:

------------------[snip]----------------------
int
gss_import_sec_context(const void *input_token, size_t bufsize,
                       struct gss_api_mech      *mech,
                       struct gss_ctx           **ctx_id)
{
        if (!(*ctx_id = kzalloc(sizeof(**ctx_id), GFP_KERNEL)))
                return GSS_S_FAILURE;
        (*ctx_id)->mech_type = gss_mech_get(mech);

        return mech->gm_ops
                ->gss_import_sec_context(input_token, bufsize, *ctx_id);
}

------------------[snip]----------------------

...so in one failure case here, we take a reference to the mech. The other, we don't. The return from the ->gss_import_sec_context also doesn't seem to be consistent with the return val we use when the kzalloc fails. The ->gss_import_sec_context call seems to use "standard" errno return codes.

Presumably it returns 0 on success, so it may not matter, but if you're not checking for GSS_S_FAILURE (and we don't) then you don't know whether to put the reference or not.
Comment 16 Enrico Scholz 2008-12-03 15:51:58 EST
Seems to be ok (afais); e.g. rsc_parse() calls

  rsc_free() -> gss_delete_sec_context() -> gss_mech_put()

based upon *ctx_id.
Comment 17 Jeff Layton 2008-12-03 16:05:39 EST
Yep, I noticed that after i posted the above...

We also do this just before the rsc_free call there:

gss_mech_put(gm);

...which should be putting the other reference. So from what I can tell, we should be exiting rsc_parse with 0 extra references taken. I'm tinkering with some printk debugging now to see if I can figure out why that's not working.
Comment 18 Jeff Layton 2008-12-03 17:32:07 EST
Ahh I see. Some debug output:

gss_import_sec_context returned 0 rsci.mechctx=ffff8800100ca420
after rsc_update, rsci.mechctx=0000000000000000 rscp->mechctx=ffff8800100ca420

We do this before then:

rscp = rsc_update(&rsci, rscp);

...that does a sunrpc_cache_update(), which has this in the comments:

        /* The 'old' entry is to be replaced by 'new'.
         * If 'old' is not VALID, we update it directly,
         * otherwise we need to replace it
         */

...that calls the cache_detail->update method in several places, which for this cache is update_rsc(). That does this:

        new->mechctx = tmp->mechctx;
        tmp->mechctx = NULL;

...so I guess the idea here is that the "new" entry will take over the reference from the old entry (tmp). Not very straightforward to follow, unfortunately...

Eventually, we're going to want to call rsc_free (probably via rsc_put) on the new cache entry too, I guess. The part I'm not clear on is when that happens...
Comment 19 Jeff Layton 2008-12-08 10:02:28 EST
I'm starting to wonder whether the inflated reference counts here might be normal...

It looks like each entry in the cache will hold a reference. Eventually those cache entries should get flushed out, at which point their module references get released. Not a particularly great design since you'd have to wait for the cache to be purged before you can unplug the module. I'm not sure that the implementation is broken, however.

I suppose what we need to do is to set up a server and do a few mounts/unmounts to inflate the module refcounts. Then, let it sit idle for a while and see if we can flush the caches and make the refcounts go back to "normal".
Comment 20 Jeff Layton 2008-12-10 11:42:14 EST
Ok, after some upstream discussion, I think I understand the situation...

The problem is that the userspace rpc.svcgssd is stuffing the cache with entries that never expire. Kevin Coffman posted a patchset last week to make these cache entries have a timeout that corresponds to the expiration of the ticket. So if you have a ticket that expires in 24h, the rsc cache entry should also expire at the same time.

Not only is this making the module refcounts go up, it's also effectively a memory leak. Over a long period the cache can grow quite large.

To verify this, I tested Kevin's patchset by setting up krb5.conf to get tickets with a short timeout. With that I was able to verify that the module refcounts went back down to 0 after the ticket timed out.

Obviously, this means that you can't actually unplug the module until the cache times out. It's not an ideal situation, but it does seem like there isn't a kernel bug involved here. It's working as designed. Userspace arguably has a bug, however, so I'm moving this to a nfs-utils bug.

At some point in the future, it'll probably make more sense to change how the module refcounts are handled, probably by making it so that having a krb5-enabled export causes the kernel to hold a module reference, and to purge the cache when the last one goes away (or on rmmod).

It's hard to consider that high priority however since this does basically work and there's not much reason to unplug this module.

In any case, here's the upstream discussion on the userspace problem:

http://marc.info/?t=122819524700001&r=1&w=2

...and here is the patchset:

http://marc.info/?l=linux-nfs&m=122877893404903&w=2

I'm pretty sure SteveD will take this patchset or some variant and at that point, this will be fixed upstream and eventually in Fedora.

I'll leave this open for now to keep an eye on progress with that.
Comment 21 Jeff Layton 2008-12-10 11:51:30 EST
There is one other piece of info...

Supposedly recent clients are supposed to send a RPC on unmount that tells the server to destroy the GSS context. That should make the cache entry go away. This does not seem to be working however and the reason for it hasn't yet been determined.
Comment 22 Enrico Scholz 2008-12-15 11:04:38 EST
Created attachment 326989 [details]
rpcsec_gss_krb5 refcounter

The patchset fixes the refcount problem. Unfortunately, the "page allocation failure" still occur :(
Comment 23 Jeff Layton 2008-12-15 11:50:53 EST
The page allocation failures sound like a different problem entirely (though the fact that the gss context cache never shrinks probably exacerbates the problem). I'd suggest opening a different BZ for that since it looks like a kernel problem.
Comment 24 Bug Zapper 2009-06-09 22:49:04 EDT
This message is a reminder that Fedora 9 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 9.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '9'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 9's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 9 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Comment 25 Steve Dickson 2009-06-11 09:42:37 EDT
Since we have not seen this for a while, I'm going to close
out this bug. Please feel free reopen it if the problem 
comes back....