Bug 448603

Summary: holding files under /proc/net open no longer adds to module refcount
Product: Red Hat Enterprise Linux 4 Reporter: Jeff Layton <jlayton>
Component: kernelAssignee: Jeff Layton <jlayton>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: high Docs Contact:
Priority: high    
Version: 4.7CC: nhorman, riek, staubach, steved
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-05-18 19:03:15 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 461297    
Attachments:
Description Flags
proposed patch -- keep a counter in proc_inode to make sure we don't do more module_put()'s than gets
none
updated patch
none
patch -- fix off-by-one in module_ref counts
none
patch to make proc file creation / module ownership atomic for any module that wants it
none
new refcnt patch
none
new patch -- use spinlock to manage refcounts
none
fixed patch -- use spinlock to serialize module get refcount
none
patch -- go back to using atomics none

Description Jeff Layton 2008-05-27 19:37:00 UTC
Ran across this while testing a different patch. Do the following steps on a
machine configured as a nfs server:

# service nfs stop
# modprobe -r nfsd
# service nfs start

...you'll get the following messages and a stack trace like this:

nfsd: last server has exited
nfsd: unexporting all filesystems
rpciod: active tasks at shutdown?!
RPC: failed to contact portmap (errno -5).
nfsd: last server has exited
nfsd: unexporting all filesystems
rpciod: active tasks at shutdown?!
RPC: error 5 connecting to server localhost
RPC: failed to contact portmap (errno -5).
remove_proc_entry: nfs4.idtoname/channel busy, count=1
remove_proc_entry: rpc/nfs4.idtoname busy, count=1
remove_proc_entry: nfs4.nametoid/channel busy, count=1
remove_proc_entry: rpc/nfs4.nametoid busy, count=1
Unable to handle kernel NULL pointer dereference at 0000000000000008 RIP: 
<ffffffffa00c7cdb>{:sunrpc:cache_open+111}
PML4 1caab067 PGD 1c785067 PMD 0 
Oops: 0002 [1] SMP 
CPU 0 
Modules linked in: nfsd exportfs lockd nfs_acl md5 ipv6 parport_pc lp parport
i2c_dev i2c_core rpcsec_gss_krb5 auth_rpcgss des sunrpc iptable_filter ip_tables
ds yenta_socket pcmcia_core button battery ac uhci_hcd 8139cp mii floppy
dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod
Pid: 1986, comm: rpc.idmapd Not tainted 2.6.9-70.ELsmp
RIP: 0010:[<ffffffffa00c7cdb>] <ffffffffa00c7cdb>{:sunrpc:cache_open+111}
RSP: 0018:000001001c2b5e48  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffffffa01c1520 RCX: 000001001a6a69c0
RDX: ffffffffa01c1590 RSI: 0000000000000202 RDI: ffffffffa00e04d8
RBP: 000001001d3c4ee0 R08: 000001001c2b5d48 R09: 0000000000000000
R10: 0000000000000048 R11: 0000000000000048 R12: 000001001a6a69c0
R13: 000001001e35fc00 R14: 0000000000000000 R15: 000001001b5ffe58
FS:  0000002a95579560(0000) GS:ffffffff8050c480(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 0000000000101000 CR4: 00000000000006e0
Process rpc.idmapd (pid: 1986, threadinfo 000001001c2b4000, task 000001001e19d030)
Stack: 000001001a6a69c0 000001001a6a69c0 000001001b5f0760 ffffffff8017af09 
       0000000000000000 0000000000000000 000001001a6a69c0 0000000000008002 
       0000000000000000 000001001b651000 
Call Trace:<ffffffff8017af09>{__dentry_open+208} <ffffffff8017b0e2>{filp_open+95} 
       <ffffffff80136084>{autoremove_wake_function+0}
<ffffffff80144eb7>{do_sigaction+553} 
       <ffffffff801f19c5>{strncpy_from_user+74} <ffffffff8017b2d1>{sys_open+57} 
       <ffffffff801102b6>{system_call+126} 

Code: 48 89 68 08 48 89 45 00 48 89 55 08 48 89 6b 70 e8 f5 0e 25 
RIP <ffffffffa00c7cdb>{:sunrpc:cache_open+111} RSP <000001001c2b5e48>
CR2: 0000000000000008
 <0>Kernel panic - not syncing: Oops
 
...I've been able to reproduce on -70.ELsmp, but not on -42.0.10.ELsmp. On
-42.0.10, I'm unable to unplug nfsd, so it seems like something has changed wrt
to module refcounting.

I'm not sure when the problem was introduced, but I'll try a bisect search to
see if I can track it down...

Comment 1 Jeff Layton 2008-05-27 20:09:59 UTC
cc'ing Neil since it looks like it might be related to his patch...

Problem introduced in -68.28.EL. Up until that kernel, a running rpc.idmapd
would prevent nfsd.ko from being unplugged.

This patch seems like the most likely candidate:

-Fix race condition in proc file reading that leads to module refcnt imbalance
(Neil Horman) [280431]

Not a particularly serious regression -- unplugging modules generally falls into
the "don't do that category", but it might be indicative of bigger problems.

...Neil, thoughts?

Comment 2 Jeff Layton 2008-05-27 20:20:19 UTC
...on -68.27EL kernel, with rpc.idmapd up and running:

[root@dhcp231-224 ~]# lsmod | head
Module                  Size  Used by
nfsd                  270945  4 

...lsof info:

rpc.idmap 2575 root    0u   CHR                1,3               1295 /dev/null
rpc.idmap 2575 root    1u   CHR                1,3               1295 /dev/null
rpc.idmap 2575 root    2u   CHR                1,3               1295 /dev/null
rpc.idmap 2575 root    3r  0000                0,8       0       5923 eventpoll
rpc.idmap 2575 root    5r   DIR                0,3       0  168755209 /proc/2575/fd
rpc.idmap 2575 root    6u  unix 0x000001001f08ac40               5924 socket
rpc.idmap 2575 root    7u  unix 0x000001001f08a940               5925 socket
rpc.idmap 2575 root    8u   REG                0,3       0 4026532584
/proc/net/rpc/nfs4.nametoid/channel
rpc.idmap 2575 root    9u   REG                0,3       0 4026532580
/proc/net/rpc/nfs4.idtoname/channel
rpc.idmap 2575 root   10r   DIR               0,18       0          4
/var/lib/nfs/rpc_pipefs/nfs

...if I stop rpc.idmapd, the refcount goes to 0.


Comment 3 Jeff Layton 2008-05-27 20:42:33 UTC
Chatted with Neil on IRC about this...the patch for bug 280431 changed it so
that module refcounts were moved from the lookup/delete codepath to read/write
codepath. Simply holding open a file doesn't get you a module reference now.
That explains the behavior change.

This is probably not good. I don't think this particular situation (removing
nfsd module and stopping and starting nfs) is that big of a problem, but this
could cause problems with other programs and kernel modules.



Comment 4 Neil Horman 2008-05-28 01:11:13 UTC
No, its not good, but the alternative isn't any better.  If we do reference
counting in the open and close paths, we are exposed to the possibility that we
open a file prior to its module owner pointer getting set, which results in an
unbalanced reference count, and a module that can never be removed.

I would say that we could probably extend the module structure with a lock and
some flags to identify those modules that are going to register owners to close
the race, but I don't see any way to do that.  We can create locks in the
wrapper functions that I wrote to close the race for those modules which will
set an owner eventually, but that doesn't protect the IIRC many modules that
dont do that in RHEL4.  They'll still be exposed to underflow, and thats just as
unacceptible.

I think the best thing to do is create an rwlock between the sys_module_init
routine and the proc_lookup routine.  If we serialize beweeen all proc reads and
module loads, we can do reference counting on open/close and avoid the race
condition.

Comment 6 RHEL Program Management 2008-05-28 08:11:42 UTC
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.

Comment 7 Jeff Layton 2008-05-28 08:33:26 UTC
Yes. This won't be pretty to fix...

We really need for the refcount to go up when we open a proc file. The problem
is that we need to make sure that nothing tries to increase the refcount before
"owner" is set, or we open ourselves to the original race. There is no common
entry point for the create + set owner, so we potentially will have to wrap
locks around many callsites of create_proc_entry (at least any where we care
about protecting against the original race). Hooking into sys_module_init might
work, though that will just prevent this race for /proc files created on module
load. Anything created in a non-init routine won't be protected.

Also, since create_proc_entry does a kmalloc, I think we'll have to use a rw
semaphore since that can potentially sleep.


Comment 8 Jeff Layton 2008-06-12 13:49:38 UTC
Here's another possibility:

1) back out the patch for 280431

2) check to see if owner is NULL before calling try_module_get in
proc_get_inode. If it is, then don't even do a try_module_get. If it's not NULL,
then do try_module_get. If it's successful, then increment an atomic counter
that we store in PROC_I().

3) in proc_delete_inode, decrement the counter and if it's not 0 then do a
module_put().

...doesn't sound too bad at first glance. The key is to track whether we got a
reference or not. We also assume that it's not possible for the owner to be
non-NULL when we enter proc_get_inode, and then later to become NULL. It seems
unlikely, but there's not much we can do about it.

Neil, thoughts?



Comment 9 Jeff Layton 2008-06-12 18:01:56 UTC
Created attachment 309114 [details]
proposed patch -- keep a counter in proc_inode to make sure we don't do more module_put()'s than gets

This is a quick pass at what I was thinking about in my last comment. Untested,
but it does compile.

Comment 10 Jeff Layton 2008-06-13 18:14:08 UTC
Patch seems to fix the problem, and I think it'll prevent the race that the
original patch was intended to fix. I still need to vet it for kABI though...


Comment 11 Jeff Layton 2008-06-13 20:11:20 UTC
Created attachment 309292 [details]
updated patch

Original patch seems to have been wrong. We should be doing a module_put() if
the atomic_dec_and_test() returns false, not true...

That said, that patch seems to actually work correctly as far as I can tell. I
don't see any module refcount leaks after testing it various ways, and I'm not
sure why that is...

Comment 12 Jeff Layton 2008-06-18 18:46:23 UTC
Created attachment 309776 [details]
patch -- fix off-by-one in module_ref counts

The last patch was not correct. It made the module refcount continually
increase. The problem is that we were bumping the module refcount when
module_ref went from 0 to 1, but were not decrementing it when we went from 1
to 0. This patch should do the right thing.

That said, it's probably somewhat fragile. It seems like we can end up with
subtle races that might throw the refcounts off. We might be better suited to
abandon the simple atomic counter here and just use a regular variable with a
spinlock. With that we could throw in some BUG_ON's if the refcounts get out of
whack.

Or maybe I'm just being paranoid and this is OK. I'll stuff this patch into my
test kernels for now and we can see how it does...

Comment 13 Jeff Layton 2008-06-21 11:54:41 UTC
Hmm...I don't think this patch is quite what we want after all. Consider:

Start a module_refs of 0

proc_get_inode is called, and owner is null... no module ref taken and
module_refs stays at 0

proc_delete_inode is called and owner is now non-NULL... module_refs is now -1.
no module_put is done

proc_get_inode is called again and owner is non-NULL. we take a module reference
and bump module_refs to 0.

proc_delete_inode is called again and module_refs moves to -1. Module reference
isn't dropped.

...I'm not sure that this is really that big a problem though. Will the inode
survive past a proc_delete_inode to keep a negative refcount? It seems like the
strange nature of procfiles (where they're sort of instantiated per-task) might
mean that we don't need this complex refcounting, but can just get away with a
flag that indicates whether a reference was taken or not.


Comment 15 Neil Horman 2008-06-24 13:22:58 UTC
Created attachment 310139 [details]
patch to make proc file creation / module ownership atomic for any module that wants it

So hows this for an idea:

1) We register a notifier in proc for module initalization

2) We add a state for notification MODULE_STATE_FAILED

3) We send out the MODULE_STATE_COMMING notification prior to mod->init being
called in sys_init_module, and MODULE_STATE_LIVE notification after (or
MODULE_STATE_FAILED if something went wrong).

4) The notifier we registered in proc catches these and locks a mutex based on
which notification it is

5) We restore the try_module_get and module_put calls in proc_get_inode and
proc_delete_inode.  WE surround the former in the locking of the notifiers
mutex.

This has the affect of making all of module initalization atomic with respect
to the checking of de->owner in proc_get_inode, which should solve our race 
for any module that wants to set owner.

I'm a bit worried about deadlock form multiple module loads at the same time,
but i think we can solve that with a counting semaphore or a counter and
completion struct.  this isnt tested yet, but it sould build/run without much
work

Comment 16 Neil Horman 2008-06-24 13:42:56 UTC
Created attachment 310142 [details]
new refcnt patch

In fact, I like this variant better.  Its the same patch, but it adds an atomic
counter to the notifier function, this should allow multiple module loads to
work with this patch properly.	Let me know what you think.

Comment 17 RHEL Program Management 2008-09-03 12:52:38 UTC
Updating PM score.

Comment 18 Jeff Layton 2008-09-16 20:03:40 UTC
Created attachment 316894 [details]
new patch -- use spinlock to manage refcounts

After looking at Neil's approach, I'm a little leery of it. Serializing all procfile opens behind module_init's seems a bit scary to me. While I'm not aware of anything that would cause that to deadlock, it's sometimes hard to predict what 3rd party modules will do...

This patch might be a reasonable approach (presuming I haven't horribly broken kABI with it). It's basically the same as my earlier attempts, but it adds a counter and a spinlock to the proc_inode struct. The spinlock just protects the counter. When the counter reaches 0, we don't attempt to do any more module_put's, even if owner is set. This should prevent us from doing more puts than gets. I tried to do this earlier with atomics rather than a lock, but I think it turns out that we do need a lock for this.

Comment 19 Neil Horman 2008-09-16 20:32:06 UTC
I think it seems fine, as long as it doesn't rub the abi checker the wrong way. I would like to see how my patch behaves though, since it relegates serialization to only those times whenmodules are loading and uloading (although as we've discussed, serialization there is fraught with danger itself :) )

Comment 20 Jeff Layton 2008-09-17 12:29:43 UTC
Created attachment 316952 [details]
fixed patch -- use spinlock to serialize module get refcount

New patch, fix some broken variable names.

This uses a spinlock to serialize access to the new mod_ref_count field in proc_inode. The upshot of this patch is that we simply don't allow proc_delete_inode to do more module_put's than module_get's that were done in proc_get_inode.

It's not an ideal fix, but in conjunction with Neil's changes to the read/write codepaths this should cover module references for most of these cases.

Now to see about kABI...

Comment 21 Jeff Layton 2008-09-17 12:41:29 UTC
The patch doesn't seem to change any exported symbols (assuming I'm checking this correctly), so I think we could go with the patch in comment #20. Neil, do you have any concerns or thoughts with this approach?

Comment 22 Neil Horman 2008-09-17 13:34:33 UTC
No, I think its good.  I'm a bit concerned about performance if you have lots of heavy readers on a particular proc file, but I doubt it will be a big deal.  I would suggest that you use an atomic_t as your reference counter however, rather than a combination spinlock and integer.  You can use atomic_inc and atomic_dec_and_test to do your operations then, and save some space in the inode and in your code.

Comment 23 Jeff Layton 2008-09-17 14:01:23 UTC
I'd prefer to use atomic_t, but I don't think it's suitable for this unfortunately. In the proc_delete_inode path, we need to decrement the counter if and only if it's not already 0.

If we check it and then do the the decrement, then there's a potential race between the two. If we instead do a dec_and_test and then try to increment it again if it falls below 0, then we also have a race between those two operations.

I don't see a way to do this with atomics, so I think I'm stuck with a spinlock here. I'm open to suggestions though if you see a way...

I'm not terribly worried about performance here though. This should only be getting hit on the open/close. I'd hope that heavy readers would just hold the file open.

Comment 24 Jeff Layton 2008-09-25 19:40:45 UTC
Created attachment 317720 [details]
patch -- go back to using atomics

Here's a possible patch that should be more efficient. This patch has the refcount be done as an atomic negative value (starting at -1). Taking a reference to the module does an atomic_dec on the count. Putting a reference does this:

+		if (de->owner) {
+			if (atomic_add_negative(1, &ei->mod_refs))
+				module_put(de->owner);
+			else
+				atomic_dec(&ei->mod_refs);
+		}

...so as long as the count is in negative values, we'll do a module_put. If not, then we'll skip it and put the refcount back where it was.

While this seems to work fine in cursory testing, I'm not convinced that there isn't a race lurking between the two atomic operations once we do the -1 to 0 transition on the atomic_add_negative() call. I don't seem to be able to identify any races here, so this might be just fine, but I think it needs careful review.

Neil, any thoughts?

Comment 26 Vivek Goyal 2009-01-16 14:42:49 UTC
Committed in 78.30.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 29 errata-xmlrpc 2009-05-18 19:03:15 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1024.html