Bug 448603
Description
Jeff Layton
2008-05-27 19:37:00 UTC
cc'ing Neil since it looks like it might be related to his patch... Problem introduced in -68.28.EL. Up until that kernel, a running rpc.idmapd would prevent nfsd.ko from being unplugged. This patch seems like the most likely candidate: -Fix race condition in proc file reading that leads to module refcnt imbalance (Neil Horman) [280431] Not a particularly serious regression -- unplugging modules generally falls into the "don't do that category", but it might be indicative of bigger problems. ...Neil, thoughts? ...on -68.27EL kernel, with rpc.idmapd up and running: [root@dhcp231-224 ~]# lsmod | head Module Size Used by nfsd 270945 4 ...lsof info: rpc.idmap 2575 root 0u CHR 1,3 1295 /dev/null rpc.idmap 2575 root 1u CHR 1,3 1295 /dev/null rpc.idmap 2575 root 2u CHR 1,3 1295 /dev/null rpc.idmap 2575 root 3r 0000 0,8 0 5923 eventpoll rpc.idmap 2575 root 5r DIR 0,3 0 168755209 /proc/2575/fd rpc.idmap 2575 root 6u unix 0x000001001f08ac40 5924 socket rpc.idmap 2575 root 7u unix 0x000001001f08a940 5925 socket rpc.idmap 2575 root 8u REG 0,3 0 4026532584 /proc/net/rpc/nfs4.nametoid/channel rpc.idmap 2575 root 9u REG 0,3 0 4026532580 /proc/net/rpc/nfs4.idtoname/channel rpc.idmap 2575 root 10r DIR 0,18 0 4 /var/lib/nfs/rpc_pipefs/nfs ...if I stop rpc.idmapd, the refcount goes to 0. Chatted with Neil on IRC about this...the patch for bug 280431 changed it so that module refcounts were moved from the lookup/delete codepath to read/write codepath. Simply holding open a file doesn't get you a module reference now. That explains the behavior change. This is probably not good. I don't think this particular situation (removing nfsd module and stopping and starting nfs) is that big of a problem, but this could cause problems with other programs and kernel modules. No, its not good, but the alternative isn't any better. If we do reference counting in the open and close paths, we are exposed to the possibility that we open a file prior to its module owner pointer getting set, which results in an unbalanced reference count, and a module that can never be removed. I would say that we could probably extend the module structure with a lock and some flags to identify those modules that are going to register owners to close the race, but I don't see any way to do that. We can create locks in the wrapper functions that I wrote to close the race for those modules which will set an owner eventually, but that doesn't protect the IIRC many modules that dont do that in RHEL4. They'll still be exposed to underflow, and thats just as unacceptible. I think the best thing to do is create an rwlock between the sys_module_init routine and the proc_lookup routine. If we serialize beweeen all proc reads and module loads, we can do reference counting on open/close and avoid the race condition. This bugzilla has Keywords: Regression. Since no regressions are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP. Yes. This won't be pretty to fix... We really need for the refcount to go up when we open a proc file. The problem is that we need to make sure that nothing tries to increase the refcount before "owner" is set, or we open ourselves to the original race. There is no common entry point for the create + set owner, so we potentially will have to wrap locks around many callsites of create_proc_entry (at least any where we care about protecting against the original race). Hooking into sys_module_init might work, though that will just prevent this race for /proc files created on module load. Anything created in a non-init routine won't be protected. Also, since create_proc_entry does a kmalloc, I think we'll have to use a rw semaphore since that can potentially sleep. Here's another possibility: 1) back out the patch for 280431 2) check to see if owner is NULL before calling try_module_get in proc_get_inode. If it is, then don't even do a try_module_get. If it's not NULL, then do try_module_get. If it's successful, then increment an atomic counter that we store in PROC_I(). 3) in proc_delete_inode, decrement the counter and if it's not 0 then do a module_put(). ...doesn't sound too bad at first glance. The key is to track whether we got a reference or not. We also assume that it's not possible for the owner to be non-NULL when we enter proc_get_inode, and then later to become NULL. It seems unlikely, but there's not much we can do about it. Neil, thoughts? Created attachment 309114 [details]
proposed patch -- keep a counter in proc_inode to make sure we don't do more module_put()'s than gets
This is a quick pass at what I was thinking about in my last comment. Untested,
but it does compile.
Patch seems to fix the problem, and I think it'll prevent the race that the original patch was intended to fix. I still need to vet it for kABI though... Created attachment 309292 [details]
updated patch
Original patch seems to have been wrong. We should be doing a module_put() if
the atomic_dec_and_test() returns false, not true...
That said, that patch seems to actually work correctly as far as I can tell. I
don't see any module refcount leaks after testing it various ways, and I'm not
sure why that is...
Created attachment 309776 [details]
patch -- fix off-by-one in module_ref counts
The last patch was not correct. It made the module refcount continually
increase. The problem is that we were bumping the module refcount when
module_ref went from 0 to 1, but were not decrementing it when we went from 1
to 0. This patch should do the right thing.
That said, it's probably somewhat fragile. It seems like we can end up with
subtle races that might throw the refcounts off. We might be better suited to
abandon the simple atomic counter here and just use a regular variable with a
spinlock. With that we could throw in some BUG_ON's if the refcounts get out of
whack.
Or maybe I'm just being paranoid and this is OK. I'll stuff this patch into my
test kernels for now and we can see how it does...
Hmm...I don't think this patch is quite what we want after all. Consider: Start a module_refs of 0 proc_get_inode is called, and owner is null... no module ref taken and module_refs stays at 0 proc_delete_inode is called and owner is now non-NULL... module_refs is now -1. no module_put is done proc_get_inode is called again and owner is non-NULL. we take a module reference and bump module_refs to 0. proc_delete_inode is called again and module_refs moves to -1. Module reference isn't dropped. ...I'm not sure that this is really that big a problem though. Will the inode survive past a proc_delete_inode to keep a negative refcount? It seems like the strange nature of procfiles (where they're sort of instantiated per-task) might mean that we don't need this complex refcounting, but can just get away with a flag that indicates whether a reference was taken or not. Created attachment 310139 [details]
patch to make proc file creation / module ownership atomic for any module that wants it
So hows this for an idea:
1) We register a notifier in proc for module initalization
2) We add a state for notification MODULE_STATE_FAILED
3) We send out the MODULE_STATE_COMMING notification prior to mod->init being
called in sys_init_module, and MODULE_STATE_LIVE notification after (or
MODULE_STATE_FAILED if something went wrong).
4) The notifier we registered in proc catches these and locks a mutex based on
which notification it is
5) We restore the try_module_get and module_put calls in proc_get_inode and
proc_delete_inode. WE surround the former in the locking of the notifiers
mutex.
This has the affect of making all of module initalization atomic with respect
to the checking of de->owner in proc_get_inode, which should solve our race
for any module that wants to set owner.
I'm a bit worried about deadlock form multiple module loads at the same time,
but i think we can solve that with a counting semaphore or a counter and
completion struct. this isnt tested yet, but it sould build/run without much
work
Created attachment 310142 [details]
new refcnt patch
In fact, I like this variant better. Its the same patch, but it adds an atomic
counter to the notifier function, this should allow multiple module loads to
work with this patch properly. Let me know what you think.
Updating PM score. Created attachment 316894 [details]
new patch -- use spinlock to manage refcounts
After looking at Neil's approach, I'm a little leery of it. Serializing all procfile opens behind module_init's seems a bit scary to me. While I'm not aware of anything that would cause that to deadlock, it's sometimes hard to predict what 3rd party modules will do...
This patch might be a reasonable approach (presuming I haven't horribly broken kABI with it). It's basically the same as my earlier attempts, but it adds a counter and a spinlock to the proc_inode struct. The spinlock just protects the counter. When the counter reaches 0, we don't attempt to do any more module_put's, even if owner is set. This should prevent us from doing more puts than gets. I tried to do this earlier with atomics rather than a lock, but I think it turns out that we do need a lock for this.
I think it seems fine, as long as it doesn't rub the abi checker the wrong way. I would like to see how my patch behaves though, since it relegates serialization to only those times whenmodules are loading and uloading (although as we've discussed, serialization there is fraught with danger itself :) ) Created attachment 316952 [details]
fixed patch -- use spinlock to serialize module get refcount
New patch, fix some broken variable names.
This uses a spinlock to serialize access to the new mod_ref_count field in proc_inode. The upshot of this patch is that we simply don't allow proc_delete_inode to do more module_put's than module_get's that were done in proc_get_inode.
It's not an ideal fix, but in conjunction with Neil's changes to the read/write codepaths this should cover module references for most of these cases.
Now to see about kABI...
The patch doesn't seem to change any exported symbols (assuming I'm checking this correctly), so I think we could go with the patch in comment #20. Neil, do you have any concerns or thoughts with this approach? No, I think its good. I'm a bit concerned about performance if you have lots of heavy readers on a particular proc file, but I doubt it will be a big deal. I would suggest that you use an atomic_t as your reference counter however, rather than a combination spinlock and integer. You can use atomic_inc and atomic_dec_and_test to do your operations then, and save some space in the inode and in your code. I'd prefer to use atomic_t, but I don't think it's suitable for this unfortunately. In the proc_delete_inode path, we need to decrement the counter if and only if it's not already 0. If we check it and then do the the decrement, then there's a potential race between the two. If we instead do a dec_and_test and then try to increment it again if it falls below 0, then we also have a race between those two operations. I don't see a way to do this with atomics, so I think I'm stuck with a spinlock here. I'm open to suggestions though if you see a way... I'm not terribly worried about performance here though. This should only be getting hit on the open/close. I'd hope that heavy readers would just hold the file open. Created attachment 317720 [details]
patch -- go back to using atomics
Here's a possible patch that should be more efficient. This patch has the refcount be done as an atomic negative value (starting at -1). Taking a reference to the module does an atomic_dec on the count. Putting a reference does this:
+ if (de->owner) {
+ if (atomic_add_negative(1, &ei->mod_refs))
+ module_put(de->owner);
+ else
+ atomic_dec(&ei->mod_refs);
+ }
...so as long as the count is in negative values, we'll do a module_put. If not, then we'll skip it and put the refcount back where it was.
While this seems to work fine in cursory testing, I'm not convinced that there isn't a race lurking between the two atomic operations once we do the -1 to 0 transition on the atomic_add_negative() call. I don't seem to be able to identify any races here, so this might be just fine, but I think it needs careful review.
Neil, any thoughts?
Committed in 78.30.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/ An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1024.html |