Ran across this while testing a different patch. Do the following steps on a machine configured as a nfs server: # service nfs stop # modprobe -r nfsd # service nfs start ...you'll get the following messages and a stack trace like this: nfsd: last server has exited nfsd: unexporting all filesystems rpciod: active tasks at shutdown?! RPC: failed to contact portmap (errno -5). nfsd: last server has exited nfsd: unexporting all filesystems rpciod: active tasks at shutdown?! RPC: error 5 connecting to server localhost RPC: failed to contact portmap (errno -5). remove_proc_entry: nfs4.idtoname/channel busy, count=1 remove_proc_entry: rpc/nfs4.idtoname busy, count=1 remove_proc_entry: nfs4.nametoid/channel busy, count=1 remove_proc_entry: rpc/nfs4.nametoid busy, count=1 Unable to handle kernel NULL pointer dereference at 0000000000000008 RIP: <ffffffffa00c7cdb>{:sunrpc:cache_open+111} PML4 1caab067 PGD 1c785067 PMD 0 Oops: 0002 [1] SMP CPU 0 Modules linked in: nfsd exportfs lockd nfs_acl md5 ipv6 parport_pc lp parport i2c_dev i2c_core rpcsec_gss_krb5 auth_rpcgss des sunrpc iptable_filter ip_tables ds yenta_socket pcmcia_core button battery ac uhci_hcd 8139cp mii floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod Pid: 1986, comm: rpc.idmapd Not tainted 2.6.9-70.ELsmp RIP: 0010:[<ffffffffa00c7cdb>] <ffffffffa00c7cdb>{:sunrpc:cache_open+111} RSP: 0018:000001001c2b5e48 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffffffa01c1520 RCX: 000001001a6a69c0 RDX: ffffffffa01c1590 RSI: 0000000000000202 RDI: ffffffffa00e04d8 RBP: 000001001d3c4ee0 R08: 000001001c2b5d48 R09: 0000000000000000 R10: 0000000000000048 R11: 0000000000000048 R12: 000001001a6a69c0 R13: 000001001e35fc00 R14: 0000000000000000 R15: 000001001b5ffe58 FS: 0000002a95579560(0000) GS:ffffffff8050c480(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000008 CR3: 0000000000101000 CR4: 00000000000006e0 Process rpc.idmapd (pid: 1986, threadinfo 000001001c2b4000, task 000001001e19d030) Stack: 000001001a6a69c0 000001001a6a69c0 000001001b5f0760 ffffffff8017af09 0000000000000000 0000000000000000 000001001a6a69c0 0000000000008002 0000000000000000 000001001b651000 Call Trace:<ffffffff8017af09>{__dentry_open+208} <ffffffff8017b0e2>{filp_open+95} <ffffffff80136084>{autoremove_wake_function+0} <ffffffff80144eb7>{do_sigaction+553} <ffffffff801f19c5>{strncpy_from_user+74} <ffffffff8017b2d1>{sys_open+57} <ffffffff801102b6>{system_call+126} Code: 48 89 68 08 48 89 45 00 48 89 55 08 48 89 6b 70 e8 f5 0e 25 RIP <ffffffffa00c7cdb>{:sunrpc:cache_open+111} RSP <000001001c2b5e48> CR2: 0000000000000008 <0>Kernel panic - not syncing: Oops ...I've been able to reproduce on -70.ELsmp, but not on -42.0.10.ELsmp. On -42.0.10, I'm unable to unplug nfsd, so it seems like something has changed wrt to module refcounting. I'm not sure when the problem was introduced, but I'll try a bisect search to see if I can track it down...
cc'ing Neil since it looks like it might be related to his patch... Problem introduced in -68.28.EL. Up until that kernel, a running rpc.idmapd would prevent nfsd.ko from being unplugged. This patch seems like the most likely candidate: -Fix race condition in proc file reading that leads to module refcnt imbalance (Neil Horman) [280431] Not a particularly serious regression -- unplugging modules generally falls into the "don't do that category", but it might be indicative of bigger problems. ...Neil, thoughts?
...on -68.27EL kernel, with rpc.idmapd up and running: [root@dhcp231-224 ~]# lsmod | head Module Size Used by nfsd 270945 4 ...lsof info: rpc.idmap 2575 root 0u CHR 1,3 1295 /dev/null rpc.idmap 2575 root 1u CHR 1,3 1295 /dev/null rpc.idmap 2575 root 2u CHR 1,3 1295 /dev/null rpc.idmap 2575 root 3r 0000 0,8 0 5923 eventpoll rpc.idmap 2575 root 5r DIR 0,3 0 168755209 /proc/2575/fd rpc.idmap 2575 root 6u unix 0x000001001f08ac40 5924 socket rpc.idmap 2575 root 7u unix 0x000001001f08a940 5925 socket rpc.idmap 2575 root 8u REG 0,3 0 4026532584 /proc/net/rpc/nfs4.nametoid/channel rpc.idmap 2575 root 9u REG 0,3 0 4026532580 /proc/net/rpc/nfs4.idtoname/channel rpc.idmap 2575 root 10r DIR 0,18 0 4 /var/lib/nfs/rpc_pipefs/nfs ...if I stop rpc.idmapd, the refcount goes to 0.
Chatted with Neil on IRC about this...the patch for bug 280431 changed it so that module refcounts were moved from the lookup/delete codepath to read/write codepath. Simply holding open a file doesn't get you a module reference now. That explains the behavior change. This is probably not good. I don't think this particular situation (removing nfsd module and stopping and starting nfs) is that big of a problem, but this could cause problems with other programs and kernel modules.
No, its not good, but the alternative isn't any better. If we do reference counting in the open and close paths, we are exposed to the possibility that we open a file prior to its module owner pointer getting set, which results in an unbalanced reference count, and a module that can never be removed. I would say that we could probably extend the module structure with a lock and some flags to identify those modules that are going to register owners to close the race, but I don't see any way to do that. We can create locks in the wrapper functions that I wrote to close the race for those modules which will set an owner eventually, but that doesn't protect the IIRC many modules that dont do that in RHEL4. They'll still be exposed to underflow, and thats just as unacceptible. I think the best thing to do is create an rwlock between the sys_module_init routine and the proc_lookup routine. If we serialize beweeen all proc reads and module loads, we can do reference counting on open/close and avoid the race condition.
This bugzilla has Keywords: Regression. Since no regressions are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP.
Yes. This won't be pretty to fix... We really need for the refcount to go up when we open a proc file. The problem is that we need to make sure that nothing tries to increase the refcount before "owner" is set, or we open ourselves to the original race. There is no common entry point for the create + set owner, so we potentially will have to wrap locks around many callsites of create_proc_entry (at least any where we care about protecting against the original race). Hooking into sys_module_init might work, though that will just prevent this race for /proc files created on module load. Anything created in a non-init routine won't be protected. Also, since create_proc_entry does a kmalloc, I think we'll have to use a rw semaphore since that can potentially sleep.
Here's another possibility: 1) back out the patch for 280431 2) check to see if owner is NULL before calling try_module_get in proc_get_inode. If it is, then don't even do a try_module_get. If it's not NULL, then do try_module_get. If it's successful, then increment an atomic counter that we store in PROC_I(). 3) in proc_delete_inode, decrement the counter and if it's not 0 then do a module_put(). ...doesn't sound too bad at first glance. The key is to track whether we got a reference or not. We also assume that it's not possible for the owner to be non-NULL when we enter proc_get_inode, and then later to become NULL. It seems unlikely, but there's not much we can do about it. Neil, thoughts?
Created attachment 309114 [details] proposed patch -- keep a counter in proc_inode to make sure we don't do more module_put()'s than gets This is a quick pass at what I was thinking about in my last comment. Untested, but it does compile.
Patch seems to fix the problem, and I think it'll prevent the race that the original patch was intended to fix. I still need to vet it for kABI though...
Created attachment 309292 [details] updated patch Original patch seems to have been wrong. We should be doing a module_put() if the atomic_dec_and_test() returns false, not true... That said, that patch seems to actually work correctly as far as I can tell. I don't see any module refcount leaks after testing it various ways, and I'm not sure why that is...
Created attachment 309776 [details] patch -- fix off-by-one in module_ref counts The last patch was not correct. It made the module refcount continually increase. The problem is that we were bumping the module refcount when module_ref went from 0 to 1, but were not decrementing it when we went from 1 to 0. This patch should do the right thing. That said, it's probably somewhat fragile. It seems like we can end up with subtle races that might throw the refcounts off. We might be better suited to abandon the simple atomic counter here and just use a regular variable with a spinlock. With that we could throw in some BUG_ON's if the refcounts get out of whack. Or maybe I'm just being paranoid and this is OK. I'll stuff this patch into my test kernels for now and we can see how it does...
Hmm...I don't think this patch is quite what we want after all. Consider: Start a module_refs of 0 proc_get_inode is called, and owner is null... no module ref taken and module_refs stays at 0 proc_delete_inode is called and owner is now non-NULL... module_refs is now -1. no module_put is done proc_get_inode is called again and owner is non-NULL. we take a module reference and bump module_refs to 0. proc_delete_inode is called again and module_refs moves to -1. Module reference isn't dropped. ...I'm not sure that this is really that big a problem though. Will the inode survive past a proc_delete_inode to keep a negative refcount? It seems like the strange nature of procfiles (where they're sort of instantiated per-task) might mean that we don't need this complex refcounting, but can just get away with a flag that indicates whether a reference was taken or not.
Created attachment 310139 [details] patch to make proc file creation / module ownership atomic for any module that wants it So hows this for an idea: 1) We register a notifier in proc for module initalization 2) We add a state for notification MODULE_STATE_FAILED 3) We send out the MODULE_STATE_COMMING notification prior to mod->init being called in sys_init_module, and MODULE_STATE_LIVE notification after (or MODULE_STATE_FAILED if something went wrong). 4) The notifier we registered in proc catches these and locks a mutex based on which notification it is 5) We restore the try_module_get and module_put calls in proc_get_inode and proc_delete_inode. WE surround the former in the locking of the notifiers mutex. This has the affect of making all of module initalization atomic with respect to the checking of de->owner in proc_get_inode, which should solve our race for any module that wants to set owner. I'm a bit worried about deadlock form multiple module loads at the same time, but i think we can solve that with a counting semaphore or a counter and completion struct. this isnt tested yet, but it sould build/run without much work
Created attachment 310142 [details] new refcnt patch In fact, I like this variant better. Its the same patch, but it adds an atomic counter to the notifier function, this should allow multiple module loads to work with this patch properly. Let me know what you think.
Updating PM score.
Created attachment 316894 [details] new patch -- use spinlock to manage refcounts After looking at Neil's approach, I'm a little leery of it. Serializing all procfile opens behind module_init's seems a bit scary to me. While I'm not aware of anything that would cause that to deadlock, it's sometimes hard to predict what 3rd party modules will do... This patch might be a reasonable approach (presuming I haven't horribly broken kABI with it). It's basically the same as my earlier attempts, but it adds a counter and a spinlock to the proc_inode struct. The spinlock just protects the counter. When the counter reaches 0, we don't attempt to do any more module_put's, even if owner is set. This should prevent us from doing more puts than gets. I tried to do this earlier with atomics rather than a lock, but I think it turns out that we do need a lock for this.
I think it seems fine, as long as it doesn't rub the abi checker the wrong way. I would like to see how my patch behaves though, since it relegates serialization to only those times whenmodules are loading and uloading (although as we've discussed, serialization there is fraught with danger itself :) )
Created attachment 316952 [details] fixed patch -- use spinlock to serialize module get refcount New patch, fix some broken variable names. This uses a spinlock to serialize access to the new mod_ref_count field in proc_inode. The upshot of this patch is that we simply don't allow proc_delete_inode to do more module_put's than module_get's that were done in proc_get_inode. It's not an ideal fix, but in conjunction with Neil's changes to the read/write codepaths this should cover module references for most of these cases. Now to see about kABI...
The patch doesn't seem to change any exported symbols (assuming I'm checking this correctly), so I think we could go with the patch in comment #20. Neil, do you have any concerns or thoughts with this approach?
No, I think its good. I'm a bit concerned about performance if you have lots of heavy readers on a particular proc file, but I doubt it will be a big deal. I would suggest that you use an atomic_t as your reference counter however, rather than a combination spinlock and integer. You can use atomic_inc and atomic_dec_and_test to do your operations then, and save some space in the inode and in your code.
I'd prefer to use atomic_t, but I don't think it's suitable for this unfortunately. In the proc_delete_inode path, we need to decrement the counter if and only if it's not already 0. If we check it and then do the the decrement, then there's a potential race between the two. If we instead do a dec_and_test and then try to increment it again if it falls below 0, then we also have a race between those two operations. I don't see a way to do this with atomics, so I think I'm stuck with a spinlock here. I'm open to suggestions though if you see a way... I'm not terribly worried about performance here though. This should only be getting hit on the open/close. I'd hope that heavy readers would just hold the file open.
Created attachment 317720 [details] patch -- go back to using atomics Here's a possible patch that should be more efficient. This patch has the refcount be done as an atomic negative value (starting at -1). Taking a reference to the module does an atomic_dec on the count. Putting a reference does this: + if (de->owner) { + if (atomic_add_negative(1, &ei->mod_refs)) + module_put(de->owner); + else + atomic_dec(&ei->mod_refs); + } ...so as long as the count is in negative values, we'll do a module_put. If not, then we'll skip it and put the refcount back where it was. While this seems to work fine in cursory testing, I'm not convinced that there isn't a race lurking between the two atomic operations once we do the -1 to 0 transition on the atomic_add_negative() call. I don't seem to be able to identify any races here, so this might be just fine, but I think it needs careful review. Neil, any thoughts?
Committed in 78.30.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1024.html