Bug 1025907
Summary: | use after free in new nfsd DRC code | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | g. artim <gartim> | ||||
Component: | kernel | Assignee: | Jeff Layton <jlayton> | ||||
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 19 | CC: | bfields, gansalmon, gartim, itamar, jlayton, jonathan, kernel-maint, madhu.chinakonda, nfs-maint, steved, trux | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | All | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1036971 (view as bug list) | Environment: | |||||
Last Closed: | 2014-01-08 14:22:49 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1036971, 1036972 | ||||||
Attachments: |
|
Description
g. artim
2013-11-01 22:36:23 UTC
added: +++++ noacl,nocto,nodiratime,noac,noatime to my nfs4 fstab and ran: +++ terminal 1: (test.txt is 3GB file with random data) #!/bin/bash echo cp xtest1.txt time cp xtest.txt xtest1.txt echo cp xtest2.txt time cp xtest.txt xtest2.txt echo cp xtest3.txt time cp xtest.txt xtest3.txt echo cp xtest4.txt time cp xtest.txt xtest4.txt echo cp xtest5.txt time cp xtest.txt xtest5.txt sync sync echo rm xtest?.txt time \rm xtest?.txt terminal 2: my response to 'watch -n 1 ls -la' on the nfs4 directory I ran in went from 5-15 seconds to 1 second response. this is a nfs4 client, but works as well on nfs3 client. the setup is: ++++++++++++ server: fc19, btrfs w/lzo option on h/w raid 5, 20TB client: fc19, nfs4 so something is amiss between btrfs and nfs, my guess its the nodiratime, but havent isolated it down to that nfs option. gary I guess this is the list_move in lru_put_end. Could we in theory hit this case like this?: if (!list_empty(&lru_head)) { rp = list_first_entry(&lru_head, struct svc_cacherep, c_lru); if (nfsd_cache_entry_expired(rp) || num_drc_entries >= max_drc_entries) { lru_put_end(rp); prune_cache_entries(); /* prunes rp */ goto search_cache; .... search_cache: found = nfsd_cache_search(rqstp, csum); /* returns NULL */ if (found) { .... } if (!rp) { .... } .... lru_put_end(rp); I don't see what guarantees rp is still good at this point, but I haven't looked closely. (In reply to J. Bruce Fields from comment #2) > > I don't see what guarantees rp is still good at this point, but I haven't > looked closely. Been a while since I've been in this code too, but I think the fact that we hold the cache_lock over all of the above should ensure that rp doesn't go away out from under us. Ahh ok, I see where you're saying that prune_cache_entries might prune rp... I think that's pretty unlikely. lru_put_end does this: rp->c_timestamp = jiffies; list_move_tail(&rp->c_lru, &lru_head); ...and then prune_cache_entries does this: list_for_each_entry_safe(rp, tmp, &lru_head, c_lru) { if (!nfsd_cache_entry_expired(rp) && num_drc_entries <= max_drc_entries) break; nfsd_reply_cache_free_locked(rp); freed++; } ...so in order for rp to be pruned, you'd have to basically have max_drc_entries be 0, which I don't think is really possible. I'm stumped...is this reproducible at all? If so, would it be possible to get a vmcore? Hmm...so it looks like this is the found_entry case (aka, a cache hit): Reading symbols from /usr/lib/debug/lib/modules/3.11.8-200.fc19.x86_64/kernel/fs/nfsd/nfsd.ko.debug...done. (gdb) list *(nfsd_cache_lookup+0x388) 0xc208 is in nfsd_cache_lookup (fs/nfsd/nfscache.c:478). 473 age = jiffies - rp->c_timestamp; 474 lru_put_end(rp); 475 476 rtn = RC_DROPIT; 477 /* Request being processed or excessive rexmits */ 478 if (rp->c_state == RC_INPROG || age < RC_DELAY) 479 goto out; 480 481 /* From the hall of fame of impractical attacks: 482 * Is this a user who tries to snoop on the cache? */ The kernel version on my box is slightly different here of course, but I'll see if I can verify that against the version that the oops got reported against. Still unclear on how we could hit such a bug, but I'll take a look tomorrow... (In reply to Jeff Layton from comment #5) > I'm stumped...is this reproducible at all? If so, would it be possible to > get a vmcore? since I updated to: Linux r1epi 3.11.7-200.fc19.x86_64 #1 SMP Mon Nov 4 14:09:03 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux I havent got a dump of the kernel. I only get the 10-15 second hangs on directory lists (ls -la). The options -- noacl,nocto,nodiratime,noac,noatime -- eliminated that hang time. Let me know if you want me to somehow force vmcore, but need some instructions. I do have a backup raid of the same size that I could test with. Its a different raid level (0), but I get the same results (long hang times on listing dirs when cp or rm of big files (3GB)). The list corruption is the more serious problem and the one we really need to address here. That may or may not be associated with the stalls you see. Have you seen the warnings about list corruption more than once? There's no need to force a vmcore at this point. Ok, pulling down the kernel module that the list corruption was noticed on: (gdb) list *(nfsd_cache_lookup+0x388) 0xc208 is in nfsd_cache_lookup (fs/nfsd/nfscache.c:478). ...and that's pretty much the same spot as in v3.11.8, so the lru_put_end call here is in the found_entry case. One possibility is that nfsd_cache_search returned an entry that was deleted. Hmm...there is one place where we do free entries without holding the lock. nfsd_reply_cache_shutdown() will do so, but that should only be called if we couldn't successfully plug in the module in the first place or if we're unplugging it. Neither should be possible with nfsd threads actually running. (In reply to Jeff Layton from comment #8) > The list corruption is the more serious problem and the one we really need > to address here. That may or may not be associated with the stalls you see. > Have you seen the warnings about list corruption more than once? > > There's no need to force a vmcore at this point. only got it one time during a heavy load on the nfs server..that workload will return, but not sure when, the group that creates it are MIA now, holidays! Created attachment 831771 [details]
patch -- nfsd: when reusing an existing entry, unhash it first
Ok, I think this patch will probably fix it. Christoph Hellwig came up with a reproducer and I was able to track it down. I can't reproduce the bug with this patch, but I'm also hitting a deadlock of some sort in nfsd so I can't completely verify that just yet.
Still, I'm pretty confident that this is the bug that was causing the list corruption warnings. Hopefully this patch will make 3.13 and stable kernels.
Also, I don't think it's likely that this is directly related to the btrfs stalls you're seeing, but it may be that those stalls can make it more likely for this problem to occur. Patch was merged into 3.13-rc4 and should make its way to stable soon. *********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs. Fedora 19 has now been rebased to 3.12.6-200.fc19. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 20, and are still experiencing this issue, please change the version to Fedora 20. If you experience different issues, please open a new bug report for those. Should be fixed with 3.12.6 in F19 stable updates. Please reopen if you still see this. |