Bug 1025907 - use after free in new nfsd DRC code
Summary: use after free in new nfsd DRC code
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 19
Hardware: x86_64
OS: All
unspecified
high
Target Milestone: ---
Assignee: Jeff Layton
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: 1036971 1036972
TreeView+ depends on / blocked
 
Reported: 2013-11-01 22:36 UTC by g. artim
Modified: 2014-06-18 07:43 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1036971 (view as bug list)
Environment:
Last Closed: 2014-01-08 14:22:49 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
patch -- nfsd: when reusing an existing entry, unhash it first (1.92 KB, patch)
2013-12-02 20:32 UTC, Jeff Layton
no flags Details | Diff

Description g. artim 2013-11-01 22:36:23 UTC
Description of problem:
nfs system froze


Version-Release number of selected component (if applicable):

kernel-tools-libs-3.11.6-200.fc19.x86_64
kernel-headers-3.11.6-200.fc19.x86_64
kernel-3.11.3-201.fc19.x86_64
kernel-tools-3.11.6-200.fc19.x86_64
kernel-devel-3.11.6-200.fc19.x86_64
kernel-devel-3.11.1-200.fc19.x86_64
kernel-3.11.1-200.fc19.x86_64
kernel-devel-3.11.3-201.fc19.x86_64
kernel-3.11.6-200.fc19.x86_64

btrfs-progs-0.20.rc1.20130917git194aa4a-1.fc19.x86_64


How reproducible:
not sure what caused it.

Steps to Reproduce:
1.
2.
3.

Actual results:
freeze of system and dump:
Nov  1 09:00:22 r1epi kernel: [210852.907240] ------------[ cut here ]------------
Nov  1 09:00:22 r1epi kernel: [210852.907246] WARNING: CPU: 1 PID: 627 at lib/list_debug.c:53 __list_del_entry+0x63/0xd0()
Nov  1 09:00:22 r1epi kernel: [210852.907247] list_del corruption, ffff8800af398010->next is LIST_POISON1 (dead000000100100)
Nov  1 09:00:22 r1epi kernel: [210852.907248] Modules linked in: binfmt_misc xt_iprange bonding ip6t_REJECT nf_conntrack_ipv6 nf_conntrack_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter xt_conntrack nf_conntrack ip6_tables snd_hda_codec_hdmi snd_hda_codec_realtek iTCO_wdt iTCO_vendor_support btrfs zlib_deflate x86_pkg_temp_thermal coretemp kvm crc32_pclmul snd_hda_intel crc32c_intel raid6_pq btusb snd_hda_codec libcrc32c bluetooth ghash_clmulni_intel eeepc_wmi asus_wmi snd_hwdep sparse_keymap mxm_wmi snd_pcm microcode serio_raw xor rfkill i2c_i801 lpc_ich e1000 snd_page_alloc mfd_core e1000e mei_me snd_timer mei snd ptp soundcore shpchp pps_core wmi mperf nfsd auth_rpcgss nfs_acl lockd sunrpc i915 firewire_ohci i2c_algo_bit firewire_core drm_kms_helper crc_itu_t drm aacraid i2c_core video
Nov  1 09:00:22 r1epi kernel: [210852.907279] CPU: 1 PID: 627 Comm: nfsd Not tainted 3.11.6-200.fc19.x86_64 #1
Nov  1 09:00:22 r1epi kernel: [210852.907280] Hardware name: System manufacturer System Product Name/P8Z68-V PRO, BIOS 0902 09/19/2011
Nov  1 09:00:22 r1epi kernel: [210852.907281]  0000000000000009 ffff8801ff7bfcd8 ffffffff81647c7f ffff8801ff7bfd20
Nov  1 09:00:22 r1epi kernel: [210852.907283]  ffff8801ff7bfd10 ffffffff8106715d ffff8800af398010 0000000000000001
Nov  1 09:00:22 r1epi kernel: [210852.907285]  0000000000000000 0000000039b19996 ffff8801ff6a0000 ffff8801ff7bfd70
Nov  1 09:00:22 r1epi kernel: [210852.907287] Call Trace:
Nov  1 09:00:22 r1epi kernel: [210852.907291]  [<ffffffff81647c7f>] dump_stack+0x45/0x56
Nov  1 09:00:22 r1epi kernel: [210852.907294]  [<ffffffff8106715d>] warn_slowpath_common+0x7d/0xa0
Nov  1 09:00:22 r1epi kernel: [210852.907295]  [<ffffffff810671cc>] warn_slowpath_fmt+0x4c/0x50
Nov  1 09:00:22 r1epi kernel: [210852.907297]  [<ffffffff8118d80c>] ? kmem_cache_free+0x20c/0x220
Nov  1 09:00:22 r1epi kernel: [210852.907299]  [<ffffffff81310963>] __list_del_entry+0x63/0xd0
Nov  1 09:00:22 r1epi kernel: [210852.907306]  [<ffffffffa01c3cc1>] lru_put_end+0x21/0x60 [nfsd]
Nov  1 09:00:22 r1epi kernel: [210852.907310]  [<ffffffffa01c4208>] nfsd_cache_lookup+0x388/0x720 [nfsd]
Nov  1 09:00:22 r1epi kernel: [210852.907315]  [<ffffffffa01b8cf8>] nfsd_dispatch+0x88/0x200 [nfsd]
Nov  1 09:00:22 r1epi kernel: [210852.907323]  [<ffffffffa0187ffd>] svc_process_common+0x46d/0x6e0 [sunrpc]
Nov  1 09:00:22 r1epi kernel: [210852.907329]  [<ffffffffa0188377>] svc_process+0x107/0x170 [sunrpc]
Nov  1 09:00:22 r1epi kernel: [210852.907332]  [<ffffffffa01b871f>] nfsd+0xbf/0x130 [nfsd]
Nov  1 09:00:22 r1epi kernel: [210852.907336]  [<ffffffffa01b8660>] ? nfsd_destroy+0x80/0x80 [nfsd]
Nov  1 09:00:22 r1epi kernel: [210852.907338]  [<ffffffff81088650>] kthread+0xc0/0xd0
Nov  1 09:00:22 r1epi kernel: [210852.907340]  [<ffffffff81088590>] ? insert_kthread_work+0x40/0x40
Nov  1 09:00:22 r1epi kernel: [210852.907342]  [<ffffffff81656dec>] ret_from_fork+0x7c/0xb0
Nov  1 09:00:22 r1epi kernel: [210852.907344]  [<ffffffff81088590>] ? insert_kthread_work+0x40/0x40
Nov  1 09:00:22 r1epi kernel: [210852.907345] ---[ end trace 250b87d973d6ebb7 ]---


Expected results:

system stay up


Additional info:

file system is mounted with the lzo option.

Comment 1 g. artim 2013-11-20 17:12:27 UTC
added: 
+++++
noacl,nocto,nodiratime,noac,noatime
  to my nfs4 fstab and

ran:
+++

terminal 1:

(test.txt is 3GB file with random data)

#!/bin/bash
echo cp xtest1.txt
time cp xtest.txt xtest1.txt
echo cp xtest2.txt
time cp xtest.txt xtest2.txt
echo cp xtest3.txt
time cp xtest.txt xtest3.txt
echo cp xtest4.txt
time cp xtest.txt xtest4.txt
echo cp xtest5.txt
time cp xtest.txt xtest5.txt
sync
sync
echo rm xtest?.txt
time \rm xtest?.txt

terminal 2:

my response to 'watch -n 1 ls -la' on the
nfs4 directory I ran in went from
5-15 seconds to 1 second response.

this is a nfs4 client, but works as
well on nfs3 client.

the setup is:
++++++++++++

server:
fc19, btrfs w/lzo option on h/w raid
5, 20TB

client:
fc19, nfs4

so something is amiss between btrfs and nfs,
my guess its the nodiratime, but havent
isolated it down to that nfs option.


gary

Comment 2 J. Bruce Fields 2013-11-20 21:04:08 UTC
I guess this is the list_move in lru_put_end.  Could we in theory hit this case like this?:

        if (!list_empty(&lru_head)) {
                rp = list_first_entry(&lru_head, struct svc_cacherep, c_lru);
                if (nfsd_cache_entry_expired(rp) ||
                    num_drc_entries >= max_drc_entries) {
                        lru_put_end(rp);
                        prune_cache_entries(); /* prunes rp */
                        goto search_cache;
....
search_cache:
        found = nfsd_cache_search(rqstp, csum); /* returns NULL */
        if (found) {
                ....
        }
        if (!rp) {
                ....
        }
        ....
        lru_put_end(rp);

I don't see what guarantees rp is still good at this point, but I haven't looked closely.

Comment 3 Jeff Layton 2013-11-20 21:10:27 UTC
(In reply to J. Bruce Fields from comment #2)

> 
> I don't see what guarantees rp is still good at this point, but I haven't
> looked closely.

Been a while since I've been in this code too, but I think the fact that we hold the cache_lock over all of the above should ensure that rp doesn't go away out from under us.

Comment 4 Jeff Layton 2013-11-20 21:24:59 UTC
Ahh ok, I see where you're saying that prune_cache_entries might prune rp...

I think that's pretty unlikely. lru_put_end does this:

        rp->c_timestamp = jiffies;
        list_move_tail(&rp->c_lru, &lru_head);

...and then prune_cache_entries does this:

        list_for_each_entry_safe(rp, tmp, &lru_head, c_lru) {
                if (!nfsd_cache_entry_expired(rp) &&
                    num_drc_entries <= max_drc_entries)
                        break;
                nfsd_reply_cache_free_locked(rp);
                freed++;
        }

...so in order for rp to be pruned, you'd have to basically have max_drc_entries be 0, which I don't think is really possible.

Comment 5 Jeff Layton 2013-11-20 21:32:21 UTC
I'm stumped...is this reproducible at all? If so, would it be possible to get a vmcore?

Comment 6 Jeff Layton 2013-11-20 21:39:33 UTC
Hmm...so it looks like this is the found_entry case (aka, a cache hit):

Reading symbols from /usr/lib/debug/lib/modules/3.11.8-200.fc19.x86_64/kernel/fs/nfsd/nfsd.ko.debug...done.
(gdb) list *(nfsd_cache_lookup+0x388)
0xc208 is in nfsd_cache_lookup (fs/nfsd/nfscache.c:478).
473		age = jiffies - rp->c_timestamp;
474		lru_put_end(rp);
475	
476		rtn = RC_DROPIT;
477		/* Request being processed or excessive rexmits */
478		if (rp->c_state == RC_INPROG || age < RC_DELAY)
479			goto out;
480	
481		/* From the hall of fame of impractical attacks:
482		 * Is this a user who tries to snoop on the cache? */

The kernel version on my box is slightly different here of course, but I'll see if I can verify that against the version that the oops got reported against.

Still unclear on how we could hit such a bug, but I'll take a look tomorrow...

Comment 7 g. artim 2013-11-20 22:03:57 UTC
(In reply to Jeff Layton from comment #5)
> I'm stumped...is this reproducible at all? If so, would it be possible to
> get a vmcore?

since I updated to:

Linux r1epi 3.11.7-200.fc19.x86_64 #1 SMP Mon Nov 4 14:09:03 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

I havent got a dump of the kernel.

I only get the 10-15 second hangs on directory lists (ls -la). The options --
noacl,nocto,nodiratime,noac,noatime --
eliminated that hang time. 

Let me know if you want me to somehow force vmcore, but need some instructions. I do have a backup raid of the same size that I could test with. Its a different raid level (0), but I get the same results (long hang times on listing dirs when cp or rm of big files (3GB)).

Comment 8 Jeff Layton 2013-11-21 13:51:17 UTC
The list corruption is the more serious problem and the one we really need to address here. That may or may not be associated with the stalls you see. Have you seen the warnings about list corruption more than once?

There's no need to force a vmcore at this point.

Comment 9 Jeff Layton 2013-11-21 14:13:57 UTC
Ok, pulling down the kernel module that the list corruption was noticed on:

(gdb) list *(nfsd_cache_lookup+0x388)
0xc208 is in nfsd_cache_lookup (fs/nfsd/nfscache.c:478).

...and that's pretty much the same spot as in v3.11.8, so the lru_put_end call here is in the found_entry case. One possibility is that nfsd_cache_search returned an entry that was deleted.

Hmm...there is one place where we do free entries without holding the lock. nfsd_reply_cache_shutdown() will do so, but that should only be called if we couldn't successfully plug in the module in the first place or if we're unplugging it. Neither should be possible with nfsd threads actually running.

Comment 10 g. artim 2013-11-21 17:54:26 UTC
(In reply to Jeff Layton from comment #8)
> The list corruption is the more serious problem and the one we really need
> to address here. That may or may not be associated with the stalls you see.
> Have you seen the warnings about list corruption more than once?
> 
> There's no need to force a vmcore at this point.

only got it one time during a heavy load on the nfs server..that workload will return, but not sure when, the group that creates it are MIA now, holidays!

Comment 11 Jeff Layton 2013-12-02 20:32:05 UTC
Created attachment 831771 [details]
patch -- nfsd: when reusing an existing entry, unhash it first

Ok, I think this patch will probably fix it. Christoph Hellwig came up with a reproducer and I was able to track it down. I can't reproduce the bug with this patch, but I'm also hitting a deadlock of some sort in nfsd so I can't completely verify that just yet.

Still, I'm pretty confident that this is the bug that was causing the list corruption warnings. Hopefully this patch will make 3.13 and stable kernels.

Comment 12 Jeff Layton 2013-12-03 02:21:51 UTC
Also, I don't think it's likely that this is directly related to the btrfs stalls you're seeing, but it may be that those stalls can make it more likely for this problem to occur.

Comment 13 Jeff Layton 2013-12-16 11:56:57 UTC
Patch was merged into 3.13-rc4 and should make its way to stable soon.

Comment 14 Justin M. Forbes 2014-01-03 22:09:39 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs.

Fedora 19 has now been rebased to 3.12.6-200.fc19.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 20, and are still experiencing this issue, please change the version to Fedora 20.

If you experience different issues, please open a new bug report for those.

Comment 15 Josh Boyer 2014-01-08 14:22:49 UTC
Should be fixed with 3.12.6 in F19 stable updates.  Please reopen if you still see this.


Note You need to log in before you can comment on or make changes to this bug.