Description of problem: nfs system froze Version-Release number of selected component (if applicable): kernel-tools-libs-3.11.6-200.fc19.x86_64 kernel-headers-3.11.6-200.fc19.x86_64 kernel-3.11.3-201.fc19.x86_64 kernel-tools-3.11.6-200.fc19.x86_64 kernel-devel-3.11.6-200.fc19.x86_64 kernel-devel-3.11.1-200.fc19.x86_64 kernel-3.11.1-200.fc19.x86_64 kernel-devel-3.11.3-201.fc19.x86_64 kernel-3.11.6-200.fc19.x86_64 btrfs-progs-0.20.rc1.20130917git194aa4a-1.fc19.x86_64 How reproducible: not sure what caused it. Steps to Reproduce: 1. 2. 3. Actual results: freeze of system and dump: Nov 1 09:00:22 r1epi kernel: [210852.907240] ------------[ cut here ]------------ Nov 1 09:00:22 r1epi kernel: [210852.907246] WARNING: CPU: 1 PID: 627 at lib/list_debug.c:53 __list_del_entry+0x63/0xd0() Nov 1 09:00:22 r1epi kernel: [210852.907247] list_del corruption, ffff8800af398010->next is LIST_POISON1 (dead000000100100) Nov 1 09:00:22 r1epi kernel: [210852.907248] Modules linked in: binfmt_misc xt_iprange bonding ip6t_REJECT nf_conntrack_ipv6 nf_conntrack_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter xt_conntrack nf_conntrack ip6_tables snd_hda_codec_hdmi snd_hda_codec_realtek iTCO_wdt iTCO_vendor_support btrfs zlib_deflate x86_pkg_temp_thermal coretemp kvm crc32_pclmul snd_hda_intel crc32c_intel raid6_pq btusb snd_hda_codec libcrc32c bluetooth ghash_clmulni_intel eeepc_wmi asus_wmi snd_hwdep sparse_keymap mxm_wmi snd_pcm microcode serio_raw xor rfkill i2c_i801 lpc_ich e1000 snd_page_alloc mfd_core e1000e mei_me snd_timer mei snd ptp soundcore shpchp pps_core wmi mperf nfsd auth_rpcgss nfs_acl lockd sunrpc i915 firewire_ohci i2c_algo_bit firewire_core drm_kms_helper crc_itu_t drm aacraid i2c_core video Nov 1 09:00:22 r1epi kernel: [210852.907279] CPU: 1 PID: 627 Comm: nfsd Not tainted 3.11.6-200.fc19.x86_64 #1 Nov 1 09:00:22 r1epi kernel: [210852.907280] Hardware name: System manufacturer System Product Name/P8Z68-V PRO, BIOS 0902 09/19/2011 Nov 1 09:00:22 r1epi kernel: [210852.907281] 0000000000000009 ffff8801ff7bfcd8 ffffffff81647c7f ffff8801ff7bfd20 Nov 1 09:00:22 r1epi kernel: [210852.907283] ffff8801ff7bfd10 ffffffff8106715d ffff8800af398010 0000000000000001 Nov 1 09:00:22 r1epi kernel: [210852.907285] 0000000000000000 0000000039b19996 ffff8801ff6a0000 ffff8801ff7bfd70 Nov 1 09:00:22 r1epi kernel: [210852.907287] Call Trace: Nov 1 09:00:22 r1epi kernel: [210852.907291] [<ffffffff81647c7f>] dump_stack+0x45/0x56 Nov 1 09:00:22 r1epi kernel: [210852.907294] [<ffffffff8106715d>] warn_slowpath_common+0x7d/0xa0 Nov 1 09:00:22 r1epi kernel: [210852.907295] [<ffffffff810671cc>] warn_slowpath_fmt+0x4c/0x50 Nov 1 09:00:22 r1epi kernel: [210852.907297] [<ffffffff8118d80c>] ? kmem_cache_free+0x20c/0x220 Nov 1 09:00:22 r1epi kernel: [210852.907299] [<ffffffff81310963>] __list_del_entry+0x63/0xd0 Nov 1 09:00:22 r1epi kernel: [210852.907306] [<ffffffffa01c3cc1>] lru_put_end+0x21/0x60 [nfsd] Nov 1 09:00:22 r1epi kernel: [210852.907310] [<ffffffffa01c4208>] nfsd_cache_lookup+0x388/0x720 [nfsd] Nov 1 09:00:22 r1epi kernel: [210852.907315] [<ffffffffa01b8cf8>] nfsd_dispatch+0x88/0x200 [nfsd] Nov 1 09:00:22 r1epi kernel: [210852.907323] [<ffffffffa0187ffd>] svc_process_common+0x46d/0x6e0 [sunrpc] Nov 1 09:00:22 r1epi kernel: [210852.907329] [<ffffffffa0188377>] svc_process+0x107/0x170 [sunrpc] Nov 1 09:00:22 r1epi kernel: [210852.907332] [<ffffffffa01b871f>] nfsd+0xbf/0x130 [nfsd] Nov 1 09:00:22 r1epi kernel: [210852.907336] [<ffffffffa01b8660>] ? nfsd_destroy+0x80/0x80 [nfsd] Nov 1 09:00:22 r1epi kernel: [210852.907338] [<ffffffff81088650>] kthread+0xc0/0xd0 Nov 1 09:00:22 r1epi kernel: [210852.907340] [<ffffffff81088590>] ? insert_kthread_work+0x40/0x40 Nov 1 09:00:22 r1epi kernel: [210852.907342] [<ffffffff81656dec>] ret_from_fork+0x7c/0xb0 Nov 1 09:00:22 r1epi kernel: [210852.907344] [<ffffffff81088590>] ? insert_kthread_work+0x40/0x40 Nov 1 09:00:22 r1epi kernel: [210852.907345] ---[ end trace 250b87d973d6ebb7 ]--- Expected results: system stay up Additional info: file system is mounted with the lzo option.
added: +++++ noacl,nocto,nodiratime,noac,noatime to my nfs4 fstab and ran: +++ terminal 1: (test.txt is 3GB file with random data) #!/bin/bash echo cp xtest1.txt time cp xtest.txt xtest1.txt echo cp xtest2.txt time cp xtest.txt xtest2.txt echo cp xtest3.txt time cp xtest.txt xtest3.txt echo cp xtest4.txt time cp xtest.txt xtest4.txt echo cp xtest5.txt time cp xtest.txt xtest5.txt sync sync echo rm xtest?.txt time \rm xtest?.txt terminal 2: my response to 'watch -n 1 ls -la' on the nfs4 directory I ran in went from 5-15 seconds to 1 second response. this is a nfs4 client, but works as well on nfs3 client. the setup is: ++++++++++++ server: fc19, btrfs w/lzo option on h/w raid 5, 20TB client: fc19, nfs4 so something is amiss between btrfs and nfs, my guess its the nodiratime, but havent isolated it down to that nfs option. gary
I guess this is the list_move in lru_put_end. Could we in theory hit this case like this?: if (!list_empty(&lru_head)) { rp = list_first_entry(&lru_head, struct svc_cacherep, c_lru); if (nfsd_cache_entry_expired(rp) || num_drc_entries >= max_drc_entries) { lru_put_end(rp); prune_cache_entries(); /* prunes rp */ goto search_cache; .... search_cache: found = nfsd_cache_search(rqstp, csum); /* returns NULL */ if (found) { .... } if (!rp) { .... } .... lru_put_end(rp); I don't see what guarantees rp is still good at this point, but I haven't looked closely.
(In reply to J. Bruce Fields from comment #2) > > I don't see what guarantees rp is still good at this point, but I haven't > looked closely. Been a while since I've been in this code too, but I think the fact that we hold the cache_lock over all of the above should ensure that rp doesn't go away out from under us.
Ahh ok, I see where you're saying that prune_cache_entries might prune rp... I think that's pretty unlikely. lru_put_end does this: rp->c_timestamp = jiffies; list_move_tail(&rp->c_lru, &lru_head); ...and then prune_cache_entries does this: list_for_each_entry_safe(rp, tmp, &lru_head, c_lru) { if (!nfsd_cache_entry_expired(rp) && num_drc_entries <= max_drc_entries) break; nfsd_reply_cache_free_locked(rp); freed++; } ...so in order for rp to be pruned, you'd have to basically have max_drc_entries be 0, which I don't think is really possible.
I'm stumped...is this reproducible at all? If so, would it be possible to get a vmcore?
Hmm...so it looks like this is the found_entry case (aka, a cache hit): Reading symbols from /usr/lib/debug/lib/modules/3.11.8-200.fc19.x86_64/kernel/fs/nfsd/nfsd.ko.debug...done. (gdb) list *(nfsd_cache_lookup+0x388) 0xc208 is in nfsd_cache_lookup (fs/nfsd/nfscache.c:478). 473 age = jiffies - rp->c_timestamp; 474 lru_put_end(rp); 475 476 rtn = RC_DROPIT; 477 /* Request being processed or excessive rexmits */ 478 if (rp->c_state == RC_INPROG || age < RC_DELAY) 479 goto out; 480 481 /* From the hall of fame of impractical attacks: 482 * Is this a user who tries to snoop on the cache? */ The kernel version on my box is slightly different here of course, but I'll see if I can verify that against the version that the oops got reported against. Still unclear on how we could hit such a bug, but I'll take a look tomorrow...
(In reply to Jeff Layton from comment #5) > I'm stumped...is this reproducible at all? If so, would it be possible to > get a vmcore? since I updated to: Linux r1epi 3.11.7-200.fc19.x86_64 #1 SMP Mon Nov 4 14:09:03 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux I havent got a dump of the kernel. I only get the 10-15 second hangs on directory lists (ls -la). The options -- noacl,nocto,nodiratime,noac,noatime -- eliminated that hang time. Let me know if you want me to somehow force vmcore, but need some instructions. I do have a backup raid of the same size that I could test with. Its a different raid level (0), but I get the same results (long hang times on listing dirs when cp or rm of big files (3GB)).
The list corruption is the more serious problem and the one we really need to address here. That may or may not be associated with the stalls you see. Have you seen the warnings about list corruption more than once? There's no need to force a vmcore at this point.
Ok, pulling down the kernel module that the list corruption was noticed on: (gdb) list *(nfsd_cache_lookup+0x388) 0xc208 is in nfsd_cache_lookup (fs/nfsd/nfscache.c:478). ...and that's pretty much the same spot as in v3.11.8, so the lru_put_end call here is in the found_entry case. One possibility is that nfsd_cache_search returned an entry that was deleted. Hmm...there is one place where we do free entries without holding the lock. nfsd_reply_cache_shutdown() will do so, but that should only be called if we couldn't successfully plug in the module in the first place or if we're unplugging it. Neither should be possible with nfsd threads actually running.
(In reply to Jeff Layton from comment #8) > The list corruption is the more serious problem and the one we really need > to address here. That may or may not be associated with the stalls you see. > Have you seen the warnings about list corruption more than once? > > There's no need to force a vmcore at this point. only got it one time during a heavy load on the nfs server..that workload will return, but not sure when, the group that creates it are MIA now, holidays!
Created attachment 831771 [details] patch -- nfsd: when reusing an existing entry, unhash it first Ok, I think this patch will probably fix it. Christoph Hellwig came up with a reproducer and I was able to track it down. I can't reproduce the bug with this patch, but I'm also hitting a deadlock of some sort in nfsd so I can't completely verify that just yet. Still, I'm pretty confident that this is the bug that was causing the list corruption warnings. Hopefully this patch will make 3.13 and stable kernels.
Also, I don't think it's likely that this is directly related to the btrfs stalls you're seeing, but it may be that those stalls can make it more likely for this problem to occur.
Patch was merged into 3.13-rc4 and should make its way to stable soon.
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs. Fedora 19 has now been rebased to 3.12.6-200.fc19. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 20, and are still experiencing this issue, please change the version to Fedora 20. If you experience different issues, please open a new bug report for those.
Should be fixed with 3.12.6 in F19 stable updates. Please reopen if you still see this.