Description of problem: NFS server crashes under heavy load Version-Release number of selected component (if applicable): kernel = 4.0.4-303.fc22.x86_64 How reproducible: Highly Steps to Reproduce: 1. Install fedora 22 on an nfs server and 20 or so clients sharing the same ethernet network (using ldap to coordinate user ids). 2. Server-side config: /etc/export = "/local 192.168.36.0/22(rw,no_root_squash,no_subtree_check,fsid=0)" 3. Client-side config: auto.master = "/net /etc/auto.nfs4" auto.nfs4 = "-fstype=nfs4,rw,nosuid,soft $key:/" 4. Hammer the nfs server. (run multiple jobs per client that read/write to the nfs share). Actual results: It works for awhile, but eventually crashes (see example journalctl log below). Short story: kernel traces referring to various nfsd4_* calls lead quickly to general protection faults and an unresponsive server. Clients pause until server is rebooted, and resume until it crashes again. Expected results: Maybe sluggish performance under heavy load, but no crash. Additional info: - The server and all 25 clients have identical hardware: """""""""""" CPU: Intel i7-4770K LGA1150 Memory: 32GB DETAILS? Power Supply: Antec Earthwatts 550W 80 Plus Platinum Case: Zalman ZM-T2 Drive Bay: StarTech 2.5in SATA removable hard drive bay for PCI expansion slot Motherboard: MSi Z87M-G43 """"""""""""" - All machines pass overnight memtest
(In reply to pablofelt from comment #0) > Actual results: > It works for awhile, but eventually crashes (see example journalctl log > below). Did you forget to attach the log? > Short story: kernel traces referring to various nfsd4_* calls lead quickly > to general protection faults and an unresponsive server. Clients pause until > server is rebooted, and resume until it crashes again. We need the log output, especially the first of those kernel traces, but if you can just attach the entire log that might be best.
Created attachment 1044437 [details] output of journalctl -b -1 > crashlog.log
First interesting I saw in the log: Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: ------------[ cut here ]------------ Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: WARNING: CPU: 0 PID: 1363 at fs/nfsd/nfs4state.c:3833 nfsd4_process_open2+0xcc1/0x1310 [nfsd]() Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ebtable_filter ebtables ip6table_filter ip6_tables bridge cfg80211 rfkill hwmon_vid snd_hda_codec_realtek intel_rapl iosf_mbi x86_pkg_temp_thermal coretemp kvm_intel snd_hda_codec_generic snd_hda_codec_hdmi kvm snd_hda_intel snd_hda_controller snd_hda_codec crct10dif_pclmul crc32_pclmul crc32c_intel snd_hwdep ghash_clmulni_intel snd_seq snd_seq_device snd_pcm snd_timer snd soundcore iTCO_wdt mei_me i2c_i801 iTCO_vendor_support ppdev serio_raw mei mxm_wmi lpc_ich mfd_core shpchp parport_pc parport tpm_infineon tpm_tis tpm wmi nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc uas usb_storage i915 i2c_algo_bit drm_kms_helper 8021q garp stp llc drm mrp r8169 mii video Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: CPU: 0 PID: 1363 Comm: nfsd Not tainted 4.0.4-303.fc22.x86_64 #1 Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: Hardware name: MSI MS-7823/Z87M-G43 (MS-7823), BIOS V2.4 09/18/2013 Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: 0000000000000000 00000000fbdd6a35 ffff8800d280bba8 ffffffff81783124 Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: 0000000000000000 0000000000000000 ffff8800d280bbe8 ffffffff8109c66a Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: ffff8807d61383e0 ffff8800d4823bb8 ffff8807d61383e0 ffff8807b9c6c000 Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: Call Trace: Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: [<ffffffff81783124>] dump_stack+0x45/0x57 Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: [<ffffffff8109c66a>] warn_slowpath_common+0x8a/0xc0 Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: [<ffffffff8109c79a>] warn_slowpath_null+0x1a/0x20 Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: [<ffffffffa029ccd1>] nfsd4_process_open2+0xcc1/0x1310 [nfsd] Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: [<ffffffffa028a168>] nfsd4_open+0x558/0x880 [nfsd] Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: [<ffffffffa028a867>] nfsd4_proc_compound+0x3d7/0x6f0 [nfsd] Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: [<ffffffffa0276103>] nfsd_dispatch+0xc3/0x220 [nfsd] Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: [<ffffffffa020e032>] ? svc_tcp_adjust_wspace+0x12/0x30 [sunrpc] Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: [<ffffffffa020c0cb>] svc_process_common+0x43b/0x690 [sunrpc] Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: [<ffffffffa020d493>] svc_process+0x103/0x1b0 [sunrpc] Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: [<ffffffffa0275a57>] nfsd+0x117/0x190 [nfsd] Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: [<ffffffffa0275940>] ? nfsd_destroy+0x90/0x90 [nfsd] Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: [<ffffffff810bb678>] kthread+0xd8/0xf0 Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: [<ffffffff810bb5a0>] ? kthread_worker_fn+0x180/0x180 Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: [<ffffffff81789698>] ret_from_fork+0x58/0x90 Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: [<ffffffff810bb5a0>] ? kthread_worker_fn+0x180/0x180 Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: ---[ end trace f776cf2a7e8bdfd8 ]---
That's the WARN_ON_ONCE(1) in nfs4_setlease() in the case find_readable_file() returns NULL. Turning off delegations (with echo 0 >/proc/fs/sys/leases-enable) might work around the problem. (If so, that would be interesting to know.) It might also be interesting to try a more recent kernel, though a quick check of the git history isn't turning up an obvious fix.
> Turning off delegations (with echo 0 >/proc/fs/sys/leases-enable) might work > around the problem. (If so, that would be interesting to know.) I tried this (sudo echo 0 > /proc/sys/fs/leases-enable) with no effect. Server crashes with the same messages as before. Did I need to restart any services to make this take effect?
(In reply to Paul Felt from comment #5) > > Turning off delegations (with echo 0 >/proc/fs/sys/leases-enable) might work > > around the problem. (If so, that would be interesting to know.) > > I tried this (sudo echo 0 > /proc/sys/fs/leases-enable) with no effect. > Server crashes with the same messages as before. Did I need to restart any > services to make this take effect? I believe the server stops handling out new delegations the moment you do that, but old ones may still hang around. So the best test would be to disable leases before starting the server. But that looks like a bit of a long shot. I'll have to review the code more carefully....
Later there's also a list_debug warning in close. I have two other reports (in email) of similar warnings, but haven't been able to figure them out yet. Something appears to be corrupting the server's NFSv4 state.
For what it is worth (we haven't tested this very rigorously) the problem seems to emerge when machines are competing to access the same set of files.
This problem is still an issue using kernel 4.1.4-200.fc22.x86_64
Note Jeff Layton has a couple fixes that may address this: http://marc.info/?l=linux-nfs&m=144025516314054&w=2 Any testing results from kernels with those patches applied would be welcome.
Yes, I think those patches will probably help the oopses due to double puts of the hash references. There is still another problem that's not fixed -- we can end up allowing a nfsd to find a nfs4_file that is not fully set up yet. I think that's the cause of the WARN that popped in comment #3. I started looking at that, but I think there may be some other bugs in that area and it'd be good to fix those at the same time.
Both of the patches that comment #10 pointed to were backported to 4.1.9. Does the 4.1.10 update fix this issue?
*** Bug 1269390 has been marked as a duplicate of this bug. ***
Apologies for the delayed response. I tested again with kernel 4.2.3-200.fc22.x86_64 and was unable to trigger a system crash. I saw one warning (copied below), but it looks like the issue has been resolved, as far as I can tell. Thanks! Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: ------------[ cut here ]------------ Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: WARNING: CPU: 2 PID: 1431 at fs/nfsd/nfs4state.c:3955 nfsd4_process_open2+0xfe3/0x1420 [nfsd]() Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: Modules linked in: bnep bluetooth fuse rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ebtable_filter ebtables ip6table_filter Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: CPU: 2 PID: 1431 Comm: nfsd Not tainted 4.2.3-200.fc22.x86_64 #1 Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: Hardware name: MSI MS-7823/Z87M-G43 (MS-7823), BIOS V2.4 09/18/2013 Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: 0000000000000000 0000000083fd5c61 ffff8807db113ba8 ffffffff8177220a Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: 0000000000000000 0000000000000000 ffff8807db113be8 ffffffff8109e4a6 Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: ffff8807db7f83e0 ffff8800cc50c240 ffff8807db7f83e0 ffff8800be9abea0 Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: Call Trace: Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: [<ffffffff8177220a>] dump_stack+0x45/0x57 Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: [<ffffffff8109e4a6>] warn_slowpath_common+0x86/0xc0 Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: [<ffffffff8109e5da>] warn_slowpath_null+0x1a/0x20 Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: [<ffffffffa02a7b93>] nfsd4_process_open2+0xfe3/0x1420 [nfsd] Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: [<ffffffffa0295cf8>] nfsd4_open+0x4f8/0x810 [nfsd] Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: [<ffffffffa029639a>] nfsd4_proc_compound+0x38a/0x660 [nfsd] Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: [<ffffffffa0282e90>] nfsd_dispatch+0xc0/0x200 [nfsd] Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: [<ffffffffa0220a92>] ? svc_tcp_adjust_wspace+0x12/0x30 [sunrpc] Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: [<ffffffffa021ec3c>] svc_process_common+0x40c/0x650 [sunrpc] Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: [<ffffffffa021fd28>] svc_process+0xf8/0x190 [sunrpc] Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: [<ffffffffa02828e3>] nfsd+0xf3/0x160 [nfsd] Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: [<ffffffffa02827f0>] ? nfsd_destroy+0x70/0x70 [nfsd] Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: [<ffffffff810bc8a8>] kthread+0xd8/0xf0 Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: [<ffffffff810bc7d0>] ? kthread_worker_fn+0x160/0x160 Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: [<ffffffff81778fdf>] ret_from_fork+0x3f/0x70 Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: [<ffffffff810bc7d0>] ? kthread_worker_fn+0x160/0x160 Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: ---[ end trace e47a34a527629a94 ]--- Oct 19 08:38:08 perplexity.aml.cs.byu.edu abrt-dump-journal-oops[852]: abrt-dump-journal-oops: Found oopses: 1 Oct 19 08:38:08 perplexity.aml.cs.byu.edu abrt-dump-journal-oops[852]: abrt-dump-journal-oops: Creating problem directories Oct 19 08:38:09 perplexity.aml.cs.byu.edu abrt-dump-journal-oops[852]: Reported 1 kernel oopses to Abrt Oct 19 08:38:10 perplexity.aml.cs.byu.edu abrt-server[4668]: Looking for kernel package Oct 19 08:38:10 perplexity.aml.cs.byu.edu abrt-server[4668]: Kernel package kernel-core-4.2.3-200.fc22.x86_64 found
Thanks, that's probably material for another bug. I see there was some speculation at http://lkml.kernel.org/r/20150728114933.6f917374@tlielax.poochiereds.net that this might be due to a race between two opens, in which case 35a92fe8770c "nfsd: serialize state seqid morphing operations" (not yet upstream) might help? Though I'm not certain. In any case, probably worth opening another bug and/or reporting upstream. (And if it were possible to also test my latest tree at git://linux-nfs.org/~bfields/linux.git nfsd-next that could also be interesting.)
Fedora 22 changed to end-of-life (EOL) status on 2016-07-19. Fedora 22 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.