Bug 1236688 - NFS General Protection Fault under heavy load
Summary: NFS General Protection Fault under heavy load
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 22
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 1269390 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-06-29 17:59 UTC by Paul Felt
Modified: 2016-07-19 19:13 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-07-19 19:13:36 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
output of journalctl -b -1 > crashlog.log (385.08 KB, text/plain)
2015-06-29 18:36 UTC, Paul Felt
no flags Details

Description Paul Felt 2015-06-29 17:59:19 UTC
Description of problem:
NFS server crashes under heavy load 


Version-Release number of selected component (if applicable):
kernel = 4.0.4-303.fc22.x86_64


How reproducible:
Highly


Steps to Reproduce:
1. Install fedora 22 on an nfs server and 20 or so clients sharing the same ethernet network (using ldap to coordinate user ids). 

2. Server-side config:
/etc/export = "/local 192.168.36.0/22(rw,no_root_squash,no_subtree_check,fsid=0)"

3. Client-side config:
auto.master = "/net  /etc/auto.nfs4"
auto.nfs4 = "-fstype=nfs4,rw,nosuid,soft  $key:/"

4. Hammer the nfs server. (run multiple jobs per client that read/write to the nfs share).


Actual results:
It works for awhile, but eventually crashes (see example journalctl log below). 
Short story: kernel traces referring to various nfsd4_* calls lead quickly to general protection faults and an unresponsive server. Clients pause until server is rebooted, and resume until it crashes again.

Expected results:
Maybe sluggish performance under heavy load, but no crash.


Additional info:

- The server and all 25 clients have identical hardware:
""""""""""""
CPU: Intel i7-4770K  LGA1150 
Memory: 32GB                 DETAILS?
Power Supply: Antec Earthwatts 550W   80 Plus Platinum
Case: Zalman ZM-T2
Drive Bay: StarTech 2.5in SATA removable hard drive bay for PCI expansion slot
Motherboard: MSi Z87M-G43
"""""""""""""

- All machines pass overnight memtest

Comment 1 J. Bruce Fields 2015-06-29 18:34:26 UTC
(In reply to pablofelt from comment #0)
> Actual results:
> It works for awhile, but eventually crashes (see example journalctl log
> below).

Did you forget to attach the log?

> Short story: kernel traces referring to various nfsd4_* calls lead quickly
> to general protection faults and an unresponsive server. Clients pause until
> server is rebooted, and resume until it crashes again.

We need the log output, especially the first of those kernel traces, but if you can just attach the entire log that might be best.

Comment 2 Paul Felt 2015-06-29 18:36:56 UTC
Created attachment 1044437 [details]
output of journalctl -b -1 > crashlog.log

Comment 3 J. Bruce Fields 2015-06-29 21:38:14 UTC
First interesting I saw in the log:

Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: ------------[ cut here ]------------
Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: WARNING: CPU: 0 PID: 1363 at fs/nfsd/nfs4state.c:3833 nfsd4_process_open2+0xcc1/0x1310 [nfsd]()
Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ebtable_filter ebtables ip6table_filter ip6_tables bridge cfg80211 rfkill hwmon_vid snd_hda_codec_realtek intel_rapl iosf_mbi x86_pkg_temp_thermal coretemp kvm_intel snd_hda_codec_generic snd_hda_codec_hdmi kvm snd_hda_intel snd_hda_controller snd_hda_codec crct10dif_pclmul crc32_pclmul crc32c_intel snd_hwdep ghash_clmulni_intel snd_seq snd_seq_device snd_pcm snd_timer snd soundcore iTCO_wdt mei_me i2c_i801 iTCO_vendor_support ppdev serio_raw mei mxm_wmi lpc_ich mfd_core shpchp parport_pc parport tpm_infineon tpm_tis tpm wmi nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc uas usb_storage i915 i2c_algo_bit drm_kms_helper 8021q garp stp llc drm mrp r8169 mii video
Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: CPU: 0 PID: 1363 Comm: nfsd Not tainted 4.0.4-303.fc22.x86_64 #1
Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: Hardware name: MSI MS-7823/Z87M-G43 (MS-7823), BIOS V2.4 09/18/2013
Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel:  0000000000000000 00000000fbdd6a35 ffff8800d280bba8 ffffffff81783124
Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel:  0000000000000000 0000000000000000 ffff8800d280bbe8 ffffffff8109c66a
Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel:  ffff8807d61383e0 ffff8800d4823bb8 ffff8807d61383e0 ffff8807b9c6c000
Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: Call Trace:
Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel:  [<ffffffff81783124>] dump_stack+0x45/0x57
Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel:  [<ffffffff8109c66a>] warn_slowpath_common+0x8a/0xc0
Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel:  [<ffffffff8109c79a>] warn_slowpath_null+0x1a/0x20
Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel:  [<ffffffffa029ccd1>] nfsd4_process_open2+0xcc1/0x1310 [nfsd]
Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel:  [<ffffffffa028a168>] nfsd4_open+0x558/0x880 [nfsd]
Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel:  [<ffffffffa028a867>] nfsd4_proc_compound+0x3d7/0x6f0 [nfsd]
Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel:  [<ffffffffa0276103>] nfsd_dispatch+0xc3/0x220 [nfsd]
Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel:  [<ffffffffa020e032>] ? svc_tcp_adjust_wspace+0x12/0x30 [sunrpc]
Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel:  [<ffffffffa020c0cb>] svc_process_common+0x43b/0x690 [sunrpc]
Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel:  [<ffffffffa020d493>] svc_process+0x103/0x1b0 [sunrpc]
Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel:  [<ffffffffa0275a57>] nfsd+0x117/0x190 [nfsd]
Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel:  [<ffffffffa0275940>] ? nfsd_destroy+0x90/0x90 [nfsd]
Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel:  [<ffffffff810bb678>] kthread+0xd8/0xf0
Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel:  [<ffffffff810bb5a0>] ? kthread_worker_fn+0x180/0x180
Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel:  [<ffffffff81789698>] ret_from_fork+0x58/0x90
Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel:  [<ffffffff810bb5a0>] ? kthread_worker_fn+0x180/0x180
Jun 26 16:39:22 perplexity.aml.cs.byu.edu kernel: ---[ end trace f776cf2a7e8bdfd8 ]---

Comment 4 J. Bruce Fields 2015-06-29 21:45:19 UTC
That's the WARN_ON_ONCE(1) in nfs4_setlease() in the case find_readable_file() returns NULL.

Turning off delegations (with echo 0 >/proc/fs/sys/leases-enable) might work around the problem.  (If so, that would be interesting to know.)

It might also be interesting to try a more recent kernel, though a quick check of the git history isn't turning up an obvious fix.

Comment 5 Paul Felt 2015-06-29 22:47:48 UTC
> Turning off delegations (with echo 0 >/proc/fs/sys/leases-enable) might work
> around the problem.  (If so, that would be interesting to know.)

I tried this (sudo echo 0 > /proc/sys/fs/leases-enable) with no effect. Server crashes with the same messages as before. Did I need to restart any services to make this take effect?

Comment 6 J. Bruce Fields 2015-06-30 21:25:57 UTC
(In reply to Paul Felt from comment #5)
> > Turning off delegations (with echo 0 >/proc/fs/sys/leases-enable) might work
> > around the problem.  (If so, that would be interesting to know.)
> 
> I tried this (sudo echo 0 > /proc/sys/fs/leases-enable) with no effect.
> Server crashes with the same messages as before. Did I need to restart any
> services to make this take effect?

I believe the server stops handling out new delegations the moment you do that, but old ones may still hang around.  So the best test would be to disable leases before starting the server.

But that looks like a bit of a long shot.  I'll have to review the code more carefully....

Comment 7 J. Bruce Fields 2015-07-01 18:19:45 UTC
Later there's also a list_debug warning in close.  I have two other reports (in email) of similar warnings, but haven't been able to figure them out yet.  Something appears to be corrupting the server's NFSv4 state.

Comment 8 Paul Felt 2015-07-02 20:47:29 UTC
For what it is worth (we haven't tested this very rigorously) the problem seems to emerge when machines are competing to access the same set of files.

Comment 9 Paul Felt 2015-08-20 19:12:28 UTC
This problem is still an issue using kernel 4.1.4-200.fc22.x86_64

Comment 10 J. Bruce Fields 2015-08-24 14:46:09 UTC
Note Jeff Layton has a couple fixes that may address this:

  http://marc.info/?l=linux-nfs&m=144025516314054&w=2

Any testing results from kernels with those patches applied would be welcome.

Comment 11 Jeff Layton 2015-08-24 15:43:28 UTC
Yes, I think those patches will probably help the oopses due to double puts of the hash references.

There is still another problem that's not fixed -- we can end up allowing a nfsd to find a nfs4_file that is not fully set up yet. I think that's the cause of the WARN that popped in comment #3. I started looking at that, but I think there may be some other bugs in that area and it'd be good to fix those at the same time.

Comment 12 Josh Boyer 2015-10-08 14:08:14 UTC
Both of the patches that comment #10 pointed to were backported to 4.1.9.  Does the 4.1.10 update fix this issue?

Comment 13 J. Bruce Fields 2015-10-14 15:40:16 UTC
*** Bug 1269390 has been marked as a duplicate of this bug. ***

Comment 14 Paul Felt 2015-10-19 15:37:49 UTC
Apologies for the delayed response. I tested again with kernel 4.2.3-200.fc22.x86_64 and was unable to trigger a system crash. I saw one warning (copied below), but it looks like the issue has been resolved, as far as I can tell. Thanks!


Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: ------------[ cut here ]------------
Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: WARNING: CPU: 2 PID: 1431 at fs/nfsd/nfs4state.c:3955 nfsd4_process_open2+0xfe3/0x1420 [nfsd]()
Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: Modules linked in: bnep bluetooth fuse rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ebtable_filter ebtables ip6table_filter
Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: CPU: 2 PID: 1431 Comm: nfsd Not tainted 4.2.3-200.fc22.x86_64 #1
Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: Hardware name: MSI MS-7823/Z87M-G43 (MS-7823), BIOS V2.4 09/18/2013
Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel:  0000000000000000 0000000083fd5c61 ffff8807db113ba8 ffffffff8177220a
Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel:  0000000000000000 0000000000000000 ffff8807db113be8 ffffffff8109e4a6
Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel:  ffff8807db7f83e0 ffff8800cc50c240 ffff8807db7f83e0 ffff8800be9abea0
Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: Call Trace:
Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel:  [<ffffffff8177220a>] dump_stack+0x45/0x57
Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel:  [<ffffffff8109e4a6>] warn_slowpath_common+0x86/0xc0
Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel:  [<ffffffff8109e5da>] warn_slowpath_null+0x1a/0x20
Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel:  [<ffffffffa02a7b93>] nfsd4_process_open2+0xfe3/0x1420 [nfsd]
Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel:  [<ffffffffa0295cf8>] nfsd4_open+0x4f8/0x810 [nfsd]
Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel:  [<ffffffffa029639a>] nfsd4_proc_compound+0x38a/0x660 [nfsd]
Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel:  [<ffffffffa0282e90>] nfsd_dispatch+0xc0/0x200 [nfsd]
Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel:  [<ffffffffa0220a92>] ? svc_tcp_adjust_wspace+0x12/0x30 [sunrpc]
Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel:  [<ffffffffa021ec3c>] svc_process_common+0x40c/0x650 [sunrpc]
Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel:  [<ffffffffa021fd28>] svc_process+0xf8/0x190 [sunrpc]
Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel:  [<ffffffffa02828e3>] nfsd+0xf3/0x160 [nfsd]
Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel:  [<ffffffffa02827f0>] ? nfsd_destroy+0x70/0x70 [nfsd]
Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel:  [<ffffffff810bc8a8>] kthread+0xd8/0xf0
Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel:  [<ffffffff810bc7d0>] ? kthread_worker_fn+0x160/0x160
Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel:  [<ffffffff81778fdf>] ret_from_fork+0x3f/0x70
Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel:  [<ffffffff810bc7d0>] ? kthread_worker_fn+0x160/0x160
Oct 19 08:38:07 perplexity.aml.cs.byu.edu kernel: ---[ end trace e47a34a527629a94 ]---
Oct 19 08:38:08 perplexity.aml.cs.byu.edu abrt-dump-journal-oops[852]: abrt-dump-journal-oops: Found oopses: 1
Oct 19 08:38:08 perplexity.aml.cs.byu.edu abrt-dump-journal-oops[852]: abrt-dump-journal-oops: Creating problem directories
Oct 19 08:38:09 perplexity.aml.cs.byu.edu abrt-dump-journal-oops[852]: Reported 1 kernel oopses to Abrt
Oct 19 08:38:10 perplexity.aml.cs.byu.edu abrt-server[4668]: Looking for kernel package
Oct 19 08:38:10 perplexity.aml.cs.byu.edu abrt-server[4668]: Kernel package kernel-core-4.2.3-200.fc22.x86_64 found

Comment 15 J. Bruce Fields 2015-10-21 21:08:53 UTC
Thanks, that's probably material for another bug.  I see there was some speculation at http://lkml.kernel.org/r/20150728114933.6f917374@tlielax.poochiereds.net that this might be due to a race between two opens, in which case 35a92fe8770c "nfsd: serialize state seqid morphing operations" (not yet upstream) might help?  Though I'm not certain.

In any case, probably worth opening another bug and/or reporting upstream.  (And if it were possible to also test my latest tree at git://linux-nfs.org/~bfields/linux.git nfsd-next that could also be interesting.)

Comment 16 Fedora End Of Life 2016-07-19 19:13:36 UTC
Fedora 22 changed to end-of-life (EOL) status on 2016-07-19. Fedora 22 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.