Created attachment 847339 [details] Photo of kernel dump from monitor Description of problem: Appears to be random when working with NFSv41 mounts. Has happened noumerous times on different workstations, so it must not be faulty hardware. Current setup is CentOS 6.4 64bit with the NFS exports and Fedora 19 KDE 64 bit with the NFS mounts options ACL and Krb5. (ipa server / ipa client setup). During reads / writes it randomly crashes the system with the following kernel (partial) dump: BUG: unable to handle kernel NULL pointer dereference at 0000000000000014 IP: [<ffffffffa0475c0d>] nfs41_assign_slot+0x3d/0x60 [nfsv4] PGD 0 0ops: 0002 [#1] SMP Modules linked in: cts rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd fscache ... CPU: 1 PID: 6322 Comm: kworder/1:0 Not tainted 3.12.6-200.fc19.x86_64 #1 Hardware name: Hewlett-Packard HP Compaq dx2420 Microtower/2A78h, BIOS 5.18 Workqueue: rpciod rpc_async_schedule [sunrpc] task: ... RIP: .. ... ... Call Trace: [...] rpc_wake_up_first+0x64/0x1f0 [sunrpc] [...] ? rpc_destroy_wait_queue+0x20/0x20 [sunrpc] ... ... [...] ? insert kthread_work+0x40/0x40 Code: .... RIP [...] nfs41_assign_slot+0x3d/0x60 [nfsv4] ... Kernel panic - not syncing: Fatal exception in interrupt drm_kms_helper: panic occurred, switching back to text console Full; Kernel Dump as photo on url below or attached as jpg file http://oi39.tinypic.com/24xnam1.jpg (sorry, no kdump as text available yet) Version-Release number of selected component (if applicable): How reproducible: Mount NFSv4 with ACL / Krb5 and perform read/write operations on random files Steps to Reproduce: 1. Mount NFSv4 export(s) 2. Perform Read / Writes on random files 3. Actual results: Kernel panic Expected results: Normal operation Additional info: This happens at random times on different work stations. Every time though, the kdump has almost the same information about the [nfsv4] and rpc_wake_up_first [sunrpc].
Hi, We're experiencing the same crashes. We use NFS4 mounts with sec=sys, so no kerberos involved here. Regards, Rik
Hi, I've experienced the same crash again on 3.12.7-300.fc20.x86_64 and have a kdump core dump available. How can I make sure the dump does not contain sensitive information before I upload it? I would like to make sure it doesn't contain SSH keys etc. Can 700MB dumps simply be attached to this bugreport, or should I upload them somewhere else? Regards, Rik
Happened again today many times with kernel.x86_64 3.12.8-200.fc19. I was able to reproduce it every time, by simply copying large files (over 150MB) from an external HDD, to a NFS4 mount. Strange, but the file size appears to be of great significance. The "threshold" seems to be around the 150MB, if above 150MB the kernel crashes on file copy, if below, it completes without a problem.
Seeing exactly the same, multiple times per day, easily reproduced; 3.12.8-200.fc19.x86_64 on nfsv4. Like c#3, it seems to occur only on large streaming writes to the nfs server. Full fs opts: type nfs4 (rw,relatime,vers=4.0,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=x.x.x.x,local_lock=none,addr=y.y.y.y) Happens with no other significant background activity or VM pressure present. vmcore backtrace (multiple identical examples): [exception RIP: nfs41_assign_slot+61] RIP: ffffffffa0fccc8d RSP: ffff880820a4dcb8 RFLAGS: 00010246 RAX: 00000001000bc3cd RBX: ffff880819457c98 RCX: ffff880819457c00 RDX: 0000000000000000 RSI: ffff8807cc622280 RDI: ffff8806f438fa00 RBP: ffff880820a4dcb8 R8: ffff88077c1422b0 R9: 0000000000000000 R10: dfe3c4f3fbc4bd40 R11: 00000000000005a8 R12: ffff8806f438fa00 R13: ffffffffa0fccc50 R14: ffff8807cc622280 R15: ffff880819457cb0 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #9 [ffff880820a4dcc0] rpc_wake_up_first at ffffffffa0f02534 [sunrpc] #10 [ffff880820a4dd00] nfs41_wake_and_assign_slot at ffffffffa0fcd402 [nfsv4] #11 [ffff880820a4dd10] nfs40_sequence_done.isra.27 at ffffffffa0fabfcc [nfsv4] #12 [ffff880820a4dd38] nfs4_sequence_done at ffffffffa0fac76c [nfsv4] #13 [ffff880820a4dd48] nfs4_write_done at ffffffffa0fadf5e [nfsv4] #14 [ffff880820a4dd68] nfs_writeback_done at ffffffffa0f856b8 [nfs] #15 [ffff880820a4dd90] nfs_writeback_done_common at ffffffffa0f858be [nfs] #16 [ffff880820a4dda0] rpc_exit_task at ffffffffa0f01328 [sunrpc] #17 [ffff880820a4ddb8] __rpc_execute at ffffffffa0f01fc4 [sunrpc] #18 [ffff880820a4de10] rpc_async_schedule at ffffffffa0f02366 [sunrpc] #19 [ffff880820a4de28] process_one_work at ffffffff81083636 #20 [ffff880820a4de70] worker_thread at ffffffff8108426b #21 [ffff880820a4ded0] kthread at ffffffff8108b110 #22 [ffff880820a4df50] ret_from_fork at ffffffff81675f3c
Created attachment 858114 [details] NFSv4: Fix memory corruption in nfs4_proc_open_confirm This patch fixes an issue with the open() call that has potential to cause corruption in the NFSv4 slot table. Could you please see if that makes a difference?
Hi Trond, Thanks --- it's too early to be sure but signs are encouraging with that patch. I've done a few GB of the download traffic that was reliably triggering the problem before, and pushed a bunch of synthetic write traffic at the server, with no issues so far. I haven't been using the client much for anything else today, though, so if the problem is a side effect of general client load then I may just not be pushing it hard enough. I'll be travelling for the next week and a half, though, and won't be able to test further in that time. --Stephen
I am having the same problem with kernel panics when saving files over nfs. Kernel: kernel-3.12.9-301.fc20.x86_64 nfs-utils-1.2.9-3.0.fc20.x86_64 The server is CentOS 6 x86_64 http://i57.tinypic.com/2v1ard0.png The problem occurs saving large files to an nfs share with gimp or transmission. I have three machines, all running Fedora 20, and they all have the same problem. I didn't see this with F19, but now it happens many times a day. --Dan
Hi, I've been running a patched 3.12.9 for a few days now and haven't had any lock-ups so far. So it looks the patch fixes this issue. Regards, Rik
(In reply to Stephen Tweedie from comment #6) > Thanks --- it's too early to be sure but signs are encouraging with that > patch. ... > I'll be travelling for the next week and a half, though, and won't be able > to test further in that time. Bad news --- I'm back, got my desktop up and running, and had the same error with the patched kernel within 30 minutes. Nothing in logs before the panic. I never saw this with 3.10. (Never used 3.11 in the same environment.) I'm going to retry with the -debug kernel. Thanks, Stephen
Created attachment 862294 [details] NFSv4: Fix a slot leak in nfs40_sequence_done You probably need this patch in addition to the previous one. It prevents further slot leaks in nfs4_open_confirm_done() and nfs4_release_lockowner_done(). Both these patches should be trickling down through stable at this time.
The second patch seems to have fixed it: several days uptime now with no recurrence, thanks! --Stephen
*** Bug 1055537 has been marked as a duplicate of this bug. ***
There was a debian user still hitting a similar oops even with those two patches: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=734268 So we may need another follow-on fix...stay tuned.
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs. Fedora 19 has now been rebased to 3.13.5-100.fc19. Please test this kernel update and let us know if you issue has been resolved or if it is still present with the newer kernel. If you experience different issues, please open a new bug report for those.
This one just triggered yesterday for me on 3.13.5-202.fc20.x86_64: problem is still there. Unfortunately I was also able to reproduce (albeit only once) with the two patches above applied against an earlier kernel; those patches certainly made the problem trigger more rarely but did not cure it. Thanks, Stephen
And again on 3.13.6-200.fc20.x86_64. Partial vmcore and logs are available, but the backtrace looks exactly like the one above in comment c#4, so I'm not sure how much more help another dump will be.
Hi Stephen, Commit b7e63a1079b2 (NFSv4: Fix another nfs4_sequence corruptor) has only just been merged into 3.14-rc6, and should start trickling into the stable kernels soon. It addresses a second corruption instance of the same list. In the meantime, please see https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=b7e63a1079b2 if you'd like to apply the patch manually.
Hi Trond, Thanks, I've got a local kernel patched, built and installed with that latest upstream patch, will pick it up in a reboot tonight and will see how it goes. --Stephen
The last of these patches was merged into v3.13.7 so I think we can call this fixed.