Bug 1050206
| Summary: | BUG: kernel NULL pointer dereference at nfs41_assign_slot+0x3d [nfsv4] | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | antokarag | ||||||||
| Component: | kernel | Assignee: | Jeff Layton <jlayton> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||
| Severity: | medium | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 19 | CC: | antokarag, gansalmon, itamar, jlayton, jonathan, kernel-maint, madhu.chinakonda, nfs-maint, rik.theys, rosario.esposito, sct, steved, trond.myklebust | ||||||||
| Target Milestone: | --- | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | |||||||||||
| : | 1061707 (view as bug list) | Environment: | |||||||||
| Last Closed: | 2014-04-07 18:32:31 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 1061707 | ||||||||||
| Attachments: |
|
||||||||||
|
Description
antokarag
2014-01-08 20:46:52 UTC
Hi, We're experiencing the same crashes. We use NFS4 mounts with sec=sys, so no kerberos involved here. Regards, Rik Hi, I've experienced the same crash again on 3.12.7-300.fc20.x86_64 and have a kdump core dump available. How can I make sure the dump does not contain sensitive information before I upload it? I would like to make sure it doesn't contain SSH keys etc. Can 700MB dumps simply be attached to this bugreport, or should I upload them somewhere else? Regards, Rik Happened again today many times with kernel.x86_64 3.12.8-200.fc19. I was able to reproduce it every time, by simply copying large files (over 150MB) from an external HDD, to a NFS4 mount. Strange, but the file size appears to be of great significance. The "threshold" seems to be around the 150MB, if above 150MB the kernel crashes on file copy, if below, it completes without a problem. Seeing exactly the same, multiple times per day, easily reproduced; 3.12.8-200.fc19.x86_64 on nfsv4.
Like c#3, it seems to occur only on large streaming writes to the nfs server.
Full fs opts: type nfs4 (rw,relatime,vers=4.0,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=x.x.x.x,local_lock=none,addr=y.y.y.y)
Happens with no other significant background activity or VM pressure present.
vmcore backtrace (multiple identical examples):
[exception RIP: nfs41_assign_slot+61]
RIP: ffffffffa0fccc8d RSP: ffff880820a4dcb8 RFLAGS: 00010246
RAX: 00000001000bc3cd RBX: ffff880819457c98 RCX: ffff880819457c00
RDX: 0000000000000000 RSI: ffff8807cc622280 RDI: ffff8806f438fa00
RBP: ffff880820a4dcb8 R8: ffff88077c1422b0 R9: 0000000000000000
R10: dfe3c4f3fbc4bd40 R11: 00000000000005a8 R12: ffff8806f438fa00
R13: ffffffffa0fccc50 R14: ffff8807cc622280 R15: ffff880819457cb0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff880820a4dcc0] rpc_wake_up_first at ffffffffa0f02534 [sunrpc]
#10 [ffff880820a4dd00] nfs41_wake_and_assign_slot at ffffffffa0fcd402 [nfsv4]
#11 [ffff880820a4dd10] nfs40_sequence_done.isra.27 at ffffffffa0fabfcc [nfsv4]
#12 [ffff880820a4dd38] nfs4_sequence_done at ffffffffa0fac76c [nfsv4]
#13 [ffff880820a4dd48] nfs4_write_done at ffffffffa0fadf5e [nfsv4]
#14 [ffff880820a4dd68] nfs_writeback_done at ffffffffa0f856b8 [nfs]
#15 [ffff880820a4dd90] nfs_writeback_done_common at ffffffffa0f858be [nfs]
#16 [ffff880820a4dda0] rpc_exit_task at ffffffffa0f01328 [sunrpc]
#17 [ffff880820a4ddb8] __rpc_execute at ffffffffa0f01fc4 [sunrpc]
#18 [ffff880820a4de10] rpc_async_schedule at ffffffffa0f02366 [sunrpc]
#19 [ffff880820a4de28] process_one_work at ffffffff81083636
#20 [ffff880820a4de70] worker_thread at ffffffff8108426b
#21 [ffff880820a4ded0] kthread at ffffffff8108b110
#22 [ffff880820a4df50] ret_from_fork at ffffffff81675f3c
Created attachment 858114 [details]
NFSv4: Fix memory corruption in nfs4_proc_open_confirm
This patch fixes an issue with the open() call that has potential to cause corruption in the NFSv4 slot table. Could you please see if that makes a difference?
Hi Trond, Thanks --- it's too early to be sure but signs are encouraging with that patch. I've done a few GB of the download traffic that was reliably triggering the problem before, and pushed a bunch of synthetic write traffic at the server, with no issues so far. I haven't been using the client much for anything else today, though, so if the problem is a side effect of general client load then I may just not be pushing it hard enough. I'll be travelling for the next week and a half, though, and won't be able to test further in that time. --Stephen I am having the same problem with kernel panics when saving files over nfs. Kernel: kernel-3.12.9-301.fc20.x86_64 nfs-utils-1.2.9-3.0.fc20.x86_64 The server is CentOS 6 x86_64 http://i57.tinypic.com/2v1ard0.png The problem occurs saving large files to an nfs share with gimp or transmission. I have three machines, all running Fedora 20, and they all have the same problem. I didn't see this with F19, but now it happens many times a day. --Dan Hi, I've been running a patched 3.12.9 for a few days now and haven't had any lock-ups so far. So it looks the patch fixes this issue. Regards, Rik (In reply to Stephen Tweedie from comment #6) > Thanks --- it's too early to be sure but signs are encouraging with that > patch. ... > I'll be travelling for the next week and a half, though, and won't be able > to test further in that time. Bad news --- I'm back, got my desktop up and running, and had the same error with the patched kernel within 30 minutes. Nothing in logs before the panic. I never saw this with 3.10. (Never used 3.11 in the same environment.) I'm going to retry with the -debug kernel. Thanks, Stephen Created attachment 862294 [details]
NFSv4: Fix a slot leak in nfs40_sequence_done
You probably need this patch in addition to the previous one. It prevents
further slot leaks in nfs4_open_confirm_done() and nfs4_release_lockowner_done().
Both these patches should be trickling down through stable at this time.
The second patch seems to have fixed it: several days uptime now with no recurrence, thanks! --Stephen *** Bug 1055537 has been marked as a duplicate of this bug. *** There was a debian user still hitting a similar oops even with those two patches:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=734268
So we may need another follow-on fix...stay tuned.
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs. Fedora 19 has now been rebased to 3.13.5-100.fc19. Please test this kernel update and let us know if you issue has been resolved or if it is still present with the newer kernel. If you experience different issues, please open a new bug report for those. This one just triggered yesterday for me on 3.13.5-202.fc20.x86_64: problem is still there. Unfortunately I was also able to reproduce (albeit only once) with the two patches above applied against an earlier kernel; those patches certainly made the problem trigger more rarely but did not cure it. Thanks, Stephen And again on 3.13.6-200.fc20.x86_64. Partial vmcore and logs are available, but the backtrace looks exactly like the one above in comment c#4, so I'm not sure how much more help another dump will be. Hi Stephen, Commit b7e63a1079b2 (NFSv4: Fix another nfs4_sequence corruptor) has only just been merged into 3.14-rc6, and should start trickling into the stable kernels soon. It addresses a second corruption instance of the same list. In the meantime, please see https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=b7e63a1079b2 if you'd like to apply the patch manually. Hi Trond, Thanks, I've got a local kernel patched, built and installed with that latest upstream patch, will pick it up in a reboot tonight and will see how it goes. --Stephen The last of these patches was merged into v3.13.7 so I think we can call this fixed. |