Bug 1050206

Summary:

BUG: kernel NULL pointer dereference at nfs41_assign_slot+0x3d [nfsv4]

Product:

[Fedora] Fedora

Reporter:

antokarag

Component:

kernel

Assignee:

Jeff Layton <jlayton>

Status:

CLOSED ERRATA

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

CC:

antokarag, gansalmon, itamar, jlayton, jonathan, kernel-maint, madhu.chinakonda, nfs-maint, rik.theys, rosario.esposito, sct, steved, trond.myklebust

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

1061707 (view as bug list)

Environment:

Last Closed:

2014-04-07 18:32:31 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1061707

Attachments:

Description	Flags
Photo of kernel dump from monitor	none
NFSv4: Fix memory corruption in nfs4_proc_open_confirm	none
NFSv4: Fix a slot leak in nfs40_sequence_done	none

Description antokarag 2014-01-08 20:46:52 UTC

Created attachment 847339 [details]
Photo of kernel dump from monitor

Description of problem:

Appears to be random when working with NFSv41 mounts.
Has happened noumerous times on different workstations, so it must not be faulty hardware.

Current setup is CentOS 6.4 64bit with the NFS exports and Fedora 19 KDE 64 bit with the NFS mounts options ACL and Krb5. (ipa server / ipa client setup).
During reads / writes it randomly crashes the system with the following kernel (partial) dump:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000014
IP: [<ffffffffa0475c0d>] nfs41_assign_slot+0x3d/0x60 [nfsv4]
PGD 0
0ops: 0002 [#1] SMP
Modules linked in: cts rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd fscache ...
CPU: 1 PID: 6322 Comm: kworder/1:0 Not tainted 3.12.6-200.fc19.x86_64 #1
Hardware name: Hewlett-Packard HP Compaq dx2420 Microtower/2A78h, BIOS 5.18
Workqueue: rpciod rpc_async_schedule [sunrpc]
task: ...
RIP: ..
...
...
Call Trace:
[...] rpc_wake_up_first+0x64/0x1f0 [sunrpc]
[...] ? rpc_destroy_wait_queue+0x20/0x20 [sunrpc]
...
...
[...] ? insert kthread_work+0x40/0x40
Code: ....
RIP [...] nfs41_assign_slot+0x3d/0x60 [nfsv4]
...
Kernel panic - not syncing: Fatal exception in interrupt
drm_kms_helper: panic occurred, switching back to text console


Full; Kernel Dump as photo on url below or attached as jpg file
http://oi39.tinypic.com/24xnam1.jpg
(sorry, no kdump as text available yet)

Version-Release number of selected component (if applicable):


How reproducible:
Mount NFSv4 with ACL / Krb5 and perform read/write operations on random files

Steps to Reproduce:
1. Mount NFSv4 export(s)
2. Perform Read / Writes on random files
3.

Actual results:
Kernel panic


Expected results:
Normal operation


Additional info:
This happens at random times on different work stations.
Every time though, the kdump has almost the same information about the [nfsv4] and rpc_wake_up_first [sunrpc].

Comment 1 Rik Theys 2014-01-15 08:19:37 UTC

Hi,

We're experiencing the same crashes. We use NFS4 mounts with sec=sys, so no kerberos involved here.

Regards,

Rik

Comment 2 Rik Theys 2014-01-21 14:38:59 UTC

Hi,

I've experienced the same crash again on 3.12.7-300.fc20.x86_64 and have a kdump core dump available.

How can I make sure the dump does not contain sensitive information before I upload it? I would like to make sure it doesn't contain SSH keys etc.

Can 700MB dumps simply be attached to this bugreport, or should I upload them somewhere else?

Regards,

Rik

Comment 3 antokarag 2014-01-21 19:39:37 UTC

Happened again today many times with kernel.x86_64 3.12.8-200.fc19.

I was able to reproduce it every time, 
by simply copying large files (over 150MB) from an external HDD, to a NFS4 mount.

Strange, but the file size appears to be of great significance.
The "threshold" seems to be around the 150MB, if above 150MB the kernel crashes on file copy, if below, it completes without a problem.

Comment 4 Stephen Tweedie 2014-02-01 11:51:55 UTC

Seeing exactly the same, multiple times per day, easily reproduced; 3.12.8-200.fc19.x86_64 on nfsv4.

Like c#3, it seems to occur only on large streaming writes to the nfs server.

Full fs opts: type nfs4 (rw,relatime,vers=4.0,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=x.x.x.x,local_lock=none,addr=y.y.y.y)

Happens with no other significant background activity or VM pressure present.

vmcore backtrace (multiple identical examples):

    [exception RIP: nfs41_assign_slot+61]
    RIP: ffffffffa0fccc8d  RSP: ffff880820a4dcb8  RFLAGS: 00010246
    RAX: 00000001000bc3cd  RBX: ffff880819457c98  RCX: ffff880819457c00
    RDX: 0000000000000000  RSI: ffff8807cc622280  RDI: ffff8806f438fa00
    RBP: ffff880820a4dcb8   R8: ffff88077c1422b0   R9: 0000000000000000
    R10: dfe3c4f3fbc4bd40  R11: 00000000000005a8  R12: ffff8806f438fa00
    R13: ffffffffa0fccc50  R14: ffff8807cc622280  R15: ffff880819457cb0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff880820a4dcc0] rpc_wake_up_first at ffffffffa0f02534 [sunrpc]
#10 [ffff880820a4dd00] nfs41_wake_and_assign_slot at ffffffffa0fcd402 [nfsv4]
#11 [ffff880820a4dd10] nfs40_sequence_done.isra.27 at ffffffffa0fabfcc [nfsv4]
#12 [ffff880820a4dd38] nfs4_sequence_done at ffffffffa0fac76c [nfsv4]
#13 [ffff880820a4dd48] nfs4_write_done at ffffffffa0fadf5e [nfsv4]
#14 [ffff880820a4dd68] nfs_writeback_done at ffffffffa0f856b8 [nfs]
#15 [ffff880820a4dd90] nfs_writeback_done_common at ffffffffa0f858be [nfs]
#16 [ffff880820a4dda0] rpc_exit_task at ffffffffa0f01328 [sunrpc]
#17 [ffff880820a4ddb8] __rpc_execute at ffffffffa0f01fc4 [sunrpc]
#18 [ffff880820a4de10] rpc_async_schedule at ffffffffa0f02366 [sunrpc]
#19 [ffff880820a4de28] process_one_work at ffffffff81083636
#20 [ffff880820a4de70] worker_thread at ffffffff8108426b
#21 [ffff880820a4ded0] kthread at ffffffff8108b110
#22 [ffff880820a4df50] ret_from_fork at ffffffff81675f3c

Comment 5 Trond Myklebust 2014-02-01 20:17:06 UTC

Created attachment 858114 [details]
NFSv4: Fix memory corruption in nfs4_proc_open_confirm

This patch fixes an issue with the open() call that has potential to cause corruption in the NFSv4 slot table. Could you please see if that makes a difference?

Comment 6 Stephen Tweedie 2014-02-02 17:57:32 UTC

Hi Trond,

Thanks --- it's too early to be sure but signs are encouraging with that patch.  I've done a few GB of the download traffic that was reliably triggering the problem before, and pushed a bunch of synthetic write traffic at the server, with no issues so far.

I haven't been using the client much for anything else today, though, so if the problem is a side effect of general client load then I may just not be pushing it hard enough.

I'll be travelling for the next week and a half, though, and won't be able to test further in that time.

--Stephen

Comment 7 Dan Naughton 2014-02-06 00:57:49 UTC

I am having the same problem with kernel panics when saving files over nfs. 

Kernel: kernel-3.12.9-301.fc20.x86_64
nfs-utils-1.2.9-3.0.fc20.x86_64

The server is CentOS 6 x86_64 

http://i57.tinypic.com/2v1ard0.png

The problem occurs saving large files to an nfs share with gimp or transmission.

I have three machines, all running Fedora 20, and they all have the same problem.  I didn't see this with F19, but now it happens many times a day.

--Dan

Comment 8 Rik Theys 2014-02-10 14:58:26 UTC

Hi,

I've been running a patched 3.12.9 for a few days now and haven't had any lock-ups so far. So it looks the patch fixes this issue.

Regards,

Rik

Comment 9 Stephen Tweedie 2014-02-12 09:02:20 UTC

(In reply to Stephen Tweedie from comment #6)
> Thanks --- it's too early to be sure but signs are encouraging with that
> patch.
...
> I'll be travelling for the next week and a half, though, and won't be able
> to test further in that time.

Bad news --- I'm back, got my desktop up and running, and had the same error with the patched kernel within 30 minutes.  Nothing in logs before the panic.

I never saw this with 3.10.  (Never used 3.11 in the same environment.)

I'm going to retry with the -debug kernel.

Thanks,
 Stephen

Comment 10 Trond Myklebust 2014-02-12 13:34:06 UTC

Created attachment 862294 [details]
NFSv4: Fix a slot leak in nfs40_sequence_done

You probably need this patch in addition to the previous one. It prevents
further slot leaks in nfs4_open_confirm_done() and nfs4_release_lockowner_done().

Both these patches should be trickling down through stable at this time.

Comment 11 Stephen Tweedie 2014-02-17 11:18:42 UTC

The second patch seems to have fixed it: several days uptime now with no recurrence, thanks!

--Stephen

Comment 12 Jeff Layton 2014-02-26 16:23:49 UTC

*** Bug 1055537 has been marked as a duplicate of this bug. ***

Comment 13 Jeff Layton 2014-02-26 19:22:40 UTC

There was a debian user still hitting a similar oops even with those two patches:

    https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=734268

So we may need another follow-on fix...stay tuned.

Comment 14 Justin M. Forbes 2014-03-10 14:47:41 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs.

Fedora 19 has now been rebased to 3.13.5-100.fc19.  Please test this kernel update and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you experience different issues, please open a new bug report for those.

Comment 15 Stephen Tweedie 2014-03-10 14:59:33 UTC

This one just triggered yesterday for me on 3.13.5-202.fc20.x86_64: problem is still there.

Unfortunately I was also able to reproduce (albeit only once) with the two patches above applied against an earlier kernel; those patches certainly made the problem trigger more rarely but did not cure it.

Thanks,
 Stephen

Comment 16 Stephen Tweedie 2014-03-12 12:53:23 UTC

And again on 3.13.6-200.fc20.x86_64.  Partial vmcore and logs are available, but the backtrace looks exactly like the one above in comment c#4, so I'm not sure how much more help another dump will be.

Comment 17 Trond Myklebust 2014-03-12 13:50:57 UTC

Hi Stephen,

Commit b7e63a1079b2 (NFSv4: Fix another nfs4_sequence corruptor) has only just been merged into 3.14-rc6, and should start trickling into the stable kernels soon. It addresses a second corruption instance of the same list.

In the meantime, please see https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=b7e63a1079b2 if you'd like to apply the patch manually.

Comment 18 Stephen Tweedie 2014-03-12 16:07:39 UTC

Hi Trond,

Thanks, I've got a local kernel patched, built and installed with that latest upstream patch, will pick it up in a reboot tonight and will see how it goes.

--Stephen

Comment 19 Jeff Layton 2014-04-07 18:32:31 UTC

The last of these patches was merged into v3.13.7 so I think we can call this fixed.