Bug 1055537
Summary: | kernel BUG at fs/nfs/nfs4xdr.c:908! | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Rosario Esposito <rosario.esposito> |
Component: | kernel | Assignee: | Jeff Layton <jlayton> |
Status: | CLOSED DUPLICATE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | 19 | CC: | gansalmon, itamar, jlayton, jonathan, kernel-maint, madhu.chinakonda, nfs-maint, reinier, rosario.esposito, steved, trond.myklebust |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2014-02-26 16:23:49 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
That's the BUG_ON in this function: static __be32 *reserve_space(struct xdr_stream *xdr, size_t nbytes) { __be32 *p = xdr_reserve_space(xdr, nbytes); BUG_ON(!p); return p; } ...a full backtrace would be helpful though. Without knowing the call chain that caused this we probably won't make any progress on it. The stuff in the abrt report is more about a problem with a network driver. That may or may not be related (probably not though). Unfortunately I don't have the full backtrace as Kdump was not enabled at that time. Is there any other log files I could check ? By the way yesterday, after the last crash, I switched the NFS mount from version 4 to version 3 and I got no kernel panics since then. I had a similar panic, on Debian's linux 3.12.6-2~bpo70+1, just now. We're running NFSv4 intensively. I happen to have the kdump traces: kernel BUG at /build/linux-Ol1uYP/linux-3.12.6/fs/nfs/nfs4xdr.c:908! Workqueue: rpciod rpc_async_schedule [sunrpc] Call Trace: [<ffffffffa05d18fb>] ? encode_string+0x1b/0x40 [nfsv4] [<ffffffffa05d36f0>] ? nfs4_xdr_enc_open_confirm+0x70/0xd0 [nfsv4] [<ffffffffa04b0429>] ? rpcauth_wrap_req+0xa9/0xe0 [sunrpc] [<ffffffffa04a4210>] ? call_connect+0xe0/0xe0 [sunrpc] [<ffffffffa04a4385>] ? call_transmit+0x175/0x2c0 [sunrpc] [<ffffffffa04adf4e>] ? __rpc_execute+0x8e/0x430 [sunrpc] [<ffffffffa04a4210>] ? call_connect+0xe0/0xe0 [sunrpc] [<ffffffffa04ae30d>] ? rpc_async_schedule+0x1d/0x30 [sunrpc] [<ffffffff8107a7a7>] ? process_one_work+0x157/0x450 [<ffffffff8107bc24>] ? worker_thread+0x114/0x370 [<ffffffff8107bb10>] ? manage_workers.isra.21+0x2d0/0x2d0 [<ffffffff810822d3>] ? kthread+0xb3/0xc0 [<ffffffff81082220>] ? flush_kthread_worker+0xa0/0xa0 [<ffffffff814cc40c>] ? ret_from_fork+0x7c/0xb0 [<ffffffff81082220>] ? flush_kthread_worker+0xa0/0xa0 I hope this helps finding the source of the problem, let me know if additional information regarding the crash is required. We also were running NFSv4 intensively. Yet no kernel panics since I've switched to NFSv3. Makes sense since the panic in NFSv4 code and NFSv3 doesn't travel that codepath. So what happened is that we hit this case in xdr_reserve_space(): if (unlikely(q > xdr->end || q < p)) return NULL; ...basically, that means that we don't have enough room in the xdr buffer to encode the reply or we somehow got a negative offset. It would be interesting to figure out what line of code this address refers to 'nfs4_xdr_enc_open_confirm+0x70'. There are some instructions here though you'll need to substitute the nfsv4.ko for cifs.ko in them: https://wiki.samba.org/index.php/LinuxCIFS_troubleshooting#Oopses In my case: [root@clueet ~]# grep nfs4_xdr_enc_open_confirm /tmp/oops-2014-01-20-13\:45\:53-32607-0/dmesg [ 5989.901230] [<ffffffffa04f0181>] nfs4_xdr_enc_open_confirm+0x61/0xc0 [nfsv4] [ 5993.028997] [<ffffffffa04f0181>] nfs4_xdr_enc_open_confirm+0x61/0xc0 [nfsv4] [ 6015.784755] [<ffffffffa04f0181>] nfs4_xdr_enc_open_confirm+0x61/0xc0 [nfsv4] [ 6053.074343] [<ffffffffa04f0181>] nfs4_xdr_enc_open_confirm+0x61/0xc0 [nfsv4] [ 6232.702430] [<ffffffffa04f0181>] nfs4_xdr_enc_open_confirm+0x61/0xc0 [nfsv4] # gdb /lib/modules/3.12.7-200.fc19.x86_64/kernel/fs/nfs/nfsv4.ko GNU gdb (GDB) Fedora 7.6.1-46.fc19 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/lib/modules/3.12.7-200.fc19.x86_64/kernel/fs/nfs/nfsv4.ko...Reading symbols from /usr/lib/debug/lib/modules/3.12.7-200.fc19.x86_64/kernel/fs/nfs/nfsv4.ko.debug...done. done. (gdb) list *(nfs4_xdr_enc_open_confirm+0x61) 0x11181 is in nfs4_xdr_enc_open_confirm (fs/nfs/nfs4xdr.c:974). 969 970 static void encode_op_hdr(struct xdr_stream *xdr, enum nfs_opnum4 op, 971 uint32_t replen, 972 struct compound_hdr *hdr) 973 { 974 encode_uint32(xdr, op); 975 hdr->nops++; 976 hdr->replen += replen; 977 } 978 (gdb) (the 'dmesg' file is in the previously attached abrt tarball) I hope it helps. Sigh...so most of the functions called by nfs4_xdr_enc_open_confirm end up calling that function to put the op header into the call. So we'll need to do a bit more investigation to determine which one that is. I'll take a closer look when I get time. Looking at the disassembly from 3.12.8-300.fc20.x86_64 (I doubt anything changed in this file between the two versions). 0xffffffffa0fcd179 <nfs4_xdr_enc_open_confirm+0x59>: mov %rbx,%rdi 0xffffffffa0fcd17c <nfs4_xdr_enc_open_confirm+0x5c>: callq 0xffffffffa0fcb400 <encode_putfh> 0xffffffffa0fcd181 <nfs4_xdr_enc_open_confirm+0x61>: mov %rbx,%rdi 0xffffffffa0fcd184 <nfs4_xdr_enc_open_confirm+0x64>: mov $0x14,%esi ...so the oops is likely in the call to encode_string in encode_putfh. That call is this: encode_string(xdr, fh->size, fh->data); ....which would seem to imply that maybe you're getting some very large filehandles? It might be interesting to instrument code just before the BUG_ON call and see if that's the problem... I would very much need this to be fixed as I'm having a number of other issues with NFSv3 file locking. If you could make a modified version of nfsv4.ko I'd be happy to test it. Created attachment 857310 [details] patch -- print number of bytes requested and some info about xdr stream prior to oopsing Ok, there's a patched kernel here that might give us more info: https://koji.fedoraproject.org/koji/taskinfo?taskID=6468532 It just adds a printk before the BUG_ON where this is oopsing. That might give us a better idea of what's going wrong. Created attachment 860881 [details]
dmesg oops with kernel 3.12.8-200.fc19.x86_64
Hi Jeff, I was waiting for a new crash before booting the machine with the patched kernel. It happened today (see attached dmesg) after ~2 weeks running without problems. Now I booted it with kernel 3.12.9-201.bz1055537.1.fc19.x86_64.debug. I'll let you know what happens when it crashes again. Created attachment 860911 [details] oops with kernel 3.12.9-201.bz1055537.1.fc19.x86_64.debug Hi again, I got another oops after ~12 hours. This is the output of the printk you added: Feb 8 21:02:02 clueet kernel: [29990.543184] reserve_space: p=0xffff8807ee0409f0 end=0xffff8807ee040b34 bytes=42236 Feb 8 21:02:02 clueet kernel: [29990.559115] ------------[ cut here ]------------ Feb 8 21:02:02 clueet kernel: [29990.574703] kernel BUG at fs/nfs/nfs4xdr.c:910! Feb 8 21:02:02 clueet kernel: [29990.590071] invalid opcode: 0000 [#1] SMP I hope this helps. Created attachment 860946 [details] second oops with kernel 3.12.9-201.bz1055537.1.fc19.x86_64.debug Another one... Feb 9 04:32:02 clueet kernel: [14826.807581] reserve_space: p=0xffff8810128889f0 end=0xffff881012888b34 bytes=42236 Feb 9 04:32:02 clueet kernel: [14826.822269] ------------[ cut here ]------------ Feb 9 04:32:03 clueet kernel: [14826.836676] kernel BUG at fs/nfs/nfs4xdr.c:910! Feb 9 04:32:03 clueet kernel: [14826.850876] invalid opcode: 0000 [#1] SMP Do you have the rest of the stack trace by any chance (or a vmcore)? 0xffff881012888b34-0xffff8810128889f0 == 0x144 ...which is decimal 324, but this code is trying to reserve 42k. Do you by any chance have the rest of the stack trace? Assuming that the stack trace looks similar to the one in comment #11, then it looks like you're somehow getting a 42k filehandle (which is extremely large). It should be no larger than 128 bytes: #define NFS_MAXFHSIZE 128 struct nfs_fh { unsigned short size; unsigned char data[NFS_MAXFHSIZE]; }; What might be best is to collect a vmcore, as I suspect that there may be some memory corruption... Created attachment 861519 [details] full /var/log messages oops with kernel 3.12.9-201.bz1055537.1.fc19.x86_64.debug I only have a full /var/log/messages, I attached it in comment #19. I was not able to generate a vmcore. Apparently there's a problem in doing it with Fedora 19 https://fedoraproject.org/wiki/QA:Testcase_ABRT_vmcore I followed these instructions: https://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes but my /var/crash was empty after the kernel panic. Do you have any clue how to get a vmcore correctly ? Concerning the memory corruption you suspect, what does that mean ? could it be a hardware issue ? By memory corruption, I mean that something likely scribbled over a region of memory. That can happen for all sorts of reasons faulty hardware is certainly one, but it can happen due to software problems too. The problem is that they're often quite tricky to track down. Looking through the log, I see this a little later: Feb 9 04:33:07 clueet kernel: [14828.773531] BUG: unable to handle kernel paging request at ffffffffffffff98 Feb 9 04:33:07 clueet kernel: [14828.781821] IP: [<ffffffff810a2150>] kthread_data+0x10/0x20 Feb 9 04:33:07 clueet kernel: [14828.789907] PGD 1c0f067 PUD 1c11067 PMD 0 ...which is actually the more troubling problem, since that's a real GFP and not just a spot where someone put a deliberate BUG() assertion. In fact, it looks like that's equivalent to 0 - 104, so it's possible there was an ECONNRESET error in there somewhere. Anyway, my suspicion is that the original BUG() call in reserve_space is just a symptom of a larger problem. That problem may be what's causing kdump not to work correctly too. What may help is to set /proc/sys/kernel/panic_on_oops to 1 (it's generally 0 in Fedora) since that'll make the box panic on the first BUG() call instead of trying to keep going. Hi, since my last post the system has been running with no problems, using NFSv3 mounts. Today I rebooted it, I switched back to NFSv4 + kernel 3.12.9-201.bz1055537.1.fc19.x86_64. I also set /proc/sys/kernel/panic_on_oops to 1. After a few hours, I got a kernel panic and it self rebooted. This time the system generated a ~680MB vmcore which I uploaded here: https://docs.google.com/file/d/0B4tKW-EJ-33qV0N3bS0tQkVfQU0 I was not able to capture the output of dmesg. There was just an empty vmcore-dmesg-incomplete.txt in /var/crash. Please, can you have a look ? [root@clueet ~]# crash /var/crash/127.0.0.1-2014.02.25-13\:38\:12/vmcore /usr/lib/debug/lib/modules/3.12.9-201.bz1055537.1.fc19.x86_64/vmlinux crash 7.0.3-1.fc20 Copyright (C) 2002-2013 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb (GDB) 7.6 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... KERNEL: /usr/lib/debug/lib/modules/3.12.9-201.bz1055537.1.fc19.x86_64/vmlinux DUMPFILE: /var/crash/127.0.0.1-2014.02.25-13:38:12/vmcore [PARTIAL DUMP] CPUS: 32 DATE: Tue Feb 25 13:38:03 2014 UPTIME: 12:55:24 LOAD AVERAGE: 2.19, 3.30, 3.32 TASKS: 2358 NODENAME: clueet.cern.ch RELEASE: 3.12.9-201.bz1055537.1.fc19.x86_64 VERSION: #1 SMP Wed Jan 29 21:17:10 UTC 2014 MACHINE: x86_64 (2899 Mhz) MEMORY: 64 GB PANIC: "kernel BUG at fs/nfs/nfs4xdr.c:910!" PID: 11996 COMMAND: "kworker/6:1" TASK: ffff880541d539c0 [THREAD_INFO: ffff88084adac000] CPU: 6 STATE: TASK_RUNNING (PANIC) crash> [46511.287488] reserve_space: p=0xffff8805eba79070 end=0xffff8805eba791b4 bytes=57348 [46511.301751] ------------[ cut here ]------------ [46511.315562] kernel BUG at fs/nfs/nfs4xdr.c:910! Great, please do the following: rename the vmcore to something like bz1055537-vmcore and then gzip it. Then, upload that file to: ftp://dropbox.redhat.com/incoming ...and let me know once it's done uploading. I'm traveling this week and might not get a chance to look at it for a bit, but I'll get to it as soon as I can. Done, bz1055537-vmcore.gz uploaded. Thanks a lot. Thanks. Looks like the debuginfo got gc'ed in koji, so I'm having to rebuild the kernel. Trond seems to think that this may be related to bug 1061707. Once I get a look at the core I'll try to confirm that either way... In the meantime, it would be good to test a v3.12.12 or later kernel once those show up in koji. I don't seem to be authorized to see bug 1061707 but I trust you :-) If needed, I still have the debuginfo rpms of kernel 3.12.9-201.bz1055537.1.fc19.x86_64. Ok, looking at the vmcore, I think I agree with Trond and this is likely a dup of bug 1050206. The symptoms look a little different, but the same data structures are being corrupted. It would be best to try a newer kernel on this box that has commit 17ead6c85c3d0ef57a14d1373f1f1cee2ce60ea8. It looks like they're working on v3.13.5 kernels in koji, which have that patch. You may want to try one of those: http://koji.fedoraproject.org/koji/buildinfo?buildID=500653 I'm going to go ahead and close this as a dup of that bug. If the newer kernel doesn't help, we'll plan to reopen this bug and continue looking at it. *** This bug has been marked as a duplicate of bug 1050206 *** Thanks a lot, I'll give it a try ! Hi, no kernel panics observed since I updated to kernel 3.13.5-101.fc19.x86_64. This seems to have fixed it. Thanks ! |
Created attachment 852758 [details] abrt bug report Description of problem: My Fedora 19 server is experiencing kernel panics after several hours of regular usage. I'd like to understand if this is caused by a hardware issue or a kernel instability. Version-Release number of selected component (if applicable): Fedora 19 - kernel 3.12.7-200.fc19.x86_64 How reproducible: Not easy to reproduce.. It happened unexpectedly: 3 times in 2 days. The last messages it logs before dying are: [ 5989.835667] ------------[ cut here ]------------ [ 5989.835825] kernel BUG at fs/nfs/nfs4xdr.c:908! [ 5989.835979] invalid opcode: 0000 [#1] SMP ... ... (see attachment for the full dmesg log) Additional info: The server has an Intel 82599ES 10gbit/s interface (eth4). The sub-interface eth4.2 is used to mount home directories from a nexenta storage system via NFS4 on VLAN2.