Description of problem: ----------------------- 6 node Ganesha cluster,6 clients mounted a Ganesha export via v4. Ran kernel untar in different subdirs from the 6 clients. Almost a minute later, all nodes crashed one by one generating a vmcore (Thanks Soumya for the initial debug). I tried this once on FUSE and did not face any problem. The problem is very easily reproducible on smaller setups , smaller load and without HA as well. Version-Release number of selected component (if applicable): ------------------------------------------------------------ [root@gqas003 /]# rpm -qa|grep ganesha nfs-ganesha-gluster-2.5.4-1.el7rhgs.x86_64 nfs-ganesha-debuginfo-2.5.4-1.el7rhgs.x86_64 glusterfs-ganesha-3.12.2-1.el7rhgs.x86_64 nfs-ganesha-2.5.4-1.el7rhgs.x86_64 [root@gqas003 /]# [root@gqas003 /]# rpm -qa|grep kernel kernel-3.10.0-693.el7.x86_64 How reproducible: ----------------- 100% Steps to Reproduce: ------------------ 1. Create a Ganesha HA cluster 2. Mount Ganesha export on multiple clients and trigger any write intensive workload. Actual results: --------------- Nodes crash , all of them , one by one . Quorum gets lost, application is hung. Expected results: ----------------- No crashes.
Pasting BT from core : [root@gqas003 ~]# crash /usr/lib/debug/lib/modules/3.10.0-799.el7.x86_64/vmlinux /var/crash/127.0.0.1-2017-12-01-03\:24\:18/vmcore crash 7.2.0-2.el7 Copyright (C) 2002-2017 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb (GDB) 7.6 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... WARNING: kernel relocated [8MB]: patching 80866 gdb minimal_symbol values KERNEL: /usr/lib/debug/lib/modules/3.10.0-799.el7.x86_64/vmlinux DUMPFILE: /var/crash/127.0.0.1-2017-12-01-03:24:18/vmcore [PARTIAL DUMP] CPUS: 24 DATE: Fri Dec 1 03:24:12 2017 UPTIME: 01:31:15 LOAD AVERAGE: 0.62, 0.35, 0.22 TASKS: 701 NODENAME: gqas003.sbu.lab.eng.bos.redhat.com RELEASE: 3.10.0-799.el7.x86_64 VERSION: #1 SMP Mon Nov 27 07:04:19 EST 2017 MACHINE: x86_64 (2666 Mhz) MEMORY: 48 GB PANIC: "BUG: unable to handle kernel paging request at ffff8e7a00098000" PID: 0 COMMAND: "swapper/11" TASK: ffff8e7b76e14f10 (1 of 24) [THREAD_INFO: ffff8e7b76e30000] CPU: 11 STATE: TASK_RUNNING (PANIC) crash> bt PID: 0 TASK: ffff8e7b76e14f10 CPU: 11 COMMAND: "swapper/11" #0 [ffff8e8017b439b8] machine_kexec at ffffffff8185f68b #1 [ffff8e8017b43a18] __crash_kexec at ffffffff8190c6f2 #2 [ffff8e8017b43ae8] crash_kexec at ffffffff8190c7e0 #3 [ffff8e8017b43b00] oops_end at ffffffff81ee2af8 #4 [ffff8e8017b43b28] no_context at ffffffff81ed326b #5 [ffff8e8017b43b78] __bad_area_nosemaphore at ffffffff81ed3302 #6 [ffff8e8017b43bc8] bad_area_nosemaphore at ffffffff81ed3473 #7 [ffff8e8017b43bd8] __do_page_fault at ffffffff81ee5a70 #8 [ffff8e8017b43c40] do_page_fault at ffffffff81ee5c65 #9 [ffff8e8017b43c70] page_fault at ffffffff81ee1d88 [exception RIP: memcpy+13] RIP: ffffffff81b4b59d RSP: ffff8e8017b43d28 RFLAGS: 00010206 RAX: ffff8e7a000004eb RBX: ffff9f52c6fa7000 RCX: 0000000003144f56 RDX: 0000000000000005 RSI: ffff8e7aceafdffb RDI: ffff8e7a00097ffb RBP: ffff8e8017b43d30 R8: 0000000000000000 R9: 00000000000004eb R10: 00000000000001b7 R11: 00000000000000c6 R12: ffff8e861358ecc0 R13: ffff8e86143c8600 R14: ffff8e8609f59000 R15: 00000000000000c6 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #10 [ffff8e8017b43d28] swiotlb_tbl_sync_single at ffffffff81b64d43 #11 [ffff8e8017b43d38] swiotlb_sync_single at ffffffff81b64d80 #12 [ffff8e8017b43d48] swiotlb_sync_single_for_cpu at ffffffff81b64d9c #13 [ffff8e8017b43d58] ixgbe_clean_rx_irq at ffffffffc03f3392 [ixgbe] #14 [ffff8e8017b43de0] ixgbe_poll at ffffffffc03f454e [ixgbe] #15 [ffff8e8017b43e78] net_rx_action at ffffffff81dbab79 #16 [ffff8e8017b43ef8] __do_softirq at ffffffff8189505f #17 [ffff8e8017b43f68] call_softirq at ffffffff81eec45c #18 [ffff8e8017b43f80] do_softirq at ffffffff8182d5b5 #19 [ffff8e8017b43fa0] irq_exit at ffffffff818953e5 #20 [ffff8e8017b43fb8] do_IRQ at ffffffff81eecff6 --- <IRQ stack> --- bt: cannot transition from IRQ stack to current process stack: IRQ stack pointer: ffff8e8017b439b8 process stack pointer: ffffffff81eecfce current stack base: ffff8e7b76e30000 crash>
This is a problem with receiving on the NIC. I don't see how Ganesha can possibly be causing this. It may be a driver bug? Have the kernels on these boxes been updated recently?
(In reply to Daniel Gryniewicz from comment #5) > This is a problem with receiving on the NIC. I don't see how Ganesha can > possibly be causing this. It may be a driver bug? Have the kernels on > these boxes been updated recently? Yes,I upgraded from 7.4 to 7.5 and am having problems with I/O ever since. For whatever reason,I cannot reproduce this on FUSE. Should I be cloning this to RHEL - kernel?
(In reply to Ambarish from comment #6) > Yes,I upgraded from 7.4 to 7.5 and am having problems with I/O ever since. > > For whatever reason,I cannot reproduce this on FUSE. > > Should I be cloning this to RHEL - kernel? I think so, yes.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2610