1519713 – [Ganesha] : Ganesha nodes crash when I/O is started from clients , vmcore generated.

Bug 1519713 - [Ganesha] : Ganesha nodes crash when I/O is started from clients , vmcore generated.

Summary: [Ganesha] : Ganesha nodes crash when I/O is started from clients , vmcore gen...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	nfs-ganesha
Sub Component:
Version:	rhgs-3.4
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	Kaleb KEITHLEY
QA Contact:	Manisha Saini
Docs Contact:
URL:
Whiteboard:
Depends On:	1520428
Blocks:	1503137
TreeView+	depends on / blocked

Reported:	2017-12-01 09:22 UTC by Ambarish
Modified:	2018-09-24 12:44 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1520428 (view as bug list)
Environment:
Last Closed:	2018-09-04 06:53:36 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2018:2610	0	None	None	None	2018-09-04 06:54:41 UTC

Description Ambarish 2017-12-01 09:22:02 UTC

Description of problem:
-----------------------

6 node Ganesha cluster,6 clients mounted a Ganesha export via v4.

Ran kernel untar in different subdirs from the 6 clients.

Almost a minute later, all nodes crashed one by one generating a vmcore (Thanks Soumya for the initial debug).

I tried this once on FUSE and did not face any problem.

The problem is very easily reproducible on smaller setups , smaller load  and without HA as well.

Version-Release number of selected component (if applicable):
------------------------------------------------------------
[root@gqas003 /]# rpm -qa|grep ganesha


nfs-ganesha-gluster-2.5.4-1.el7rhgs.x86_64
nfs-ganesha-debuginfo-2.5.4-1.el7rhgs.x86_64
glusterfs-ganesha-3.12.2-1.el7rhgs.x86_64
nfs-ganesha-2.5.4-1.el7rhgs.x86_64

[root@gqas003 /]# 
[root@gqas003 /]# rpm -qa|grep kernel
kernel-3.10.0-693.el7.x86_64



How reproducible:
-----------------

100%

Steps to Reproduce:
------------------

1. Create a Ganesha HA cluster

2. Mount Ganesha export on multiple clients and trigger any write intensive workload.


Actual results:
---------------

Nodes crash , all of them , one by one .

Quorum gets lost, application is hung.

Expected results:
-----------------

No crashes.

Comment 4 Ambarish 2017-12-01 11:02:00 UTC

Pasting BT from core :

[root@gqas003 ~]# crash /usr/lib/debug/lib/modules/3.10.0-799.el7.x86_64/vmlinux /var/crash/127.0.0.1-2017-12-01-03\:24\:18/vmcore

crash 7.2.0-2.el7
Copyright (C) 2002-2017  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

WARNING: kernel relocated [8MB]: patching 80866 gdb minimal_symbol values

      KERNEL: /usr/lib/debug/lib/modules/3.10.0-799.el7.x86_64/vmlinux 
    DUMPFILE: /var/crash/127.0.0.1-2017-12-01-03:24:18/vmcore  [PARTIAL DUMP]
        CPUS: 24
        DATE: Fri Dec  1 03:24:12 2017
      UPTIME: 01:31:15
LOAD AVERAGE: 0.62, 0.35, 0.22
       TASKS: 701
    NODENAME: gqas003.sbu.lab.eng.bos.redhat.com
     RELEASE: 3.10.0-799.el7.x86_64
     VERSION: #1 SMP Mon Nov 27 07:04:19 EST 2017
     MACHINE: x86_64  (2666 Mhz)
      MEMORY: 48 GB
       PANIC: "BUG: unable to handle kernel paging request at ffff8e7a00098000"
         PID: 0
     COMMAND: "swapper/11"
        TASK: ffff8e7b76e14f10  (1 of 24)  [THREAD_INFO: ffff8e7b76e30000]
         CPU: 11
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 0      TASK: ffff8e7b76e14f10  CPU: 11  COMMAND: "swapper/11"
 #0 [ffff8e8017b439b8] machine_kexec at ffffffff8185f68b
 #1 [ffff8e8017b43a18] __crash_kexec at ffffffff8190c6f2
 #2 [ffff8e8017b43ae8] crash_kexec at ffffffff8190c7e0
 #3 [ffff8e8017b43b00] oops_end at ffffffff81ee2af8
 #4 [ffff8e8017b43b28] no_context at ffffffff81ed326b
 #5 [ffff8e8017b43b78] __bad_area_nosemaphore at ffffffff81ed3302
 #6 [ffff8e8017b43bc8] bad_area_nosemaphore at ffffffff81ed3473
 #7 [ffff8e8017b43bd8] __do_page_fault at ffffffff81ee5a70
 #8 [ffff8e8017b43c40] do_page_fault at ffffffff81ee5c65
 #9 [ffff8e8017b43c70] page_fault at ffffffff81ee1d88
    [exception RIP: memcpy+13]
    RIP: ffffffff81b4b59d  RSP: ffff8e8017b43d28  RFLAGS: 00010206
    RAX: ffff8e7a000004eb  RBX: ffff9f52c6fa7000  RCX: 0000000003144f56
    RDX: 0000000000000005  RSI: ffff8e7aceafdffb  RDI: ffff8e7a00097ffb
    RBP: ffff8e8017b43d30   R8: 0000000000000000   R9: 00000000000004eb
    R10: 00000000000001b7  R11: 00000000000000c6  R12: ffff8e861358ecc0
    R13: ffff8e86143c8600  R14: ffff8e8609f59000  R15: 00000000000000c6
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#10 [ffff8e8017b43d28] swiotlb_tbl_sync_single at ffffffff81b64d43
#11 [ffff8e8017b43d38] swiotlb_sync_single at ffffffff81b64d80
#12 [ffff8e8017b43d48] swiotlb_sync_single_for_cpu at ffffffff81b64d9c
#13 [ffff8e8017b43d58] ixgbe_clean_rx_irq at ffffffffc03f3392 [ixgbe]
#14 [ffff8e8017b43de0] ixgbe_poll at ffffffffc03f454e [ixgbe]
#15 [ffff8e8017b43e78] net_rx_action at ffffffff81dbab79
#16 [ffff8e8017b43ef8] __do_softirq at ffffffff8189505f
#17 [ffff8e8017b43f68] call_softirq at ffffffff81eec45c
#18 [ffff8e8017b43f80] do_softirq at ffffffff8182d5b5
#19 [ffff8e8017b43fa0] irq_exit at ffffffff818953e5
#20 [ffff8e8017b43fb8] do_IRQ at ffffffff81eecff6
--- <IRQ stack> ---
bt: cannot transition from IRQ stack to current process stack:
        IRQ stack pointer: ffff8e8017b439b8
    process stack pointer: ffffffff81eecfce
       current stack base: ffff8e7b76e30000
crash>

Comment 5 Daniel Gryniewicz 2017-12-01 14:25:13 UTC

This is a problem with receiving on the NIC.  I don't see how Ganesha can possibly be causing this.  It may be a driver bug?  Have the kernels on these boxes been updated recently?

Comment 6 Ambarish 2017-12-04 05:30:43 UTC

(In reply to Daniel Gryniewicz from comment #5)
> This is a problem with receiving on the NIC.  I don't see how Ganesha can
> possibly be causing this.  It may be a driver bug?  Have the kernels on
> these boxes been updated recently?

Yes,I upgraded from 7.4 to 7.5 and am having problems with I/O ever since.

For whatever reason,I cannot reproduce this on FUSE.

Should I be cloning this to RHEL - kernel?

Comment 7 Daniel Gryniewicz 2017-12-04 13:13:15 UTC

(In reply to Ambarish from comment #6)

> Yes,I upgraded from 7.4 to 7.5 and am having problems with I/O ever since.
> 
> For whatever reason,I cannot reproduce this on FUSE.
> 
> Should I be cloning this to RHEL - kernel?

I think so, yes.

Comment 21 errata-xmlrpc 2018-09-04 06:53:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2610

Note You need to log in before you can comment on or make changes to this bug.