1398902 – [Ganesha] : Ganesha crashes when nfs-ganesha is restarted amidst continuous I/O from heterogeneous clients.

Bug 1398902 - [Ganesha] : Ganesha crashes when nfs-ganesha is restarted amidst continuous I/O from heterogeneous clients.

Summary: [Ganesha] : Ganesha crashes when nfs-ganesha is restarted amidst continuous I...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	nfs-ganesha
Sub Component:
Version:	rhgs-3.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Kaleb KEITHLEY
QA Contact:	Ambarish
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-11-27 07:22 UTC by Ambarish
Modified:	2017-08-23 12:30 UTC (History)
CC List:	11 users (show)
Fixed In Version:	rhgs-3.2.0
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-08-23 12:30:17 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Ambarish 2016-11-27 07:22:49 UTC

Description of problem:
------------------------

4 Node Ganesha cluster.
4 clients mounted a 2*2 volume,2 via v3 and 2 via v4.

Workload : tarball untar,small file creates

Restarted nfs-ganesha on all 4 nodes.

Ganesha crashed on 2/4 nodes .The signature of both the  BTs is different :

**************
BT from Node 1
**************

(gdb) 
#0  0x00007fdbdb7fe3b0 in GlusterFS () from /usr/lib64/ganesha/libfsalgluster.so.4.2.0
#1  0x00007fdc6c7ec4c3 in mdcache_lru_clean (entry=0x7fd8f4139570)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:421
#2  mdcache_lru_unref (entry=entry@entry=0x7fd8f4139570, flags=flags@entry=0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1464
#3  0x00007fdc6c7e9e11 in mdcache_put (entry=0x7fd8f4139570)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.h:186
#4  mdcache_unexport (exp_hdl=0x7fdbdc0d2130)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_export.c:152
#5  0x00007fdc6c7cc473 in clean_up_export (export=0x7fdbdc002e28)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/support/exports.c:2266
#6  unexport (export=export@entry=0x7fdbdc002e28) at /usr/src/debug/nfs-ganesha-2.4.1/src/support/exports.c:2287
#7  0x00007fdc6c7dc8d7 in remove_all_exports () at /usr/src/debug/nfs-ganesha-2.4.1/src/support/export_mgr.c:761
#8  0x00007fdc6c75014a in do_shutdown () at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_admin_thread.c:433
#9  admin_thread (UnusedArg=<optimized out>) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_admin_thread.c:466
#10 0x00007fdc6acaedc5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fdc6a37d73d in clone () from /lib64/libc.so.6
(gdb) 



**************
BT from Node 2
**************


(gdb) 
#0  0x00007f230e17bc05 in __gf_free (free_ptr=0x7f2054008340) at mem-pool.c:314
#1  0x00007f230e1793ae in fd_destroy (bound=_gf_true, fd=0x7f20540d5b3c) at fd.c:523
#2  fd_unref (fd=0x7f20540d5b3c) at fd.c:568
#3  0x00007f22f7b4cd78 in client_local_wipe (local=local@entry=0x7f22f006b260) at client-helpers.c:131
#4  0x00007f22f7b52b38 in client3_3_flush_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, 
    myframe=0x7f2304e43490) at client-rpc-fops.c:921
#5  0x00007f230df1f680 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f22f00728d0, pollin=pollin@entry=0x7f22f0446d70)
    at rpc-clnt.c:791
#6  0x00007f230df1f95f in rpc_clnt_notify (trans=<optimized out>, mydata=0x7f22f0072900, event=<optimized out>, 
    data=0x7f22f0446d70) at rpc-clnt.c:962
#7  0x00007f230df1b883 in rpc_transport_notify (this=this@entry=0x7f22f0082600, 
    event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f22f0446d70) at rpc-transport.c:537
#8  0x00007f22fca20eb4 in socket_event_poll_in (this=this@entry=0x7f22f0082600) at socket.c:2267
#9  0x00007f22fca23365 in socket_event_handler (fd=<optimized out>, idx=1, data=0x7f22f0082600, poll_in=1, 
    poll_out=0, poll_err=0) at socket.c:2397
#10 0x00007f230e1af3d0 in event_dispatch_epoll_handler (event=0x7f22fca18540, event_pool=0x7f2308086710)
    at event-epoll.c:571
#11 event_dispatch_epoll_worker (data=0x7f22f8000920) at event-epoll.c:674
#12 0x00007f2399d39dc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f239940873d in clone () from /lib64/libc.so.6
(gdb) 


Once again,these cores were dumped when ganesha restart stopped the ganesha process ,as the process was alive and running post the crash.

Also,this is a different use case as reported in BZ#1393526.The BTs are different as well.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------

nfs-ganesha-gluster-2.4.1-1.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64


How reproducible:
-----------------

2/2.

Steps to Reproduce:
------------------

1. Mount the volume via v3 and v4.

2. Run different types of workload from the application side on v3  as well as v4 mounts.(smallfile,kerenl untar etc)

3. Restart ganesha service on all nodes

Actual results:
---------------

Ganesha crashed and dumped core.

Expected results:
------------------

No crashes.

Additional info:
-----------------

OS : RHEL 7.3

*Vol Config* :

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: aeab0f8a-1e34-4681-bdf4-5b1416e46f27
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: on
server.allow-insecure: on
performance.stat-prefetch: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
[root@gqas013 tmp]#

Comment 3 Atin Mukherjee 2016-11-30 13:02:53 UTC

This BZ has been taken out from rhgs-3.2.0 as per today's triage exercise.

Note You need to log in before you can comment on or make changes to this bug.