Bug 1413350

Summary: [Ganesha] : Subsequent mounts fail and Ganesha crashes (during an attempt to mount) post volume restarts.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Ambarish <asoman>
Component: nfs-ganeshaAssignee: Daniel Gryniewicz <dang>
Status: CLOSED ERRATA QA Contact: Ambarish <asoman>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: aloganat, amukherj, asoman, bturner, dang, ffilz, jthottan, mbenjamin, rcyriac, rhinduja, rhs-bugs, skoduri, storage-qa-internal
Target Milestone: ---Keywords: Regression
Target Release: RHGS 3.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: nfs-ganesha-2.4.1-6 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-23 06:28:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1351528    

Description Ambarish 2017-01-15 09:09:27 UTC
Description of problem:
------------------------

4 node Ganesha cluster,2*2 volume.

Restarted the volume after setting a few options.

Mounts failed post that,and Ganesha crashed on one of the nodes dumping the following core,when an attempt to mount was made :

***************
BT from gqas006
***************

(gdb) bt
#0  pthread_spin_lock () at ../nptl/sysdeps/x86_64/pthread_spin_lock.S:24
#1  0x00007fb893099ebd in inode_ctx_get0 () from /lib64/libglusterfs.so.0
#2  0x00007fb893099f45 in inode_needs_lookup () from /lib64/libglusterfs.so.0
#3  0x00007fb89336cc86 in __glfs_resolve_inode () from /lib64/libgfapi.so.0
#4  0x00007fb89336cd8b in glfs_resolve_inode () from /lib64/libgfapi.so.0
#5  0x00007fb89336d3f9 in glfs_h_stat () from /lib64/libgfapi.so.0
#6  0x00007fb893788df4 in getattrs (obj_hdl=0x7fb898ce0fe8, attrs=0x7fb85ae87ba0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/FSAL_GLUSTER/handle.c:756
#7  0x00007fb898230a14 in mdcache_refresh_attrs (entry=entry@entry=0x7fb898db0b60, need_acl=<optimized out>)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:939
#8  0x00007fb89823151a in mdcache_getattrs (obj_hdl=0x7fb898db0b98, attrs_out=0x7fb85ae87d40)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:1032
#9  0x00007fb8981b6a7f in nfs_SetPostOpAttr (obj=obj@entry=0x7fb898db0b98, Fattr=Fattr@entry=0x7fb7880008c8, 
    attrs=attrs@entry=0x0) at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs_proto_tools.c:91
#10 0x00007fb8981b7fb3 in nfs3_fsinfo (arg=<optimized out>, req=<optimized out>, res=0x7fb7880008c0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs3_fsinfo.c:133
#11 0x00007fb89817f13c in nfs_rpc_execute (reqdata=reqdata@entry=0x7fb7d40008c0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1281
#12 0x00007fb89818079a in worker_run (ctx=0x7fb898ddfde0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1548
#13 0x00007fb89820a409 in fridgethr_start_routine (arg=0x7fb898ddfde0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/support/fridgethr.c:550
#14 0x00007fb8966eadc5 in start_thread (arg=0x7fb85ae89700) at pthread_create.c:308
#15 0x00007fb895db973d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) 



Version-Release number of selected component (if applicable):
--------------------------------------------------------------

nfs-ganesha-gluster-2.4.1-5.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-11.el7rhgs.x86_64


How reproducible:
------------------

2/2 on fresh setups.

Actual results:
----------------

mounts fail and Ganesha crashes after a volume exported via Ganesha is restarted

Expected results:
------------------

Mounts should succeed and Ganesha should not crash.

Additional info:
----------------

Client and Server OS : RHEL 7.3

*Vol Config* :

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 156d084e-2a6e-44e0-b982-750bab037a7d
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: off
server.allow-insecure: on
performance.stat-prefetch: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable

Comment 4 Ambarish 2017-01-16 09:11:05 UTC
While working on a  reproducer Ganesha crashed on all the 4 nodes.
Setup shared with Dev for further RCA.

Comment 5 Arthy Loganathan 2017-01-16 10:32:31 UTC
This issue is not seen with previous build,

nfs-ganesha-gluster-2.4.1-4.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-11.el7rhgs.x86_64
nfs-ganesha-2.4.1-4.el7rhgs.x86_64

Comment 6 Ambarish 2017-01-16 10:35:44 UTC
Marking it as Regression from 3.1.3 -> 3.2

More clearly this regression was introduced between 2.4.1-4 -> 2.4.1-5.

Comment 7 Arthy Loganathan 2017-01-16 11:33:41 UTC
Tried in another setup with 

nfs-ganesha-gluster-2.4.1-4.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-11.el7rhgs.x86_64
nfs-ganesha-2.4.1-4.el7rhgs.x86_64

and the issue is reproducible.

Comment 8 Soumya Koduri 2017-01-16 11:48:48 UTC
Thanks Arthy. Not sure which build may have caused this regression but definitely not 2.4.1-5. There seem to be ref_leak for md-cache entry due to which it is not being cleaned up during volume unexport. Hence when the volume is re-exported with the same exportID, same md-cache entry is being re-used which is referring to the old freed memory (in this case glusterfs inode structure).

Comment 9 Daniel Gryniewicz 2017-01-16 14:02:53 UTC
So quite possibly related to https://bugzilla.redhat.com/show_bug.cgi?id=1413502

Comment 10 Soumya Koduri 2017-01-16 15:31:12 UTC
While debugging this issue using gdb on QE setup, found that in mdcache_unexport(), 

160		/* Unhash the root object */
161		assert(!cih_remove_checked(root_entry));
162	}

Line-161 never gets processed because of which root_entry doesn't get unref'ed resulting in ref leak. Dan confirmed that that could be the reason for this unexport issue.

Proposed fix by Dan - https://review.gerrithub.io/#/c/343263/

Comment 11 Arthy Loganathan 2017-01-17 12:57:28 UTC
Hitting the same crash, while doing refresh config on the nfs-ganesha enabled volume.

(gdb) bt
#0  pthread_spin_lock () at ../nptl/sysdeps/x86_64/pthread_spin_lock.S:24
#1  0x00007f9e5ff88ebd in inode_ctx_get0 () from /lib64/libglusterfs.so.0
#2  0x00007f9e5ff88f45 in inode_needs_lookup () from /lib64/libglusterfs.so.0
#3  0x00007f9e6025bc86 in __glfs_resolve_inode () from /lib64/libgfapi.so.0
#4  0x00007f9e6025bd8b in glfs_resolve_inode () from /lib64/libgfapi.so.0
#5  0x00007f9e6025c3f9 in glfs_h_stat () from /lib64/libgfapi.so.0
#6  0x00007f9e60677df4 in getattrs (obj_hdl=0x7f9d78638558, attrs=0x7f9debfd5d40) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/FSAL_GLUSTER/handle.c:756
#7  0x00007f9e64f0ca14 in mdcache_refresh_attrs (entry=entry@entry=0x7f9d78e508f0, need_acl=<optimized out>)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:939
#8  0x00007f9e64f0d51a in mdcache_getattrs (obj_hdl=0x7f9d78e50928, attrs_out=0x7f9debfd5fd0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:1032
#9  0x00007f9e64e91e17 in file_To_Fattr (data=data@entry=0x7f9debfd6180, request_mask=1433550, attr=attr@entry=0x7f9debfd5fd0, Fattr=Fattr@entry=0x7f9d78654760, 
    Bitmap=Bitmap@entry=0x7f9d74182e18) at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs_proto_tools.c:3299
#10 0x00007f9e64e6f0c2 in nfs4_op_getattr (op=0x7f9d74182e10, data=0x7f9debfd6180, resp=0x7f9d78654750) at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs4_op_getattr.c:140
#11 0x00007f9e64e69f8d in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7f9d78e58b20) at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs4_Compound.c:734
#12 0x00007f9e64e5b13c in nfs_rpc_execute (reqdata=reqdata@entry=0x7f9d740008c0) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1281
#13 0x00007f9e64e5c79a in worker_run (ctx=0x7f9e69947f40) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1548
#14 0x00007f9e64ee6409 in fridgethr_start_routine (arg=0x7f9e69947f40) at /usr/src/debug/nfs-ganesha-2.4.1/src/support/fridgethr.c:550
#15 0x00007f9e633c6dc5 in start_thread (arg=0x7f9debfd7700) at pthread_create.c:308
#16 0x00007f9e62a9573d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Comment 13 Ambarish 2017-01-20 07:50:26 UTC
The reported issue was not reproducible on Ganesha 2.4.1-6,Gluster 3.8.4-12.

Verified.

Comment 15 errata-xmlrpc 2017-03-23 06:28:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2017-0493.html