Description of problem: ------------------------ 4 node Ganesha cluster,2*2 volume. Restarted the volume after setting a few options. Mounts failed post that,and Ganesha crashed on one of the nodes dumping the following core,when an attempt to mount was made : *************** BT from gqas006 *************** (gdb) bt #0 pthread_spin_lock () at ../nptl/sysdeps/x86_64/pthread_spin_lock.S:24 #1 0x00007fb893099ebd in inode_ctx_get0 () from /lib64/libglusterfs.so.0 #2 0x00007fb893099f45 in inode_needs_lookup () from /lib64/libglusterfs.so.0 #3 0x00007fb89336cc86 in __glfs_resolve_inode () from /lib64/libgfapi.so.0 #4 0x00007fb89336cd8b in glfs_resolve_inode () from /lib64/libgfapi.so.0 #5 0x00007fb89336d3f9 in glfs_h_stat () from /lib64/libgfapi.so.0 #6 0x00007fb893788df4 in getattrs (obj_hdl=0x7fb898ce0fe8, attrs=0x7fb85ae87ba0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/FSAL_GLUSTER/handle.c:756 #7 0x00007fb898230a14 in mdcache_refresh_attrs (entry=entry@entry=0x7fb898db0b60, need_acl=<optimized out>) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:939 #8 0x00007fb89823151a in mdcache_getattrs (obj_hdl=0x7fb898db0b98, attrs_out=0x7fb85ae87d40) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:1032 #9 0x00007fb8981b6a7f in nfs_SetPostOpAttr (obj=obj@entry=0x7fb898db0b98, Fattr=Fattr@entry=0x7fb7880008c8, attrs=attrs@entry=0x0) at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs_proto_tools.c:91 #10 0x00007fb8981b7fb3 in nfs3_fsinfo (arg=<optimized out>, req=<optimized out>, res=0x7fb7880008c0) at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs3_fsinfo.c:133 #11 0x00007fb89817f13c in nfs_rpc_execute (reqdata=reqdata@entry=0x7fb7d40008c0) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1281 #12 0x00007fb89818079a in worker_run (ctx=0x7fb898ddfde0) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1548 #13 0x00007fb89820a409 in fridgethr_start_routine (arg=0x7fb898ddfde0) at /usr/src/debug/nfs-ganesha-2.4.1/src/support/fridgethr.c:550 #14 0x00007fb8966eadc5 in start_thread (arg=0x7fb85ae89700) at pthread_create.c:308 #15 0x00007fb895db973d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 (gdb) Version-Release number of selected component (if applicable): -------------------------------------------------------------- nfs-ganesha-gluster-2.4.1-5.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-11.el7rhgs.x86_64 How reproducible: ------------------ 2/2 on fresh setups. Actual results: ---------------- mounts fail and Ganesha crashes after a volume exported via Ganesha is restarted Expected results: ------------------ Mounts should succeed and Ganesha should not crash. Additional info: ---------------- Client and Server OS : RHEL 7.3 *Vol Config* : Volume Name: testvol Type: Distributed-Replicate Volume ID: 156d084e-2a6e-44e0-b982-750bab037a7d Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0 Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1 Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2 Brick4: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3 Options Reconfigured: ganesha.enable: on features.cache-invalidation: off server.allow-insecure: on performance.stat-prefetch: off transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable
While working on a reproducer Ganesha crashed on all the 4 nodes. Setup shared with Dev for further RCA.
This issue is not seen with previous build, nfs-ganesha-gluster-2.4.1-4.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-11.el7rhgs.x86_64 nfs-ganesha-2.4.1-4.el7rhgs.x86_64
Marking it as Regression from 3.1.3 -> 3.2 More clearly this regression was introduced between 2.4.1-4 -> 2.4.1-5.
Tried in another setup with nfs-ganesha-gluster-2.4.1-4.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-11.el7rhgs.x86_64 nfs-ganesha-2.4.1-4.el7rhgs.x86_64 and the issue is reproducible.
Thanks Arthy. Not sure which build may have caused this regression but definitely not 2.4.1-5. There seem to be ref_leak for md-cache entry due to which it is not being cleaned up during volume unexport. Hence when the volume is re-exported with the same exportID, same md-cache entry is being re-used which is referring to the old freed memory (in this case glusterfs inode structure).
So quite possibly related to https://bugzilla.redhat.com/show_bug.cgi?id=1413502
While debugging this issue using gdb on QE setup, found that in mdcache_unexport(), 160 /* Unhash the root object */ 161 assert(!cih_remove_checked(root_entry)); 162 } Line-161 never gets processed because of which root_entry doesn't get unref'ed resulting in ref leak. Dan confirmed that that could be the reason for this unexport issue. Proposed fix by Dan - https://review.gerrithub.io/#/c/343263/
Hitting the same crash, while doing refresh config on the nfs-ganesha enabled volume. (gdb) bt #0 pthread_spin_lock () at ../nptl/sysdeps/x86_64/pthread_spin_lock.S:24 #1 0x00007f9e5ff88ebd in inode_ctx_get0 () from /lib64/libglusterfs.so.0 #2 0x00007f9e5ff88f45 in inode_needs_lookup () from /lib64/libglusterfs.so.0 #3 0x00007f9e6025bc86 in __glfs_resolve_inode () from /lib64/libgfapi.so.0 #4 0x00007f9e6025bd8b in glfs_resolve_inode () from /lib64/libgfapi.so.0 #5 0x00007f9e6025c3f9 in glfs_h_stat () from /lib64/libgfapi.so.0 #6 0x00007f9e60677df4 in getattrs (obj_hdl=0x7f9d78638558, attrs=0x7f9debfd5d40) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/FSAL_GLUSTER/handle.c:756 #7 0x00007f9e64f0ca14 in mdcache_refresh_attrs (entry=entry@entry=0x7f9d78e508f0, need_acl=<optimized out>) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:939 #8 0x00007f9e64f0d51a in mdcache_getattrs (obj_hdl=0x7f9d78e50928, attrs_out=0x7f9debfd5fd0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:1032 #9 0x00007f9e64e91e17 in file_To_Fattr (data=data@entry=0x7f9debfd6180, request_mask=1433550, attr=attr@entry=0x7f9debfd5fd0, Fattr=Fattr@entry=0x7f9d78654760, Bitmap=Bitmap@entry=0x7f9d74182e18) at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs_proto_tools.c:3299 #10 0x00007f9e64e6f0c2 in nfs4_op_getattr (op=0x7f9d74182e10, data=0x7f9debfd6180, resp=0x7f9d78654750) at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs4_op_getattr.c:140 #11 0x00007f9e64e69f8d in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7f9d78e58b20) at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs4_Compound.c:734 #12 0x00007f9e64e5b13c in nfs_rpc_execute (reqdata=reqdata@entry=0x7f9d740008c0) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1281 #13 0x00007f9e64e5c79a in worker_run (ctx=0x7f9e69947f40) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1548 #14 0x00007f9e64ee6409 in fridgethr_start_routine (arg=0x7f9e69947f40) at /usr/src/debug/nfs-ganesha-2.4.1/src/support/fridgethr.c:550 #15 0x00007f9e633c6dc5 in start_thread (arg=0x7f9debfd7700) at pthread_create.c:308 #16 0x00007f9e62a9573d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
The reported issue was not reproducible on Ganesha 2.4.1-6,Gluster 3.8.4-12. Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2017-0493.html