Description of problem: ---------------------- 4 node cluster with a 2*2 volume. The volume is mounted via v3 and v4 on 7 clients and I/O (dd and tarball untar) is pumped from all the mounts. Almost 1.5 hours into the workload,Ganesha crashed on 3/4 nodes and dumped core.Since pacemaker quorum was lost,all IOs were hung at the mount point. The signature of the BT is different from what I reported in (https://bugzilla.redhat.com/show_bug.cgi?id=1398921) ********** On gqas009 ********** (gdb) bt #0 remove_recolour (head=head@entry=0x7f0fa4006040, parent=0x7f1094068e00, node=<optimized out>) at /usr/src/debug/ntirpc-1.4.3/src/rbtree.c:331 #1 0x00007f123956cc63 in opr_rbtree_remove (head=head@entry=0x7f0fa4006040, node=<optimized out>, node@entry=0x7f115c024150) at /usr/src/debug/ntirpc-1.4.3/src/rbtree.c:453 #2 0x00007f123b4ba591 in rbtree_x_cached_remove (hk=<optimized out>, nk=0x7f115c024150, t=0x7f0fa4005f90, xt=0x7f0fa40010e8) at /usr/include/ntirpc/misc/rbtree_x.h:154 #3 nfs_dupreq_finish (req=req@entry=0x7f101c81b328, res_nfs=res_nfs@entry=0x7f0ef0012cc0) at /usr/src/debug/nfs-ganesha-2.4.1/src/RPCAL/nfs_dupreq.c:1123 #4 0x00007f123b4402a7 in nfs_rpc_execute (reqdata=reqdata@entry=0x7f101c81b300) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1358 #5 0x00007f123b44178a in worker_run (ctx=0x7f123c9fcac0) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1548 #6 0x00007f123b4cb189 in fridgethr_start_routine (arg=0x7f123c9fcac0) at /usr/src/debug/nfs-ganesha-2.4.1/src/support/fridgethr.c:550 #7 0x00007f12399abdc5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007f123907a73d in clone () from /lib64/libc.so.6 (gdb) *********** On gqas015 *********** (gdb) bt #0 0x00007fd4d52811d7 in raise () from /lib64/libc.so.6 #1 0x00007fd4d52828c8 in abort () from /lib64/libc.so.6 #2 0x00007fd4d52c0f07 in __libc_message () from /lib64/libc.so.6 #3 0x00007fd4d52c8503 in _int_free () from /lib64/libc.so.6 #4 0x00007fd43342f8d7 in wb_forget (this=<optimized out>, inode=<optimized out>) at write-behind.c:2258 #5 0x00007fd44a09e471 in __inode_ctx_free (inode=inode@entry=0x7fd42331480c) at inode.c:332 #6 0x00007fd44a09f652 in __inode_destroy (inode=0x7fd42331480c) at inode.c:353 #7 inode_table_prune (table=table@entry=0x7fd42c002420) at inode.c:1543 #8 0x00007fd44a09f934 in inode_unref (inode=0x7fd42331480c) at inode.c:524 #9 0x00007fd44a3773b6 in pub_glfs_h_close (object=0x7fd14802f610) at glfs-handleops.c:1365 #10 0x00007fd44a790a59 in handle_release (obj_hdl=0x7fd14802f318) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/FSAL_GLUSTER/handle.c:71 #11 0x00007fd4d77b4812 in mdcache_lru_clean (entry=0x7fd1480d0860) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:421 #12 mdcache_lru_get (entry=entry@entry=0x7fd4aaa5bd18) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1201 #13 0x00007fd4d77bec7e in mdcache_alloc_handle (fs=0x0, sub_handle=0x7fd2e803abd8, export=0x7fd4440d2130) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:117 #14 mdcache_new_entry (export=export@entry=0x7fd4440d2130, sub_handle=0x7fd2e803abd8, attrs_in=attrs_in@entry=0x7fd4aaa5be70, attrs_out=attrs_out@entry=0x0, new_directory=new_directory@entry=false, entry=entry@entry=0x7fd4aaa5bdd0, state=state@entry=0x0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:411 #15 0x00007fd4d77b86b4 in mdcache_alloc_and_check_handle (export=export@entry=0x7fd4440d2130, sub_handle=<optimized out>, new_obj=new_obj@entry=0x7fd4aaa5be68, new_directory=new_directory@entry=false, attrs_in=attrs_in@entry=0x7fd4aaa5be70, attrs_out=attrs_out@entry=0x0, tag=tag@entry=0x7fd4d77edb84 "lookup ", parent=parent@entry=0x7fd17c0ce920, name=name@entry=0x7fd2e8010d30 ".gitignore", invalidate=invalidate@entry=true, state=state@entry=0x0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:93 #16 0x00007fd4d77bfefa in mdc_lookup_uncached (mdc_parent=mdc_parent@entry=0x7fd17c0ce920, name=name@entry=0x7fd2e8010d30 ".gitignore", new_entry=new_entry@entry=0x7fd4aaa5c010, attrs_out=attrs_out@entry=0x0) ---Type <return> to continue, or q <return> to quit--- at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:1041 #17 0x00007fd4d77c02cd in mdc_lookup (mdc_parent=0x7fd17c0ce920, name=0x7fd2e8010d30 ".gitignore", uncached=uncached@entry=true, new_entry=new_entry@entry=0x7fd4aaa5c010, attrs_out=0x0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:985 #18 0x00007fd4d77b79eb in mdcache_lookup (parent=<optimized out>, name=<optimized out>, handle=0x7fd4aaa5c098, attrs_out=<optimized out>) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:166 #19 0x00007fd4d76efc97 in fsal_lookup (parent=0x7fd17c0ce958, name=0x7fd2e8010d30 ".gitignore", obj=obj@entry=0x7fd4aaa5c098, attrs_out=attrs_out@entry=0x0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/fsal_helper.c:712 #20 0x00007fd4d7723636 in nfs4_op_lookup (op=<optimized out>, data=0x7fd4aaa5c180, resp=0x7fd2e801cc70) at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs4_op_lookup.c:106 #21 0x00007fd4d7717f7d in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7fd2e804eb30) at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs4_Compound.c:734 #22 0x00007fd4d770912c in nfs_rpc_execute (reqdata=reqdata@entry=0x7fd26c01c050) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1281 #23 0x00007fd4d770a78a in worker_run (ctx=0x7fd4d7c814c0) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1548 #24 0x00007fd4d7794189 in fridgethr_start_routine (arg=0x7fd4d7c814c0) at /usr/src/debug/nfs-ganesha-2.4.1/src/support/fridgethr.c:550 #25 0x00007fd4d5c74dc5 in start_thread () from /lib64/libpthread.so.0 #26 0x00007fd4d534373d in clone () from /lib64/libc.so.6 (gdb) (gdb) *********** On gqas014 *********** (gdb) bt #0 0x00007fc652ec11d7 in raise () from /lib64/libc.so.6 #1 0x00007fc652ec28c8 in abort () from /lib64/libc.so.6 #2 0x00007fc652f00f07 in __libc_message () from /lib64/libc.so.6 #3 0x00007fc652f08503 in _int_free () from /lib64/libc.so.6 #4 0x00007fc6553c3522 in gsh_free (p=<optimized out>) at /usr/src/debug/nfs-ganesha-2.4.1/src/include/abstract_mem.h:271 #5 pool_free (pool=<optimized out>, object=<optimized out>) at /usr/src/debug/nfs-ganesha-2.4.1/src/include/abstract_mem.h:420 #6 free_nfs_res (res=<optimized out>) at /usr/src/debug/nfs-ganesha-2.4.1/src/include/nfs_dupreq.h:125 #7 nfs_dupreq_free_dupreq (dv=0x7fc40c22e830) at /usr/src/debug/nfs-ganesha-2.4.1/src/RPCAL/nfs_dupreq.c:784 #8 nfs_dupreq_finish (req=req@entry=0x7fc5880008e8, res_nfs=res_nfs@entry=0x7fc47403a280) at /usr/src/debug/nfs-ganesha-2.4.1/src/RPCAL/nfs_dupreq.c:1133 #9 0x00007fc6553492a7 in nfs_rpc_execute (reqdata=reqdata@entry=0x7fc5880008c0) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1358 #10 0x00007fc65534a78a in worker_run (ctx=0x7fc6556d4ec0) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1548 #11 0x00007fc6553d4189 in fridgethr_start_routine (arg=0x7fc6556d4ec0) at /usr/src/debug/nfs-ganesha-2.4.1/src/support/fridgethr.c:550 #12 0x00007fc6538b4dc5 in start_thread () from /lib64/libpthread.so.0 #13 0x00007fc652f8373d in clone () from /lib64/libc.so.6 (gdb) Version-Release number of selected component (if applicable): -------------------------------------------------------------- glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64 nfs-ganesha-2.4.1-1.el7rhgs.x86_64 How reproducible: ----------------- Reporting the first occurrence. Steps to Reproduce: ------------------- 1. Create a 4 node cluster and mount the volume via v3 and v4 on the clients. 2. Pump I/O. Actual results: --------------- Ganesha crashes on 3 nodes..IOs are hung as pacemaker quorum is lost. Expected results: ----------------- No crashes. Additional info: ----------------- OS : RHEL 7.3 *Vol Config* : Volume Name: testvol Type: Distributed-Replicate Volume ID: db9c8fe1-375d-4375-955b-f8291af4f931 Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0 Brick2: gqas009.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1 Brick3: gqas010.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2 Brick4: gqas015.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3 Options Reconfigured: ganesha.enable: on features.cache-invalidation: on server.allow-insecure: on performance.stat-prefetch: off transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable
This is the same backtrace as bug #1401160
There is reason to suspect this bug has the same root cause as https://bugzilla.redhat.com/show_bug.cgi?id=1398846 (updated with proposed fix from upstream).
The reported issue was not reproducible on Ganesha 2.4.1-6,Gluster 3.8.4-12 on two tries. Will reopen if hit again during regressions.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2017-0493.html
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days