Description of problem: ----------------------- *This is to track one of the BTs seen in https://bugzilla.redhat.com/show_bug.cgi?id=1401182,possibly in the WB layer* : 4 node cluster with a 2*2 volume. The volume is mounted via v3 and v4 on 7 clients and I/O (dd and tarball untar) is pumped from all the mounts. Almost 1.5 hours into the workload,Ganesha crashed on 3/4 nodes and dumped core.Since pacemaker quorum was lost,all IOs were hung at the mount point. (gdb) bt #0 0x00007fd4d52811d7 in raise () from /lib64/libc.so.6 #1 0x00007fd4d52828c8 in abort () from /lib64/libc.so.6 #2 0x00007fd4d52c0f07 in __libc_message () from /lib64/libc.so.6 #3 0x00007fd4d52c8503 in _int_free () from /lib64/libc.so.6 #4 0x00007fd43342f8d7 in wb_forget (this=<optimized out>, inode=<optimized out>) at write-behind.c:2258 #5 0x00007fd44a09e471 in __inode_ctx_free (inode=inode@entry=0x7fd42331480c) at inode.c:332 #6 0x00007fd44a09f652 in __inode_destroy (inode=0x7fd42331480c) at inode.c:353 #7 inode_table_prune (table=table@entry=0x7fd42c002420) at inode.c:1543 #8 0x00007fd44a09f934 in inode_unref (inode=0x7fd42331480c) at inode.c:524 #9 0x00007fd44a3773b6 in pub_glfs_h_close (object=0x7fd14802f610) at glfs-handleops.c:1365 #10 0x00007fd44a790a59 in handle_release (obj_hdl=0x7fd14802f318) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/FSAL_GLUSTER/handle.c:71 #11 0x00007fd4d77b4812 in mdcache_lru_clean (entry=0x7fd1480d0860) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:421 #12 mdcache_lru_get (entry=entry@entry=0x7fd4aaa5bd18) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1201 #13 0x00007fd4d77bec7e in mdcache_alloc_handle (fs=0x0, sub_handle=0x7fd2e803abd8, export=0x7fd4440d2130) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:117 #14 mdcache_new_entry (export=export@entry=0x7fd4440d2130, sub_handle=0x7fd2e803abd8, attrs_in=attrs_in@entry=0x7fd4aaa5be70, attrs_out=attrs_out@entry=0x0, new_directory=new_directory@entry=false, entry=entry@entry=0x7fd4aaa5bdd0, state=state@entry=0x0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:411 #15 0x00007fd4d77b86b4 in mdcache_alloc_and_check_handle (export=export@entry=0x7fd4440d2130, sub_handle=<optimized out>, new_obj=new_obj@entry=0x7fd4aaa5be68, new_directory=new_directory@entry=false, attrs_in=attrs_in@entry=0x7fd4aaa5be70, attrs_out=attrs_out@entry=0x0, tag=tag@entry=0x7fd4d77edb84 "lookup ", parent=parent@entry=0x7fd17c0ce920, name=name@entry=0x7fd2e8010d30 ".gitignore", invalidate=invalidate@entry=true, state=state@entry=0x0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:93 #16 0x00007fd4d77bfefa in mdc_lookup_uncached (mdc_parent=mdc_parent@entry=0x7fd17c0ce920, name=name@entry=0x7fd2e8010d30 ".gitignore", new_entry=new_entry@entry=0x7fd4aaa5c010, attrs_out=attrs_out@entry=0x0) ---Type <return> to continue, or q <return> to quit--- at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:1041 #17 0x00007fd4d77c02cd in mdc_lookup (mdc_parent=0x7fd17c0ce920, name=0x7fd2e8010d30 ".gitignore", uncached=uncached@entry=true, new_entry=new_entry@entry=0x7fd4aaa5c010, attrs_out=0x0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:985 #18 0x00007fd4d77b79eb in mdcache_lookup (parent=<optimized out>, name=<optimized out>, handle=0x7fd4aaa5c098, attrs_out=<optimized out>) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:166 #19 0x00007fd4d76efc97 in fsal_lookup (parent=0x7fd17c0ce958, name=0x7fd2e8010d30 ".gitignore", obj=obj@entry=0x7fd4aaa5c098, attrs_out=attrs_out@entry=0x0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/fsal_helper.c:712 #20 0x00007fd4d7723636 in nfs4_op_lookup (op=<optimized out>, data=0x7fd4aaa5c180, resp=0x7fd2e801cc70) at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs4_op_lookup.c:106 #21 0x00007fd4d7717f7d in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7fd2e804eb30) at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs4_Compound.c:734 #22 0x00007fd4d770912c in nfs_rpc_execute (reqdata=reqdata@entry=0x7fd26c01c050) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1281 #23 0x00007fd4d770a78a in worker_run (ctx=0x7fd4d7c814c0) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1548 #24 0x00007fd4d7794189 in fridgethr_start_routine (arg=0x7fd4d7c814c0) at /usr/src/debug/nfs-ganesha-2.4.1/src/support/fridgethr.c:550 #25 0x00007fd4d5c74dc5 in start_thread () from /lib64/libpthread.so.0 #26 0x00007fd4d534373d in clone () from /lib64/libc.so.6 (gdb) (gdb) Version-Release number of selected component (if applicable): ------------------------------------------------------------- glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64 nfs-ganesha-2.4.1-1.el7rhgs.x86_64 How reproducible: ------------------ 1/1 Steps to Reproduce: ------------------ 1. Create a 4 node cluster and mount the volume via v3 and v4 on the clients. 2. Pump I/O. Actual results: --------------- Ganesha crashes on 3 nodes. Expected results: ---------------- No crashes Additional info: ---------------- OS : RHEL 7.3 *Vol Config* : Volume Name: testvol Type: Distributed-Replicate Volume ID: db9c8fe1-375d-4375-955b-f8291af4f931 Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0 Brick2: gqas009.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1 Brick3: gqas010.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2 Brick4: gqas015.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3 Options Reconfigured: ganesha.enable: on features.cache-invalidation: on server.allow-insecure: on performance.stat-prefetch: off transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable More info on https://bugzilla.redhat.com/show_bug.cgi?id=1401182 sos,core here : http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1401182
The reported issue was not reproducible on Ganesha 2.4.1-6,Gluster 3.8.4-12 on two tries. Will reopen if hit again during regressions.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2017-0493.html