Description of problem: ----------------------- 4 Node Ganesha cluster. 3 volumes - two 1*1 and one 1*2. 7 Clients mount the three volumes(one mount per client via v3 or v4 at random). *Workload Details*- Linux tarball untar,dd. Almost 2.5 hours into my workload,Ganesha crashed on all the nodes one after the other. Since the pacemaker quorum was eventually lost,I/O on all clients was halted. BT from cores : ************ On gqas009 ************ (gdb) bt #0 0x00007f9cfe31b46f in __inode_ctx_free (inode=inode@entry=0x7f9c9e8d4138) at inode.c:332 #1 0x00007f9cfe31c652 in __inode_destroy (inode=0x7f9c9e8d4138) at inode.c:353 #2 inode_table_prune (table=table@entry=0x7f9ca806a640) at inode.c:1543 #3 0x00007f9cfe31c934 in inode_unref (inode=0x7f9c9e8d4138) at inode.c:524 #4 0x00007f9cfe5f43b6 in pub_glfs_h_close (object=0x7f9c28010ae0) at glfs-handleops.c:1365 #5 0x00007f9d04361a59 in handle_release (obj_hdl=0x7f9c280598c8) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/FSAL_GLUSTER/handle.c:71 #6 0x00007f9d90383812 in mdcache_lru_clean (entry=0x7f9bb80040f0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:421 #7 mdcache_lru_get (entry=entry@entry=0x7f9d6fe43d18) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1201 #8 0x00007f9d9038dc7e in mdcache_alloc_handle (fs=0x0, sub_handle=0x7f9bbc2d7328, export=0x7f9d001382e0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:117 #9 mdcache_new_entry (export=export@entry=0x7f9d001382e0, sub_handle=0x7f9bbc2d7328, attrs_in=attrs_in@entry=0x7f9d6fe43e70, attrs_out=attrs_out@entry=0x0, new_directory=new_directory@entry=false, entry=entry@entry=0x7f9d6fe43dd0, state=state@entry=0x0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:411 #10 0x00007f9d903876b4 in mdcache_alloc_and_check_handle (export=export@entry=0x7f9d001382e0, sub_handle=<optimized out>, new_obj=new_obj@entry=0x7f9d6fe43e68, new_directory=new_directory@entry=false, attrs_in=attrs_in@entry=0x7f9d6fe43e70, attrs_out=attrs_out@entry=0x0, tag=tag@entry=0x7f9d903bcb84 "lookup ", parent=parent@entry=0x7f9b8802b9e0, name=name@entry=0x7f9bbc33fe20 "types.h", invalidate=invalidate@entry=true, state=state@entry=0x0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:93 #11 0x00007f9d9038eefa in mdc_lookup_uncached (mdc_parent=mdc_parent@entry=0x7f9b8802b9e0, name=name@entry=0x7f9bbc33fe20 "types.h", new_entry=new_entry@entry=0x7f9d6fe44010, attrs_out=attrs_out@entry=0x0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:1041 #12 0x00007f9d9038f2cd in mdc_lookup (mdc_parent=0x7f9b8802b9e0, name=0x7f9bbc33fe20 "types.h", uncached=uncached@entry=true, new_entry=new_entry@entry=0x7f9d6fe44010, attrs_out=0x0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:985 #13 0x00007f9d903869eb in mdcache_lookup (parent=<optimized out>, name=<optimized out>, handle=0x7f9d6fe44098, ---Type <return> to continue, or q <return> to quit--- attrs_out=<optimized out>) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:166 #14 0x00007f9d902bec97 in fsal_lookup (parent=0x7f9b8802ba18, name=0x7f9bbc33fe20 "types.h", obj=obj@entry=0x7f9d6fe44098, attrs_out=attrs_out@entry=0x0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/fsal_helper.c:712 #15 0x00007f9d902f2636 in nfs4_op_lookup (op=<optimized out>, data=0x7f9d6fe44180, resp=0x7f9bbc0fbee0) at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs4_op_lookup.c:106 #16 0x00007f9d902e6f7d in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7f9bbc1a5360) at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs4_Compound.c:734 #17 0x00007f9d902d812c in nfs_rpc_execute (reqdata=reqdata@entry=0x7f9c740d2480) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1281 #18 0x00007f9d902d978a in worker_run (ctx=0x7f9d90bdae00) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1548 #19 0x00007f9d90363189 in fridgethr_start_routine (arg=0x7f9d90bdae00) at /usr/src/debug/nfs-ganesha-2.4.1/src/support/fridgethr.c:550 #20 0x00007f9d8e843dc5 in start_thread () from /lib64/libpthread.so.0 #21 0x00007f9d8df1273d in clone () from /lib64/libc.so.6 (gdb) ************ On gqas010 ************ (gdb) bt #0 0x00007f20935afe6d in default_unlink (frame=0x7f202e948e8c, this=0x7f2020007590, loc=0x7f202e17c678, flags=0, xdata=0x0) at defaults.c:2832 #1 0x00007f20935afe74 in default_unlink (frame=0x7f202e948e8c, this=0x7f2020008410, loc=0x7f202e17c678, flags=0, xdata=0x0) at defaults.c:2832 #2 0x00007f20935afe74 in default_unlink (frame=0x7f202e948e8c, this=0x7f2020009180, loc=0x7f202e17c678, flags=0, xdata=0x0) at defaults.c:2832 #3 0x00007f20935afe74 in default_unlink (frame=0x7f202e948e8c, this=0x7f202000a010, loc=0x7f202e17c678, flags=0, xdata=0x0) at defaults.c:2832 #4 0x00007f20935afe74 in default_unlink (frame=0x7f202e948e8c, this=0x7f202000ad80, loc=0x7f202e17c678, flags=0, xdata=0x0) at defaults.c:2832 #5 0x00007f20935cb2ed in default_unlink_resume (frame=0x7f202e9449d0, this=0x7f202000bb60, loc=0x7f202e17c678, flags=0, xdata=0x0) at defaults.c:2081 #6 0x00007f209355856d in call_resume (stub=0x7f202e17c628) at call-stub.c:2508 #7 0x00007f206c30a2e8 in open_and_resume (this=this@entry=0x7f202000bb60, fd=fd@entry=0x0, stub=stub@entry=0x7f202e17c628) at open-behind.c:245 #8 0x00007f206c30c443 in ob_unlink (frame=<optimized out>, this=0x7f202000bb60, loc=0x7f202e1750bc, xflags=<optimized out>, xdata=<optimized out>) at open-behind.c:777 #9 0x00007f206c0f5a27 in mdc_unlink (frame=0x7f202e955970, this=0x7f202000c940, loc=0x7f202e1750bc, xflag=0, xdata=0x0) at md-cache.c:1432 #10 0x00007f20935cb2ed in default_unlink_resume (frame=0x7f202e92bb7c, this=0x7f202000d720, loc=0x7f202e1750bc, flags=0, xdata=0x0) at defaults.c:2081 #11 0x00007f209355856d in call_resume (stub=0x7f202e17506c) at call-stub.c:2508 #12 0x00007f205fdf9857 in iot_worker (data=0x7f202001b2e0) at io-threads.c:220 #13 0x00007f2123997dc5 in start_thread () from /lib64/libpthread.so.0 #14 0x00007f212306673d in clone () from /lib64/libc.so.6 (gdb) *********** On gqas014 *********** (gdb) bt #0 0x00007f23280a2698 in ?? () #1 0x00007f260daa7b51 in mdcache_close (obj_hdl=<optimized out>) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_file.c:440 #2 0x00007f260daa0b79 in fsal_close (obj_hdl=0x7f2454067d28) at /usr/src/debug/nfs-ganesha-2.4.1/src/include/fsal.h:416 #3 mdcache_lru_clean (entry=0x7f2454067cf0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:411 #4 mdcache_lru_get (entry=entry@entry=0x7f25a94d8d18) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1201 #5 0x00007f260daaac7e in mdcache_alloc_handle (fs=0x0, sub_handle=0x7f22cc117e88, export=0x7f257c1382e0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:117 #6 mdcache_new_entry (export=export@entry=0x7f257c1382e0, sub_handle=0x7f22cc117e88, attrs_in=attrs_in@entry=0x7f25a94d8e70, attrs_out=attrs_out@entry=0x0, new_directory=new_directory@entry=false, entry=entry@entry=0x7f25a94d8dd0, state=state@entry=0x0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:411 #7 0x00007f260daa46b4 in mdcache_alloc_and_check_handle (export=export@entry=0x7f257c1382e0, sub_handle=<optimized out>, new_obj=new_obj@entry=0x7f25a94d8e68, new_directory=new_directory@entry=false, attrs_in=attrs_in@entry=0x7f25a94d8e70, attrs_out=attrs_out@entry=0x0, tag=tag@entry=0x7f260dad9b84 "lookup ", parent=parent@entry=0x7f2420027590, name=name@entry=0x7f22cc14b430 "bounds.c", invalidate=invalidate@entry=true, state=state@entry=0x0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:93 #8 0x00007f260daabefa in mdc_lookup_uncached (mdc_parent=mdc_parent@entry=0x7f2420027590, name=name@entry=0x7f22cc14b430 "bounds.c", new_entry=new_entry@entry=0x7f25a94d9010, attrs_out=attrs_out@entry=0x0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:1041 #9 0x00007f260daac2cd in mdc_lookup (mdc_parent=0x7f2420027590, name=0x7f22cc14b430 "bounds.c", uncached=uncached@entry=true, new_entry=new_entry@entry=0x7f25a94d9010, attrs_out=0x0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:985 #10 0x00007f260daa39eb in mdcache_lookup (parent=<optimized out>, name=<optimized out>, handle=0x7f25a94d9098, attrs_out=<optimized out>) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:166 ---Type <return> to continue, or q <return> to quit--- #11 0x00007f260d9dbc97 in fsal_lookup (parent=0x7f24200275c8, name=0x7f22cc14b430 "bounds.c", obj=obj@entry=0x7f25a94d9098, attrs_out=attrs_out@entry=0x0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/fsal_helper.c:712 #12 0x00007f260da0f636 in nfs4_op_lookup (op=<optimized out>, data=0x7f25a94d9180, resp=0x7f22cc2381b0) at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs4_op_lookup.c:106 #13 0x00007f260da03f7d in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7f22cc028260) at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs4_Compound.c:734 #14 0x00007f260d9f512c in nfs_rpc_execute (reqdata=reqdata@entry=0x7f25780008c0) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1281 #15 0x00007f260d9f678a in worker_run (ctx=0x7f260f110400) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1548 #16 0x00007f260da80189 in fridgethr_start_routine (arg=0x7f260f110400) at /usr/src/debug/nfs-ganesha-2.4.1/src/support/fridgethr.c:550 #17 0x00007f260bf60dc5 in start_thread () from /lib64/libpthread.so.0 #18 0x00007f260b62f73d in clone () from /lib64/libc.so.6 (gdb) (gdb) *********** On gqas015 *********** (gdb) bt #0 0x00007fcb557db45f in wb_setattr_helper (frame=0x7fcb244f9014, this=0x7fcb08007590, loc=0x7fcb1aa2bb4c, stbuf=0x7fcb1aa2c084, valid=48, xdata=0x0) at write-behind.c:2078 #1 0x00007fcb63d5956d in call_resume (stub=0x7fcb1aa2bafc) at call-stub.c:2508 #2 0x00007fcb557def19 in wb_do_winds (wb_inode=wb_inode@entry=0x7fcb084d8c50, tasks=tasks@entry=0x7fca10ef6020) at write-behind.c:1509 #3 0x00007fcb557df03b in wb_process_queue (wb_inode=wb_inode@entry=0x7fcb084d8c50) at write-behind.c:1544 #4 0x00007fcb557e1428 in wb_setattr (frame=0x7fcb244f9014, this=<optimized out>, loc=0x7fcb1aa4923c, stbuf=0x7fcb1aa49774, valid=48, xdata=0x0) at write-behind.c:2103 #5 0x00007fcb63db1181 in default_setattr (frame=0x7fcb244f9014, this=0x7fcb08008410, loc=0x7fcb1aa4923c, stbuf=0x7fcb1aa49774, valid=48, xdata=0x0) at defaults.c:2880 #6 0x00007fcb63db1181 in default_setattr (frame=0x7fcb244f9014, this=0x7fcb08009180, loc=0x7fcb1aa4923c, stbuf=0x7fcb1aa49774, valid=48, xdata=0x0) at defaults.c:2880 #7 0x00007fcb551b51c4 in ioc_setattr (frame=0x7fcb244918a8, this=0x7fcb0800a010, loc=0x7fcb1aa4923c, stbuf=0x7fcb1aa49774, valid=48, xdata=0x0) at io-cache.c:168 #8 0x00007fcb63db1181 in default_setattr (frame=0x7fcb244918a8, this=0x7fcb0800ad80, loc=0x7fcb1aa4923c, stbuf=0x7fcb1aa49774, valid=48, xdata=0x0) at defaults.c:2880 #9 0x00007fcb63db1181 in default_setattr (frame=0x7fcb244918a8, this=0x7fcb0800bb60, loc=0x7fcb1aa4923c, stbuf=0x7fcb1aa49774, valid=48, xdata=0x0) at defaults.c:2880 #10 0x00007fcb54b8cd74 in mdc_setattr (frame=0x7fcb24487a7c, this=0x7fcb0800c940, loc=0x7fcb1aa4923c, stbuf=0x7fcb1aa49774, valid=48, xdata=0x0) at md-cache.c:1855 #11 0x00007fcb63dcc9e4 in default_setattr_resume (frame=0x7fcb24497218, this=0x7fcb0800d720, loc=0x7fcb1aa4923c, stbuf=0x7fcb1aa49774, valid=48, xdata=0x0) at defaults.c:2120 #12 0x00007fcb63d5956d in call_resume (stub=0x7fcb1aa491ec) at call-stub.c:2508 #13 0x00007fcb54981857 in iot_worker (data=0x7fcb0801b2e0) at io-threads.c:220 #14 0x00007fcc0c400dc5 in start_thread () from /lib64/libpthread.so.0 #15 0x00007fcc0bacf73d in clone () from /lib64/libc.so.6 (gdb) Version-Release number of selected component (if applicable): ------------------------------------------------------------- glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64 nfs-ganesha-2.4.1-1.el7rhgs.x86_64 How reproducible: ----------------- Reporting the first occurence since it is a crash. Steps to Reproduce: ------------------- 1. Create more than 1 volume.Mount it via v3 and v4 on different clients. 2. Pump I/O Actual results: --------------- Ganesha crashes on all the nodes. Expected results: ----------------- No crashes. Additional info: ---------------- OS : RHEL 7.3 *Vol Config* : Volume Name: testvol1 Type: Distribute Volume ID: 8e44ea57-950d-477a-8e45-d07711919016 Status: Started Snapshot Count: 0 Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks/testvol1_brick0 Options Reconfigured: ganesha.enable: on features.cache-invalidation: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable Volume Name: testvol2 Type: Distribute Volume ID: 3ab53906-26d4-440b-8888-0c7ee2a8ef45 Status: Started Snapshot Count: 0 Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: gqas015.sbu.lab.eng.bos.redhat.com:/bricks/testvol2_brick1 Options Reconfigured: ganesha.enable: on features.cache-invalidation: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable Volume Name: testvol3 Type: Replicate Volume ID: 71425134-0a46-4b66-9282-89f20df1d772 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: gqas009.sbu.lab.eng.bos.redhat.com:/bricks/testvol3_brick2 Brick2: gqas010.sbu.lab.eng.bos.redhat.com:/bricks/testvol3_brick3 Options Reconfigured: ganesha.enable: on features.cache-invalidation: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable
The crashed reported (1 & 3) on nodes gqas009 & gqas014 look similar to the one reported in https://bugzilla.redhat.com/show_bug.cgi?id=1379665#c0 The crash happens when an lru entry is being re-used/purged, possibly due to memory corruption. Since it was not reproducible, we thought its fixed in the latest code. But seem like not. From the code inspection, I see a possible cause for corruption. mdcache_lru_clean -> release -> FSAL_GLUSTER's 'handle_release' -> fsal_obj_handle_fini (obj_handle->handle) followed by mdcache_lru_clean -> fsal_obj_handle_fini (entry->obj_handle) fsal_obj_handle_fini() is being called twice on the handle destroying mutex lock twice. Not sure if that is the cause of this corruption, but definitely that needs to be fixed.
(In reply to Soumya Koduri from comment #3) > The crashed reported (1 & 3) on nodes gqas009 & gqas014 look similar to the > one reported in https://bugzilla.redhat.com/show_bug.cgi?id=1379665#c0 > > The crash happens when an lru entry is being re-used/purged, possibly due to > memory corruption. Since it was not reproducible, we thought its fixed in > the latest code. But seem like not. From the code inspection, I see a > possible cause for corruption. > > > mdcache_lru_clean -> release -> FSAL_GLUSTER's 'handle_release' -> > fsal_obj_handle_fini (obj_handle->handle) > > followed by > > mdcache_lru_clean -> fsal_obj_handle_fini (entry->obj_handle) > > fsal_obj_handle_fini() is being called twice on the handle destroying mutex > lock twice. Not sure if that is the cause of this corruption, but definitely > that needs to be fixed. Sorry. The handles which md-cache and sub-FSALs operate on are different. So above claim is not valid.
3 *might* be cause by this: https://review.gerrithub.io/304538
The reported issue was not reproducible on Ganesha 2.4.1-6,Gluster 3.8.4-12 on two tries. Will reopen if hit again during regressions.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2017-0493.html