Description of problem: ------------------------ 4 Node cluster,3 volumes - one 2*(4+2),one 2*2 and one single brick,all exported via Ganesha. 6 clients were pumping IO and each mounted more than one volume on different mount points via v3/v4. Within 3 hours,Ganesha crashed on one of the nodes and within 6 hours it crashed on another. The pacemaker quorum was eventually lost,causing an IO halt on the application side. *************** BT from gqas006 *************** (gdb) bt #0 0x00007f85912481d7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x00007f85912498c8 in __GI_abort () at abort.c:90 #2 0x00007f8591287f07 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7f8591392b48 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:196 #3 0x00007f859128f503 in malloc_printerr (ar_ptr=0x7f8350000020, ptr=<optimized out>, str=0x7f8591392bb8 "double free or corruption (fasttop)", action=3) at malloc.c:5013 #4 _int_free (av=0x7f8350000020, p=<optimized out>, have_lock=0) at malloc.c:3835 #5 0x00007f850633d82e in pub_glfs_h_close (object=object@entry=0x7f83500d1e50) at glfs-handleops.c:1366 #6 0x00007f850675c17a in gluster_cleanup_vars (glhandle=glhandle@entry=0x7f83500d1e50) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/FSAL_GLUSTER/gluster_internal.c:171 #7 0x00007f8506758f8d in glusterfs_open2 (obj_hdl=0x7f82cc0cb008, state=0x0, openflags=<optimized out>, createmode=<optimized out>, name=0x7f839802fb40 "cache.c", attrib_set=<optimized out>, verifier=0x7f8566a22e70 "\001v'\206", new_obj=0x7f8566a229e0, attrs_out=0x7f8566a229f0, caller_perm_check=0x7f8566a22b5f) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/FSAL_GLUSTER/handle.c:1672 #8 0x00007f8593783e1f in mdcache_open2 (obj_hdl=0x7f83e004fa68, state=0x0, openflags=<optimized out>, createmode=FSAL_EXCLUSIVE, name=0x7f839802fb40 "cache.c", attrs_in=0x7f8566a22cb0, verifier=0x7f8566a22e70 "\001v'\206", new_obj=0x7f8566a22c88, attrs_out=0x7f8566a22d90, caller_perm_check=0x7f8566a22b5f) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_file.c:657 #9 0x00007f85936b5fe9 in open2_by_name (in_obj=in_obj@entry=0x7f83e004fa68, state=state@entry=0x0, openflags=<optimized out>, createmode=createmode@entry=FSAL_EXCLUSIVE, name=name@entry=0x7f839802fb40 "cache.c", attr=attr@entry=0x7f8566a22cb0, verifier=0x7f8566a22e70 "\001v'\206", obj=obj@entry=0x7f8566a22c88, attrs_out=attrs_out@entry=0x7f8566a22d90) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/fsal_helper.c:399 #10 0x00007f85936b8db8 in fsal_open2 (in_obj=in_obj@entry=0x7f83e004fa68, state=state@entry=0x0, openflags=<optimized out>, openflags@entry=3, createmode=createmode@entry=FSAL_EXCLUSIVE, name=name@entry=0x7f839802fb40 "cache.c", attr=attr@entry=0x7f8566a22cb0, verifier=verifier@entry=0x7f8566a22e70 "\001v'\206", obj=obj@entry=0x7f8566a22c88, attrs_out=attrs_out@entry=0x7f8566a22d90) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/fsal_helper.c:1816 #11 0x00007f8593708b52 in nfs3_create (arg=0x7f8398064aa8, req=<optimized out>, res=0x7f83500c59c0) at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs3_create.c:177 #12 0x00007f85936d013c in nfs_rpc_execute (reqdata=reqdata@entry=0x7f83980648c0) ---Type <return> to continue, or q <return> to quit--- at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1281 #13 0x00007f85936d179a in worker_run (ctx=0x7f8594fb54b0) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1548 #14 0x00007f859375b409 in fridgethr_start_routine (arg=0x7f8594fb54b0) at /usr/src/debug/nfs-ganesha-2.4.1/src/support/fridgethr.c:550 #15 0x00007f8591c3bdc5 in start_thread (arg=0x7f8566a24700) at pthread_create.c:308 #16 0x00007f859130a73d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 (gdb) Version-Release number of selected component (if applicable): ------------------------------------------------------------- nfs-ganesha-2.4.1-5.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-12.el7rhgs.x86_64 How reproducible: ----------------- 1/1 Actual results: --------------- Ganesha crashed on 2/4 nodes,IO comes to a halt. Expected results: ----------------- No crashes. Additional info: ----------------- *Client and Server OS* : RHEL 7.3 *Vol Config* : Volume Name: butcher Type: Distributed-Disperse Volume ID: 305e78e0-8e67-4673-a614-a73d06824dfb Status: Started Snapshot Count: 0 Number of Bricks: 2 x (4 + 2) = 12 Transport-type: tcp Bricks: Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick4: gqas011.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick5: gqas013.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick6: gqas005.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick7: gqas013.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick8: gqas005.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick9: gqas006.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick10: gqas011.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick11: gqas006.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick12: gqas011.sbu.lab.eng.bos.redhat.com:/bricks3/brick Options Reconfigured: ganesha.enable: on performance.md-cache-timeout: 600 performance.cache-invalidation: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable Volume Name: replicate Type: Distributed-Replicate Volume ID: ece1a987-2b47-444d-b30c-2d6d8474904c Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: gqas011.sbu.lab.eng.bos.redhat.com:/bricks4/A1 Brick2: gqas013.sbu.lab.eng.bos.redhat.com:/bricks4/A1 Brick3: gqas005.sbu.lab.eng.bos.redhat.com:/bricks4/A1 Brick4: gqas006.sbu.lab.eng.bos.redhat.com:/bricks4/A1 Options Reconfigured: ganesha.enable: on features.cache-invalidation: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable Volume Name: single Type: Distribute Volume ID: 9cf434ba-8833-40aa-8eb8-f3345bd73899 Status: Started Snapshot Count: 0 Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: gqas011.sbu.lab.eng.bos.redhat.com:/bricks8/1 Options Reconfigured: ganesha.enable: on features.cache-invalidation: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable [root@gqas006 tmp]#
(gdb) nt Undefined command: "nt". Try "help". (gdb) bt #0 0x00007f85912481d7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x00007f85912498c8 in __GI_abort () at abort.c:90 #2 0x00007f8591287f07 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7f8591392b48 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:196 #3 0x00007f859128f503 in malloc_printerr (ar_ptr=0x7f8350000020, ptr=<optimized out>, str=0x7f8591392bb8 "double free or corruption (fasttop)", action=3) at malloc.c:5013 #4 _int_free (av=0x7f8350000020, p=<optimized out>, have_lock=0) at malloc.c:3835 #5 0x00007f850633d82e in pub_glfs_h_close (object=object@entry=0x7f83500d1e50) at glfs-handleops.c:1366 #6 0x00007f850675c17a in gluster_cleanup_vars (glhandle=glhandle@entry=0x7f83500d1e50) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/FSAL_GLUSTER/gluster_internal.c:171 #7 0x00007f8506758f8d in glusterfs_open2 (obj_hdl=0x7f82cc0cb008, state=0x0, openflags=<optimized out>, createmode=<optimized out>, name=0x7f839802fb40 "cache.c", attrib_set=<optimized out>, verifier=0x7f8566a22e70 "\001v'\206", new_obj=0x7f8566a229e0, attrs_out=0x7f8566a229f0, caller_perm_check=0x7f8566a22b5f) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/FSAL_GLUSTER/handle.c:1672 #8 0x00007f8593783e1f in mdcache_open2 (obj_hdl=0x7f83e004fa68, state=0x0, openflags=<optimized out>, createmode=FSAL_EXCLUSIVE, name=0x7f839802fb40 "cache.c", attrs_in=0x7f8566a22cb0, verifier=0x7f8566a22e70 "\001v'\206", new_obj=0x7f8566a22c88, attrs_out=0x7f8566a22d90, caller_perm_check=0x7f8566a22b5f) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_file.c:657 #9 0x00007f85936b5fe9 in open2_by_name (in_obj=in_obj@entry=0x7f83e004fa68, state=state@entry=0x0, openflags=<optimized out>, createmode=createmode@entry=FSAL_EXCLUSIVE, name=name@entry=0x7f839802fb40 "cache.c", attr=attr@entry=0x7f8566a22cb0, verifier=0x7f8566a22e70 "\001v'\206", obj=obj@entry=0x7f8566a22c88, attrs_out=attrs_out@entry=0x7f8566a22d90) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/fsal_helper.c:399 #10 0x00007f85936b8db8 in fsal_open2 (in_obj=in_obj@entry=0x7f83e004fa68, state=state@entry=0x0, openflags=<optimized out>, openflags@entry=3, createmode=createmode@entry=FSAL_EXCLUSIVE, name=name@entry=0x7f839802fb40 "cache.c", attr=attr@entry=0x7f8566a22cb0, verifier=verifier@entry=0x7f8566a22e70 "\001v'\206", obj=obj@entry=0x7f8566a22c88, attrs_out=attrs_out@entry=0x7f8566a22d90) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/fsal_helper.c:1816 #11 0x00007f8593708b52 in nfs3_create (arg=0x7f8398064aa8, req=<optimized out>, res=0x7f83500c59c0) at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs3_create.c:177 #12 0x00007f85936d013c in nfs_rpc_execute (reqdata=reqdata@entry=0x7f83980648c0) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1281 #13 0x00007f85936d179a in worker_run (ctx=0x7f8594fb54b0) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1548 #14 0x00007f859375b409 in fridgethr_start_routine (arg=0x7f8594fb54b0) at /usr/src/debug/nfs-ganesha-2.4.1/src/support/fridgethr.c:550 #15 0x00007f8591c3bdc5 in start_thread (arg=0x7f8566a24700) at pthread_create.c:308 #16 0x00007f859130a73d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 (gdb) f 5 #5 0x00007f850633d82e in pub_glfs_h_close (object=object@entry=0x7f83500d1e50) at glfs-handleops.c:1366 1366 GF_FREE (object); (gdb) p object $1 = (struct glfs_object *) 0x7f83500d1e50 (gdb) p sizeof(struct mem_header_ No struct type named mem_header_. (gdb) p sizeof(struct mem_header) $2 = 64 (gdb) p $1=64 Left operand of assignment is not a modifiable lvalue. (gdb) p $1-64 $3 = (struct glfs_object *) 0x7f83500d1850 (gdb) p (struct mem_header *)$3 $4 = (struct mem_header *) 0x7f83500d1850 (gdb) p *$ $5 = {type = 0, size = 0, mem_acct = 0x0, magic = 0, padding = {0, 0, 0, 0, 0, 0, 0, 0}} (gdb) p/x $4->magic $6 = 0x0 (gdb) p errno $7 = 6 (gdb) f 7 #7 0x00007f8506758f8d in glusterfs_open2 (obj_hdl=0x7f82cc0cb008, state=0x0, openflags=<optimized out>, createmode=<optimized out>, name=0x7f839802fb40 "cache.c", attrib_set=<optimized out>, verifier=0x7f8566a22e70 "\001v'\206", new_obj=0x7f8566a229e0, attrs_out=0x7f8566a229f0, caller_perm_check=0x7f8566a22b5f) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/FSAL_GLUSTER/handle.c:1672 1672 gluster_cleanup_vars(glhandle); (gdb) p status $8 = <optimized out> (gdb) p new_obj $9 = (struct fsal_obj_handle **) 0x7f8566a229e0 (gdb) p myself $10 = (struct glusterfs_handle *) 0x7f8350039a80 (gdb) p *$ $11 = {glhandle = 0x7f83500a21b0, globjhdl = "x\000\000P\203\177\000\000\246\024\247=\006\202M\373\a\312\355A80MA\253\350\027\256\231p", <incomplete sequence \337>, globalfd = {openflags = 0, glfd = 0x0}, handle = {handles = { next = 0x0, prev = 0x0}, fs = 0x0, fsal = 0x0, obj_ops = {get_ref = 0x0, put_ref = 0x0, release = 0x0, merge = 0x0, lookup = 0x0, readdir = 0x0, create = 0x0, mkdir = 0x0, mknode = 0x0, symlink = 0x0, readlink = 0x0, test_access = 0x0, getattrs = 0x0, setattrs = 0x0, link = 0x0, fs_locations = 0x0, rename = 0x0, unlink = 0x0, open = 0x0, reopen = 0x0, status = 0x0, read = 0x0, read_plus = 0x0, write = 0x0, write_plus = 0x0, seek = 0x0, io_advise = 0x0, commit = 0x0, lock_op = 0x0, share_op = 0x0, close = 0x0, list_ext_attrs = 0x0, getextattr_id_by_name = 0x0, getextattr_value_by_name = 0x0, getextattr_value_by_id = 0x0, setextattr_value = 0x0, setextattr_value_by_id = 0x0, remove_extattr_by_id = 0x0, remove_extattr_by_name = 0x0, handle_is = 0x0, handle_digest = 0x0, handle_to_key = 0x0, handle_cmp = 0x0, layoutget = 0x0, layoutreturn = 0x0, layoutcommit = 0x0, getxattrs = 0x0, setxattrs = 0x0, removexattrs = 0x0, listxattrs = 0x0, open2 = 0x0, check_verifier = 0x0, status2 = 0x0, reopen2 = 0x0, read2 = 0x0, write2 = 0x0, seek2 = 0x0, io_advise2 = 0x0, commit2 = 0x0, lock_op2 = 0x0, setattr2 = 0x0, close2 = 0x0}, lock = {__data = {__lock = 0, __nr_readers = 0, __readers_wakeup = 0, __writer_wakeup = 0, __nr_readers_queued = 0, __nr_writers_queued = 0, __writer = 0, __shared = 0, __pad1 = 0, __pad2 = 0, __flags = 0}, __size = '\000' <repeats 55 times>, __align = 0}, type = REGULAR_FILE, fsid = {major = 1189, minor = 635256}, fileid = 12387176813748553183, state_hdl = 0x0}, share = {share_access_read = 0, share_access_write = 0, share_deny_read = 0, share_deny_write = 0, share_deny_write_mand = 0}, rd_issued = 0, rd_serial = 0, rw_issued = 0, rw_serial = 0, rw_max_len = 752} (gdb) p myself->glhandle $12 = (struct glfs_object *) 0x7f83500a21b0 (gdb) p &myself->handle $13 = (struct fsal_obj_handle *) 0x7f8350039ab8 (gdb) p *new_obj $14 = (struct fsal_obj_handle *) 0x0 (gdb) p *myself->glhandle $15 = {inode = 0x0, gfid = "\301\000\000\000\000\000\000\000\060\305\nP\203\177\000"} (gdb) p *glhandle value has been optimized out (gdb) p glhandle $16 = <optimized out> (gdb) f 5 #5 0x00007f850633d82e in pub_glfs_h_close (object=object@entry=0x7f83500d1e50) at glfs-handleops.c:1366 1366 GF_FREE (object); (gdb) p *object $17 = {inode = 0x7f84dd935400, gfid = "\a\312\355A80MA\253\350\027\256\231p", <incomplete sequence \337>} (gdb) p &myself->handle No symbol "myself" in current context. (gdb) f 7 #7 0x00007f8506758f8d in glusterfs_open2 (obj_hdl=0x7f82cc0cb008, state=0x0, openflags=<optimized out>, createmode=<optimized out>, name=0x7f839802fb40 "cache.c", attrib_set=<optimized out>, verifier=0x7f8566a22e70 "\001v'\206", new_obj=0x7f8566a229e0, attrs_out=0x7f8566a229f0, caller_perm_check=0x7f8566a22b5f) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/FSAL_GLUSTER/handle.c:1672 1672 gluster_cleanup_vars(glhandle); (gdb) p &myself->handle $18 = (struct fsal_obj_handle *) 0x7f8350039ab8 (gdb) p *$ $19 = {handles = {next = 0x0, prev = 0x0}, fs = 0x0, fsal = 0x0, obj_ops = {get_ref = 0x0, put_ref = 0x0, release = 0x0, merge = 0x0, lookup = 0x0, readdir = 0x0, create = 0x0, mkdir = 0x0, mknode = 0x0, symlink = 0x0, readlink = 0x0, test_access = 0x0, getattrs = 0x0, setattrs = 0x0, link = 0x0, fs_locations = 0x0, rename = 0x0, unlink = 0x0, open = 0x0, reopen = 0x0, status = 0x0, read = 0x0, read_plus = 0x0, write = 0x0, write_plus = 0x0, seek = 0x0, io_advise = 0x0, commit = 0x0, lock_op = 0x0, share_op = 0x0, close = 0x0, list_ext_attrs = 0x0, getextattr_id_by_name = 0x0, getextattr_value_by_name = 0x0, getextattr_value_by_id = 0x0, setextattr_value = 0x0, setextattr_value_by_id = 0x0, remove_extattr_by_id = 0x0, remove_extattr_by_name = 0x0, handle_is = 0x0, handle_digest = 0x0, handle_to_key = 0x0, handle_cmp = 0x0, layoutget = 0x0, layoutreturn = 0x0, layoutcommit = 0x0, getxattrs = 0x0, setxattrs = 0x0, removexattrs = 0x0, listxattrs = 0x0, open2 = 0x0, check_verifier = 0x0, status2 = 0x0, reopen2 = 0x0, read2 = 0x0, write2 = 0x0, seek2 = 0x0, io_advise2 = 0x0, commit2 = 0x0, lock_op2 = 0x0, setattr2 = 0x0, close2 = 0x0}, lock = {__data = {__lock = 0, __nr_readers = 0, __readers_wakeup = 0, __writer_wakeup = 0, __nr_readers_queued = 0, __nr_writers_queued = 0, __writer = 0, __shared = 0, __pad1 = 0, __pad2 = 0, __flags = 0}, __size = '\000' <repeats 55 times>, __align = 0}, type = REGULAR_FILE, fsid = {major = 1189, minor = 635256}, fileid = 12387176813748553183, state_hdl = 0x0} (gdb) p handle No symbol "handle" in current context. (gdb) p *new_obj $20 = (struct fsal_obj_handle *) 0x0 (gdb) q [2017-01-16 20:24:47.722199] E [MSGID: 109040] [dht-helper.c:1198:dht_migration_complete_check_task] 0-butcher-dht: /d1/dir8/linux-4.9.4/arch/mips/mm/cache.c: failed to lookup the file on butcher-dht [Stale file handle] There is STALE_FH error returned for cache.c. Not sure why the file has gone stale. But in ganesha layer if we receive error for settarrs, we seem to be freeing glusterfs handle twice. Fix posted upstream for review - https://review.gerrithub.io/#/c/343338/
The reported issue was not reproducible on Ganesha 2.4.1-6,Gluster 3.8.4-12. Verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2017-0493.html