Description of problem: ----------------------- 2 node setup,4 clients mount the gluster volume via v4 (2 clients:1 server). Ganesha crashed on one of my nodes and dumped the following core : <BT> (gdb) bt #0 0x00007fbc473fe5db in malloc_consolidate (av=av@entry=0x7fbadc000020) at malloc.c:4164 #1 0x00007fbc474003a5 in _int_malloc (av=av@entry=0x7fbadc000020, bytes=bytes@entry=1032) at malloc.c:3446 #2 0x00007fbc47403b64 in __libc_calloc (n=<optimized out>, elem_size=<optimized out>) at malloc.c:3243 #3 0x00007fbc4476c3e6 in __gf_calloc (nmemb=nmemb@entry=1, size=<optimized out>, type=type@entry=17, typestr=typestr@entry=0x7fbc447e05eb "gf_common_mt_inode_ctx") at mem-pool.c:117 #4 0x00007fbc44754205 in __inode_create (table=table@entry=0x7fbc1c029a80) at inode.c:614 #5 0x00007fbc4475564b in inode_new (table=0x7fbc1c029a80) at inode.c:647 #6 0x00007fbc44a2a8e6 in glfs_resolve_component (fs=fs@entry=0x563be72d6060, subvol=subvol@entry=0x7fbc2c04b3e0, parent=parent@entry=0x7fb944005d70, component=component@entry=0x7fbadc055110 "0000003811yF", iatt=iatt@entry=0x7fbc027dab00, force_lookup=<optimized out>) at glfs-resolve.c:367 #7 0x00007fbc44a2adfb in priv_glfs_resolve_at (fs=fs@entry=0x563be72d6060, subvol=subvol@entry=0x7fbc2c04b3e0, at=at@entry=0x7fb944005d70, origpath=origpath@entry=0x7fbadc051140 "0000003811yF", loc=loc@entry=0x7fbc027dac00, iatt=iatt@entry=0x7fbc027dac40, follow=follow@entry=0, reval=reval@entry=0) at glfs-resolve.c:501 #8 0x00007fbc44a2c7b4 in pub_glfs_h_lookupat (fs=0x563be72d6060, parent=<optimized out>, path=path@entry=0x7fbadc051140 "0000003811yF", stat=stat@entry=0x7fbc027dad20, follow=follow@entry=0) at glfs-handleops.c:102 #9 0x00007fbc44a2c898 in pub_glfs_h_lookupat34 (fs=<optimized out>, parent=<optimized out>, path=path@entry=0x7fbadc051140 "0000003811yF", stat=stat@entry=0x7fbc027dad20) at glfs-handleops.c:133 #10 0x00007fbc44e4939f in lookup (parent=0x7fb94409a898, path=0x7fbadc051140 "0000003811yF", handle=0x7fbc027dae60, attrs_out=0x7fbc027dae70) at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/FSAL_GLUSTER/handle.c:117 #11 0x0000563be67e9f8f in mdc_lookup_uncached (mdc_parent=mdc_parent@entry=0x7fb8ac097870, name=name@entry=0x7fbadc051140 "0000003811yF", new_entry=new_entry@entry=0x7fbc027db010, attrs_out=attrs_out@entry=0x0) at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:1046 #12 0x0000563be67ecc5b in mdc_lookup (mdc_parent=0x7fb8ac097870, name=0x7fbadc051140 "0000003811yF", uncached=uncached@entry=true, new_entry=new_entry@entry=0x7fbc027db010, attrs_out=0x0) at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:1004 #13 0x0000563be67e1b7b in mdcache_lookup (parent=<optimized out>, name=<optimized out>, handle=0x7fbc027db098, attrs_out=<optimized out>) at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:167 #14 0x0000563be671aac7 in fsal_lookup (parent=0x7fb8ac0978a8, name=0x7fbadc051140 "0000003811yF", obj=obj@entry=0x7fbc027db098, attrs_out=attrs_out@entry=0x0) at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/fsal_helper.c:712 #15 0x0000563be674e066 in nfs4_op_lookup (op=<optimized out>, data=0x7fbc027db180, resp=0x7fbadc05c870) at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_op_lookup.c:106 #16 0x0000563be674297d in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7fbadc05b3d0) at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_Compound.c:734 #17 0x0000563be6733b1c in nfs_rpc_execute (reqdata=reqdata@entry=0x7fbb880008c0) at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1281 #18 0x0000563be673518a in worker_run (ctx=0x563bea352b00) at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1548 #19 0x0000563be67be889 in fridgethr_start_routine (arg=0x563bea352b00) at /usr/src/debug/nfs-ganesha-2.4.4/src/support/fridgethr.c:550 #20 0x00007fbc47dade25 in start_thread (arg=0x7fbc027dc700) at pthread_create.c:308 #21 0x00007fbc4747b34d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 (gdb) </BT> Version-Release number of selected component (if applicable): -------------------------------------------------------------- nfs-ganesha-2.4.4-10.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-31.el7rhgs.x86_64 How reproducible: ----------------- Reporting the first occurrence Additional info: ---------------- Volume Name: butcher Type: Distributed-Disperse Volume ID: 22c652d8-0754-438a-8131-373bad7c12ab Status: Started Snapshot Count: 0 Number of Bricks: 4 x (4 + 2) = 24 Transport-type: tcp Bricks: Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks1/A1 Brick2: gqas007.sbu.lab.eng.bos.redhat.com:/bricks1/A1 Brick3: gqas014.sbu.lab.eng.bos.redhat.com:/bricks2/A1 Brick4: gqas007.sbu.lab.eng.bos.redhat.com:/bricks2/A1 Brick5: gqas014.sbu.lab.eng.bos.redhat.com:/bricks3/A1 Brick6: gqas007.sbu.lab.eng.bos.redhat.com:/bricks3/A1 Brick7: gqas014.sbu.lab.eng.bos.redhat.com:/bricks4/A1 Brick8: gqas007.sbu.lab.eng.bos.redhat.com:/bricks4/A1 Brick9: gqas014.sbu.lab.eng.bos.redhat.com:/bricks5/A1 Brick10: gqas007.sbu.lab.eng.bos.redhat.com:/bricks5/A1 Brick11: gqas014.sbu.lab.eng.bos.redhat.com:/bricks6/A1 Brick12: gqas007.sbu.lab.eng.bos.redhat.com:/bricks6/A1 Brick13: gqas014.sbu.lab.eng.bos.redhat.com:/bricks7/A1 Brick14: gqas007.sbu.lab.eng.bos.redhat.com:/bricks7/A1 Brick15: gqas014.sbu.lab.eng.bos.redhat.com:/bricks8/A1 Brick16: gqas007.sbu.lab.eng.bos.redhat.com:/bricks8/A1 Brick17: gqas014.sbu.lab.eng.bos.redhat.com:/bricks9/A1 Brick18: gqas007.sbu.lab.eng.bos.redhat.com:/bricks9/A1 Brick19: gqas014.sbu.lab.eng.bos.redhat.com:/bricks10/A1 Brick20: gqas007.sbu.lab.eng.bos.redhat.com:/bricks10/A1 Brick21: gqas014.sbu.lab.eng.bos.redhat.com:/bricks11/A1 Brick22: gqas007.sbu.lab.eng.bos.redhat.com:/bricks11/A1 Brick23: gqas014.sbu.lab.eng.bos.redhat.com:/bricks12/A1 Brick24: gqas007.sbu.lab.eng.bos.redhat.com:/bricks12/A1 Options Reconfigured: ganesha.enable: on network.inode-lru-limit: 50000 performance.md-cache-timeout: 600 performance.cache-invalidation: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on transport.address-family: inet nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable
This is definitely memory corruption. I notice that all these crashes are on the same machine (gqas007). Is it possible that there's a hardware issue with the RAM on that box? Do these crashes happen on any other box? Can we attempt to reproduce on a set of boxes excluding this one?