1467035 – [Ganesha] : Ganesha crashed during iozones/bonnie from multiple clients in malloc_consolidate,possible memory corruption

Bug 1467035 - [Ganesha] : Ganesha crashed during iozones/bonnie from multiple clients in malloc_consolidate,possible memory corruption

Summary: [Ganesha] : Ganesha crashed during iozones/bonnie from multiple clients in m...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	nfs-ganesha
Sub Component:
Version:	rhgs-3.3
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Kaleb KEITHLEY
QA Contact:	Ambarish
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-07-02 04:03 UTC by Ambarish
Modified:	2017-08-10 07:10 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-08-10 07:10:09 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Ambarish 2017-07-02 04:03:40 UTC

Description of problem:
-----------------------

2 node setup,4 clients mount the gluster volume via v4 (2 clients:1 server).

Ganesha crashed on one of my nodes and dumped the following core :

<BT>

(gdb) bt
#0  0x00007fbc473fe5db in malloc_consolidate (av=av@entry=0x7fbadc000020) at malloc.c:4164
#1  0x00007fbc474003a5 in _int_malloc (av=av@entry=0x7fbadc000020, bytes=bytes@entry=1032) at malloc.c:3446
#2  0x00007fbc47403b64 in __libc_calloc (n=<optimized out>, elem_size=<optimized out>) at malloc.c:3243
#3  0x00007fbc4476c3e6 in __gf_calloc (nmemb=nmemb@entry=1, size=<optimized out>, type=type@entry=17, 
    typestr=typestr@entry=0x7fbc447e05eb "gf_common_mt_inode_ctx") at mem-pool.c:117
#4  0x00007fbc44754205 in __inode_create (table=table@entry=0x7fbc1c029a80) at inode.c:614
#5  0x00007fbc4475564b in inode_new (table=0x7fbc1c029a80) at inode.c:647
#6  0x00007fbc44a2a8e6 in glfs_resolve_component (fs=fs@entry=0x563be72d6060, subvol=subvol@entry=0x7fbc2c04b3e0, 
    parent=parent@entry=0x7fb944005d70, component=component@entry=0x7fbadc055110 "0000003811yF", 
    iatt=iatt@entry=0x7fbc027dab00, force_lookup=<optimized out>) at glfs-resolve.c:367
#7  0x00007fbc44a2adfb in priv_glfs_resolve_at (fs=fs@entry=0x563be72d6060, subvol=subvol@entry=0x7fbc2c04b3e0, 
    at=at@entry=0x7fb944005d70, origpath=origpath@entry=0x7fbadc051140 "0000003811yF", loc=loc@entry=0x7fbc027dac00, 
    iatt=iatt@entry=0x7fbc027dac40, follow=follow@entry=0, reval=reval@entry=0) at glfs-resolve.c:501
#8  0x00007fbc44a2c7b4 in pub_glfs_h_lookupat (fs=0x563be72d6060, parent=<optimized out>, 
    path=path@entry=0x7fbadc051140 "0000003811yF", stat=stat@entry=0x7fbc027dad20, follow=follow@entry=0)
    at glfs-handleops.c:102
#9  0x00007fbc44a2c898 in pub_glfs_h_lookupat34 (fs=<optimized out>, parent=<optimized out>, 
    path=path@entry=0x7fbadc051140 "0000003811yF", stat=stat@entry=0x7fbc027dad20) at glfs-handleops.c:133
#10 0x00007fbc44e4939f in lookup (parent=0x7fb94409a898, path=0x7fbadc051140 "0000003811yF", handle=0x7fbc027dae60, 
    attrs_out=0x7fbc027dae70) at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/FSAL_GLUSTER/handle.c:117
#11 0x0000563be67e9f8f in mdc_lookup_uncached (mdc_parent=mdc_parent@entry=0x7fb8ac097870, 
    name=name@entry=0x7fbadc051140 "0000003811yF", new_entry=new_entry@entry=0x7fbc027db010, 
    attrs_out=attrs_out@entry=0x0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:1046
#12 0x0000563be67ecc5b in mdc_lookup (mdc_parent=0x7fb8ac097870, name=0x7fbadc051140 "0000003811yF", 
    uncached=uncached@entry=true, new_entry=new_entry@entry=0x7fbc027db010, attrs_out=0x0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:1004
#13 0x0000563be67e1b7b in mdcache_lookup (parent=<optimized out>, name=<optimized out>, handle=0x7fbc027db098, 
    attrs_out=<optimized out>)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:167
#14 0x0000563be671aac7 in fsal_lookup (parent=0x7fb8ac0978a8, name=0x7fbadc051140 "0000003811yF", 
    obj=obj@entry=0x7fbc027db098, attrs_out=attrs_out@entry=0x0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/fsal_helper.c:712
#15 0x0000563be674e066 in nfs4_op_lookup (op=<optimized out>, data=0x7fbc027db180, resp=0x7fbadc05c870)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_op_lookup.c:106
#16 0x0000563be674297d in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7fbadc05b3d0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_Compound.c:734
#17 0x0000563be6733b1c in nfs_rpc_execute (reqdata=reqdata@entry=0x7fbb880008c0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1281
#18 0x0000563be673518a in worker_run (ctx=0x563bea352b00)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1548
#19 0x0000563be67be889 in fridgethr_start_routine (arg=0x563bea352b00)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/support/fridgethr.c:550
#20 0x00007fbc47dade25 in start_thread (arg=0x7fbc027dc700) at pthread_create.c:308
#21 0x00007fbc4747b34d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) 
</BT>

Version-Release number of selected component (if applicable):
--------------------------------------------------------------

nfs-ganesha-2.4.4-10.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-31.el7rhgs.x86_64


How reproducible:
-----------------

Reporting the first occurrence



Additional info:
----------------

Volume Name: butcher
Type: Distributed-Disperse
Volume ID: 22c652d8-0754-438a-8131-373bad7c12ab
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x (4 + 2) = 24
Transport-type: tcp
Bricks:
Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Brick2: gqas007.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Brick3: gqas014.sbu.lab.eng.bos.redhat.com:/bricks2/A1
Brick4: gqas007.sbu.lab.eng.bos.redhat.com:/bricks2/A1
Brick5: gqas014.sbu.lab.eng.bos.redhat.com:/bricks3/A1
Brick6: gqas007.sbu.lab.eng.bos.redhat.com:/bricks3/A1
Brick7: gqas014.sbu.lab.eng.bos.redhat.com:/bricks4/A1
Brick8: gqas007.sbu.lab.eng.bos.redhat.com:/bricks4/A1
Brick9: gqas014.sbu.lab.eng.bos.redhat.com:/bricks5/A1
Brick10: gqas007.sbu.lab.eng.bos.redhat.com:/bricks5/A1
Brick11: gqas014.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Brick12: gqas007.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Brick13: gqas014.sbu.lab.eng.bos.redhat.com:/bricks7/A1
Brick14: gqas007.sbu.lab.eng.bos.redhat.com:/bricks7/A1
Brick15: gqas014.sbu.lab.eng.bos.redhat.com:/bricks8/A1
Brick16: gqas007.sbu.lab.eng.bos.redhat.com:/bricks8/A1
Brick17: gqas014.sbu.lab.eng.bos.redhat.com:/bricks9/A1
Brick18: gqas007.sbu.lab.eng.bos.redhat.com:/bricks9/A1
Brick19: gqas014.sbu.lab.eng.bos.redhat.com:/bricks10/A1
Brick20: gqas007.sbu.lab.eng.bos.redhat.com:/bricks10/A1
Brick21: gqas014.sbu.lab.eng.bos.redhat.com:/bricks11/A1
Brick22: gqas007.sbu.lab.eng.bos.redhat.com:/bricks11/A1
Brick23: gqas014.sbu.lab.eng.bos.redhat.com:/bricks12/A1
Brick24: gqas007.sbu.lab.eng.bos.redhat.com:/bricks12/A1
Options Reconfigured:
ganesha.enable: on
network.inode-lru-limit: 50000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
transport.address-family: inet
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable

Comment 3 Daniel Gryniewicz 2017-07-05 12:52:59 UTC

This is definitely memory corruption.

I notice that all these crashes are on the same machine (gqas007).  Is it possible that there's a hardware issue with the RAM on that box?  Do these crashes happen on any other box?  Can we attempt to reproduce on a set of boxes excluding this one?

Note You need to log in before you can comment on or make changes to this bug.