1403706 – Possible write behind corruption while pumping IO from heterogeneous Ganesha mounts

Bug 1403706 - Possible write behind corruption while pumping IO from heterogeneous Ganesha mounts

Summary: Possible write behind corruption while pumping IO from heterogeneous Ganesha ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	nfs-ganesha
Sub Component:
Version:	rhgs-3.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Kaleb KEITHLEY
QA Contact:	Ambarish
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1351528 1401182
TreeView+	depends on / blocked

Reported:	2016-12-12 09:08 UTC by Ambarish
Modified:	2017-03-28 06:53 UTC (History)
CC List:	7 users (show)
Fixed In Version:	nfs-ganesha-2.4.1-4
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-03-23 06:27:09 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2017:0493	0	normal	SHIPPED_LIVE	Red Hat Gluster Storage 3.2.0 nfs-ganesha bug fix and enhancement update	2017-03-23 09:19:13 UTC

Description Ambarish 2016-12-12 09:08:29 UTC

Description of problem:
-----------------------
*This is to track one of the BTs seen in https://bugzilla.redhat.com/show_bug.cgi?id=1401182,possibly in the WB layer* :

4 node cluster with a 2*2 volume.
The volume is mounted via v3 and v4 on 7 clients and I/O (dd and tarball untar) is pumped from all the mounts.

Almost 1.5 hours into the workload,Ganesha crashed on 3/4 nodes and dumped core.Since pacemaker quorum was lost,all IOs were hung at the mount point.

(gdb) bt
#0  0x00007fd4d52811d7 in raise () from /lib64/libc.so.6
#1  0x00007fd4d52828c8 in abort () from /lib64/libc.so.6
#2  0x00007fd4d52c0f07 in __libc_message () from /lib64/libc.so.6
#3  0x00007fd4d52c8503 in _int_free () from /lib64/libc.so.6
#4  0x00007fd43342f8d7 in wb_forget (this=<optimized out>, inode=<optimized out>) at write-behind.c:2258
#5  0x00007fd44a09e471 in __inode_ctx_free (inode=inode@entry=0x7fd42331480c) at inode.c:332
#6  0x00007fd44a09f652 in __inode_destroy (inode=0x7fd42331480c) at inode.c:353
#7  inode_table_prune (table=table@entry=0x7fd42c002420) at inode.c:1543
#8  0x00007fd44a09f934 in inode_unref (inode=0x7fd42331480c) at inode.c:524
#9  0x00007fd44a3773b6 in pub_glfs_h_close (object=0x7fd14802f610) at glfs-handleops.c:1365
#10 0x00007fd44a790a59 in handle_release (obj_hdl=0x7fd14802f318)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/FSAL_GLUSTER/handle.c:71
#11 0x00007fd4d77b4812 in mdcache_lru_clean (entry=0x7fd1480d0860)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:421
#12 mdcache_lru_get (entry=entry@entry=0x7fd4aaa5bd18)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1201
#13 0x00007fd4d77bec7e in mdcache_alloc_handle (fs=0x0, sub_handle=0x7fd2e803abd8, export=0x7fd4440d2130)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:117
#14 mdcache_new_entry (export=export@entry=0x7fd4440d2130, sub_handle=0x7fd2e803abd8, 
    attrs_in=attrs_in@entry=0x7fd4aaa5be70, attrs_out=attrs_out@entry=0x0, new_directory=new_directory@entry=false, 
    entry=entry@entry=0x7fd4aaa5bdd0, state=state@entry=0x0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:411
#15 0x00007fd4d77b86b4 in mdcache_alloc_and_check_handle (export=export@entry=0x7fd4440d2130, 
    sub_handle=<optimized out>, new_obj=new_obj@entry=0x7fd4aaa5be68, new_directory=new_directory@entry=false, 
    attrs_in=attrs_in@entry=0x7fd4aaa5be70, attrs_out=attrs_out@entry=0x0, tag=tag@entry=0x7fd4d77edb84 "lookup ", 
    parent=parent@entry=0x7fd17c0ce920, name=name@entry=0x7fd2e8010d30 ".gitignore", 
    invalidate=invalidate@entry=true, state=state@entry=0x0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:93
#16 0x00007fd4d77bfefa in mdc_lookup_uncached (mdc_parent=mdc_parent@entry=0x7fd17c0ce920, 
    name=name@entry=0x7fd2e8010d30 ".gitignore", new_entry=new_entry@entry=0x7fd4aaa5c010, 
    attrs_out=attrs_out@entry=0x0)
---Type <return> to continue, or q <return> to quit---
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:1041
#17 0x00007fd4d77c02cd in mdc_lookup (mdc_parent=0x7fd17c0ce920, name=0x7fd2e8010d30 ".gitignore", 
    uncached=uncached@entry=true, new_entry=new_entry@entry=0x7fd4aaa5c010, attrs_out=0x0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:985
#18 0x00007fd4d77b79eb in mdcache_lookup (parent=<optimized out>, name=<optimized out>, handle=0x7fd4aaa5c098, 
    attrs_out=<optimized out>)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:166
#19 0x00007fd4d76efc97 in fsal_lookup (parent=0x7fd17c0ce958, name=0x7fd2e8010d30 ".gitignore", 
    obj=obj@entry=0x7fd4aaa5c098, attrs_out=attrs_out@entry=0x0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/fsal_helper.c:712
#20 0x00007fd4d7723636 in nfs4_op_lookup (op=<optimized out>, data=0x7fd4aaa5c180, resp=0x7fd2e801cc70)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs4_op_lookup.c:106
#21 0x00007fd4d7717f7d in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7fd2e804eb30)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs4_Compound.c:734
#22 0x00007fd4d770912c in nfs_rpc_execute (reqdata=reqdata@entry=0x7fd26c01c050)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1281
#23 0x00007fd4d770a78a in worker_run (ctx=0x7fd4d7c814c0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1548
#24 0x00007fd4d7794189 in fridgethr_start_routine (arg=0x7fd4d7c814c0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/support/fridgethr.c:550
#25 0x00007fd4d5c74dc5 in start_thread () from /lib64/libpthread.so.0
#26 0x00007fd4d534373d in clone () from /lib64/libc.so.6
(gdb) 
(gdb) 


Version-Release number of selected component (if applicable):
-------------------------------------------------------------

glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64
nfs-ganesha-2.4.1-1.el7rhgs.x86_64


How reproducible:
------------------

1/1

Steps to Reproduce:
------------------

1. Create a 4 node cluster and mount the volume via v3 and v4 on the clients.

2. Pump I/O.


Actual results:
---------------

Ganesha crashes on 3 nodes.


Expected results:
----------------

No crashes

Additional info:
----------------


OS : RHEL 7.3

*Vol Config* :
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: db9c8fe1-375d-4375-955b-f8291af4f931
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas009.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas010.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas015.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: on
server.allow-insecure: on
performance.stat-prefetch: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable


More info on https://bugzilla.redhat.com/show_bug.cgi?id=1401182

sos,core here :
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1401182

Comment 5 Ambarish 2017-01-20 07:53:03 UTC

The reported issue was not reproducible on Ganesha 2.4.1-6,Gluster 3.8.4-12 on two tries.

Will reopen if hit again during regressions.

Comment 7 errata-xmlrpc 2017-03-23 06:27:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2017-0493.html

Note You need to log in before you can comment on or make changes to this bug.