Bug 1401182 - [Tracker] : Ganesha crashes on writes from heterogeneous clients ; Pacemaker quorum lost ; I/O halted on application [NEEDINFO]
Summary: [Tracker] : Ganesha crashes on writes from heterogeneous clients ; Pacemaker ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: nfs-ganesha
Version: rhgs-3.2
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: RHGS 3.2.0
Assignee: Matt Benjamin (redhat)
QA Contact: Ambarish
URL:
Whiteboard:
Depends On: 1398846 1403706
Blocks: 1351528
TreeView+ depends on / blocked
 
Reported: 2016-12-03 11:13 UTC by Ambarish
Modified: 2017-03-28 06:54 UTC (History)
12 users (show)

Fixed In Version: nfs-ganesha-2.4.1-4
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-03-23 06:26:10 UTC
Target Upstream Version:
skoduri: needinfo? (mbenjamin)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:0493 0 normal SHIPPED_LIVE Red Hat Gluster Storage 3.2.0 nfs-ganesha bug fix and enhancement update 2017-03-23 09:19:13 UTC

Description Ambarish 2016-12-03 11:13:54 UTC
Description of problem:
----------------------

4 node cluster with a 2*2 volume.
The volume is mounted via v3 and v4 on 7 clients and I/O (dd and tarball untar) is pumped from all the mounts.

Almost 1.5 hours into the workload,Ganesha crashed on 3/4 nodes and dumped core.Since pacemaker quorum was lost,all IOs were hung at the mount point.

The signature of the BT is different from what I reported in (https://bugzilla.redhat.com/show_bug.cgi?id=1398921)

**********
On gqas009
**********

(gdb) bt
#0  remove_recolour (head=head@entry=0x7f0fa4006040, parent=0x7f1094068e00, node=<optimized out>)
    at /usr/src/debug/ntirpc-1.4.3/src/rbtree.c:331
#1  0x00007f123956cc63 in opr_rbtree_remove (head=head@entry=0x7f0fa4006040, node=<optimized out>, 
    node@entry=0x7f115c024150) at /usr/src/debug/ntirpc-1.4.3/src/rbtree.c:453
#2  0x00007f123b4ba591 in rbtree_x_cached_remove (hk=<optimized out>, nk=0x7f115c024150, t=0x7f0fa4005f90, 
    xt=0x7f0fa40010e8) at /usr/include/ntirpc/misc/rbtree_x.h:154
#3  nfs_dupreq_finish (req=req@entry=0x7f101c81b328, res_nfs=res_nfs@entry=0x7f0ef0012cc0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/RPCAL/nfs_dupreq.c:1123
#4  0x00007f123b4402a7 in nfs_rpc_execute (reqdata=reqdata@entry=0x7f101c81b300)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1358
#5  0x00007f123b44178a in worker_run (ctx=0x7f123c9fcac0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1548
#6  0x00007f123b4cb189 in fridgethr_start_routine (arg=0x7f123c9fcac0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/support/fridgethr.c:550
#7  0x00007f12399abdc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f123907a73d in clone () from /lib64/libc.so.6
(gdb) 



***********
On gqas015
***********

(gdb) bt
#0  0x00007fd4d52811d7 in raise () from /lib64/libc.so.6
#1  0x00007fd4d52828c8 in abort () from /lib64/libc.so.6
#2  0x00007fd4d52c0f07 in __libc_message () from /lib64/libc.so.6
#3  0x00007fd4d52c8503 in _int_free () from /lib64/libc.so.6
#4  0x00007fd43342f8d7 in wb_forget (this=<optimized out>, inode=<optimized out>) at write-behind.c:2258
#5  0x00007fd44a09e471 in __inode_ctx_free (inode=inode@entry=0x7fd42331480c) at inode.c:332
#6  0x00007fd44a09f652 in __inode_destroy (inode=0x7fd42331480c) at inode.c:353
#7  inode_table_prune (table=table@entry=0x7fd42c002420) at inode.c:1543
#8  0x00007fd44a09f934 in inode_unref (inode=0x7fd42331480c) at inode.c:524
#9  0x00007fd44a3773b6 in pub_glfs_h_close (object=0x7fd14802f610) at glfs-handleops.c:1365
#10 0x00007fd44a790a59 in handle_release (obj_hdl=0x7fd14802f318)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/FSAL_GLUSTER/handle.c:71
#11 0x00007fd4d77b4812 in mdcache_lru_clean (entry=0x7fd1480d0860)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:421
#12 mdcache_lru_get (entry=entry@entry=0x7fd4aaa5bd18)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1201
#13 0x00007fd4d77bec7e in mdcache_alloc_handle (fs=0x0, sub_handle=0x7fd2e803abd8, export=0x7fd4440d2130)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:117
#14 mdcache_new_entry (export=export@entry=0x7fd4440d2130, sub_handle=0x7fd2e803abd8, 
    attrs_in=attrs_in@entry=0x7fd4aaa5be70, attrs_out=attrs_out@entry=0x0, new_directory=new_directory@entry=false, 
    entry=entry@entry=0x7fd4aaa5bdd0, state=state@entry=0x0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:411
#15 0x00007fd4d77b86b4 in mdcache_alloc_and_check_handle (export=export@entry=0x7fd4440d2130, 
    sub_handle=<optimized out>, new_obj=new_obj@entry=0x7fd4aaa5be68, new_directory=new_directory@entry=false, 
    attrs_in=attrs_in@entry=0x7fd4aaa5be70, attrs_out=attrs_out@entry=0x0, tag=tag@entry=0x7fd4d77edb84 "lookup ", 
    parent=parent@entry=0x7fd17c0ce920, name=name@entry=0x7fd2e8010d30 ".gitignore", 
    invalidate=invalidate@entry=true, state=state@entry=0x0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:93
#16 0x00007fd4d77bfefa in mdc_lookup_uncached (mdc_parent=mdc_parent@entry=0x7fd17c0ce920, 
    name=name@entry=0x7fd2e8010d30 ".gitignore", new_entry=new_entry@entry=0x7fd4aaa5c010, 
    attrs_out=attrs_out@entry=0x0)
---Type <return> to continue, or q <return> to quit---
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:1041
#17 0x00007fd4d77c02cd in mdc_lookup (mdc_parent=0x7fd17c0ce920, name=0x7fd2e8010d30 ".gitignore", 
    uncached=uncached@entry=true, new_entry=new_entry@entry=0x7fd4aaa5c010, attrs_out=0x0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:985
#18 0x00007fd4d77b79eb in mdcache_lookup (parent=<optimized out>, name=<optimized out>, handle=0x7fd4aaa5c098, 
    attrs_out=<optimized out>)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:166
#19 0x00007fd4d76efc97 in fsal_lookup (parent=0x7fd17c0ce958, name=0x7fd2e8010d30 ".gitignore", 
    obj=obj@entry=0x7fd4aaa5c098, attrs_out=attrs_out@entry=0x0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/fsal_helper.c:712
#20 0x00007fd4d7723636 in nfs4_op_lookup (op=<optimized out>, data=0x7fd4aaa5c180, resp=0x7fd2e801cc70)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs4_op_lookup.c:106
#21 0x00007fd4d7717f7d in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7fd2e804eb30)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs4_Compound.c:734
#22 0x00007fd4d770912c in nfs_rpc_execute (reqdata=reqdata@entry=0x7fd26c01c050)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1281
#23 0x00007fd4d770a78a in worker_run (ctx=0x7fd4d7c814c0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1548
#24 0x00007fd4d7794189 in fridgethr_start_routine (arg=0x7fd4d7c814c0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/support/fridgethr.c:550
#25 0x00007fd4d5c74dc5 in start_thread () from /lib64/libpthread.so.0
#26 0x00007fd4d534373d in clone () from /lib64/libc.so.6
(gdb) 
(gdb) 


***********
On gqas014
***********

(gdb) bt
#0  0x00007fc652ec11d7 in raise () from /lib64/libc.so.6
#1  0x00007fc652ec28c8 in abort () from /lib64/libc.so.6
#2  0x00007fc652f00f07 in __libc_message () from /lib64/libc.so.6
#3  0x00007fc652f08503 in _int_free () from /lib64/libc.so.6
#4  0x00007fc6553c3522 in gsh_free (p=<optimized out>)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/include/abstract_mem.h:271
#5  pool_free (pool=<optimized out>, object=<optimized out>)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/include/abstract_mem.h:420
#6  free_nfs_res (res=<optimized out>) at /usr/src/debug/nfs-ganesha-2.4.1/src/include/nfs_dupreq.h:125
#7  nfs_dupreq_free_dupreq (dv=0x7fc40c22e830) at /usr/src/debug/nfs-ganesha-2.4.1/src/RPCAL/nfs_dupreq.c:784
#8  nfs_dupreq_finish (req=req@entry=0x7fc5880008e8, res_nfs=res_nfs@entry=0x7fc47403a280)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/RPCAL/nfs_dupreq.c:1133
#9  0x00007fc6553492a7 in nfs_rpc_execute (reqdata=reqdata@entry=0x7fc5880008c0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1358
#10 0x00007fc65534a78a in worker_run (ctx=0x7fc6556d4ec0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1548
#11 0x00007fc6553d4189 in fridgethr_start_routine (arg=0x7fc6556d4ec0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/support/fridgethr.c:550
#12 0x00007fc6538b4dc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007fc652f8373d in clone () from /lib64/libc.so.6
(gdb) 


Version-Release number of selected component (if applicable):
--------------------------------------------------------------

glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64
nfs-ganesha-2.4.1-1.el7rhgs.x86_64


How reproducible:
-----------------

Reporting the first occurrence.

Steps to Reproduce:
-------------------

1. Create a 4 node cluster and mount the volume via v3 and v4 on the clients.

2. Pump I/O.

Actual results:
---------------

Ganesha crashes on 3 nodes..IOs are hung as pacemaker quorum is lost.


Expected results:
-----------------

No crashes.


Additional info:
-----------------

OS : RHEL 7.3

*Vol Config* :
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: db9c8fe1-375d-4375-955b-f8291af4f931
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas009.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas010.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas015.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: on
server.allow-insecure: on
performance.stat-prefetch: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable

Comment 3 Daniel Gryniewicz 2016-12-05 13:47:40 UTC
This is the same backtrace as bug #1401160

Comment 4 Matt Benjamin (redhat) 2016-12-06 15:49:41 UTC
There is reason to suspect this bug has the same root cause as https://bugzilla.redhat.com/show_bug.cgi?id=1398846 (updated with proposed fix from upstream).

Comment 12 Ambarish 2017-01-20 07:54:37 UTC
The reported issue was not reproducible on Ganesha 2.4.1-6,Gluster 3.8.4-12 on two tries.

Will reopen if hit again during regressions.

Comment 14 errata-xmlrpc 2017-03-23 06:26:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2017-0493.html


Note You need to log in before you can comment on or make changes to this bug.