1401182 – [Tracker] : Ganesha crashes on writes from heterogeneous clients ; Pacemaker quorum lost ; I/O halted on application

Bug 1401182 - [Tracker] : Ganesha crashes on writes from heterogeneous clients ; Pacemaker quorum lost ; I/O halted on application

Summary: [Tracker] : Ganesha crashes on writes from heterogeneous clients ; Pacemaker ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	nfs-ganesha
Sub Component:
Version:	rhgs-3.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Matt Benjamin (redhat)
QA Contact:	Ambarish
Docs Contact:
URL:
Whiteboard:
Depends On:	1398846 1403706
Blocks:	1351528
TreeView+	depends on / blocked

Reported:	2016-12-03 11:13 UTC by Ambarish
Modified:	2023-09-14 03:35 UTC (History)
CC List:	12 users (show)
Fixed In Version:	nfs-ganesha-2.4.1-4
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-03-23 06:26:10 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2017:0493	0	normal	SHIPPED_LIVE	Red Hat Gluster Storage 3.2.0 nfs-ganesha bug fix and enhancement update	2017-03-23 09:19:13 UTC

Description Ambarish 2016-12-03 11:13:54 UTC

Description of problem:
----------------------

4 node cluster with a 2*2 volume.
The volume is mounted via v3 and v4 on 7 clients and I/O (dd and tarball untar) is pumped from all the mounts.

Almost 1.5 hours into the workload,Ganesha crashed on 3/4 nodes and dumped core.Since pacemaker quorum was lost,all IOs were hung at the mount point.

The signature of the BT is different from what I reported in (https://bugzilla.redhat.com/show_bug.cgi?id=1398921)

**********
On gqas009
**********

(gdb) bt
#0  remove_recolour (head=head@entry=0x7f0fa4006040, parent=0x7f1094068e00, node=<optimized out>)
    at /usr/src/debug/ntirpc-1.4.3/src/rbtree.c:331
#1  0x00007f123956cc63 in opr_rbtree_remove (head=head@entry=0x7f0fa4006040, node=<optimized out>, 
    node@entry=0x7f115c024150) at /usr/src/debug/ntirpc-1.4.3/src/rbtree.c:453
#2  0x00007f123b4ba591 in rbtree_x_cached_remove (hk=<optimized out>, nk=0x7f115c024150, t=0x7f0fa4005f90, 
    xt=0x7f0fa40010e8) at /usr/include/ntirpc/misc/rbtree_x.h:154
#3  nfs_dupreq_finish (req=req@entry=0x7f101c81b328, res_nfs=res_nfs@entry=0x7f0ef0012cc0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/RPCAL/nfs_dupreq.c:1123
#4  0x00007f123b4402a7 in nfs_rpc_execute (reqdata=reqdata@entry=0x7f101c81b300)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1358
#5  0x00007f123b44178a in worker_run (ctx=0x7f123c9fcac0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1548
#6  0x00007f123b4cb189 in fridgethr_start_routine (arg=0x7f123c9fcac0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/support/fridgethr.c:550
#7  0x00007f12399abdc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f123907a73d in clone () from /lib64/libc.so.6
(gdb) 



***********
On gqas015
***********

(gdb) bt
#0  0x00007fd4d52811d7 in raise () from /lib64/libc.so.6
#1  0x00007fd4d52828c8 in abort () from /lib64/libc.so.6
#2  0x00007fd4d52c0f07 in __libc_message () from /lib64/libc.so.6
#3  0x00007fd4d52c8503 in _int_free () from /lib64/libc.so.6
#4  0x00007fd43342f8d7 in wb_forget (this=<optimized out>, inode=<optimized out>) at write-behind.c:2258
#5  0x00007fd44a09e471 in __inode_ctx_free (inode=inode@entry=0x7fd42331480c) at inode.c:332
#6  0x00007fd44a09f652 in __inode_destroy (inode=0x7fd42331480c) at inode.c:353
#7  inode_table_prune (table=table@entry=0x7fd42c002420) at inode.c:1543
#8  0x00007fd44a09f934 in inode_unref (inode=0x7fd42331480c) at inode.c:524
#9  0x00007fd44a3773b6 in pub_glfs_h_close (object=0x7fd14802f610) at glfs-handleops.c:1365
#10 0x00007fd44a790a59 in handle_release (obj_hdl=0x7fd14802f318)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/FSAL_GLUSTER/handle.c:71
#11 0x00007fd4d77b4812 in mdcache_lru_clean (entry=0x7fd1480d0860)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:421
#12 mdcache_lru_get (entry=entry@entry=0x7fd4aaa5bd18)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1201
#13 0x00007fd4d77bec7e in mdcache_alloc_handle (fs=0x0, sub_handle=0x7fd2e803abd8, export=0x7fd4440d2130)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:117
#14 mdcache_new_entry (export=export@entry=0x7fd4440d2130, sub_handle=0x7fd2e803abd8, 
    attrs_in=attrs_in@entry=0x7fd4aaa5be70, attrs_out=attrs_out@entry=0x0, new_directory=new_directory@entry=false, 
    entry=entry@entry=0x7fd4aaa5bdd0, state=state@entry=0x0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:411
#15 0x00007fd4d77b86b4 in mdcache_alloc_and_check_handle (export=export@entry=0x7fd4440d2130, 
    sub_handle=<optimized out>, new_obj=new_obj@entry=0x7fd4aaa5be68, new_directory=new_directory@entry=false, 
    attrs_in=attrs_in@entry=0x7fd4aaa5be70, attrs_out=attrs_out@entry=0x0, tag=tag@entry=0x7fd4d77edb84 "lookup ", 
    parent=parent@entry=0x7fd17c0ce920, name=name@entry=0x7fd2e8010d30 ".gitignore", 
    invalidate=invalidate@entry=true, state=state@entry=0x0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:93
#16 0x00007fd4d77bfefa in mdc_lookup_uncached (mdc_parent=mdc_parent@entry=0x7fd17c0ce920, 
    name=name@entry=0x7fd2e8010d30 ".gitignore", new_entry=new_entry@entry=0x7fd4aaa5c010, 
    attrs_out=attrs_out@entry=0x0)
---Type <return> to continue, or q <return> to quit---
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:1041
#17 0x00007fd4d77c02cd in mdc_lookup (mdc_parent=0x7fd17c0ce920, name=0x7fd2e8010d30 ".gitignore", 
    uncached=uncached@entry=true, new_entry=new_entry@entry=0x7fd4aaa5c010, attrs_out=0x0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:985
#18 0x00007fd4d77b79eb in mdcache_lookup (parent=<optimized out>, name=<optimized out>, handle=0x7fd4aaa5c098, 
    attrs_out=<optimized out>)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:166
#19 0x00007fd4d76efc97 in fsal_lookup (parent=0x7fd17c0ce958, name=0x7fd2e8010d30 ".gitignore", 
    obj=obj@entry=0x7fd4aaa5c098, attrs_out=attrs_out@entry=0x0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/fsal_helper.c:712
#20 0x00007fd4d7723636 in nfs4_op_lookup (op=<optimized out>, data=0x7fd4aaa5c180, resp=0x7fd2e801cc70)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs4_op_lookup.c:106
#21 0x00007fd4d7717f7d in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7fd2e804eb30)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs4_Compound.c:734
#22 0x00007fd4d770912c in nfs_rpc_execute (reqdata=reqdata@entry=0x7fd26c01c050)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1281
#23 0x00007fd4d770a78a in worker_run (ctx=0x7fd4d7c814c0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1548
#24 0x00007fd4d7794189 in fridgethr_start_routine (arg=0x7fd4d7c814c0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/support/fridgethr.c:550
#25 0x00007fd4d5c74dc5 in start_thread () from /lib64/libpthread.so.0
#26 0x00007fd4d534373d in clone () from /lib64/libc.so.6
(gdb) 
(gdb) 


***********
On gqas014
***********

(gdb) bt
#0  0x00007fc652ec11d7 in raise () from /lib64/libc.so.6
#1  0x00007fc652ec28c8 in abort () from /lib64/libc.so.6
#2  0x00007fc652f00f07 in __libc_message () from /lib64/libc.so.6
#3  0x00007fc652f08503 in _int_free () from /lib64/libc.so.6
#4  0x00007fc6553c3522 in gsh_free (p=<optimized out>)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/include/abstract_mem.h:271
#5  pool_free (pool=<optimized out>, object=<optimized out>)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/include/abstract_mem.h:420
#6  free_nfs_res (res=<optimized out>) at /usr/src/debug/nfs-ganesha-2.4.1/src/include/nfs_dupreq.h:125
#7  nfs_dupreq_free_dupreq (dv=0x7fc40c22e830) at /usr/src/debug/nfs-ganesha-2.4.1/src/RPCAL/nfs_dupreq.c:784
#8  nfs_dupreq_finish (req=req@entry=0x7fc5880008e8, res_nfs=res_nfs@entry=0x7fc47403a280)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/RPCAL/nfs_dupreq.c:1133
#9  0x00007fc6553492a7 in nfs_rpc_execute (reqdata=reqdata@entry=0x7fc5880008c0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1358
#10 0x00007fc65534a78a in worker_run (ctx=0x7fc6556d4ec0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1548
#11 0x00007fc6553d4189 in fridgethr_start_routine (arg=0x7fc6556d4ec0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/support/fridgethr.c:550
#12 0x00007fc6538b4dc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007fc652f8373d in clone () from /lib64/libc.so.6
(gdb) 


Version-Release number of selected component (if applicable):
--------------------------------------------------------------

glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64
nfs-ganesha-2.4.1-1.el7rhgs.x86_64


How reproducible:
-----------------

Reporting the first occurrence.

Steps to Reproduce:
-------------------

1. Create a 4 node cluster and mount the volume via v3 and v4 on the clients.

2. Pump I/O.

Actual results:
---------------

Ganesha crashes on 3 nodes..IOs are hung as pacemaker quorum is lost.


Expected results:
-----------------

No crashes.


Additional info:
-----------------

OS : RHEL 7.3

*Vol Config* :
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: db9c8fe1-375d-4375-955b-f8291af4f931
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas009.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas010.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas015.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: on
server.allow-insecure: on
performance.stat-prefetch: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable

Comment 3 Daniel Gryniewicz 2016-12-05 13:47:40 UTC

This is the same backtrace as bug #1401160

Comment 4 Matt Benjamin (redhat) 2016-12-06 15:49:41 UTC

There is reason to suspect this bug has the same root cause as https://bugzilla.redhat.com/show_bug.cgi?id=1398846 (updated with proposed fix from upstream).

Comment 12 Ambarish 2017-01-20 07:54:37 UTC

The reported issue was not reproducible on Ganesha 2.4.1-6,Gluster 3.8.4-12 on two tries.

Will reopen if hit again during regressions.

Comment 14 errata-xmlrpc 2017-03-23 06:26:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2017-0493.html

Comment 15 Red Hat Bugzilla 2023-09-14 03:35:33 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.