Description of problem:
----------------------
4 node cluster with a 2*2 volume.
The volume is mounted via v3 and v4 on 7 clients and I/O (dd and tarball untar) is pumped from all the mounts.
Almost 1.5 hours into the workload,Ganesha crashed on 3/4 nodes and dumped core.Since pacemaker quorum was lost,all IOs were hung at the mount point.
The signature of the BT is different from what I reported in (https://bugzilla.redhat.com/show_bug.cgi?id=1398921)
**********
On gqas009
**********
(gdb) bt
#0 remove_recolour (head=head@entry=0x7f0fa4006040, parent=0x7f1094068e00, node=<optimized out>)
at /usr/src/debug/ntirpc-1.4.3/src/rbtree.c:331
#1 0x00007f123956cc63 in opr_rbtree_remove (head=head@entry=0x7f0fa4006040, node=<optimized out>,
node@entry=0x7f115c024150) at /usr/src/debug/ntirpc-1.4.3/src/rbtree.c:453
#2 0x00007f123b4ba591 in rbtree_x_cached_remove (hk=<optimized out>, nk=0x7f115c024150, t=0x7f0fa4005f90,
xt=0x7f0fa40010e8) at /usr/include/ntirpc/misc/rbtree_x.h:154
#3 nfs_dupreq_finish (req=req@entry=0x7f101c81b328, res_nfs=res_nfs@entry=0x7f0ef0012cc0)
at /usr/src/debug/nfs-ganesha-2.4.1/src/RPCAL/nfs_dupreq.c:1123
#4 0x00007f123b4402a7 in nfs_rpc_execute (reqdata=reqdata@entry=0x7f101c81b300)
at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1358
#5 0x00007f123b44178a in worker_run (ctx=0x7f123c9fcac0)
at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1548
#6 0x00007f123b4cb189 in fridgethr_start_routine (arg=0x7f123c9fcac0)
at /usr/src/debug/nfs-ganesha-2.4.1/src/support/fridgethr.c:550
#7 0x00007f12399abdc5 in start_thread () from /lib64/libpthread.so.0
#8 0x00007f123907a73d in clone () from /lib64/libc.so.6
(gdb)
***********
On gqas015
***********
(gdb) bt
#0 0x00007fd4d52811d7 in raise () from /lib64/libc.so.6
#1 0x00007fd4d52828c8 in abort () from /lib64/libc.so.6
#2 0x00007fd4d52c0f07 in __libc_message () from /lib64/libc.so.6
#3 0x00007fd4d52c8503 in _int_free () from /lib64/libc.so.6
#4 0x00007fd43342f8d7 in wb_forget (this=<optimized out>, inode=<optimized out>) at write-behind.c:2258
#5 0x00007fd44a09e471 in __inode_ctx_free (inode=inode@entry=0x7fd42331480c) at inode.c:332
#6 0x00007fd44a09f652 in __inode_destroy (inode=0x7fd42331480c) at inode.c:353
#7 inode_table_prune (table=table@entry=0x7fd42c002420) at inode.c:1543
#8 0x00007fd44a09f934 in inode_unref (inode=0x7fd42331480c) at inode.c:524
#9 0x00007fd44a3773b6 in pub_glfs_h_close (object=0x7fd14802f610) at glfs-handleops.c:1365
#10 0x00007fd44a790a59 in handle_release (obj_hdl=0x7fd14802f318)
at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/FSAL_GLUSTER/handle.c:71
#11 0x00007fd4d77b4812 in mdcache_lru_clean (entry=0x7fd1480d0860)
at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:421
#12 mdcache_lru_get (entry=entry@entry=0x7fd4aaa5bd18)
at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1201
#13 0x00007fd4d77bec7e in mdcache_alloc_handle (fs=0x0, sub_handle=0x7fd2e803abd8, export=0x7fd4440d2130)
at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:117
#14 mdcache_new_entry (export=export@entry=0x7fd4440d2130, sub_handle=0x7fd2e803abd8,
attrs_in=attrs_in@entry=0x7fd4aaa5be70, attrs_out=attrs_out@entry=0x0, new_directory=new_directory@entry=false,
entry=entry@entry=0x7fd4aaa5bdd0, state=state@entry=0x0)
at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:411
#15 0x00007fd4d77b86b4 in mdcache_alloc_and_check_handle (export=export@entry=0x7fd4440d2130,
sub_handle=<optimized out>, new_obj=new_obj@entry=0x7fd4aaa5be68, new_directory=new_directory@entry=false,
attrs_in=attrs_in@entry=0x7fd4aaa5be70, attrs_out=attrs_out@entry=0x0, tag=tag@entry=0x7fd4d77edb84 "lookup ",
parent=parent@entry=0x7fd17c0ce920, name=name@entry=0x7fd2e8010d30 ".gitignore",
invalidate=invalidate@entry=true, state=state@entry=0x0)
at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:93
#16 0x00007fd4d77bfefa in mdc_lookup_uncached (mdc_parent=mdc_parent@entry=0x7fd17c0ce920,
name=name@entry=0x7fd2e8010d30 ".gitignore", new_entry=new_entry@entry=0x7fd4aaa5c010,
attrs_out=attrs_out@entry=0x0)
---Type <return> to continue, or q <return> to quit---
at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:1041
#17 0x00007fd4d77c02cd in mdc_lookup (mdc_parent=0x7fd17c0ce920, name=0x7fd2e8010d30 ".gitignore",
uncached=uncached@entry=true, new_entry=new_entry@entry=0x7fd4aaa5c010, attrs_out=0x0)
at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:985
#18 0x00007fd4d77b79eb in mdcache_lookup (parent=<optimized out>, name=<optimized out>, handle=0x7fd4aaa5c098,
attrs_out=<optimized out>)
at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:166
#19 0x00007fd4d76efc97 in fsal_lookup (parent=0x7fd17c0ce958, name=0x7fd2e8010d30 ".gitignore",
obj=obj@entry=0x7fd4aaa5c098, attrs_out=attrs_out@entry=0x0)
at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/fsal_helper.c:712
#20 0x00007fd4d7723636 in nfs4_op_lookup (op=<optimized out>, data=0x7fd4aaa5c180, resp=0x7fd2e801cc70)
at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs4_op_lookup.c:106
#21 0x00007fd4d7717f7d in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7fd2e804eb30)
at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs4_Compound.c:734
#22 0x00007fd4d770912c in nfs_rpc_execute (reqdata=reqdata@entry=0x7fd26c01c050)
at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1281
#23 0x00007fd4d770a78a in worker_run (ctx=0x7fd4d7c814c0)
at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1548
#24 0x00007fd4d7794189 in fridgethr_start_routine (arg=0x7fd4d7c814c0)
at /usr/src/debug/nfs-ganesha-2.4.1/src/support/fridgethr.c:550
#25 0x00007fd4d5c74dc5 in start_thread () from /lib64/libpthread.so.0
#26 0x00007fd4d534373d in clone () from /lib64/libc.so.6
(gdb)
(gdb)
***********
On gqas014
***********
(gdb) bt
#0 0x00007fc652ec11d7 in raise () from /lib64/libc.so.6
#1 0x00007fc652ec28c8 in abort () from /lib64/libc.so.6
#2 0x00007fc652f00f07 in __libc_message () from /lib64/libc.so.6
#3 0x00007fc652f08503 in _int_free () from /lib64/libc.so.6
#4 0x00007fc6553c3522 in gsh_free (p=<optimized out>)
at /usr/src/debug/nfs-ganesha-2.4.1/src/include/abstract_mem.h:271
#5 pool_free (pool=<optimized out>, object=<optimized out>)
at /usr/src/debug/nfs-ganesha-2.4.1/src/include/abstract_mem.h:420
#6 free_nfs_res (res=<optimized out>) at /usr/src/debug/nfs-ganesha-2.4.1/src/include/nfs_dupreq.h:125
#7 nfs_dupreq_free_dupreq (dv=0x7fc40c22e830) at /usr/src/debug/nfs-ganesha-2.4.1/src/RPCAL/nfs_dupreq.c:784
#8 nfs_dupreq_finish (req=req@entry=0x7fc5880008e8, res_nfs=res_nfs@entry=0x7fc47403a280)
at /usr/src/debug/nfs-ganesha-2.4.1/src/RPCAL/nfs_dupreq.c:1133
#9 0x00007fc6553492a7 in nfs_rpc_execute (reqdata=reqdata@entry=0x7fc5880008c0)
at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1358
#10 0x00007fc65534a78a in worker_run (ctx=0x7fc6556d4ec0)
at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1548
#11 0x00007fc6553d4189 in fridgethr_start_routine (arg=0x7fc6556d4ec0)
at /usr/src/debug/nfs-ganesha-2.4.1/src/support/fridgethr.c:550
#12 0x00007fc6538b4dc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007fc652f8373d in clone () from /lib64/libc.so.6
(gdb)
Version-Release number of selected component (if applicable):
--------------------------------------------------------------
glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64
nfs-ganesha-2.4.1-1.el7rhgs.x86_64
How reproducible:
-----------------
Reporting the first occurrence.
Steps to Reproduce:
-------------------
1. Create a 4 node cluster and mount the volume via v3 and v4 on the clients.
2. Pump I/O.
Actual results:
---------------
Ganesha crashes on 3 nodes..IOs are hung as pacemaker quorum is lost.
Expected results:
-----------------
No crashes.
Additional info:
-----------------
OS : RHEL 7.3
*Vol Config* :
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: db9c8fe1-375d-4375-955b-f8291af4f931
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas009.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas010.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas015.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: on
server.allow-insecure: on
performance.stat-prefetch: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
Comment 3Daniel Gryniewicz
2016-12-05 13:47:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://rhn.redhat.com/errata/RHEA-2017-0493.html
Comment 15Red Hat Bugzilla
2023-09-14 03:35:33 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days