1457183 – [Ganesha] : Ganesha crashes while cluster enters failover/failback mode and during basic IO with the same BT.

Bug 1457183 - [Ganesha] : Ganesha crashes while cluster enters failover/failback mode and during basic IO with the same BT.

Summary: [Ganesha] : Ganesha crashes while cluster enters failover/failback mode and d...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	libgfapi
Sub Component:
Version:	rhgs-3.3
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Jiffin
QA Contact:	Ambarish
Docs Contact:
URL:
Whiteboard:
Depends On:	1460514 1477994 1559352
Blocks:	1417151
TreeView+	depends on / blocked

Reported:	2017-05-31 09:30 UTC by Ambarish
Modified:	2018-03-22 11:27 UTC (History)
CC List:	13 users (show)
Fixed In Version:	glusterfs-3.8.4-29
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1460514 (view as bug list)
Environment:
Last Closed:	2017-09-21 04:45:37 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Description Ambarish 2017-05-31 09:30:25 UTC

Description of problem:
-----------------------


4 node cluster,4 clients accessing the export via v4.

Kill NFS-Ganesha on any node to simulate failover.

Ganesha crashes in a minute when failover is about to get over.

Try restarting Ganesha to simulate failback.

Ganesha dumps the same core again.


(gdb) bt
#0  0x00007fb042dac8a0 in glusterfs_normalize_dentry () from /lib64/libglusterfs.so.0
#1  0x00007fb04307eac8 in glfs_resolve_at () from /lib64/libgfapi.so.0
#2  0x00007fb0430805c4 in glfs_h_lookupat () from /lib64/libgfapi.so.0
#3  0x00007fb04349d6df in lookup (parent=0x7faf74003fa8, path=0x55a9dec3a482 "..", handle=0x7fb019db0af8, 
    attrs_out=0x0) at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/FSAL_GLUSTER/handle.c:113
#4  0x000055a9dec28b7f in mdc_get_parent (export=export@entry=0x55a9e066bbf0, entry=0x7faf74004260)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:287
#5  0x000055a9dec257f5 in mdcache_create_handle (exp_hdl=0x55a9e066bbf0, hdl_desc=<optimized out>, 
    handle=0x7fb019db0be8, attrs_out=0x0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:1768
#6  0x000055a9deb91daa in nfs4_mds_putfh (data=data@entry=0x7fb019db1180)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_op_putfh.c:211
#7  0x000055a9deb922c0 in nfs4_op_putfh (op=0x7faf880017e0, data=0x7fb019db1180, resp=0x7faf74000ad0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_op_putfh.c:281
#8  0x000055a9deb81bbd in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7faf740009c0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_Compound.c:734
#9  0x000055a9deb72d6c in nfs_rpc_execute (reqdata=reqdata@entry=0x7faf880008c0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1281
#10 0x000055a9deb743ca in worker_run (ctx=0x55a9e0758ec0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1548
#11 0x000055a9debfd999 in fridgethr_start_routine (arg=0x55a9e0758ec0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/support/fridgethr.c:550
#12 0x00007fb046401e25 in start_thread () from /lib64/libpthread.so.0
#13 0x00007fb045acf34d in clone () from /lib64/libc.so.6



Version-Release number of selected component (if applicable):
--------------------------------------------------------------


[root@gqas013 tmp]# rpm -qa|grep ganesha
glusterfs-ganesha-3.8.4-26.el7rhgs.x86_64
nfs-ganesha-gluster-2.4.4-6.el7rhgs.x86_64
nfs-ganesha-debuginfo-2.4.4-6.el7rhgs.x86_64
nfs-ganesha-2.4.4-6.el7rhgs.x86_64
[root@gqas013 tmp]# 
[root@gqas013 tmp]# 
[root@gqas013 tmp]# rpm -qa|grep libnti
libntirpc-1.4.3-1.el7rhgs.x86_64
libntirpc-devel-1.4.3-1.el7rhgs.x86_64
[root@gqas013 tmp]# 

[root@gqas013 tmp]# rpm -qa|grep pacem
pacemaker-cluster-libs-1.1.16-10.el7.x86_64
pacemaker-cli-1.1.16-10.el7.x86_64
pacemaker-1.1.16-10.el7.x86_64
pacemaker-libs-1.1.16-10.el7.x86_64
[root@gqas013 tmp]# 

[root@gqas013 tmp]# rpm -qa|grep coros
corosynclib-2.4.0-9.el7.x86_64
corosync-2.4.0-9.el7.x86_64
[root@gqas013 tmp]# 

[root@gqas013 tmp]# rpm -qa|grep resource-ag
resource-agents-3.9.5-100.el7.x86_64

[root@gqas013 tmp]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.4 Beta (Maipo)
[root@gqas013 tmp]# 


How reproducible:
-----------------
3/3

Comment 2 Ambarish 2017-05-31 13:33:53 UTC

Clearer BT :


#0  glusterfs_normalize_dentry (parent=parent@entry=0x7fc99fb96778, component=component@entry=0x7fc99fb96780, 
    dentry_name=dentry_name@entry=0x7fc99fb96840 "") at inode.c:2646
#1  0x00007fca36f12ac8 in priv_glfs_resolve_at (fs=fs@entry=0x55f5f8b01060, subvol=subvol@entry=0x7fca1c01fe00, 
    at=at@entry=0x7fc734020090, origpath=origpath@entry=0x55f5f6f1b482 "..", loc=loc@entry=0x7fc99fb978d0, 
    iatt=iatt@entry=0x7fc99fb97910, follow=follow@entry=0, reval=reval@entry=0) at glfs-resolve.c:412
#2  0x00007fca36f145c4 in pub_glfs_h_lookupat (fs=0x55f5f8b01060, parent=<optimized out>, 
    path=path@entry=0x55f5f6f1b482 "..", stat=stat@entry=0x7fc99fb979f0, follow=follow@entry=0)
    at glfs-handleops.c:102
#3  0x00007fca36f146a8 in pub_glfs_h_lookupat34 (fs=<optimized out>, parent=<optimized out>, 
    path=path@entry=0x55f5f6f1b482 "..", stat=stat@entry=0x7fc99fb979f0) at glfs-handleops.c:133
#4  0x00007fca373316df in lookup (parent=0x7fc734031fa8, path=0x55f5f6f1b482 "..", handle=0x7fc99fb97af8, 
    attrs_out=0x0) at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/FSAL_GLUSTER/handle.c:113
#5  0x000055f5f6f09b7f in mdc_get_parent (export=export@entry=0x55f5f8b00bf0, entry=0x7fc734033ee0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:287
#6  0x000055f5f6f067f5 in mdcache_create_handle (exp_hdl=0x55f5f8b00bf0, hdl_desc=<optimized out>, 
    handle=0x7fc99fb97be8, attrs_out=0x0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:1768
#7  0x000055f5f6e72daa in nfs4_mds_putfh (data=data@entry=0x7fc99fb98180)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_op_putfh.c:211
#8  0x000055f5f6e732c0 in nfs4_op_putfh (op=0x7fc97c243bd0, data=0x7fc99fb98180, resp=0x7fc734031bd0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_op_putfh.c:281
#9  0x000055f5f6e62bbd in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7fc734029050)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_Compound.c:734
#10 0x000055f5f6e53d6c in nfs_rpc_execute (reqdata=reqdata@entry=0x7fc97c243400)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1281
#11 0x000055f5f6e553ca in worker_run (ctx=0x55f5f8c2d870)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1548
#12 0x000055f5f6ede999 in fridgethr_start_routine (arg=0x55f5f8c2d870)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/support/fridgethr.c:550
#13 0x00007fca3a295e25 in start_thread () from /lib64/libpthread.so.0
#14 0x00007fca3996334d in clone () from /lib64/libc.so.6
(gdb)

Comment 4 Manisha Saini 2017-05-31 15:18:30 UTC

Crash is not specific to rhel  7.4

Its even observed in Rhel 7.3 - After doing node reboot,node is unable to come up.

bt-

(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/gssproxy/proxymech.so
Reading symbols from /lib64/libgssrpc.so.4...Reading symbols from /lib64/libgssrpc.so.4...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libgssrpc.so.4
0x00007f5fe2aceef7 in pthread_join () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install nfs-ganesha-2.4.4-6.el7rhgs.x86_64
(gdb) c
Continuing.
[New Thread 0x7f5f002bb700 (LWP 14671)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f5f81bbe700 (LWP 11392)]
0x00007f5fdf2758a0 in glusterfs_normalize_dentry () from /lib64/libglusterfs.so.0
(gdb) bt
#0  0x00007f5fdf2758a0 in glusterfs_normalize_dentry () from /lib64/libglusterfs.so.0
#1  0x00007f5fdf547ac8 in glfs_resolve_at () from /lib64/libgfapi.so.0
#2  0x00007f5fdf5495c4 in glfs_h_lookupat () from /lib64/libgfapi.so.0
#3  0x00007f5fdf9666df in lookup () from /usr/lib64/ganesha/libfsalgluster.so
#4  0x00007f5fe4617b7f in mdc_get_parent ()
#5  0x00007f5fe46147f5 in mdcache_create_handle ()
#6  0x00007f5fe4580daa in nfs4_mds_putfh ()
#7  0x00007f5fe45812c0 in nfs4_op_putfh ()
#8  0x00007f5fe4570bbd in nfs4_Compound ()
#9  0x00007f5fe4561d6c in nfs_rpc_execute ()
#10 0x00007f5fe45633ca in worker_run ()
#11 0x00007f5fe45ec999 in fridgethr_start_routine ()
#12 0x00007f5fe2acddc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f5fe219c76d in clone () from /lib64/libc.so.6


Version-
nfs-ganesha-2.4.4-6.el7rhgs.x86_64

Comment 5 Daniel Gryniewicz 2017-05-31 15:20:27 UTC

(09:44:53 AM) jiffin: kkeithley, dang it kinda of regression caused from gluster
(09:44:55 AM) jiffin: patch
(09:45:08 AM) jiffin: https://review.gluster.org/#/c/17177/6/libglusterfs/src/inode.c
(09:45:19 AM) jiffin: i am looking to that issue

Comment 6 Jiffin 2017-06-02 13:33:52 UTC

RCA :
In above case, following should have happened
linux untar was running on four different servers on four different directories using four different servers, let's say client1 was creating file a/b/c/file and failover happened(server1 got killed), so now it sends requests to server2.
Usually the in nfs world all the communication happen via file handles. First server will try to create handle for that file using given gfid,  but server does not have context about a/b/c till now. After creation of handle, it tries do lookup on parent using "..".
During parent inode resolution the glusterfs_normalize_dentry() tries replcae ".." with its name and then it crashes, because this function has assumption inode of parent already linked to table. But in above case inode of parent never linked to table. Finally ending up in killing all the ganesha servers

There are two solution to resolve this issue [1] handle inode_parent failures in glusterfs_normalise_dentry() or [2] revert the changes made to  glfs_resolve_component() in  https://review.gluster.org/#/c/17177 and then call glusterfs_normalise_dentry() followed by it

Also thanks Soumya & Rafi for help in debugging the issue

Comment 10 Ambarish 2017-06-06 09:18:24 UTC

Both Manisha's and my setup are in an unrecoverable state.

Marking this as a Test Blocker for Ganesha

Comment 11 Atin Mukherjee 2017-06-12 03:06:16 UTC

upstream patch : https://review.gluster.org/#/c/17502/

Comment 16 errata-xmlrpc 2017-09-21 04:45:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Comment 17 errata-xmlrpc 2017-09-21 04:58:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.