Bug 1747349 - [Ganesha] Ganesha crashed on one of the node at _setglustercreds_
Summary: [Ganesha] Ganesha crashed on one of the node at _setglustercreds_
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: nfs-ganesha
Version: rhgs-3.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: RHGS 3.5.0
Assignee: Daniel Gryniewicz
QA Contact: Manisha Saini
URL:
Whiteboard:
Depends On:
Blocks: 1696809
TreeView+ depends on / blocked
 
Reported: 2019-08-30 07:37 UTC by Manisha Saini
Modified: 2019-11-03 22:00 UTC (History)
10 users (show)

Fixed In Version: nfs-ganesha-2.7.3-8
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-30 12:15:39 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2019:3252 0 None None None 2019-10-30 12:15:52 UTC

Description Manisha Saini 2019-08-30 07:37:00 UTC
Description of problem:
========================

Hit this crash while verifying BZ - https://bugzilla.redhat.com/show_bug.cgi?id=1383559

Ganesha got crashed on node whose VIP was used to mount the volume on all the clients

--------------
Reading symbols from /usr/bin/ganesha.nfsd...Reading symbols from /usr/lib/debug/usr/bin/ganesha.nfsd.debug...done.
done.
Missing separate debuginfo for /lib64/libntirpc.so.1.7
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/ba/f482e25a6a8dbfb3665ccc5c81f3bef51b5b30
Missing separate debuginfo for /lib64/libwbclient.so.0
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/b4/1852bdd635e26adba49a0e2f4e2f6e0165e27b
Missing separate debuginfo for 
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/ba/3372036854a6399a5260547cce7841b54ad536
Missing separate debuginfo for /lib64/libntirpc.so.1.7
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/ba/f482e25a6a8dbfb3665ccc5c81f3bef51b5b30.debug
Missing separate debuginfo for /lib64/libwbclient.so.0
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/b4/1852bdd635e26adba49a0e2f4e2f6e0165e27b.debug
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/bin/ganesha.nfsd -L /var/log/ganesha/ganesha.log -f /etc/ganesha/ganesha.c'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f4585db3a83 in setglustercreds (glfs_export=glfs_export@entry=0xffffffffffffffe0, uid=uid@entry=0x7f44d03c18d8, gid=gid@entry=0x7f44d03c18dc, ngrps=1, 
    groups=0x7f44d00365c0, file=file@entry=0x7f4585db6378 "/builddir/build/BUILD/nfs-ganesha-2.7.3/src/FSAL/FSAL_GLUSTER/handle.c", line=line@entry=1180, 
    function=function@entry=0x7f4585db6ae0 <__func__.24208> "glusterfs_close_my_fd") at /usr/src/debug/nfs-ganesha-2.7.3/src/FSAL/FSAL_GLUSTER/gluster_internal.c:219
219			if (*uid != glfs_export->saveduid)
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 dbus-libs-1.10.24-13.el7_6.x86_64 elfutils-libelf-0.176-2.el7.x86_64 elfutils-libs-0.176-2.el7.x86_64 glibc-2.17-292.el7.x86_64 gssproxy-0.7.0-26.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_6.x86_64 libacl-2.2.51-14.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libblkid-2.23.2-61.el7.x86_64 libcap-2.22-10.el7.x86_64 libcom_err-1.42.9-16.el7.x86_64 libgcc-4.8.5-39.el7.x86_64 libgcrypt-1.5.3-14.el7.x86_64 libgpg-error-1.12-3.el7.x86_64 libnfsidmap-0.25-19.el7.x86_64 libselinux-2.5-14.1.el7.x86_64 libuuid-2.23.2-61.el7.x86_64 libwbclient-4.9.1-6.el7.x86_64 lz4-1.7.5-3.el7.x86_64 openssl-libs-1.0.2k-19.el7.x86_64 pcre-8.32-17.el7.x86_64 samba-client-libs-4.9.1-6.el7.x86_64 systemd-libs-219-67.el7_7.1.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  0x00007f4585db3a83 in setglustercreds (glfs_export=glfs_export@entry=0xffffffffffffffe0, uid=uid@entry=0x7f44d03c18d8, gid=gid@entry=0x7f44d03c18dc, ngrps=1, 
    groups=0x7f44d00365c0, file=file@entry=0x7f4585db6378 "/builddir/build/BUILD/nfs-ganesha-2.7.3/src/FSAL/FSAL_GLUSTER/handle.c", line=line@entry=1180, 
    function=function@entry=0x7f4585db6ae0 <__func__.24208> "glusterfs_close_my_fd") at /usr/src/debug/nfs-ganesha-2.7.3/src/FSAL/FSAL_GLUSTER/gluster_internal.c:219
#1  0x00007f4585dae010 in glusterfs_close_my_fd (my_fd=my_fd@entry=0x7f44d03c1890) at /usr/src/debug/nfs-ganesha-2.7.3/src/FSAL/FSAL_GLUSTER/handle.c:1177
#2  0x00007f4585dae140 in glusterfs_close2 (obj_hdl=0x7f44201f9328, state=0x7f44d03c17b0) at /usr/src/debug/nfs-ganesha-2.7.3/src/FSAL/FSAL_GLUSTER/handle.c:2828
#3  0x000055ed16ac69a7 in mdcache_close2 (obj_hdl=0x7f4544417f98, state=<optimized out>)
    at /usr/src/debug/nfs-ganesha-2.7.3/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_file.c:830
#4  0x000055ed16a656c3 in _state_del_locked (state=0x7f44d03c17b0, func=func@entry=0x55ed16aff5d0 <__func__.20790> "state_nfs4_state_wipe", line=line@entry=640)
    at /usr/src/debug/nfs-ganesha-2.7.3/src/SAL/nfs4_state.c:398
#5  0x000055ed16a67078 in state_nfs4_state_wipe (ostate=0x7f4544418200) at /usr/src/debug/nfs-ganesha-2.7.3/src/SAL/nfs4_state.c:640
#6  0x000055ed16a5e14c in state_wipe_file (obj=0x7f4544417f98) at /usr/src/debug/nfs-ganesha-2.7.3/src/SAL/state_misc.c:1309
#7  0x000055ed16ab7aaf in _mdcache_lru_unref (entry=entry@entry=0x7f4544417f60, flags=flags@entry=0, func=func@entry=0x55ed16b0f780 <__func__.20988> "mdcache_put", 
    line=line@entry=196) at /usr/src/debug/nfs-ganesha-2.7.3/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1965
#8  0x000055ed16ac5148 in mdcache_put (entry=0x7f4544417f60) at /usr/src/debug/nfs-ganesha-2.7.3/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.h:196
#9  mdc_read_cb (obj=<optimized out>, ret=..., obj_data=0x7f4485ee3d40, caller_data=0x7f4420124ea0)
    at /usr/src/debug/nfs-ganesha-2.7.3/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_file.c:555
#10 0x00007f4585db2739 in glusterfs_read2 (obj_hdl=0x7f44201f9328, bypass=<optimized out>, done_cb=0x55ed16ac50b0 <mdc_read_cb>, read_arg=0x7f4485ee3d40, 
    caller_arg=0x7f4420124ea0) at /usr/src/debug/nfs-ganesha-2.7.3/src/FSAL/FSAL_GLUSTER/handle.c:2084
#11 0x000055ed16ac6654 in mdcache_read2 (obj_hdl=0x7f4544417f98, bypass=<optimized out>, done_cb=<optimized out>, read_arg=0x7f4485ee3d40, caller_arg=<optimized out>)
    at /usr/src/debug/nfs-ganesha-2.7.3/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_file.c:589
#12 0x000055ed16a15699 in nfs4_read (op=<optimized out>, data=<optimized out>, resp=0x7f4420505b90, io=<optimized out>, info=0x0)
    at /usr/src/debug/nfs-ganesha-2.7.3/src/Protocols/NFS/nfs4_op_read.c:562
#13 0x000055ed16a02703 in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7f4420469170)
    at /usr/src/debug/nfs-ganesha-2.7.3/src/Protocols/NFS/nfs4_Compound.c:942
#14 0x000055ed169f5b1f in nfs_rpc_process_request (reqdata=0x7f4420115030) at /usr/src/debug/nfs-ganesha-2.7.3/src/MainNFSD/nfs_worker_thread.c:1328
#15 0x000055ed169f4fca in nfs_rpc_decode_request (xprt=0x7f4544000f90, xdrs=0x7f442004f5d0)
    at /usr/src/debug/nfs-ganesha-2.7.3/src/MainNFSD/nfs_rpc_dispatcher_thread.c:1345
#16 0x00007f458cbd062d in svc_rqst_xprt_task () from /lib64/libntirpc.so.1.7
#17 0x00007f458cbd0b6a in svc_rqst_run_task () from /lib64/libntirpc.so.1.7
#18 0x00007f458cbd8c0b in work_pool_thread () from /lib64/libntirpc.so.1.7
#19 0x00007f458af6eea5 in start_thread () from /lib64/libpthread.so.0
#20 0x00007f458a8798cd in clone () from /lib64/libc.so.6
----------------------


Version-Release number of selected component (if applicable):
=========================================================

# rpm -qa | grep ganesha
nfs-ganesha-2.7.3-7.el7rhgs.x86_64
nfs-ganesha-debuginfo-2.7.3-7.el7rhgs.x86_64
glusterfs-ganesha-6.0-11.el7rhgs.x86_64
nfs-ganesha-gluster-2.7.3-7.el7rhgs.x86_64


How reproducible:
================
1/1

Steps to Reproduce:
====================

1.Create 4 node ganesha cluster
2.Create 1 Distributed-Disperse volume 2 x (4 + 2) = 12
3.Mount the volume on 3 clients via v4.1
4.Run Linux untars from all the 3 clients.Wait for untars to be completed
5.Now mount the volume on 2 more clients
4.Again run linux untars in different directories from 5 clients
5. Wait for around 1 hour and then run rm -rf * from another client with IO's
still ongoing parallelly
 

Actual results:
==============
After some time ganesha got crashed on one of the node


Expected results:
================
Ganesha should not crash

Additional info:
==================


[root@gprfs040 abrt]# pcs status
Cluster name: ganesha-ha
Stack: corosync
Current DC: gprfs035.sbu.lab.eng.bos.redhat.com (version 1.1.20-5.el7-3c4c782f70) - partition with quorum
Last updated: Fri Aug 30 03:36:39 2019
Last change: Thu Aug 29 17:47:10 2019 by root via crm_attribute on gprfs033.sbu.lab.eng.bos.redhat.com

4 nodes configured
24 resources configured

Online: [ gprfs033.sbu.lab.eng.bos.redhat.com gprfs034.sbu.lab.eng.bos.redhat.com gprfs035.sbu.lab.eng.bos.redhat.com gprfs040.sbu.lab.eng.bos.redhat.com ]

Full list of resources:

 Clone Set: nfs_setup-clone [nfs_setup]
     Started: [ gprfs033.sbu.lab.eng.bos.redhat.com gprfs034.sbu.lab.eng.bos.redhat.com gprfs035.sbu.lab.eng.bos.redhat.com gprfs040.sbu.lab.eng.bos.redhat.com ]
 Clone Set: nfs-mon-clone [nfs-mon]
     Started: [ gprfs033.sbu.lab.eng.bos.redhat.com gprfs034.sbu.lab.eng.bos.redhat.com gprfs035.sbu.lab.eng.bos.redhat.com gprfs040.sbu.lab.eng.bos.redhat.com ]
 Clone Set: nfs-grace-clone [nfs-grace]
     Started: [ gprfs034.sbu.lab.eng.bos.redhat.com gprfs035.sbu.lab.eng.bos.redhat.com gprfs040.sbu.lab.eng.bos.redhat.com ]
     Stopped: [ gprfs033.sbu.lab.eng.bos.redhat.com ]
 Resource Group: gprfs033.sbu.lab.eng.bos.redhat.com-group
     gprfs033.sbu.lab.eng.bos.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started gprfs040.sbu.lab.eng.bos.redhat.com
     gprfs033.sbu.lab.eng.bos.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started gprfs040.sbu.lab.eng.bos.redhat.com
     gprfs033.sbu.lab.eng.bos.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Started gprfs040.sbu.lab.eng.bos.redhat.com
 Resource Group: gprfs034.sbu.lab.eng.bos.redhat.com-group
     gprfs034.sbu.lab.eng.bos.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started gprfs034.sbu.lab.eng.bos.redhat.com
     gprfs034.sbu.lab.eng.bos.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started gprfs034.sbu.lab.eng.bos.redhat.com
     gprfs034.sbu.lab.eng.bos.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Started gprfs034.sbu.lab.eng.bos.redhat.com
 Resource Group: gprfs035.sbu.lab.eng.bos.redhat.com-group
     gprfs035.sbu.lab.eng.bos.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started gprfs035.sbu.lab.eng.bos.redhat.com
     gprfs035.sbu.lab.eng.bos.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started gprfs035.sbu.lab.eng.bos.redhat.com
     gprfs035.sbu.lab.eng.bos.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Started gprfs035.sbu.lab.eng.bos.redhat.com
 Resource Group: gprfs040.sbu.lab.eng.bos.redhat.com-group
     gprfs040.sbu.lab.eng.bos.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started gprfs040.sbu.lab.eng.bos.redhat.com
     gprfs040.sbu.lab.eng.bos.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started gprfs040.sbu.lab.eng.bos.redhat.com
     gprfs040.sbu.lab.eng.bos.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Started gprfs040.sbu.lab.eng.bos.redhat.com

Failed Resource Actions:
* gprfs035.sbu.lab.eng.bos.redhat.com-nfs_unblock_monitor_10000 on gprfs035.sbu.lab.eng.bos.redhat.com 'unknown error' (1): call=73, status=Timed Out, exitreason='',
    last-rc-change='Sun Aug 25 23:05:43 2019', queued=0ms, exec=0ms
* gprfs033.sbu.lab.eng.bos.redhat.com-nfs_unblock_monitor_10000 on gprfs040.sbu.lab.eng.bos.redhat.com 'unknown error' (1): call=161, status=Timed Out, exitreason='',
    last-rc-change='Sun Aug 25 23:05:47 2019', queued=0ms, exec=0ms
* gprfs040.sbu.lab.eng.bos.redhat.com-nfs_unblock_monitor_10000 on gprfs040.sbu.lab.eng.bos.redhat.com 'unknown error' (1): call=153, status=Timed Out, exitreason='',
    last-rc-change='Sun Aug 25 23:05:41 2019', queued=0ms, exec=0ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Comment 3 Frank Filz 2019-08-30 16:28:26 UTC
Is it possible to try recreating this with Ganesha V2.7.6? It looks like op_ctx->fsal_export becomes NULL at some point in the call stack. I can't identify a fix since V2.7.3 for sure, but there are a few that might address this issue. If it still fails for V2.7.3, we should also try next (V2.9-dev) to determine if it is an upstream problem.

Comment 8 Manisha Saini 2019-10-09 08:52:59 UTC
Verified this BZ with


# rpm -qa | grep ganesha
nfs-ganesha-2.7.3-9.el7rhgs.x86_64
glusterfs-ganesha-6.0-15.el7rhgs.x86_64
nfs-ganesha-gluster-2.7.3-9.el7rhgs.x86_64


Steps for verification:
===================

1. Create 8 node ganesha cluster
2. Create Distributed-disperse  2 x (4 + 2) volume.Export the volume via ganesha
3. Mount the volume on 6 clients via 4.1 protocol
4. Start running linux untars from 2 clients and readdir operation from other 4 clients (du -sh,ls -laRt,find's)
5. Wait for linux untars to complete.
6. Now mount the volume on 2 more clients and run linux untars from 5 clients in diff dir's
7. Run IO's for 1 hour.
8. Now trigger rm -rf * from one of the client when IO's are running in parallel. 

No crash were observed. Moving this BZ to verified state.

Comment 10 errata-xmlrpc 2019-10-30 12:15:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:3252


Note You need to log in before you can comment on or make changes to this bug.