1414410 – [NFS-Ganesha] Ganesha service crashed while doing refresh config on volume and when IOs are running.

Bug 1414410 - [NFS-Ganesha] Ganesha service crashed while doing refresh config on volume and when IOs are running.

Summary: [NFS-Ganesha] Ganesha service crashed while doing refresh config on volume an...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	common-ha
Sub Component:
Version:	rhgs-3.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Soumya Koduri
QA Contact:	Manisha Saini
Docs Contact:
URL:
Whiteboard:
Depends On:	1370027
Blocks:	1417147
TreeView+	depends on / blocked

Reported:	2017-01-18 12:39 UTC by Arthy Loganathan
Modified:	2017-09-21 04:56 UTC (History)
CC List:	11 users (show)
Fixed In Version:	glusterfs-3.8.4-19
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-09-21 04:30:55 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1415669	0	unspecified	CLOSED	NFS-Ganesha: Do not perform 'Refresh-config' while there are I/Os going on	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Internal Links: 1415669

Description Arthy Loganathan 2017-01-18 12:39:20 UTC

Description of problem:
Ganesha service crashed while doing refresh config on volume and when IOs are running parallely.

Version-Release number of selected component (if applicable):
nfs-ganesha-gluster-2.4.1-6.el7rhgs.x86_64
nfs-ganesha-2.4.1-6.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-12.el7rhgs.x86_64


How reproducible:
Always

Steps to Reproduce:
1. Create ganesha cluster and create a volume.
2. Export the volume.
3. Run refresh config on the volume.
   /usr/libexec/ganesha/ganesha-ha.sh --refresh-config /var/run/gluster/shared_storage/nfs-ganesha/ vol_ec

Actual results:
Ganesha service got crashed.

Expected results:
No crash should be observed.

Additional info:

During refresh config and while IOs are running have seen the following two crashes.

1st crash:
----------

[Thread 0x7f8887e25700 (LWP 21862) exited]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f878aefd700 (LWP 22287)]
inode_forget (inode=0x7f886e8908f4, nlookup=nlookup@entry=0) at inode.c:1132
1132	        table = inode->table;
(gdb) bt
#0  inode_forget (inode=0x7f886e8908f4, nlookup=nlookup@entry=0) at inode.c:1132
#1  0x00007f88943ad81e in pub_glfs_h_close (object=0x7f880c003a80) at glfs-handleops.c:1364
#2  0x00007f88947c7cb9 in handle_release (obj_hdl=0x7f880c025a18) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/FSAL_GLUSTER/handle.c:71
#3  0x00007f8899053a23 in mdcache_lru_clean (entry=0x7f880c025ec0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:421
#4  mdcache_lru_unref (entry=entry@entry=0x7f880c025ec0, flags=flags@entry=0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1464
#5  0x00007f88990510a1 in mdcache_put (entry=0x7f880c025ec0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.h:186
#6  mdcache_unexport (exp_hdl=0x7f87846478c0, root_obj=<optimized out>) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_export.c:158
#7  0x00007f8899033386 in clean_up_export (root_obj=0x7f8784648b78, export=0x7f878400a128) at /usr/src/debug/nfs-ganesha-2.4.1/src/support/exports.c:2220
#8  release_export (export=0x7f878400a128) at /usr/src/debug/nfs-ganesha-2.4.1/src/support/exports.c:2264
#9  unexport (export=export@entry=0x7f878400a128) at /usr/src/debug/nfs-ganesha-2.4.1/src/support/exports.c:2287
#10 0x00007f88990431f8 in gsh_export_removeexport (args=<optimized out>, reply=<optimized out>, error=0x7f878aefc2e0) at /usr/src/debug/nfs-ganesha-2.4.1/src/support/export_mgr.c:1092
#11 0x00007f8899065869 in dbus_message_entrypoint (conn=0x7f889a4a5c30, msg=0x7f889a4a5eb0, user_data=<optimized out>) at /usr/src/debug/nfs-ganesha-2.4.1/src/dbus/dbus_server.c:512
#12 0x00007f88988fec76 in _dbus_object_tree_dispatch_and_unlock () from /lib64/libdbus-1.so.3
#13 0x00007f88988f0e49 in dbus_connection_dispatch () from /lib64/libdbus-1.so.3
#14 0x00007f88988f10e2 in _dbus_connection_read_write_dispatch () from /lib64/libdbus-1.so.3
#15 0x00007f8899066931 in gsh_dbus_thread (arg=<optimized out>) at /usr/src/debug/nfs-ganesha-2.4.1/src/dbus/dbus_server.c:737
#16 0x00007f8897515dc5 in start_thread () from /lib64/libpthread.so.0
#17 0x00007f8896be473d in clone () from /lib64/libc.so.6

/var/log/messages snippet:
--------------------------

Jan 18 16:01:02 dhcp46-111 kernel: ganesha.nfsd[21715]: segfault at 7f487c008288 ip 00007f49365796a0 sp 00007f493d03bf70 error 4 in libglusterfs.so.0.0.1[7f4936540000+ed000]
Jan 18 16:01:02 dhcp46-111 systemd: nfs-ganesha.service: main process exited, code=killed, status=11/SEGV


[root@dhcp46-111 ~]# service nfs-ganesha status -l
Redirecting to /bin/systemctl status  -l nfs-ganesha.service
● nfs-ganesha.service - NFS-Ganesha file server
   Loaded: loaded (/usr/lib/systemd/system/nfs-ganesha.service; disabled; vendor preset: disabled)
   Active: failed (Result: signal) since Wed 2017-01-18 16:01:02 IST; 1min 13s ago
     Docs: http://github.com/nfs-ganesha/nfs-ganesha/wiki
  Process: 21408 ExecStop=/bin/dbus-send --system --dest=org.ganesha.nfsd --type=method_call /org/ganesha/nfsd/admin org.ganesha.nfsd.admin.shutdown (code=exited, status=0/SUCCESS)
 Main PID: 21446 (code=killed, signal=SEGV)

Jan 18 15:24:25 dhcp46-111.lab.eng.blr.redhat.com systemd[1]: Starting NFS-Ganesha file server...
Jan 18 15:24:25 dhcp46-111.lab.eng.blr.redhat.com systemd[1]: Started NFS-Ganesha file server.
Jan 18 16:01:02 dhcp46-111.lab.eng.blr.redhat.com systemd[1]: nfs-ganesha.service: main process exited, code=killed, status=11/SEGV
Jan 18 16:01:02 dhcp46-111.lab.eng.blr.redhat.com systemd[1]: Unit nfs-ganesha.service entered failed state.
Jan 18 16:01:02 dhcp46-111.lab.eng.blr.redhat.com systemd[1]: nfs-ganesha.service failed.
[root@dhcp46-111 ~]# service nfs-ganesha start
Redirecting to /bin/systemctl start  nfs-ganesha.service


2nd crash:
----------

(gdb) bt
#0  0x00007f3e9f9d5210 in pthread_spin_lock () from /lib64/libpthread.so.0
#1  0x00007f3e9c592ebd in inode_ctx_get0 (inode=0x7f3e76889954, xlator=xlator@entry=0x7f3d8c62ee20, value1=value1@entry=0x7f3dbd757ee0) at inode.c:2145
#2  0x00007f3e9c592f45 in inode_needs_lookup (inode=0x7f3e76889954, this=0x7f3d8c62ee20) at inode.c:1924
#3  0x00007f3e9c865c86 in __glfs_resolve_inode (fs=fs@entry=0x7f3d8c0153e0, subvol=subvol@entry=0x7f3ea6eed120, object=object@entry=0x7f3e24011f30) at glfs-resolve.c:1025
#4  0x00007f3e9c865d8b in glfs_resolve_inode (fs=fs@entry=0x7f3d8c0153e0, subvol=subvol@entry=0x7f3ea6eed120, object=object@entry=0x7f3e24011f30) at glfs-resolve.c:1051
#5  0x00007f3e9c867262 in pub_glfs_h_open (fs=0x7f3d8c0153e0, object=0x7f3e24011f30, flags=flags@entry=513) at glfs-handleops.c:637
#6  0x00007f3e9cc83160 in glusterfs_open_my_fd (objhandle=objhandle@entry=0x7f3e24014ea0, openflags=openflags@entry=66, posix_flags=513, my_fd=my_fd@entry=0x7f3dbd758120)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/FSAL_GLUSTER/handle.c:1029
#7  0x00007f3e9cc844ea in glusterfs_open2 (obj_hdl=0x7f3e24014ed8, state=0x7f3e2c8233f0, openflags=<optimized out>, createmode=FSAL_UNCHECKED, name=<optimized out>, 
    attrib_set=<optimized out>, verifier=0x7f3dbd7586c0 "atime=11/01/2017 15:07:36 mtime=18/01/2017 17:42:15", new_obj=0x7f3dbd758340, attrs_out=0x7f3dbd758350, 
    caller_perm_check=0x7f3dbd7584bf) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/FSAL_GLUSTER/handle.c:1336
#8  0x00007f3ea15190ef in mdcache_open2 (obj_hdl=0x7f3e2400bb58, state=0x7f3e2c8233f0, openflags=<optimized out>, createmode=FSAL_UNCHECKED, name=0x0, attrs_in=0x7f3dbd7585e0, 
    verifier=0x7f3dbd7586c0 "atime=11/01/2017 15:07:36 mtime=18/01/2017 17:42:15", new_obj=0x7f3dbd758580, attrs_out=0x0, caller_perm_check=0x7f3dbd7584bf)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_file.c:657
#9  0x00007f3ea144de9b in fsal_open2 (in_obj=0x7f3e2400bb58, state=0x7f3e2c8233f0, openflags=openflags@entry=66, createmode=createmode@entry=FSAL_UNCHECKED, name=<optimized out>, 
    attr=attr@entry=0x7f3dbd7585e0, verifier=verifier@entry=0x7f3dbd7586c0 "atime=11/01/2017 15:07:36 mtime=18/01/2017 17:42:15", obj=obj@entry=0x7f3dbd758580, 
    attrs_out=attrs_out@entry=0x0) at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/fsal_helper.c:1846
#10 0x00007f3ea1439486 in open4_ex (arg=arg@entry=0x7f3d88187728, data=data@entry=0x7f3dbd759180, res_OPEN4=res_OPEN4@entry=0x7f3e2c82e308, clientid=<optimized out>, owner=0x7f3e2c81c480, 
    file_state=file_state@entry=0x7f3dbd758fa0, new_state=new_state@entry=0x7f3dbd758f8f) at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs4_op_open.c:1441
#11 0x00007f3ea1481a49 in nfs4_op_open (op=0x7f3d88187720, data=0x7f3dbd759180, resp=0x7f3e2c82e300) at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs4_op_open.c:1844
#12 0x00007f3ea1473f8d in nfs4_Compound (arg=<optimized out>, req=<optimized out>, res=0x7f3e2c81a560) at /usr/src/debug/nfs-ganesha-2.4.1/src/Protocols/NFS/nfs4_Compound.c:734
#13 0x00007f3ea146513c in nfs_rpc_execute (reqdata=reqdata@entry=0x7f3d880008c0) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1281
#14 0x00007f3ea146679a in worker_run (ctx=0x7f3ea6eaac40) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1548
#15 0x00007f3ea14f0409 in fridgethr_start_routine (arg=0x7f3ea6eaac40) at /usr/src/debug/nfs-ganesha-2.4.1/src/support/fridgethr.c:550
#16 0x00007f3e9f9d0dc5 in start_thread () from /lib64/libpthread.so.0
#17 0x00007f3e9f09f73d in clone () from /lib64/libc.so.6


sosreports and ganesha logs will be attached soon.

Comment 2 Soumya Koduri 2017-01-18 12:46:48 UTC

This looks similar to bug1413350 but this time not with root entry. There could have been md-cache entries re-used post volume unexport and re-export. 

Since this is not recommended use-case i.e, performing refresh-config or volume unexport while I/Os are going on the same volume (not sure if its documented, if not we must), could you try other valid scenarios and check if you hit this issue. Thanks!

Comment 3 Arthy Loganathan 2017-01-18 12:54:15 UTC

sosreport and ganesha logs are at, 
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1414410/

Comment 4 Soumya Koduri 2017-01-23 12:04:39 UTC

Raised bug1415669 to document in the admin guide that it is recommended not to perform refresh-config while I/Os are going on any volume.

Comment 9 Arthy Loganathan 2017-05-03 06:25:29 UTC

Checked the behavior multiple times  with the build,

[root@dhcp46-111 ~]# rpm -qa | grep ganesha
glusterfs-ganesha-3.8.4-23.el7rhgs.x86_64
nfs-ganesha-gluster-2.4.4-4.el7rhgs.x86_64
nfs-ganesha-2.4.4-4.el7rhgs.x86_64

During refresh config and while IOs are running , crashes are not seen.

Comment 12 Atin Mukherjee 2017-05-08 07:37:44 UTC

downstream patch : https://code.engineering.redhat.com/gerrit/#/c/101283

Comment 13 Manisha Saini 2017-05-10 10:11:36 UTC

Verified this bug on glusterfs-ganesha-3.8.4-24.el7rhgs.x86_64

Performing refresh config when IO's are running doesn't lead to crash anymore.
IO's continue to run after performing refresh-config with the support of dynamic export refresh-config option.
Moving this bug to verified state.

Comment 15 errata-xmlrpc 2017-09-21 04:30:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Comment 16 errata-xmlrpc 2017-09-21 04:56:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.