Bug 1393526

Summary:	[Ganesha] : Ganesha crashes intermittently during nfs-ganesha restarts.
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Ambarish <asoman>
Component:	io-threads	Assignee:	Pranith Kumar K <pkarampu>
Status:	CLOSED ERRATA	QA Contact:	Ambarish <asoman>
Severity:	high	Docs Contact:
Priority:	medium
Version:	rhgs-3.2	CC:	amukherj, asoman, bturner, jthottan, ndevos, pkarampu, rgowdapp, rhinduja, rhs-bugs, skoduri, storage-qa-internal
Target Milestone:	---
Target Release:	RHGS 3.2.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.8.4-6	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1396793 (view as bug list)		Environment:
Last Closed:	2017-03-23 06:17:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1396793
Bug Blocks:	1351528

Description Ambarish 2016-11-09 18:24:51 UTC

Description of problem:
----------------------

Post setting up Ganesha,i.e.,after installing the latest rpms,pcs auth,ganesha enable and export , nfs-ganesha crashed on 2/4 servers when I tried to restart  ganesha service..
The process came back alive,so my guess is it dumped core when Ganesha process was stopped.

*************
BT from crash
*************

(gdb) bt
#0  0x00007fb6f39e780c in ?? ()
#1  0x0000000000000000 in ?? ()
(gdb) 

The signature of the BT looks similar to the one reported in BZ#1380619.

client-io-threads was on during my testing.I'll update result after setting it to off as well in the BZ soon.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
[root@gqas013 tmp]# rpm -qa|grep ganesha
glusterfs-ganesha-3.8.4-3.el7rhgs.x86_64
nfs-ganesha-2.4.1-1.el7rhgs.x86_64
[root@gqas013 tmp]# 


How reproducible:
-----------------

2/4

Steps to Reproduce:
------------------

> After a fresh install,perform steps to set up Ganesha - install rpms,pcs auth,enable Ganesha and export.

> Start the volume,restart glusterd,rpcbind and nfs-ganesha.


Actual results:
---------------

Ganesha crashed and dumped core on 2/4 servers.
The process was alive,so the core was dumped when Ganesha was stopped during the restart

Expected results:
-----------------

No crashes while restarting system services.

Additional info:
----------------

OS : RHEL 7.3

*Vol config* :

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 7b413fd4-9775-44a2-bfa8-23d206db9dfe
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
performance.stat-prefetch: off
server.allow-insecure: on
features.cache-invalidation: off
ganesha.enable: on
cluster.enable-shared-storage: enable
nfs-ganesha: enable
[root@gqas013 tmp]#

Comment 3 Soumya Koduri 2016-11-09 18:33:18 UTC

Ambarish,
If you happen to reproduce the issue, please take the core (using gdb) before running service stop/restart so as to compare the threads before and after the crash. Thanks!

Comment 5 Ambarish 2016-11-10 11:34:54 UTC

I tried it twice,but I could not reproduce the issue post setting client-io-threads to "off".

The issue is a bit intermittent,so it's hard to say that with certainty ,though. (if that is or is not the culprit).

Comment 8 Ambarish 2016-11-11 05:05:50 UTC

Soumya,

I tried the steps after keeping my volume in "Started" state,before setting up the Ganesha cluster and exporting the volume,twice on fresh setups,and I could not reproduce the crash on multiple tries of system service restarts.

Comment 9 Soumya Koduri 2016-11-11 06:03:25 UTC

Thanks Amabrish. That almost confirms the theory that this crash is hit only if a volume is being exported via nfs-ganesha before it is even started. 

Since this is not a recommended configuration, lowering the priority of the bug for now.

I suspect that probably when the volume is not started, the flow shall be 

glfs_init() -> xlator_init() of all the child subvols -> and then rpc_connection to brick which shall fail.

Post which "glfs_fini" shall be called. May be since glfs_init() itself failed, graph would have not been setup and PARENT_DOWN may not have been sent to io-threads xlator, resulting in the dangling thread.

This is just the theory I have on top of my mind. Will look through the code a bit. CCin Pranith too.

Comment 19 Ambarish 2016-12-27 03:57:27 UTC

I could not reproduce this crash on multiple tries

gluster : glusterfs-3.8.4-10
ganesha : 2.4.1-3

Verified.

Comment 21 errata-xmlrpc 2017-03-23 06:17:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html