Bug 1393526

Summary: [Ganesha] : Ganesha crashes intermittently during nfs-ganesha restarts.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Ambarish <asoman>
Component: io-threadsAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED ERRATA QA Contact: Ambarish <asoman>
Severity: high Docs Contact:
Priority: medium    
Version: rhgs-3.2CC: amukherj, asoman, bturner, jthottan, ndevos, pkarampu, rgowdapp, rhinduja, rhs-bugs, skoduri, storage-qa-internal
Target Milestone: ---   
Target Release: RHGS 3.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.8.4-6 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1396793 (view as bug list) Environment:
Last Closed: 2017-03-23 06:17:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1396793    
Bug Blocks: 1351528    

Description Ambarish 2016-11-09 18:24:51 UTC
Description of problem:
----------------------

Post setting up Ganesha,i.e.,after installing the latest rpms,pcs auth,ganesha enable and export , nfs-ganesha crashed on 2/4 servers when I tried to restart  ganesha service..
The process came back alive,so my guess is it dumped core when Ganesha process was stopped.

*************
BT from crash
*************

(gdb) bt
#0  0x00007fb6f39e780c in ?? ()
#1  0x0000000000000000 in ?? ()
(gdb) 

The signature of the BT looks similar to the one reported in BZ#1380619.

client-io-threads was on during my testing.I'll update result after setting it to off as well in the BZ soon.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
[root@gqas013 tmp]# rpm -qa|grep ganesha
glusterfs-ganesha-3.8.4-3.el7rhgs.x86_64
nfs-ganesha-2.4.1-1.el7rhgs.x86_64
[root@gqas013 tmp]# 


How reproducible:
-----------------

2/4

Steps to Reproduce:
------------------

> After a fresh install,perform steps to set up Ganesha - install rpms,pcs auth,enable Ganesha and export.

> Start the volume,restart glusterd,rpcbind and nfs-ganesha.


Actual results:
---------------

Ganesha crashed and dumped core on 2/4 servers.
The process was alive,so the core was dumped when Ganesha was stopped during the restart

Expected results:
-----------------

No crashes while restarting system services.

Additional info:
----------------

OS : RHEL 7.3

*Vol config* :

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 7b413fd4-9775-44a2-bfa8-23d206db9dfe
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
performance.stat-prefetch: off
server.allow-insecure: on
features.cache-invalidation: off
ganesha.enable: on
cluster.enable-shared-storage: enable
nfs-ganesha: enable
[root@gqas013 tmp]#

Comment 3 Soumya Koduri 2016-11-09 18:33:18 UTC
Ambarish,
If you happen to reproduce the issue, please take the core (using gdb) before running service stop/restart so as to compare the threads before and after the crash. Thanks!

Comment 5 Ambarish 2016-11-10 11:34:54 UTC
I tried it twice,but I could not reproduce the issue post setting client-io-threads to "off".

The issue is a bit intermittent,so it's hard to say that with certainty ,though. (if that is or is not the culprit).

Comment 8 Ambarish 2016-11-11 05:05:50 UTC
Soumya,

I tried the steps after keeping my volume in "Started" state,before setting up the Ganesha cluster and exporting the volume,twice on fresh setups,and I could not reproduce the crash on multiple tries of system service restarts.

Comment 9 Soumya Koduri 2016-11-11 06:03:25 UTC
Thanks Amabrish. That almost confirms the theory that this crash is hit only if a volume is being exported via nfs-ganesha before it is even started. 

Since this is not a recommended configuration, lowering the priority of the bug for now.

I suspect that probably when the volume is not started, the flow shall be 

glfs_init() -> xlator_init() of all the child subvols -> and then rpc_connection to brick which shall fail.

Post which "glfs_fini" shall be called. May be since glfs_init() itself failed, graph would have not been setup and PARENT_DOWN may not have been sent to io-threads xlator, resulting in the dangling thread.

This is just the theory I have on top of my mind. Will look through the code a bit. CCin Pranith too.

Comment 19 Ambarish 2016-12-27 03:57:27 UTC
I could not reproduce this crash on multiple tries

gluster : glusterfs-3.8.4-10
ganesha : 2.4.1-3

Verified.

Comment 21 errata-xmlrpc 2017-03-23 06:17:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html