Bug 1381353

Summary: Ganesha crashes on volume restarts
Product: Red Hat Gluster Storage Reporter: Ambarish <asoman>
Component: nfs-ganeshaAssignee: Soumya Koduri <skoduri>
Status: CLOSED ERRATA QA Contact: Ambarish <asoman>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: amukherj, asoman, jthottan, ndevos, rhinduja, rhs-bugs, skoduri, storage-qa-internal
Target Milestone: ---Keywords: Triaged
Target Release: RHGS 3.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.8.4-3 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-23 06:07:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On: 1380619    
Bug Blocks: 1351528    

Description Ambarish 2016-10-03 19:37:58 UTC
Description of problem:
-----------------------

4 node Ganesha cluster.Restarted the volume.Ganesha crashed on 3/4 nodes.

*BT from crash* [t a a bt with 256 threads is kinda lengthy,inserting a snippet]:

Thread 281 (Thread 0x7f3814780700 (LWP 20103)):
#0  0x00007f3889588a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f388b01cdbf in nfs_rpc_dequeue_req (worker=worker@entry=0x7f388c774040)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_rpc_dispatcher_thread.c:1612
#2  0x00007f388b017a79 in worker_run (ctx=0x7f388c774040)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_worker_thread.c:1519
#3  0x00007f388b0a2029 in fridgethr_start_routine (arg=0x7f388c774040)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/support/fridgethr.c:550
#4  0x00007f3889584dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f3888c521cd in clone () from /lib64/libc.so.6

Thread 280 (Thread 0x7f37dc710700 (LWP 20215)):
#0  0x00007f3889588a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#1  0x00007f388b01cdbf in nfs_rpc_dequeue_req (worker=worker@entry=0x7f388c795440)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_rpc_dispatcher_thread.c:1612
#2  0x00007f388b017a79 in worker_run (ctx=0x7f388c795440)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_worker_thread.c:1519
#3  0x00007f388b0a2029 in fridgethr_start_routine (arg=0x7f388c795440)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/support/fridgethr.c:550
#4  0x00007f3889584dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f3888c521cd in clone () from /lib64/libc.so.6

Thread 279 (Thread 0x7f37f0f39700 (LWP 20174)):
#0  0x00007f3889588a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f388b01cdbf in nfs_rpc_dequeue_req (worker=worker@entry=0x7f388c789180)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_rpc_dispatcher_thread.c:1612
#2  0x00007f388b017a79 in worker_run (ctx=0x7f388c789180)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_worker_thread.c:1519
#3  0x00007f388b0a2029 in fridgethr_start_routine (arg=0x7f388c789180)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/support/fridgethr.c:550
#4  0x00007f3889584dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f3888c521cd in clone () from /lib64/libc.so.6

Thread 278 (Thread 0x7f37eaf2d700 (LWP 20186)):
#0  0x00007f3889588a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f388b01cdbf in nfs_rpc_dequeue_req (worker=worker@entry=0x7f388c78ca80)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_rpc_dispatcher_thread.c:1612
#2  0x00007f388b017a79 in worker_run (ctx=0x7f388c78ca80)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_worker_thread.c:1519
#3  0x00007f388b0a2029 in fridgethr_start_routine (arg=0x7f388c78ca80)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/support/fridgethr.c:550
#4  0x00007f3889584dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f3888c521cd in clone () from /lib64/libc.so.6

---Type <return> to continue, or q <return> to quit---
Thread 277 (Thread 0x7f383ffd7700 (LWP 20016)):
#0  0x00007f3889588a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f388b01cdbf in nfs_rpc_dequeue_req (worker=worker@entry=0x7f388c75a300)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_rpc_dispatcher_thread.c:1612
#2  0x00007f388b017a79 in worker_run (ctx=0x7f388c75a300)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_worker_thread.c:1519
#3  0x00007f388b0a2029 in fridgethr_start_routine (arg=0x7f388c75a300)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/support/fridgethr.c:550
#4  0x00007f3889584dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f3888c521cd in clone () from /lib64/libc.so.6

Thread 276 (Thread 0x7f37e3f1f700 (LWP 20200)):
#0  0x00007f3889588a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f388b01cdbf in nfs_rpc_dequeue_req (worker=worker@entry=0x7f388c790d00)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_rpc_dispatcher_thread.c:1612
#2  0x00007f388b017a79 in worker_run (ctx=0x7f388c790d00)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_worker_thread.c:1519
#3  0x00007f388b0a2029 in fridgethr_start_routine (arg=0x7f388c790d00)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/support/fridgethr.c:550
#4  0x00007f3889584dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f3888c521cd in clone () from /lib64/libc.so.6



Version-Release number of selected component (if applicable):
-------------------------------------------------------------

nfs-ganesha-2.4.0-2.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-2.el7rhgs.x86_64

How reproducible:
-----------------

2/2

Steps to Reproduce:
------------------

1. Set up a 4 node Ganesha cluster.

2. gluster v stop <vol>/gluster v start <vol>

3. Check if Ganesha process is alive on the servers.

Actual results:
---------------

Ganesha crashed on 3/4 nodes

Expected results:
-----------------

Ganesha should not crash on a volume restart.

Additional info:
----------------

mount vers =4

On Dev's suggestion,"GANESHA_DIR=/etc/ganesha/ " was changed to "GANESHA_DIR=/var/run/gluster/shared_storage/nfs-ganesha" inside /var/lib/glusterd/hooks/1/start/post/S31ganesha-start.sh .


Client and Server OS : RHEL 7.2

Volume Configuration :

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: b93b99bd-d1d2-4236-98bc-08311f94e7dc
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
ganesha.enable: on
features.cache-invalidation: off
nfs.disable: on
performance.readdir-ahead: on
performance.stat-prefetch: off
server.allow-insecure: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
[root@gqas013 tmp]#

Comment 3 Soumya Koduri 2016-10-04 04:49:11 UTC
As mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1380619#c6, this issue looks similar to the one raised in bug1380619. Will check cores and confirm.

Comment 4 Soumya Koduri 2016-10-05 06:19:44 UTC
From the cores provided, we can see stack corruption. This is exactly the same issue being addressed as part of bug1380619. Since the use cases are different marking this dependant on that bug.

Comment 14 Jiffin 2016-11-08 06:31:24 UTC
This issue is fixed by the patch https://code.engineering.redhat.com/gerrit/87972, hence changing the status

Comment 15 Ambarish 2016-11-10 09:45:54 UTC
Verified on 3.8.4-3.

Restarted the volume a couple of times,I did not see any crashes.

Comment 19 errata-xmlrpc 2017-03-23 06:07:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html