1381353 – Ganesha crashes on volume restarts

Bug 1381353 - Ganesha crashes on volume restarts

Summary: Ganesha crashes on volume restarts

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	nfs-ganesha
Sub Component:
Version:	rhgs-3.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Soumya Koduri
QA Contact:	Ambarish
Docs Contact:
URL:
Whiteboard:
Depends On:	1380619
Blocks:	1351528
TreeView+	depends on / blocked

Reported:	2016-10-03 19:37 UTC by Ambarish
Modified:	2017-03-28 06:56 UTC (History)
CC List:	8 users (show)
Fixed In Version:	glusterfs-3.8.4-3
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-03-23 06:07:35 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:0486	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update	2017-03-23 09:18:45 UTC

Description Ambarish 2016-10-03 19:37:58 UTC

Description of problem:
-----------------------

4 node Ganesha cluster.Restarted the volume.Ganesha crashed on 3/4 nodes.

*BT from crash* [t a a bt with 256 threads is kinda lengthy,inserting a snippet]:

Thread 281 (Thread 0x7f3814780700 (LWP 20103)):
#0  0x00007f3889588a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f388b01cdbf in nfs_rpc_dequeue_req (worker=worker@entry=0x7f388c774040)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_rpc_dispatcher_thread.c:1612
#2  0x00007f388b017a79 in worker_run (ctx=0x7f388c774040)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_worker_thread.c:1519
#3  0x00007f388b0a2029 in fridgethr_start_routine (arg=0x7f388c774040)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/support/fridgethr.c:550
#4  0x00007f3889584dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f3888c521cd in clone () from /lib64/libc.so.6

Thread 280 (Thread 0x7f37dc710700 (LWP 20215)):
#0  0x00007f3889588a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#1  0x00007f388b01cdbf in nfs_rpc_dequeue_req (worker=worker@entry=0x7f388c795440)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_rpc_dispatcher_thread.c:1612
#2  0x00007f388b017a79 in worker_run (ctx=0x7f388c795440)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_worker_thread.c:1519
#3  0x00007f388b0a2029 in fridgethr_start_routine (arg=0x7f388c795440)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/support/fridgethr.c:550
#4  0x00007f3889584dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f3888c521cd in clone () from /lib64/libc.so.6

Thread 279 (Thread 0x7f37f0f39700 (LWP 20174)):
#0  0x00007f3889588a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f388b01cdbf in nfs_rpc_dequeue_req (worker=worker@entry=0x7f388c789180)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_rpc_dispatcher_thread.c:1612
#2  0x00007f388b017a79 in worker_run (ctx=0x7f388c789180)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_worker_thread.c:1519
#3  0x00007f388b0a2029 in fridgethr_start_routine (arg=0x7f388c789180)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/support/fridgethr.c:550
#4  0x00007f3889584dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f3888c521cd in clone () from /lib64/libc.so.6

Thread 278 (Thread 0x7f37eaf2d700 (LWP 20186)):
#0  0x00007f3889588a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f388b01cdbf in nfs_rpc_dequeue_req (worker=worker@entry=0x7f388c78ca80)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_rpc_dispatcher_thread.c:1612
#2  0x00007f388b017a79 in worker_run (ctx=0x7f388c78ca80)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_worker_thread.c:1519
#3  0x00007f388b0a2029 in fridgethr_start_routine (arg=0x7f388c78ca80)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/support/fridgethr.c:550
#4  0x00007f3889584dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f3888c521cd in clone () from /lib64/libc.so.6

---Type <return> to continue, or q <return> to quit---
Thread 277 (Thread 0x7f383ffd7700 (LWP 20016)):
#0  0x00007f3889588a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f388b01cdbf in nfs_rpc_dequeue_req (worker=worker@entry=0x7f388c75a300)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_rpc_dispatcher_thread.c:1612
#2  0x00007f388b017a79 in worker_run (ctx=0x7f388c75a300)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_worker_thread.c:1519
#3  0x00007f388b0a2029 in fridgethr_start_routine (arg=0x7f388c75a300)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/support/fridgethr.c:550
#4  0x00007f3889584dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f3888c521cd in clone () from /lib64/libc.so.6

Thread 276 (Thread 0x7f37e3f1f700 (LWP 20200)):
#0  0x00007f3889588a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f388b01cdbf in nfs_rpc_dequeue_req (worker=worker@entry=0x7f388c790d00)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_rpc_dispatcher_thread.c:1612
#2  0x00007f388b017a79 in worker_run (ctx=0x7f388c790d00)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_worker_thread.c:1519
#3  0x00007f388b0a2029 in fridgethr_start_routine (arg=0x7f388c790d00)
    at /usr/src/debug/nfs-ganesha-2.4.0/src/support/fridgethr.c:550
#4  0x00007f3889584dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f3888c521cd in clone () from /lib64/libc.so.6



Version-Release number of selected component (if applicable):
-------------------------------------------------------------

nfs-ganesha-2.4.0-2.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-2.el7rhgs.x86_64

How reproducible:
-----------------

2/2

Steps to Reproduce:
------------------

1. Set up a 4 node Ganesha cluster.

2. gluster v stop <vol>/gluster v start <vol>

3. Check if Ganesha process is alive on the servers.

Actual results:
---------------

Ganesha crashed on 3/4 nodes

Expected results:
-----------------

Ganesha should not crash on a volume restart.

Additional info:
----------------

mount vers =4

On Dev's suggestion,"GANESHA_DIR=/etc/ganesha/ " was changed to "GANESHA_DIR=/var/run/gluster/shared_storage/nfs-ganesha" inside /var/lib/glusterd/hooks/1/start/post/S31ganesha-start.sh .


Client and Server OS : RHEL 7.2

Volume Configuration :

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: b93b99bd-d1d2-4236-98bc-08311f94e7dc
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
ganesha.enable: on
features.cache-invalidation: off
nfs.disable: on
performance.readdir-ahead: on
performance.stat-prefetch: off
server.allow-insecure: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
[root@gqas013 tmp]#

Comment 3 Soumya Koduri 2016-10-04 04:49:11 UTC

As mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1380619#c6, this issue looks similar to the one raised in bug1380619. Will check cores and confirm.

Comment 4 Soumya Koduri 2016-10-05 06:19:44 UTC

From the cores provided, we can see stack corruption. This is exactly the same issue being addressed as part of bug1380619. Since the use cases are different marking this dependant on that bug.

Comment 14 Jiffin 2016-11-08 06:31:24 UTC

This issue is fixed by the patch https://code.engineering.redhat.com/gerrit/87972, hence changing the status

Comment 15 Ambarish 2016-11-10 09:45:54 UTC

Verified on 3.8.4-3.

Restarted the volume a couple of times,I did not see any crashes.

Comment 19 errata-xmlrpc 2017-03-23 06:07:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.