Description of problem: ----------------------- 4 node Ganesha cluster.Restarted the volume.Ganesha crashed on 3/4 nodes. *BT from crash* [t a a bt with 256 threads is kinda lengthy,inserting a snippet]: Thread 281 (Thread 0x7f3814780700 (LWP 20103)): #0 0x00007f3889588a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f388b01cdbf in nfs_rpc_dequeue_req (worker=worker@entry=0x7f388c774040) at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_rpc_dispatcher_thread.c:1612 #2 0x00007f388b017a79 in worker_run (ctx=0x7f388c774040) at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_worker_thread.c:1519 #3 0x00007f388b0a2029 in fridgethr_start_routine (arg=0x7f388c774040) at /usr/src/debug/nfs-ganesha-2.4.0/src/support/fridgethr.c:550 #4 0x00007f3889584dc5 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f3888c521cd in clone () from /lib64/libc.so.6 Thread 280 (Thread 0x7f37dc710700 (LWP 20215)): #0 0x00007f3889588a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 ---Type <return> to continue, or q <return> to quit--- #1 0x00007f388b01cdbf in nfs_rpc_dequeue_req (worker=worker@entry=0x7f388c795440) at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_rpc_dispatcher_thread.c:1612 #2 0x00007f388b017a79 in worker_run (ctx=0x7f388c795440) at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_worker_thread.c:1519 #3 0x00007f388b0a2029 in fridgethr_start_routine (arg=0x7f388c795440) at /usr/src/debug/nfs-ganesha-2.4.0/src/support/fridgethr.c:550 #4 0x00007f3889584dc5 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f3888c521cd in clone () from /lib64/libc.so.6 Thread 279 (Thread 0x7f37f0f39700 (LWP 20174)): #0 0x00007f3889588a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f388b01cdbf in nfs_rpc_dequeue_req (worker=worker@entry=0x7f388c789180) at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_rpc_dispatcher_thread.c:1612 #2 0x00007f388b017a79 in worker_run (ctx=0x7f388c789180) at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_worker_thread.c:1519 #3 0x00007f388b0a2029 in fridgethr_start_routine (arg=0x7f388c789180) at /usr/src/debug/nfs-ganesha-2.4.0/src/support/fridgethr.c:550 #4 0x00007f3889584dc5 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f3888c521cd in clone () from /lib64/libc.so.6 Thread 278 (Thread 0x7f37eaf2d700 (LWP 20186)): #0 0x00007f3889588a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f388b01cdbf in nfs_rpc_dequeue_req (worker=worker@entry=0x7f388c78ca80) at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_rpc_dispatcher_thread.c:1612 #2 0x00007f388b017a79 in worker_run (ctx=0x7f388c78ca80) at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_worker_thread.c:1519 #3 0x00007f388b0a2029 in fridgethr_start_routine (arg=0x7f388c78ca80) at /usr/src/debug/nfs-ganesha-2.4.0/src/support/fridgethr.c:550 #4 0x00007f3889584dc5 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f3888c521cd in clone () from /lib64/libc.so.6 ---Type <return> to continue, or q <return> to quit--- Thread 277 (Thread 0x7f383ffd7700 (LWP 20016)): #0 0x00007f3889588a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f388b01cdbf in nfs_rpc_dequeue_req (worker=worker@entry=0x7f388c75a300) at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_rpc_dispatcher_thread.c:1612 #2 0x00007f388b017a79 in worker_run (ctx=0x7f388c75a300) at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_worker_thread.c:1519 #3 0x00007f388b0a2029 in fridgethr_start_routine (arg=0x7f388c75a300) at /usr/src/debug/nfs-ganesha-2.4.0/src/support/fridgethr.c:550 #4 0x00007f3889584dc5 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f3888c521cd in clone () from /lib64/libc.so.6 Thread 276 (Thread 0x7f37e3f1f700 (LWP 20200)): #0 0x00007f3889588a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f388b01cdbf in nfs_rpc_dequeue_req (worker=worker@entry=0x7f388c790d00) at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_rpc_dispatcher_thread.c:1612 #2 0x00007f388b017a79 in worker_run (ctx=0x7f388c790d00) at /usr/src/debug/nfs-ganesha-2.4.0/src/MainNFSD/nfs_worker_thread.c:1519 #3 0x00007f388b0a2029 in fridgethr_start_routine (arg=0x7f388c790d00) at /usr/src/debug/nfs-ganesha-2.4.0/src/support/fridgethr.c:550 #4 0x00007f3889584dc5 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f3888c521cd in clone () from /lib64/libc.so.6 Version-Release number of selected component (if applicable): ------------------------------------------------------------- nfs-ganesha-2.4.0-2.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-2.el7rhgs.x86_64 How reproducible: ----------------- 2/2 Steps to Reproduce: ------------------ 1. Set up a 4 node Ganesha cluster. 2. gluster v stop <vol>/gluster v start <vol> 3. Check if Ganesha process is alive on the servers. Actual results: --------------- Ganesha crashed on 3/4 nodes Expected results: ----------------- Ganesha should not crash on a volume restart. Additional info: ---------------- mount vers =4 On Dev's suggestion,"GANESHA_DIR=/etc/ganesha/ " was changed to "GANESHA_DIR=/var/run/gluster/shared_storage/nfs-ganesha" inside /var/lib/glusterd/hooks/1/start/post/S31ganesha-start.sh . Client and Server OS : RHEL 7.2 Volume Configuration : Volume Name: testvol Type: Distributed-Replicate Volume ID: b93b99bd-d1d2-4236-98bc-08311f94e7dc Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0 Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1 Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2 Brick4: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3 Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on ganesha.enable: on features.cache-invalidation: off nfs.disable: on performance.readdir-ahead: on performance.stat-prefetch: off server.allow-insecure: on nfs-ganesha: enable cluster.enable-shared-storage: enable [root@gqas013 tmp]#
As mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1380619#c6, this issue looks similar to the one raised in bug1380619. Will check cores and confirm.
From the cores provided, we can see stack corruption. This is exactly the same issue being addressed as part of bug1380619. Since the use cases are different marking this dependant on that bug.
This issue is fixed by the patch https://code.engineering.redhat.com/gerrit/87972, hence changing the status
Verified on 3.8.4-3. Restarted the volume a couple of times,I did not see any crashes.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html