Bug 1467896

Summary: [GANESHA] Ganesha crashed while running diskfill utility on nfs share mounted on windows client
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Manisha Saini <msaini>
Component: nfs-ganeshaAssignee: Frank Filz <ffilz>
Status: CLOSED ERRATA QA Contact: Manisha Saini <msaini>
Severity: unspecified Docs Contact:
Priority: urgent    
Version: rhgs-3.3CC: amukherj, dang, ffilz, jthottan, kkeithle, rhinduja, rhs-bugs, sheggodu, skoduri, storage-qa-internal
Target Milestone: ---   
Target Release: RHGS 3.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: nfs-ganesha-2.5.4-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-04 06:53:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1562766, 1562774    
Bug Blocks: 1503134    
Attachments:
Description Flags
Windows snippet none

Description Manisha Saini 2017-07-05 12:43:01 UTC
Description of problem:
Ganesha crashed while running diskfill utility on nfs share mounted on windows client

Version-Release number of selected component (if applicable):

# rpm -qa | grep ganesha
nfs-ganesha-gluster-2.4.4-10.el7rhgs.x86_64
nfs-ganesha-debuginfo-2.4.4-10.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-31.el7rhgs.x86_64
nfs-ganesha-2.4.4-10.el7rhgs.x86_64


How reproducible:
3/3 times

Steps to Reproduce:
1.Create a 4 node ganesha cluster
2.Create a 6*2 dist-replicate volume and enable on it.
3.Mount the volume to windows client (V3)
4.Run diskfill utility from windows client

Actual results:
Ganesha got crashed on the node from which the volume was mounted on windows client.

Expected results:
Ganesha should not get crashed

Additional info:


Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fc16d772700 (LWP 26131)]
dec_nlm_state_ref (state=0x0) at /usr/src/debug/nfs-ganesha-2.4.4/src/SAL/nlm_state.c:276
276		refcount = atomic_dec_int32_t(&state->state_refcount);
(gdb) bt
#0  dec_nlm_state_ref (state=0x0) at /usr/src/debug/nfs-ganesha-2.4.4/src/SAL/nlm_state.c:276
#1  0x0000556f2173596e in nlm4_Lock (args=<optimized out>, req=<optimized out>, res=0x7fc1fc008770)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NLM/nlm_Lock.c:206
#2  0x0000556f216f6b1c in nfs_rpc_execute (reqdata=reqdata@entry=0x7fc1cc015090)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1281
#3  0x0000556f216f818a in worker_run (ctx=0x556f2310cdb0) at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1548
#4  0x0000556f21781889 in fridgethr_start_routine (arg=0x556f2310cdb0) at /usr/src/debug/nfs-ganesha-2.4.4/src/support/fridgethr.c:550
#5  0x00007fc213805e25 in start_thread () from /lib64/libpthread.so.0
#6  0x00007fc212ed334d in clone () from /lib64/libc.so.6


Ran the diskfill utility on RHEL client on both V3 and V4.
Runs fine with RHEL clients


Attaching sosreports and core file shortly

Comment 5 Jiffin 2017-07-06 06:00:13 UTC
(In reply to Daniel Gryniewicz from comment #4)
> This may be
> https://github.com/nfs-ganesha/nfs-ganesha/commit/
> 52e0e125322fb0cc5c608be4cd43b90a702d88e2

The issue in above case was "state" was not being allocated from get_nlm_state() and above change was addressing refcount issue. Correct me if I am wrong will above change prevent those scenario(guarantee that state entry will be present in the hashtable).

Comment 6 Daniel Gryniewicz 2017-07-06 13:08:48 UTC
Hmmm... I missed that nsm_state_applies is false.  This should *not* be possible for a LOCK op.  The only way to call nlm4_Lock() from that call path is to have req->rq_proc == NLMPROC4_NM_LOCK, which makes nsm_state_applies true.  So something else is very very wrong here, I think.

Comment 7 Frank Filz 2017-07-06 13:18:02 UTC
(In reply to Daniel Gryniewicz from comment #6)
> Hmmm... I missed that nsm_state_applies is false.  This should *not* be
> possible for a LOCK op.  The only way to call nlm4_Lock() from that call
> path is to have req->rq_proc == NLMPROC4_NM_LOCK, which makes
> nsm_state_applies true.  So something else is very very wrong here, I think.

Windows client uses NLMPROC4_NM_LOCK...

I probably messed something up for NM_LOCK...

Crud, we really need access to Windows client in development...

Comment 9 Manisha Saini 2017-07-11 19:21:42 UTC
Tested the same use case with 3.2-

While extracting the zip folder ganesha got crashed-

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f3331db0700 (LWP 22701)]
0x0000560f118a49c5 in dec_nlm_state_ref ()
(gdb) bt
#0  0x0000560f118a49c5 in dec_nlm_state_ref ()
#1  0x0000560f11873cbe in nlm4_Lock ()
#2  0x0000560f11834eec in nfs_rpc_execute ()
#3  0x0000560f1183655a in worker_run ()
#4  0x0000560f118c02f9 in fridgethr_start_routine ()
#5  0x00007f334299ce25 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f334206a34d in clone () from /lib64/libc.so.6


But didn't hit the crash while running diskfill utility on windows mount point

Comment 10 Manisha Saini 2017-07-11 19:24:13 UTC
Got accidentally messed with assignee.Resting again

Comment 11 Frank Filz 2017-07-12 15:43:03 UTC
(In reply to Manisha Saini from comment #9)
> Tested the same use case with 3.2-
> 
> While extracting the zip folder ganesha got crashed-
> 
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7f3331db0700 (LWP 22701)]
> 0x0000560f118a49c5 in dec_nlm_state_ref ()
> (gdb) bt
> #0  0x0000560f118a49c5 in dec_nlm_state_ref ()
> #1  0x0000560f11873cbe in nlm4_Lock ()
> #2  0x0000560f11834eec in nfs_rpc_execute ()
> #3  0x0000560f1183655a in worker_run ()
> #4  0x0000560f118c02f9 in fridgethr_start_routine ()
> #5  0x00007f334299ce25 in start_thread () from /lib64/libpthread.so.0
> #6  0x00007f334206a34d in clone () from /lib64/libc.so.6
> 
> 
> But didn't hit the crash while running diskfill utility on windows mount
> point

Yea, this fix needs back porting to 3.2 also.

Comment 16 Daniel Gryniewicz 2017-08-29 17:33:18 UTC
I confirm that this commit is in 2.5.2.

Comment 22 Manisha Saini 2018-05-07 17:20:08 UTC
Created attachment 1432751 [details]
Windows snippet

Comment 24 errata-xmlrpc 2018-09-04 06:53:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2610