Bug 1411344

Summary: [GNFS] GNFS crashed while taking lock on a file from 2 different clients having same volume mounted from 2 different servers
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Manisha Saini <msaini>
Component: gluster-nfsAssignee: Niels de Vos <ndevos>
Status: CLOSED ERRATA QA Contact: Manisha Saini <msaini>
Severity: unspecified Docs Contact:
Priority: medium    
Version: rhgs-3.2CC: amukherj, asrivast, giuseppe.ragusa, jthottan, jules, msaini, ndevos, rcyriac, rhinduja, rhs-bugs, sbhaloth, skoduri, storage-qa-internal
Target Milestone: ---Keywords: Reproducer
Target Release: RHGS 3.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: 3.3.0-devel-freeze-exception
Fixed In Version: glusterfs-3.8.4-33 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-09-21 04:30:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1381970, 1467313    
Bug Blocks: 1411338, 1417147    
Attachments:
Description Flags
Test script
none
Test script that causes Gluster/NFS to crash in NLM code
none
Locking script none

Description Manisha Saini 2017-01-09 14:28:31 UTC
Description of problem:
Mount a volume from 2 different server to 2 different clients.
Create a file.
Take lock from 2 different clients on the same file.
In that case GNFS server got crashed


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.Create disperseVol 2 x (4 + 2) and Enable MDCache and GNFS on it
2.Mount the volume from two different servers to 2 different clients
3.Create 512 Bytes of file from 1 client on mount point
4.Take lock from client 1.Lock is acquired
5.Try taking lock from client 2.Lock is blocked (as already being taken by
client 1)
6.Release lock from client1.Take lock from client2
7.Again try taking lock from client 1.

Actual results:
Lock is being granted to client1.Which should not
Issue is reported in bug-https://bugzilla.redhat.com/show_bug.cgi?id=1411338
GNFS server got crashed

Expected results:
GNFS should handle taking lock from 2 different client on same volume mounted from 2 different servers

Additional info:

Comment 3 Niels de Vos 2017-01-10 11:02:59 UTC
When you run this test, how often does it crash for you? Is it a one time occurrence, or is it very reliable?

Comment 4 Manisha Saini 2017-01-10 12:08:56 UTC
I hit this same crash twice while running this test scenario.

Comment 5 Manisha Saini 2017-01-10 12:10:19 UTC
Missed some additional info

Version-Release number of selected component (if applicable):
glusterfs-ganesha-3.8.4-10.el7rhgs.x86_64

Comment 6 Niels de Vos 2017-01-10 12:30:35 UTC
Created attachment 1239086 [details]
Test script

While working on the attached test-script I managed to get a coredump too. This happened while manually executing the commands I wanted to put in the script. Now the script is running and has already with 100+ iterations and still no crashes...

Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/'.
Program terminated with signal 11, Segmentation fault.
#0  __strcmp_sse42 () at ../sysdeps/x86_64/multiarch/strcmp-sse42.S:164
164             movdqu  (%rdi), %xmm1
(gdb) bt
#0  __strcmp_sse42 () at ../sysdeps/x86_64/multiarch/strcmp-sse42.S:164
#1  0x00007fafa65986f2 in nlm_set_rpc_clnt (rpc_clnt=0x7faf8c005200, caller_name=0x0) at nlm4.c:345
#2  0x00007fafa659b1d5 in nlm_rpcclnt_notify (rpc_clnt=0x7faf8c005200, mydata=0x7faf9f66b06c, fn=<optimized out>, data=<optimized out>) at nlm4.c:930
#3  0x00007fafb48a0a84 in rpc_clnt_notify (trans=<optimized out>, mydata=0x7faf8c005230, event=<optimized out>, data=0x7faf8c00cd70) at rpc-clnt.c:994
#4  0x00007fafb489c973 in rpc_transport_notify (this=this@entry=0x7faf8c00cd70, event=event@entry=RPC_TRANSPORT_CONNECT, data=data@entry=0x7faf8c00cd70) at rpc-transport.c:541
#5  0x00007fafa9391c67 in socket_connect_finish (this=0x7faf8c00cd70) at socket.c:2343
#6  0x00007fafa9396315 in socket_event_handler (fd=<optimized out>, idx=10, data=0x7faf8c00cd70, poll_in=0, poll_out=4, poll_err=0) at socket.c:2386
#7  0x00007fafb4b2ece0 in event_dispatch_epoll_handler (event=0x7faf9e568e80, event_pool=0x7fafb545e6e0) at event-epoll.c:571
#8  event_dispatch_epoll_worker (data=0x7fafa0033d50) at event-epoll.c:674
#9  0x00007fafb3937df5 in start_thread (arg=0x7faf9e569700) at pthread_create.c:308
#10 0x00007fafb327e1ad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

(gdb) f 2
#2  0x00007fafa659b1d5 in nlm_rpcclnt_notify (rpc_clnt=0x7faf8c005200, mydata=0x7faf9f66b06c, fn=<optimized out>, data=<optimized out>) at nlm4.c:930
930                     ret = nlm_set_rpc_clnt (rpc_clnt, caller_name);
(gdb) l
925             cs = mydata;
926             caller_name = cs->args.nlm4_lockargs.alock.caller_name;
927
928             switch (fn) {
929             case RPC_CLNT_CONNECT:
930                     ret = nlm_set_rpc_clnt (rpc_clnt, caller_name);
931                     if (ret == -1) {
932                             gf_msg (GF_NLM, GF_LOG_ERROR, 0,
933                                     NFS_MSG_RPC_CLNT_ERROR, "Failed to set "
934                                     "rpc clnt");
(gdb) p cs->args.nlm4_lockargs                                                
$1 = {
  cookie = {
    nlm4_netobj_len = 0, 
    nlm4_netobj_val = 0x0
  }, 
  block = 0, 
  exclusive = 0, 
  alock = {
    caller_name = 0x0, 
    fh = {
      nlm4_netobj_len = 0, 
      nlm4_netobj_val = 0x0
    }, 
    oh = {
      nlm4_netobj_len = 0, 
      nlm4_netobj_val = 0x0
    }, 
    svid = 0, 
    l_offset = 0, 
    l_len = 0
  }, 
  reclaim = 0, 
  state = 0
}


It seems that the nlm4_lockargs are empty... No idea how that can happen, will investigate a little more.

Comment 8 Niels de Vos 2017-01-10 17:09:28 UTC
In the occasion (not sure yet what the trigger is) that the Gluster/NFS service crashes for me, I get the following log:

[2017-01-10 15:19:05.722973] W [socket.c:590:__socket_rwv] 0-NLM-client: readv on 172.31.100.3:58171 failed (No data available)
pending frames:
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2017-01-10 15:19:16
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.8.5
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xc2)[0x7f831d790592]
/lib64/libglusterfs.so.0(gf_print_trace+0x324)[0x7f831d79a004]
/lib64/libc.so.6(+0x35650)[0x7f831be78650]
/lib64/libc.so.6(+0x131f26)[0x7f831bf74f26]
/usr/lib64/glusterfs/3.8.5/xlator/nfs/server.so(+0x3f6f2)[0x7f830f2536f2]
/usr/lib64/glusterfs/3.8.5/xlator/nfs/server.so(+0x421d5)[0x7f830f2561d5]
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x214)[0x7f831d55ba84]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f831d557973]
/usr/lib64/glusterfs/3.8.5/rpc-transport/socket.so(+0x4c67)[0x7f831204cc67]
/usr/lib64/glusterfs/3.8.5/rpc-transport/socket.so(+0x9315)[0x7f8312051315]
/lib64/libglusterfs.so.0(+0x82ce0)[0x7f831d7e9ce0]
/lib64/libpthread.so.0(+0x7df5)[0x7f831c5f2df5]
/lib64/libc.so.6(clone+0x6d)[0x7f831bf391ad]


It might be that __socket_rwv() calls the notify functions even when there is no data. This would be a connect to the NLM port with a packet that at least contains enough information to identify it as a NLM procedure. Unfortunately I have not succeeded to reproduce the problem while capturing a tcpdump...

Comment 10 Manisha Saini 2017-01-12 07:02:02 UTC
As discussed in bug Triage meeting,

Tried the same scenario with MD-Cache disable,
With MD-Cache disable also,Same GNFS crash was observed

#tailf /var/log/glusterfs/nfs.log

[2017-01-12 06:57:29.337698] W [socket.c:590:__socket_rwv] 0-NLM-client: readv on 10.70.37.142:40829 failed (No data available)
pending frames:
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2017-01-12 06:57:40
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.8.4
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xc2)[0x7ff93eb51b92]
/lib64/libglusterfs.so.0(gf_print_trace+0x324)[0x7ff93eb5b624]
/lib64/libc.so.6(+0x35250)[0x7ff93d235250]
/lib64/libc.so.6(+0x133586)[0x7ff93d333586]
/usr/lib64/glusterfs/3.8.4/xlator/nfs/server.so(+0x3f742)[0x7ff93033d742]
/usr/lib64/glusterfs/3.8.4/xlator/nfs/server.so(+0x42235)[0x7ff930340235]
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x232)[0x7ff93e91ba12]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7ff93e917893]
/usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so(+0x5167)[0x7ff933407167]
/usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so(+0x9738)[0x7ff93340b738]
/lib64/libglusterfs.so.0(+0x835b0)[0x7ff93ebab5b0]
/lib64/libpthread.so.0(+0x7dc5)[0x7ff93d9b2dc5]
/lib64/libc.so.6(clone+0x6d)[0x7ff93d2f773d]


# showmount -e localhost
clnt_create: RPC: Program not registered

Comment 11 Niels de Vos 2017-01-12 15:33:16 UTC
Created attachment 1239987 [details]
Test script that causes Gluster/NFS to crash in NLM code

I am able to crash Gluster/NFS pretty reliable now with the attached script. The following actions are taken:

1. mount a volume from server_a
2. take a whole-file lock (and keep it in the background)
3. mount a volume from server_b
4. try to take the whole-file lock too
5. cancel the 2nd lock request
6. release the obtained lock
7. try to get the lock again
8. get the lock and segfaults

This seems to take somewhere between 10-15 minutes before step 8 happens. Example output with timings:

# ./bz1411344_cancel.sh 
[A] trying to get lock: Thu Jan 12 16:06:42 CET 2017
[A] got lock: Thu Jan 12 16:06:42 CET 2017
[B] trying to get lock: Thu Jan 12 16:06:43 CET 2017
[B] cancelling lock: Thu Jan 12 16:06:45 CET 2017
[A] released lock: Thu Jan 12 16:07:02 CET 2017
[A] trying to get lock: Thu Jan 12 16:07:04 CET 2017
[A] got lock: Thu Jan 12 16:18:54 CET 2017


The environment for this script:
- two Gluster servers for mounting the NFS export (can be a single brick volume)
- single NFS-client, but with two different mountpoints

Comment 12 Niels de Vos 2017-01-12 15:35:02 UTC
Manisha, I am not sure if you did a lock-cancel in your test. Without it I am not able to reproduce the segfault. Could you please confirm the exact steps that you did to get the crashes?

Comment 13 Manisha Saini 2017-01-13 06:56:42 UTC
I used the attached script to take the lockBelow are the steps 

Steps:
1.Mount a volume from server1 to client1
2.Create 512 bytes of file say file1 using dd from client1
3.Mount the same volume from server2 to client2
4.Taking lock from client1 on the file1 -->Lock is granted
5.Try taking lock from client2 on the file1 -->Locking is blocked (not granted)
6.Release the lock from client1
7.Take Lock from client2 on file1 -->Lock is granted
8.Try taking lock from client1 on file1 -->Lock is granted (which should not.For this I filed a bug- https://bugzilla.redhat.com/show_bug.cgi?id=1411338)
9.Release lock from client 2 and client1

As you mentioned in comment 11,It takes around 5-10 minutes to crash GNFS after these steps

Comment 14 Manisha Saini 2017-01-13 06:58:08 UTC
Created attachment 1240221 [details]
Locking script

Comment 15 Manisha Saini 2017-01-16 13:37:02 UTC
Observed the same crash while taking exclusive write lock from client1 on existing file and at the same time trying overwriting the same file from client2


(gdb) bt
#0  0x00007fcdc2a1f586 in __strcmp_sse42 () from /lib64/libc.so.6
#1  0x00007fcdb5a29742 in nlm_set_rpc_clnt () from /usr/lib64/glusterfs/3.8.4/xlator/nfs/server.so
#2  0x00007fcdb5a2c235 in nlm_rpcclnt_notify () from /usr/lib64/glusterfs/3.8.4/xlator/nfs/server.so
#3  0x00007fcdc4007a12 in rpc_clnt_notify () from /lib64/libgfrpc.so.0
#4  0x00007fcdc4003893 in rpc_transport_notify () from /lib64/libgfrpc.so.0
#5  0x00007fcdb8af3167 in socket_connect_finish ()
   from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so
#6  0x00007fcdb8af7738 in socket_event_handler ()
   from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so
#7  0x00007fcdc42975b0 in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0
#8  0x00007fcdc309edc5 in start_thread () from /lib64/libpthread.so.0
#9  0x00007fcdc29e373d in clone () from /lib64/libc.so.6

Comment 19 Manisha Saini 2017-01-27 07:20:47 UTC
Observing the same Crash as reported in this bug with cthon locking test suit

pending frames:
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2017-01-27 06:59:06
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.8.4
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xc2)[0x7f0645675b92]
/lib64/libglusterfs.so.0(gf_print_trace+0x324)[0x7f064567f624]
/lib64/libc.so.6(+0x35250)[0x7f0643d59250]
/lib64/libc.so.6(+0x133586)[0x7f0643e57586]
/usr/lib64/glusterfs/3.8.4/xlator/nfs/server.so(+0x3f742)[0x7f06370ca742]
/usr/lib64/glusterfs/3.8.4/xlator/nfs/server.so(+0x42235)[0x7f06370cd235]
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x232)[0x7f064543fa12]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f064543b893]
/usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so(+0x5167)[0x7f0639f2b167]
/usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so(+0x9738)[0x7f0639f2f738]
/lib64/libglusterfs.so.0(+0x835b0)[0x7f06456cf5b0]
/lib64/libpthread.so.0(+0x7dc5)[0x7f06444d6dc5]
/lib64/libc.so.6(clone+0x6d)[0x7f0643e1b73d]
---------


========

(gdb) bt
#0  0x00007f0643e57586 in __strcmp_sse42 () from /lib64/libc.so.6
#1  0x00007f06370ca742 in nlm_set_rpc_clnt () from /usr/lib64/glusterfs/3.8.4/xlator/nfs/server.so
#2  0x00007f06370cd235 in nlm_rpcclnt_notify () from /usr/lib64/glusterfs/3.8.4/xlator/nfs/server.so
#3  0x00007f064543fa12 in rpc_clnt_notify () from /lib64/libgfrpc.so.0
#4  0x00007f064543b893 in rpc_transport_notify () from /lib64/libgfrpc.so.0
#5  0x00007f0639f2b167 in socket_connect_finish () from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so
#6  0x00007f0639f2f738 in socket_event_handler () from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so
#7  0x00007f06456cf5b0 in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0
#8  0x00007f06444d6dc5 in start_thread () from /lib64/libpthread.so.0
#9  0x00007f0643e1b73d in clone () from /lib64/libc.so.6

===============

Comment 20 Manisha Saini 2017-01-27 07:22:23 UTC
The Cthon test suit is passing.
But after completion of test,GNFS is crashing on the server node from which the Volume is mounted.

Comment 22 Jules 2017-02-05 09:57:10 UTC
Even without locking (i've set nfs client mounts to nolock,local_lock=none) the nfs server crashes on some workload. It became even more evil since i have upgraded from 3.7.8 to 3.9.1-1

I've commited a Bug Report log time ago for this already:
https://bugzilla.redhat.com/show_bug.cgi?id=1381970

Comment 24 Giuseppe Ragusa 2017-02-17 09:34:53 UTC
Given that the bug has been confirmed present even with locking out of the picture in comment #22 above (a condition that I'm not able to test/replicate ATM, unfortunately), would it be possible that the following is involved?

https://review.gluster.org/#/c/16613/

Comment 25 Niels de Vos 2017-02-17 10:31:28 UTC
(In reply to Giuseppe Ragusa from comment #24)
> Given that the bug has been confirmed present even with locking out of the
> picture in comment #22 above (a condition that I'm not able to
> test/replicate ATM, unfortunately), would it be possible that the following
> is involved?
> 
> https://review.gluster.org/#/c/16613/

For any of the segfaults that have references to 'nlm' in the backtrace, locking over NFS has not been completely disabled.

It would be needed to remove all nlockmgr services from both the client and server registrations at 'rpcbind'. This can be checked with

    # rpcinfo | grep nlockmgr

on the client and server, and registrations can be removed with

    # rpcinfo -d 100021 1
    # rpcinfo -d 100021 4

This will cause all locking related requests initiated from both the server and the client to (silently?) fail.

Comment 26 Jules 2017-02-17 10:49:29 UTC
(In reply to Niels de Vos from comment #25)
> (In reply to Giuseppe Ragusa from comment #24)
> > Given that the bug has been confirmed present even with locking out of the
> > picture in comment #22 above (a condition that I'm not able to
> > test/replicate ATM, unfortunately), would it be possible that the following
> > is involved?
> > 
> > https://review.gluster.org/#/c/16613/
> 
> For any of the segfaults that have references to 'nlm' in the backtrace,
> locking over NFS has not been completely disabled.
> 
> It would be needed to remove all nlockmgr services from both the client and
> server registrations at 'rpcbind'. This can be checked with
> 
>     # rpcinfo | grep nlockmgr
> 
> on the client and server, and registrations can be removed with
> 
>     # rpcinfo -d 100021 1
>     # rpcinfo -d 100021 4
> 
> This will cause all locking related requests initiated from both the server
> and the client to (silently?) fail.

Sounds conclusive. But one thing I don't understand is, if I force the nfs client to  not sending any locks by setting mount options: no_lock, local_lock=none and I also don't see any locks on the mounts checking via lsof, were do the write locks should come from?

Comment 27 Niels de Vos 2017-02-17 11:21:35 UTC
(In reply to Jules from comment #26)
> (In reply to Niels de Vos from comment #25)
> > (In reply to Giuseppe Ragusa from comment #24)
> > > Given that the bug has been confirmed present even with locking out of the
> > > picture in comment #22 above (a condition that I'm not able to
> > > test/replicate ATM, unfortunately), would it be possible that the following
> > > is involved?
> > > 
> > > https://review.gluster.org/#/c/16613/
> > 
> > For any of the segfaults that have references to 'nlm' in the backtrace,
> > locking over NFS has not been completely disabled.
> > 
> > It would be needed to remove all nlockmgr services from both the client and
> > server registrations at 'rpcbind'. This can be checked with
> > 
> >     # rpcinfo | grep nlockmgr
> > 
> > on the client and server, and registrations can be removed with
> > 
> >     # rpcinfo -d 100021 1
> >     # rpcinfo -d 100021 4
> > 
> > This will cause all locking related requests initiated from both the server
> > and the client to (silently?) fail.
> 
> Sounds conclusive. But one thing I don't understand is, if I force the nfs
> client to  not sending any locks by setting mount options: no_lock,
> local_lock=none and I also don't see any locks on the mounts checking via
> lsof, were do the write locks should come from?

Upon mounting the NFS-server initiates a callback channel to the NLM service running on the client. It will still try to track certain events in the NLM part of the NFS-server. When these connections are setup, the NFS-server will try to use them, and this may still cause the segfaults.

Comment 30 surabhi 2017-04-19 09:25:52 UTC
Niels,

Please provide the rpm's with the fixes. We shall be able to test it and confirm if the issue is fixed.

Comment 32 Manisha Saini 2017-04-21 14:37:57 UTC
Hey Niels,

I tried verifying this bug with the rpms given comment #16,

Created a 2 x (4 + 2) Distributed-Disperse Volume.
Started the volume.

While enabling GNFS on EC volume,GNFS got crashed.

# gluster v set disperseVol nfs.disable off
Gluster NFS is being deprecated in favor of NFS-Ganesha Enter "yes" to continue using Gluster NFS (y/n) y
volume set: success

#showmount -e localhost
clnt_create: RPC: Program not registered


In nfs.logs
--------------
[2017-04-21 14:31:23.448381] I [MSGID: 100030] [glusterfsd.c:2455:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.10dev (args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log -S /var/run/gluster/a791065172835cf260dc66c9aa9e0a7e.socket)
[2017-04-21 14:31:23.467324] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2017-04-21 14:31:23.467388] E [socket.c:2307:socket_connect_finish] 0-glusterfs: connection to ::1:24007 failed (Connection refused)
[2017-04-21 14:31:23.467439] E [glusterfsd-mgmt.c:1931:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to connect with remote-host: localhost (Transport endpoint is not connected)
[2017-04-21 14:31:27.462129] I [rpcsvc.c:2208:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: Configured rpc.outstanding-rpc-limit with value 16
[2017-04-21 14:31:27.490973] I [MSGID: 112153] [mount3.c:3992:mnt3svc_init] 0-nfs-mount: Exports auth has been disabled!
[2017-04-21 14:31:27.560772] I [rpc-drc.c:689:rpcsvc_drc_init] 0-rpc-service: DRC is turned OFF
[2017-04-21 14:31:27.560822] I [MSGID: 112110] [nfs.c:1524:init] 0-nfs: NFS service started
[2017-04-21 14:31:27.574268] I [MSGID: 122067] [ec-code.c:896:ec_code_detect] 0-disperseVol-disperse-1: Using 'sse' CPU extensions
pending frames:
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash: 
2017-04-21 14:31:27
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.10dev
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xa0)[0x7f1e244d1e90]
/lib64/libglusterfs.so.0(gf_print_trace+0x324)[0x7f1e244db794]
/lib64/libc.so.6(+0x35250)[0x7f1e22bb6250]
/usr/lib64/glusterfs/3.10dev/xlator/cluster/disperse.so(+0x34dc1)[0x7f1e16839dc1]
/usr/lib64/glusterfs/3.10dev/xlator/cluster/disperse.so(+0x33964)[0x7f1e16838964]
/usr/lib64/glusterfs/3.10dev/xlator/cluster/disperse.so(init+0x1f4)[0x7f1e16812184]
/lib64/libglusterfs.so.0(xlator_init+0x50)[0x7f1e244cf5f0]
/lib64/libglusterfs.so.0(glusterfs_graph_init+0x29)[0x7f1e24506c29]
/lib64/libglusterfs.so.0(glusterfs_graph_activate+0x3b)[0x7f1e2450754b]
/usr/sbin/glusterfs(glusterfs_process_volfp+0x12d)[0x7f1e249cb58d]
/usr/sbin/glusterfs(mgmt_getspec_cbk+0x3c1)[0x7f1e249d0f51]
/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7f1e2429c720]
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1df)[0x7f1e2429c9ff]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f1e242988e3]
/usr/lib64/glusterfs/3.10dev/rpc-transport/socket.so(+0x72f4)[0x7f1e18d862f4]
/usr/lib64/glusterfs/3.10dev/rpc-transport/socket.so(+0x9795)[0x7f1e18d88795]
/lib64/libglusterfs.so.0(+0x846a0)[0x7f1e2452c6a0]
/lib64/libpthread.so.0(+0x7dc5)[0x7f1e23333dc5]
/lib64/libc.so.6(clone+0x6d)[0x7f1e22c7873d]

----------------

# gluster v info
 
Volume Name: disperseVol
Type: Distributed-Disperse
Volume ID: 9c4e5c69-6a53-4ba3-877f-20536bc2f0d2
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (4 + 2) = 12
Transport-type: tcp
Bricks:
Brick1: 10.70.47.47:/gluster/brick1/1
Brick2: 10.70.46.38:/gluster/brick1/1
Brick3: 10.70.46.109:/gluster/brick1/1
Brick4: 10.70.46.56:/gluster/brick1/1
Brick5: 10.70.47.126:/gluster/brick1/1
Brick6: 10.70.46.135:/gluster/brick1/1
Brick7: 10.70.47.47:/gluster/brick2/2
Brick8: 10.70.46.38:/gluster/brick2/2
Brick9: 10.70.46.109:/gluster/brick2/2
Brick10: 10.70.46.56:/gluster/brick2/2
Brick11: 10.70.47.126:/gluster/brick2/1
Brick12: 10.70.46.135:/gluster/brick2/1
Options Reconfigured:
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: off

---------

# rpm -qa | grep gluster
glusterfs-libs-3.10dev-0.323.gitdbbd275.el7.centos.x86_64
glusterfs-api-3.10dev-0.323.gitdbbd275.el7.centos.x86_64
glusterfs-fuse-3.10dev-0.323.gitdbbd275.el7.centos.x86_64
glusterfs-resource-agents-3.10dev-0.323.gitdbbd275.el7.centos.noarch
glusterfs-rdma-3.10dev-0.323.gitdbbd275.el7.centos.x86_64
glusterfs-regression-tests-3.10dev-0.323.gitdbbd275.el7.centos.x86_64
glusterfs-3.10dev-0.323.gitdbbd275.el7.centos.x86_64
glusterfs-client-xlators-3.10dev-0.323.gitdbbd275.el7.centos.x86_64
python-gluster-3.10dev-0.323.gitdbbd275.el7.centos.noarch
nfs-ganesha-gluster-2.4.4-4.el7rhgs.x86_64
glusterfs-extra-xlators-3.10dev-0.323.gitdbbd275.el7.centos.x86_64
glusterfs-cli-3.10dev-0.323.gitdbbd275.el7.centos.x86_64
glusterfs-api-devel-3.10dev-0.323.gitdbbd275.el7.centos.x86_64
glusterfs-ganesha-3.10dev-0.323.gitdbbd275.el7.centos.x86_64
glusterfs-geo-replication-3.10dev-0.323.gitdbbd275.el7.centos.x86_64
glusterfs-debuginfo-3.10dev-0.323.gitdbbd275.el7.centos.x86_64
glusterfs-server-3.10dev-0.323.gitdbbd275.el7.centos.x86_64
glusterfs-devel-3.10dev-0.323.gitdbbd275.el7.centos.x86_64
glusterfs-events-3.10dev-0.323.gitdbbd275.el7.centos.x86_64

Comment 34 Manisha Saini 2017-05-03 12:47:01 UTC
Niels,


With the test build provided in comment #33, I created the EC volume and mounted it over GNFS.Ran cthon lock test suit.

Cthon test got passed and no crash was observed.

I could not try the scenario mentioned in this bug due to  
Bug 1444515 - [GNFS+EC] Unable to release the lock when the other client tries to acquire the lock on the same file.

But this crash reported in this bug was even observed earlier with cthon lock test suit,which in this test build appears to be fixed.

Comment 35 Niels de Vos 2017-05-05 12:37:00 UTC
Rebased patches have been sent for review for the Gluster Community master branch: https://review.gluster.org/#/q/topic:bug-1381970

Once these are merged, I'll backport them to the other branches and RHGS repository.

Comment 38 Manisha Saini 2017-06-19 15:32:47 UTC
Hey Niels,

I tried the use case of taking lock from 2 clients on 1 file on dist-replicate volume.NFS is getting crashed

# rpm -qa | grep gluster
glusterfs-3.8.4-28.el7rhgs.x86_64
glusterfs-server-3.8.4-28.el7rhgs.x86_64
python-gluster-3.8.4-28.el7rhgs.noarch
vdsm-gluster-4.17.33-1.1.el7rhgs.noarch
glusterfs-client-xlators-3.8.4-28.el7rhgs.x86_64
glusterfs-fuse-3.8.4-28.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-28.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-libs-3.8.4-28.el7rhgs.x86_64
glusterfs-api-3.8.4-28.el7rhgs.x86_64
glusterfs-cli-3.8.4-28.el7rhgs.x86_64
libvirt-daemon-driver-storage-gluster-3.2.0-4.el7.x86_64
gluster-nagios-addons-0.2.9-1.el7rhgs.x86_64
glusterfs-rdma-3.8.4-28.el7rhgs.x86_64



ify?! [Invalid argument]
[2017-06-19 13:08:46.117375] W [socket.c:595:__socket_rwv] 0-NLM-client: readv on 10.70.37.142:34033 failed (No data available)
[2017-06-19 13:08:46.117529] W [socket.c:595:__socket_rwv] 0-NLM-client: readv on 10.70.37.142:34033 failed (No data available)
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2017-06-19 13:08:48
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.8.4
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xc2)[0x7f6ec83b54b2]
/lib64/libglusterfs.so.0(gf_print_trace+0x324)[0x7f6ec83befe4]
/lib64/libc.so.6(+0x35270)[0x7f6ec6a1e270]
/lib64/libc.so.6(+0x165921)[0x7f6ec6b4e921]
/usr/lib64/glusterfs/3.8.4/xlator/nfs/server.so(+0x3f9aa)[0x7f6eba13a9aa]
/usr/lib64/glusterfs/3.8.4/xlator/nfs/server.so(+0x42349)[0x7f6eba13d349]
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x214)[0x7f6ec817eb54]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f6ec817a9e3]
/usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so(+0x51d7)[0x7f6ebcfa71d7]
/usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so(+0x9918)[0x7f6ebcfab918]
/lib64/libglusterfs.so.0(+0x849d6)[0x7f6ec840f9d6]
/lib64/libpthread.so.0(+0x7e25)[0x7f6ec7214e25]
/lib64/libc.so.6(clone+0x6d)[0x7f6ec6ae134d]


Volume Name: GNFSVol
Type: Distributed-Replicate
Volume ID: 6e809852-79b2-4348-b051-df4d837ab585
Status: Started
Snapshot Count: 0
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: dhcp42-1.lab.eng.blr.redhat.com:/gluster/brick1/1
Brick2: dhcp41-251.lab.eng.blr.redhat.com:/gluster/brick1/1
Brick3: dhcp41-248.lab.eng.blr.redhat.com:/gluster/brick1/1
Brick4: dhcp41-245.lab.eng.blr.redhat.com:/gluster/brick1/1
Brick5: dhcp42-1.lab.eng.blr.redhat.com:/gluster/brick2/1
Brick6: dhcp41-251.lab.eng.blr.redhat.com:/gluster/brick2/1
Brick7: dhcp41-248.lab.eng.blr.redhat.com:/gluster/brick2/1
Brick8: dhcp41-245.lab.eng.blr.redhat.com:/gluster/brick2/1
Brick9: dhcp42-1.lab.eng.blr.redhat.com:/gluster/brick3/1
Brick10: dhcp41-251.lab.eng.blr.redhat.com:/gluster/brick3/1
Brick11: dhcp41-248.lab.eng.blr.redhat.com:/gluster/brick3/1
Brick12: dhcp41-245.lab.eng.blr.redhat.com:/gluster/brick3/1
Options Reconfigured:
transport.address-family: inet
nfs.disable: off




I ran the case on dist-replicate volume because of the already known issues with EC volume.

Comment 40 Manisha Saini 2017-06-20 07:42:26 UTC
Based on comment #38 ,Moving it to assigned state

Comment 41 Manisha Saini 2017-06-20 09:37:47 UTC
Can be easily reproducible by cthon lock test suit.
Earlier in 3.2 ,GNFS used to crash after completion of Cthon test suit(Cthon test suit used to pass).

Now the GNFS is crashing in between running cthon test suit during test 12. GNFS crashed in test 12 itself


Test #11 - Make sure close() releases the process's locks.
	Parent: 11.0  - F_TLOCK [               0,          ENDING] PASSED.
	Parent: Closed testfile.
	Child:  11.1  - F_TLOCK [               0,          ENDING] PASSED.
	Child:  11.2  - F_ULOCK [               0,          ENDING] PASSED.
	Parent: 11.3  - F_TLOCK [              1d,             5b7] PASSED.
	Parent: 11.4  - F_TLOCK [            2000,              57] PASSED.
	Parent: Closed testfile.
	Child:  11.5  - F_TLOCK [               0,          ENDING] PASSED.
	Child:  11.6  - F_ULOCK [               0,          ENDING] PASSED.
	Parent: Wrote '123456789abcdef' to testfile [ 0, 16 ].
	Parent: 11.7  - F_TLOCK [               0,          ENDING] PASSED.
	Parent: Wrote '123456789abcdef' to testfile [ 13, 16 ].
	Parent: Closed testfile.
	Child:  11.8  - F_TLOCK [               0,          ENDING] PASSED.
	Child:  11.9  - F_ULOCK [               0,          ENDING] PASSED.
	Parent: Wrote '123456789abcdef' to testfile [ 0, 16 ].
	Parent: 11.10 - F_TLOCK [               0,          ENDING] PASSED.
	Parent: Truncated testfile.
	Parent: Closed testfile.
	Child:  11.11 - F_TLOCK [               0,          ENDING] PASSED.
	Child:  11.12 - F_ULOCK [               0,          ENDING] PASSED.

Test #12 - Signalled process should release locks.
	Child:  12.0  - F_TLOCK [               0,          ENDING] PASSED.
	Parent: Killed child process.

Comment 44 Niels de Vos 2017-06-20 15:31:57 UTC
Depending on what re-connect and re-try for procedure is being done, the following '.nlm4_lockargs' structure does not contain valid information:

xlators/nfs/server/src/nlm4.c:
 932         case RPC_CLNT_CONNECT:
 933                 if (!cs->req) {
 934                         gf_msg (GF_NLM, GF_LOG_ERROR, EINVAL,
 935                                 NFS_MSG_RPC_CLNT_ERROR, "Spurious notify?!");
 936                         goto err;
 937                 }
 938 
 939                 caller_name = cs->args.nlm4_lockargs.alock.caller_name;
 940                 ret = nlm_set_rpc_clnt (rpc_clnt, caller_name);

Further on in nlm_set_rpc_clnt() a strncmp(caller_name, ...) is done, but caller_name may point to garbage.

Depending on the procedure that is executed with the re-connect, the details will be in any of the 'args' structures:

from nfs3.h:
152 typedef union args_ {
153         nlm4_stat nlm4_stat;
154         nlm4_holder nlm4_holder;
155         nlm4_lock nlm4_lock;
156         nlm4_share nlm4_share;
157         nlm4_testrply nlm4_testrply;
158         nlm4_testres nlm4_testres;
159         nlm4_testargs nlm4_testargs;
160         nlm4_res nlm4_res;
161         nlm4_lockargs nlm4_lockargs;
162         nlm4_cancargs nlm4_cancargs;
163         nlm4_unlockargs nlm4_unlockargs;
164         nlm4_shareargs nlm4_shareargs;
165         nlm4_shareres nlm4_shareres;
166         nlm4_freeallargs nlm4_freeallargs;
...
171 } args;

Comment 45 Niels de Vos 2017-06-20 15:41:42 UTC
It seems that the procedure that is executed can only be accessed through cs->req->procnum. The value will need to be matched with these from nlm4.h:
 29 #define NLM4_NULL          0
 30 #define NLM4_TEST          1
 31 #define NLM4_LOCK          2
 32 #define NLM4_CANCEL        3
 33 #define NLM4_UNLOCK        4
 34 #define NLM4_GRANTED       5
 35 #define NLM4_TEST_MSG      6
 36 #define NLM4_LOCK_MSG      7
 37 #define NLM4_CANCEL_MSG    8
 38 #define NLM4_UNLOCK_MSG    9
 39 #define NLM4_GRANTED_MSG   10
 40 #define NLM4_TEST_RES      11
 41 #define NLM4_LOCK_RES      12
 42 #define NLM4_CANCEL_RES    13
 43 #define NLM4_UNLOCK_RES    14
 44 #define NLM4_GRANTED_RES   15
 45 #define NLM4_SM_NOTIFY     16
 46 #define NLM4_SEVENTEEN     17
 47 #define NLM4_EIGHTEEN      18
 48 #define NLM4_NINETEEN      19
 49 #define NLM4_SHARE         20
 50 #define NLM4_UNSHARE       21
 51 #define NLM4_NM_LOCK       22
 52 #define NLM4_FREE_ALL      23
 53 #define NLM4_PROC_COUNT    24

Not all of these are implemented, so the 'union _args' from the previous comment will need to be used for reference.

Comment 50 Niels de Vos 2017-07-04 20:06:23 UTC
Posted changes for upstream:

nfs: make nfs3_call_state_t refcounted
- https://review.gluster.org/17696

nfs/nlm: unref fds in nlm_client_free()
- https://review.gluster.org/17697

nfs/nlm: handle reconnect for non-NLM4_LOCK requests
- https://review.gluster.org/17698

nfs/nlm: use refcounting for nfs3_call_state_t
- https://review.gluster.org/17699

nfs/nlm: keep track of the call-state and frame for notifications
- https://review.gluster.org/17700

Comment 53 Jules 2017-07-16 09:10:59 UTC
Is it possible to merge these bufixes into glusterfs3.10 also?

Comment 54 Niels de Vos 2017-07-17 14:55:58 UTC
(In reply to Jules from comment #53)
> Is it possible to merge these bufixes into glusterfs3.10 also?

These changes are being backported through bug 1471870.

Note that this is a bug for Red Hat Gluster Storage, and not the Gluster Community project.

Comment 55 Manisha Saini 2017-07-24 12:46:57 UTC
Verified this bug on 

glusterfs-fuse-3.8.4-35.el7rhgs.x86_64

Use Case 1:
1.Create a volume 2 x (4 + 2) Distributed-Disperse Volume.
2.Export the volume via GNFS
3.Mount the volume on 2 clients.
4.Create 1G of file f1 from client 1
5.Take lock on file f1 from client 1 ->Lock Granted
6.Try taking lock from 2nd client on same file ->Failed to get lock as its with client 1
7.Release the Lock from client 1 on file f1->Lock is release from client 1 and is granted to client 2
8.Again try taking lock from client 1.->Lock is granted --Isuue is reported in bug- https://bugzilla.redhat.com/show_bug.cgi?id=1411338

Use Case 2:
1.Create a volume 2 x (4 + 2) Distributed-Disperse Volume.
2.Export the volume via GNFS
3.Mount the volume to 1 client
4.Run cthon lock test suit.


No GNFS crash is being observed,Moving this bug to verified state.

Comment 57 errata-xmlrpc 2017-09-21 04:30:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Comment 58 errata-xmlrpc 2017-09-21 04:56:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774