Bug 1450377

Summary: GNFS crashed while taking lock on a file from 2 different clients having same volume mounted from 2 different servers
Product: [Community] GlusterFS Reporter: Niels de Vos <ndevos>
Component: nfsAssignee: bugs <bugs>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: unspecified Docs Contact:
Priority: medium    
Version: 3.11CC: bugs
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.11.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-30 18:52:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1381970    
Bug Blocks:    

Description Niels de Vos 2017-05-12 11:30:00 UTC
Description of problem:
Mount a volume from 2 different server to 2 different clients.
Create a file.
Take lock from 2 different clients on the same file.
In that case GNFS server got crashed


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.Create disperseVol 2 x (4 + 2) and Enable MDCache and GNFS on it
2.Mount the volume from two different servers to 2 different clients
3.Create 512 Bytes of file from 1 client on mount point
4.Take lock from client 1.Lock is acquired
5.Try taking lock from client 2.Lock is blocked (as already being taken by
client 1)
6.Release lock from client1.Take lock from client2
7.Again try taking lock from client 1.

Actual results:
Lock is being granted to client1.Which should not
Issue is reported in bug-https://bugzilla.redhat.com/show_bug.cgi?id=1411338
GNFS server got crashed

Expected results:
GNFS should handle taking lock from 2 different client on same volume mounted from 2 different servers

Additional info:

--- Additional comment from Niels de Vos on 2017-01-10 13:30 CET ---

While working on the attached test-script I managed to get a coredump too. This happened while manually executing the commands I wanted to put in the script. Now the script is running and has already with 100+ iterations and still no crashes...

Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/'.
Program terminated with signal 11, Segmentation fault.
#0  __strcmp_sse42 () at ../sysdeps/x86_64/multiarch/strcmp-sse42.S:164
164             movdqu  (%rdi), %xmm1
(gdb) bt
#0  __strcmp_sse42 () at ../sysdeps/x86_64/multiarch/strcmp-sse42.S:164
#1  0x00007fafa65986f2 in nlm_set_rpc_clnt (rpc_clnt=0x7faf8c005200, caller_name=0x0) at nlm4.c:345
#2  0x00007fafa659b1d5 in nlm_rpcclnt_notify (rpc_clnt=0x7faf8c005200, mydata=0x7faf9f66b06c, fn=<optimized out>, data=<optimized out>) at nlm4.c:930
#3  0x00007fafb48a0a84 in rpc_clnt_notify (trans=<optimized out>, mydata=0x7faf8c005230, event=<optimized out>, data=0x7faf8c00cd70) at rpc-clnt.c:994
#4  0x00007fafb489c973 in rpc_transport_notify (this=this@entry=0x7faf8c00cd70, event=event@entry=RPC_TRANSPORT_CONNECT, data=data@entry=0x7faf8c00cd70) at rpc-transport.c:541
#5  0x00007fafa9391c67 in socket_connect_finish (this=0x7faf8c00cd70) at socket.c:2343
#6  0x00007fafa9396315 in socket_event_handler (fd=<optimized out>, idx=10, data=0x7faf8c00cd70, poll_in=0, poll_out=4, poll_err=0) at socket.c:2386
#7  0x00007fafb4b2ece0 in event_dispatch_epoll_handler (event=0x7faf9e568e80, event_pool=0x7fafb545e6e0) at event-epoll.c:571
#8  event_dispatch_epoll_worker (data=0x7fafa0033d50) at event-epoll.c:674
#9  0x00007fafb3937df5 in start_thread (arg=0x7faf9e569700) at pthread_create.c:308
#10 0x00007fafb327e1ad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

(gdb) f 2
#2  0x00007fafa659b1d5 in nlm_rpcclnt_notify (rpc_clnt=0x7faf8c005200, mydata=0x7faf9f66b06c, fn=<optimized out>, data=<optimized out>) at nlm4.c:930
930                     ret = nlm_set_rpc_clnt (rpc_clnt, caller_name);
(gdb) l
925             cs = mydata;
926             caller_name = cs->args.nlm4_lockargs.alock.caller_name;
927
928             switch (fn) {
929             case RPC_CLNT_CONNECT:
930                     ret = nlm_set_rpc_clnt (rpc_clnt, caller_name);
931                     if (ret == -1) {
932                             gf_msg (GF_NLM, GF_LOG_ERROR, 0,
933                                     NFS_MSG_RPC_CLNT_ERROR, "Failed to set "
934                                     "rpc clnt");
(gdb) p cs->args.nlm4_lockargs                                                
$1 = {
  cookie = {
    nlm4_netobj_len = 0, 
    nlm4_netobj_val = 0x0
  }, 
  block = 0, 
  exclusive = 0, 
  alock = {
    caller_name = 0x0, 
    fh = {
      nlm4_netobj_len = 0, 
      nlm4_netobj_val = 0x0
    }, 
    oh = {
      nlm4_netobj_len = 0, 
      nlm4_netobj_val = 0x0
    }, 
    svid = 0, 
    l_offset = 0, 
    l_len = 0
  }, 
  reclaim = 0, 
  state = 0
}


It seems that the nlm4_lockargs are empty... No idea how that can happen, will investigate a little more.

Comment 1 Worker Ant 2017-05-12 11:43:15 UTC
REVIEW: https://review.gluster.org/17264 (nfs/nlm: unref rpc-client after nlm4svc_send_granted()) posted (#1) for review on release-3.11 by Niels de Vos (ndevos)

Comment 2 Worker Ant 2017-05-12 11:43:18 UTC
REVIEW: https://review.gluster.org/17265 (nfs/nlm: ignore notify when there is no matching rpc request) posted (#1) for review on release-3.11 by Niels de Vos (ndevos)

Comment 3 Worker Ant 2017-05-12 11:43:22 UTC
REVIEW: https://review.gluster.org/17266 (nfs/nlm: log the caller_name if nlm_client_t can be found) posted (#1) for review on release-3.11 by Niels de Vos (ndevos)

Comment 4 Worker Ant 2017-05-12 11:43:25 UTC
REVIEW: https://review.gluster.org/17267 (nfs/nlm: free the nlm_client upon RPC_DISCONNECT) posted (#1) for review on release-3.11 by Niels de Vos (ndevos)

Comment 5 Worker Ant 2017-05-12 11:43:28 UTC
REVIEW: https://review.gluster.org/17268 (nfs/nlm: remove lock request from the list after cancel) posted (#1) for review on release-3.11 by Niels de Vos (ndevos)

Comment 6 Worker Ant 2017-05-17 23:24:41 UTC
COMMIT: https://review.gluster.org/17264 committed in release-3.11 by Shyamsundar Ranganathan (srangana) 
------
commit 6ae25897843160bbe7354e55ee888b5ff95111e8
Author: Niels de Vos <ndevos>
Date:   Fri Jan 13 16:05:02 2017 +0100

    nfs/nlm: unref rpc-client after nlm4svc_send_granted()
    
    nlm4svc_send_granted() uses the rpc_clnt by getting it from the
    call-state structure. It is safer to unref the rpc_clnt after the
    function is done with it.
    
    Cherry picked from commit 52c28c0c04722a9ffaa7c39c49ffebdf0a5c75e1:
    > Change-Id: I7cb7c4297801463d21259c58b50d7df7c57aec5e
    > BUG: 1381970
    > Signed-off-by: Niels de Vos <ndevos>
    > Reviewed-on: https://review.gluster.org/17187
    > Smoke: Gluster Build System <jenkins.org>
    > NetBSD-regression: NetBSD Build System <jenkins.org>
    > CentOS-regression: Gluster Build System <jenkins.org>
    > Reviewed-by: soumya k <skoduri>
    > Reviewed-by: Jeff Darcy <jeff.us>
    
    Change-Id: I7cb7c4297801463d21259c58b50d7df7c57aec5e
    BUG: 1450377
    Signed-off-by: Niels de Vos <ndevos>
    Reviewed-on: https://review.gluster.org/17264
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: jiffin tony Thottan <jthottan>
    Reviewed-by: Kaleb KEITHLEY <kkeithle>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 7 Worker Ant 2017-05-17 23:25:35 UTC
COMMIT: https://review.gluster.org/17265 committed in release-3.11 by Shyamsundar Ranganathan (srangana) 
------
commit 0533eb3c4af148d588bddd6ed967f253f99b8e33
Author: Niels de Vos <ndevos>
Date:   Fri Jan 13 14:02:45 2017 +0100

    nfs/nlm: ignore notify when there is no matching rpc request
    
    In certain (unclear) occasions it seems to happen that there are
    notifications sent to the Gluster/NFS NLM service, but no call-state can
    be found. Instead of segfaulting, log an error but keep on running.
    
    Cherry picked from commit e997d752ba08f80b1b00d2c0035874befafe5200:
    > Change-Id: I0f186e56e46a86ca40314d230c1cc7719c61f0b5
    > BUG: 1381970
    > Signed-off-by: Niels de Vos <ndevos>
    > Reviewed-on: https://review.gluster.org/17185
    > Smoke: Gluster Build System <jenkins.org>
    > NetBSD-regression: NetBSD Build System <jenkins.org>
    > CentOS-regression: Gluster Build System <jenkins.org>
    > Reviewed-by: soumya k <skoduri>
    > Reviewed-by: jiffin tony Thottan <jthottan>
    > Reviewed-by: Jeff Darcy <jeff.us>
    
    Change-Id: I0f186e56e46a86ca40314d230c1cc7719c61f0b5
    BUG: 1450377
    Signed-off-by: Niels de Vos <ndevos>
    Reviewed-on: https://review.gluster.org/17265
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: jiffin tony Thottan <jthottan>
    Reviewed-by: Kaleb KEITHLEY <kkeithle>

Comment 8 Worker Ant 2017-05-17 23:25:58 UTC
COMMIT: https://review.gluster.org/17266 committed in release-3.11 by Shyamsundar Ranganathan (srangana) 
------
commit 8eceda342eb4b5581b5eb6cc43f8f1e758975809
Author: Niels de Vos <ndevos>
Date:   Fri Jan 13 14:46:17 2017 +0100

    nfs/nlm: log the caller_name if nlm_client_t can be found
    
    In order to help tracking possible misbehaving clients down, log the
    'caller_name' (hostname of the NFS client) that does not have a matching
    nlm_client_t structure.
    
    Cherry picked from commit 9bfb74a39954a7e63bfd762c816efc7e64b9df65:
    > Change-Id: Ib514a78d1809719a3d0274acc31ee632727d746d
    > BUG: 1381970
    > Signed-off-by: Niels de Vos <ndevos>
    > Reviewed-on: https://review.gluster.org/17186
    > Smoke: Gluster Build System <jenkins.org>
    > NetBSD-regression: NetBSD Build System <jenkins.org>
    > CentOS-regression: Gluster Build System <jenkins.org>
    > Reviewed-by: soumya k <skoduri>
    > Reviewed-by: Jeff Darcy <jeff.us>
    
    Change-Id: Ib514a78d1809719a3d0274acc31ee632727d746d
    BUG: 1450377
    Signed-off-by: Niels de Vos <ndevos>
    Reviewed-on: https://review.gluster.org/17266
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: jiffin tony Thottan <jthottan>
    Reviewed-by: Kaleb KEITHLEY <kkeithle>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 9 Worker Ant 2017-05-17 23:26:03 UTC
COMMIT: https://review.gluster.org/17267 committed in release-3.11 by Shyamsundar Ranganathan (srangana) 
------
commit c71645532523191d3a364a8d1771e377db512e6b
Author: Niels de Vos <ndevos>
Date:   Fri Jan 20 14:15:31 2017 +0100

    nfs/nlm: free the nlm_client upon RPC_DISCONNECT
    
    When an NLM client disconnects, it should be removed from the list and
    free'd.
    
    > Cherry picked from commit 6897ba5c51b29c05b270c447adb1a34cb8e61911:
    > Change-Id: Ib427c896bfcdc547a3aee42a652578ffd076e2ad
    > BUG: 1381970
    > Signed-off-by: Niels de Vos <ndevos>
    > Reviewed-on: https://review.gluster.org/17189
    > Smoke: Gluster Build System <jenkins.org>
    > NetBSD-regression: NetBSD Build System <jenkins.org>
    > Reviewed-by: Kaleb KEITHLEY <kkeithle>
    > CentOS-regression: Gluster Build System <jenkins.org>
    > Reviewed-by: jiffin tony Thottan <jthottan>
    
    Change-Id: Ib427c896bfcdc547a3aee42a652578ffd076e2ad
    BUG: 1450377
    Signed-off-by: Niels de Vos <ndevos>
    Reviewed-on: https://review.gluster.org/17267
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: jiffin tony Thottan <jthottan>
    Reviewed-by: Kaleb KEITHLEY <kkeithle>

Comment 10 Worker Ant 2017-05-17 23:26:07 UTC
COMMIT: https://review.gluster.org/17268 committed in release-3.11 by Shyamsundar Ranganathan (srangana) 
------
commit 55a23c268a2451f2f4d25c06c05e4c3bc0933d5d
Author: Niels de Vos <ndevos>
Date:   Fri Jan 13 13:02:23 2017 +0100

    nfs/nlm: remove lock request from the list after cancel
    
    Once an NLM client cancels a lock request, it should be removed from the
    list. The list can also be cleaned of unneeded entries once the client
    does not have any outstanding lock/share requests/granted.
    
    Cherry picked from commit 71cb7f3eb4fb706aab7f83906592942a2ff2e924:
    > Change-Id: I2f2b666b627dcb52cddc6d5b95856e420b2b2e26
    > BUG: 1381970
    > Signed-off-by: Niels de Vos <ndevos>
    > Reviewed-on: https://review.gluster.org/17188
    > Smoke: Gluster Build System <jenkins.org>
    > NetBSD-regression: NetBSD Build System <jenkins.org>
    > Reviewed-by: Kaleb KEITHLEY <kkeithle>
    > CentOS-regression: Gluster Build System <jenkins.org>
    > Reviewed-by: jiffin tony Thottan <jthottan>
    
    Change-Id: I2f2b666b627dcb52cddc6d5b95856e420b2b2e26
    BUG: 1450377
    Signed-off-by: Niels de Vos <ndevos>
    Reviewed-on: https://review.gluster.org/17268
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: jiffin tony Thottan <jthottan>
    Reviewed-by: Kaleb KEITHLEY <kkeithle>

Comment 11 Shyamsundar 2017-05-30 18:52:18 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report.

glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html
[2] https://www.gluster.org/pipermail/gluster-users/