Bug 1411344
| Summary: | [GNFS] GNFS crashed while taking lock on a file from 2 different clients having same volume mounted from 2 different servers | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Manisha Saini <msaini> | ||||||||
| Component: | gluster-nfs | Assignee: | Niels de Vos <ndevos> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | Manisha Saini <msaini> | ||||||||
| Severity: | unspecified | Docs Contact: | |||||||||
| Priority: | medium | ||||||||||
| Version: | rhgs-3.2 | CC: | amukherj, asrivast, giuseppe.ragusa, jthottan, jules, msaini, ndevos, rcyriac, rhinduja, rhs-bugs, sbhaloth, skoduri, storage-qa-internal | ||||||||
| Target Milestone: | --- | Keywords: | Reproducer | ||||||||
| Target Release: | RHGS 3.3.0 | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | 3.3.0-devel-freeze-exception | ||||||||||
| Fixed In Version: | glusterfs-3.8.4-33 | Doc Type: | If docs needed, set a value | ||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2017-09-21 04:30:55 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | 1381970, 1467313 | ||||||||||
| Bug Blocks: | 1411338, 1417147 | ||||||||||
| Attachments: |
|
||||||||||
|
Description
Manisha Saini
2017-01-09 14:28:31 UTC
When you run this test, how often does it crash for you? Is it a one time occurrence, or is it very reliable? I hit this same crash twice while running this test scenario. Missed some additional info Version-Release number of selected component (if applicable): glusterfs-ganesha-3.8.4-10.el7rhgs.x86_64 Created attachment 1239086 [details]
Test script
While working on the attached test-script I managed to get a coredump too. This happened while manually executing the commands I wanted to put in the script. Now the script is running and has already with 100+ iterations and still no crashes...
Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/'.
Program terminated with signal 11, Segmentation fault.
#0 __strcmp_sse42 () at ../sysdeps/x86_64/multiarch/strcmp-sse42.S:164
164 movdqu (%rdi), %xmm1
(gdb) bt
#0 __strcmp_sse42 () at ../sysdeps/x86_64/multiarch/strcmp-sse42.S:164
#1 0x00007fafa65986f2 in nlm_set_rpc_clnt (rpc_clnt=0x7faf8c005200, caller_name=0x0) at nlm4.c:345
#2 0x00007fafa659b1d5 in nlm_rpcclnt_notify (rpc_clnt=0x7faf8c005200, mydata=0x7faf9f66b06c, fn=<optimized out>, data=<optimized out>) at nlm4.c:930
#3 0x00007fafb48a0a84 in rpc_clnt_notify (trans=<optimized out>, mydata=0x7faf8c005230, event=<optimized out>, data=0x7faf8c00cd70) at rpc-clnt.c:994
#4 0x00007fafb489c973 in rpc_transport_notify (this=this@entry=0x7faf8c00cd70, event=event@entry=RPC_TRANSPORT_CONNECT, data=data@entry=0x7faf8c00cd70) at rpc-transport.c:541
#5 0x00007fafa9391c67 in socket_connect_finish (this=0x7faf8c00cd70) at socket.c:2343
#6 0x00007fafa9396315 in socket_event_handler (fd=<optimized out>, idx=10, data=0x7faf8c00cd70, poll_in=0, poll_out=4, poll_err=0) at socket.c:2386
#7 0x00007fafb4b2ece0 in event_dispatch_epoll_handler (event=0x7faf9e568e80, event_pool=0x7fafb545e6e0) at event-epoll.c:571
#8 event_dispatch_epoll_worker (data=0x7fafa0033d50) at event-epoll.c:674
#9 0x00007fafb3937df5 in start_thread (arg=0x7faf9e569700) at pthread_create.c:308
#10 0x00007fafb327e1ad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) f 2
#2 0x00007fafa659b1d5 in nlm_rpcclnt_notify (rpc_clnt=0x7faf8c005200, mydata=0x7faf9f66b06c, fn=<optimized out>, data=<optimized out>) at nlm4.c:930
930 ret = nlm_set_rpc_clnt (rpc_clnt, caller_name);
(gdb) l
925 cs = mydata;
926 caller_name = cs->args.nlm4_lockargs.alock.caller_name;
927
928 switch (fn) {
929 case RPC_CLNT_CONNECT:
930 ret = nlm_set_rpc_clnt (rpc_clnt, caller_name);
931 if (ret == -1) {
932 gf_msg (GF_NLM, GF_LOG_ERROR, 0,
933 NFS_MSG_RPC_CLNT_ERROR, "Failed to set "
934 "rpc clnt");
(gdb) p cs->args.nlm4_lockargs
$1 = {
cookie = {
nlm4_netobj_len = 0,
nlm4_netobj_val = 0x0
},
block = 0,
exclusive = 0,
alock = {
caller_name = 0x0,
fh = {
nlm4_netobj_len = 0,
nlm4_netobj_val = 0x0
},
oh = {
nlm4_netobj_len = 0,
nlm4_netobj_val = 0x0
},
svid = 0,
l_offset = 0,
l_len = 0
},
reclaim = 0,
state = 0
}
It seems that the nlm4_lockargs are empty... No idea how that can happen, will investigate a little more.
In the occasion (not sure yet what the trigger is) that the Gluster/NFS service crashes for me, I get the following log: [2017-01-10 15:19:05.722973] W [socket.c:590:__socket_rwv] 0-NLM-client: readv on 172.31.100.3:58171 failed (No data available) pending frames: frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2017-01-10 15:19:16 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.8.5 /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xc2)[0x7f831d790592] /lib64/libglusterfs.so.0(gf_print_trace+0x324)[0x7f831d79a004] /lib64/libc.so.6(+0x35650)[0x7f831be78650] /lib64/libc.so.6(+0x131f26)[0x7f831bf74f26] /usr/lib64/glusterfs/3.8.5/xlator/nfs/server.so(+0x3f6f2)[0x7f830f2536f2] /usr/lib64/glusterfs/3.8.5/xlator/nfs/server.so(+0x421d5)[0x7f830f2561d5] /lib64/libgfrpc.so.0(rpc_clnt_notify+0x214)[0x7f831d55ba84] /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f831d557973] /usr/lib64/glusterfs/3.8.5/rpc-transport/socket.so(+0x4c67)[0x7f831204cc67] /usr/lib64/glusterfs/3.8.5/rpc-transport/socket.so(+0x9315)[0x7f8312051315] /lib64/libglusterfs.so.0(+0x82ce0)[0x7f831d7e9ce0] /lib64/libpthread.so.0(+0x7df5)[0x7f831c5f2df5] /lib64/libc.so.6(clone+0x6d)[0x7f831bf391ad] It might be that __socket_rwv() calls the notify functions even when there is no data. This would be a connect to the NLM port with a packet that at least contains enough information to identify it as a NLM procedure. Unfortunately I have not succeeded to reproduce the problem while capturing a tcpdump... As discussed in bug Triage meeting, Tried the same scenario with MD-Cache disable, With MD-Cache disable also,Same GNFS crash was observed #tailf /var/log/glusterfs/nfs.log [2017-01-12 06:57:29.337698] W [socket.c:590:__socket_rwv] 0-NLM-client: readv on 10.70.37.142:40829 failed (No data available) pending frames: frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2017-01-12 06:57:40 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.8.4 /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xc2)[0x7ff93eb51b92] /lib64/libglusterfs.so.0(gf_print_trace+0x324)[0x7ff93eb5b624] /lib64/libc.so.6(+0x35250)[0x7ff93d235250] /lib64/libc.so.6(+0x133586)[0x7ff93d333586] /usr/lib64/glusterfs/3.8.4/xlator/nfs/server.so(+0x3f742)[0x7ff93033d742] /usr/lib64/glusterfs/3.8.4/xlator/nfs/server.so(+0x42235)[0x7ff930340235] /lib64/libgfrpc.so.0(rpc_clnt_notify+0x232)[0x7ff93e91ba12] /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7ff93e917893] /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so(+0x5167)[0x7ff933407167] /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so(+0x9738)[0x7ff93340b738] /lib64/libglusterfs.so.0(+0x835b0)[0x7ff93ebab5b0] /lib64/libpthread.so.0(+0x7dc5)[0x7ff93d9b2dc5] /lib64/libc.so.6(clone+0x6d)[0x7ff93d2f773d] # showmount -e localhost clnt_create: RPC: Program not registered Created attachment 1239987 [details] Test script that causes Gluster/NFS to crash in NLM code I am able to crash Gluster/NFS pretty reliable now with the attached script. The following actions are taken: 1. mount a volume from server_a 2. take a whole-file lock (and keep it in the background) 3. mount a volume from server_b 4. try to take the whole-file lock too 5. cancel the 2nd lock request 6. release the obtained lock 7. try to get the lock again 8. get the lock and segfaults This seems to take somewhere between 10-15 minutes before step 8 happens. Example output with timings: # ./bz1411344_cancel.sh [A] trying to get lock: Thu Jan 12 16:06:42 CET 2017 [A] got lock: Thu Jan 12 16:06:42 CET 2017 [B] trying to get lock: Thu Jan 12 16:06:43 CET 2017 [B] cancelling lock: Thu Jan 12 16:06:45 CET 2017 [A] released lock: Thu Jan 12 16:07:02 CET 2017 [A] trying to get lock: Thu Jan 12 16:07:04 CET 2017 [A] got lock: Thu Jan 12 16:18:54 CET 2017 The environment for this script: - two Gluster servers for mounting the NFS export (can be a single brick volume) - single NFS-client, but with two different mountpoints Manisha, I am not sure if you did a lock-cancel in your test. Without it I am not able to reproduce the segfault. Could you please confirm the exact steps that you did to get the crashes? I used the attached script to take the lockBelow are the steps Steps: 1.Mount a volume from server1 to client1 2.Create 512 bytes of file say file1 using dd from client1 3.Mount the same volume from server2 to client2 4.Taking lock from client1 on the file1 -->Lock is granted 5.Try taking lock from client2 on the file1 -->Locking is blocked (not granted) 6.Release the lock from client1 7.Take Lock from client2 on file1 -->Lock is granted 8.Try taking lock from client1 on file1 -->Lock is granted (which should not.For this I filed a bug- https://bugzilla.redhat.com/show_bug.cgi?id=1411338) 9.Release lock from client 2 and client1 As you mentioned in comment 11,It takes around 5-10 minutes to crash GNFS after these steps Created attachment 1240221 [details]
Locking script
Observed the same crash while taking exclusive write lock from client1 on existing file and at the same time trying overwriting the same file from client2 (gdb) bt #0 0x00007fcdc2a1f586 in __strcmp_sse42 () from /lib64/libc.so.6 #1 0x00007fcdb5a29742 in nlm_set_rpc_clnt () from /usr/lib64/glusterfs/3.8.4/xlator/nfs/server.so #2 0x00007fcdb5a2c235 in nlm_rpcclnt_notify () from /usr/lib64/glusterfs/3.8.4/xlator/nfs/server.so #3 0x00007fcdc4007a12 in rpc_clnt_notify () from /lib64/libgfrpc.so.0 #4 0x00007fcdc4003893 in rpc_transport_notify () from /lib64/libgfrpc.so.0 #5 0x00007fcdb8af3167 in socket_connect_finish () from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so #6 0x00007fcdb8af7738 in socket_event_handler () from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so #7 0x00007fcdc42975b0 in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0 #8 0x00007fcdc309edc5 in start_thread () from /lib64/libpthread.so.0 #9 0x00007fcdc29e373d in clone () from /lib64/libc.so.6 Observing the same Crash as reported in this bug with cthon locking test suit pending frames: frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2017-01-27 06:59:06 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.8.4 /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xc2)[0x7f0645675b92] /lib64/libglusterfs.so.0(gf_print_trace+0x324)[0x7f064567f624] /lib64/libc.so.6(+0x35250)[0x7f0643d59250] /lib64/libc.so.6(+0x133586)[0x7f0643e57586] /usr/lib64/glusterfs/3.8.4/xlator/nfs/server.so(+0x3f742)[0x7f06370ca742] /usr/lib64/glusterfs/3.8.4/xlator/nfs/server.so(+0x42235)[0x7f06370cd235] /lib64/libgfrpc.so.0(rpc_clnt_notify+0x232)[0x7f064543fa12] /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f064543b893] /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so(+0x5167)[0x7f0639f2b167] /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so(+0x9738)[0x7f0639f2f738] /lib64/libglusterfs.so.0(+0x835b0)[0x7f06456cf5b0] /lib64/libpthread.so.0(+0x7dc5)[0x7f06444d6dc5] /lib64/libc.so.6(clone+0x6d)[0x7f0643e1b73d] --------- ======== (gdb) bt #0 0x00007f0643e57586 in __strcmp_sse42 () from /lib64/libc.so.6 #1 0x00007f06370ca742 in nlm_set_rpc_clnt () from /usr/lib64/glusterfs/3.8.4/xlator/nfs/server.so #2 0x00007f06370cd235 in nlm_rpcclnt_notify () from /usr/lib64/glusterfs/3.8.4/xlator/nfs/server.so #3 0x00007f064543fa12 in rpc_clnt_notify () from /lib64/libgfrpc.so.0 #4 0x00007f064543b893 in rpc_transport_notify () from /lib64/libgfrpc.so.0 #5 0x00007f0639f2b167 in socket_connect_finish () from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so #6 0x00007f0639f2f738 in socket_event_handler () from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so #7 0x00007f06456cf5b0 in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0 #8 0x00007f06444d6dc5 in start_thread () from /lib64/libpthread.so.0 #9 0x00007f0643e1b73d in clone () from /lib64/libc.so.6 =============== The Cthon test suit is passing. But after completion of test,GNFS is crashing on the server node from which the Volume is mounted. Even without locking (i've set nfs client mounts to nolock,local_lock=none) the nfs server crashes on some workload. It became even more evil since i have upgraded from 3.7.8 to 3.9.1-1 I've commited a Bug Report log time ago for this already: https://bugzilla.redhat.com/show_bug.cgi?id=1381970 Given that the bug has been confirmed present even with locking out of the picture in comment #22 above (a condition that I'm not able to test/replicate ATM, unfortunately), would it be possible that the following is involved? https://review.gluster.org/#/c/16613/ (In reply to Giuseppe Ragusa from comment #24) > Given that the bug has been confirmed present even with locking out of the > picture in comment #22 above (a condition that I'm not able to > test/replicate ATM, unfortunately), would it be possible that the following > is involved? > > https://review.gluster.org/#/c/16613/ For any of the segfaults that have references to 'nlm' in the backtrace, locking over NFS has not been completely disabled. It would be needed to remove all nlockmgr services from both the client and server registrations at 'rpcbind'. This can be checked with # rpcinfo | grep nlockmgr on the client and server, and registrations can be removed with # rpcinfo -d 100021 1 # rpcinfo -d 100021 4 This will cause all locking related requests initiated from both the server and the client to (silently?) fail. (In reply to Niels de Vos from comment #25) > (In reply to Giuseppe Ragusa from comment #24) > > Given that the bug has been confirmed present even with locking out of the > > picture in comment #22 above (a condition that I'm not able to > > test/replicate ATM, unfortunately), would it be possible that the following > > is involved? > > > > https://review.gluster.org/#/c/16613/ > > For any of the segfaults that have references to 'nlm' in the backtrace, > locking over NFS has not been completely disabled. > > It would be needed to remove all nlockmgr services from both the client and > server registrations at 'rpcbind'. This can be checked with > > # rpcinfo | grep nlockmgr > > on the client and server, and registrations can be removed with > > # rpcinfo -d 100021 1 > # rpcinfo -d 100021 4 > > This will cause all locking related requests initiated from both the server > and the client to (silently?) fail. Sounds conclusive. But one thing I don't understand is, if I force the nfs client to not sending any locks by setting mount options: no_lock, local_lock=none and I also don't see any locks on the mounts checking via lsof, were do the write locks should come from? (In reply to Jules from comment #26) > (In reply to Niels de Vos from comment #25) > > (In reply to Giuseppe Ragusa from comment #24) > > > Given that the bug has been confirmed present even with locking out of the > > > picture in comment #22 above (a condition that I'm not able to > > > test/replicate ATM, unfortunately), would it be possible that the following > > > is involved? > > > > > > https://review.gluster.org/#/c/16613/ > > > > For any of the segfaults that have references to 'nlm' in the backtrace, > > locking over NFS has not been completely disabled. > > > > It would be needed to remove all nlockmgr services from both the client and > > server registrations at 'rpcbind'. This can be checked with > > > > # rpcinfo | grep nlockmgr > > > > on the client and server, and registrations can be removed with > > > > # rpcinfo -d 100021 1 > > # rpcinfo -d 100021 4 > > > > This will cause all locking related requests initiated from both the server > > and the client to (silently?) fail. > > Sounds conclusive. But one thing I don't understand is, if I force the nfs > client to not sending any locks by setting mount options: no_lock, > local_lock=none and I also don't see any locks on the mounts checking via > lsof, were do the write locks should come from? Upon mounting the NFS-server initiates a callback channel to the NLM service running on the client. It will still try to track certain events in the NLM part of the NFS-server. When these connections are setup, the NFS-server will try to use them, and this may still cause the segfaults. Niels, Please provide the rpm's with the fixes. We shall be able to test it and confirm if the issue is fixed. Hey Niels, I tried verifying this bug with the rpms given comment #16, Created a 2 x (4 + 2) Distributed-Disperse Volume. Started the volume. While enabling GNFS on EC volume,GNFS got crashed. # gluster v set disperseVol nfs.disable off Gluster NFS is being deprecated in favor of NFS-Ganesha Enter "yes" to continue using Gluster NFS (y/n) y volume set: success #showmount -e localhost clnt_create: RPC: Program not registered In nfs.logs -------------- [2017-04-21 14:31:23.448381] I [MSGID: 100030] [glusterfsd.c:2455:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.10dev (args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log -S /var/run/gluster/a791065172835cf260dc66c9aa9e0a7e.socket) [2017-04-21 14:31:23.467324] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2017-04-21 14:31:23.467388] E [socket.c:2307:socket_connect_finish] 0-glusterfs: connection to ::1:24007 failed (Connection refused) [2017-04-21 14:31:23.467439] E [glusterfsd-mgmt.c:1931:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to connect with remote-host: localhost (Transport endpoint is not connected) [2017-04-21 14:31:27.462129] I [rpcsvc.c:2208:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: Configured rpc.outstanding-rpc-limit with value 16 [2017-04-21 14:31:27.490973] I [MSGID: 112153] [mount3.c:3992:mnt3svc_init] 0-nfs-mount: Exports auth has been disabled! [2017-04-21 14:31:27.560772] I [rpc-drc.c:689:rpcsvc_drc_init] 0-rpc-service: DRC is turned OFF [2017-04-21 14:31:27.560822] I [MSGID: 112110] [nfs.c:1524:init] 0-nfs: NFS service started [2017-04-21 14:31:27.574268] I [MSGID: 122067] [ec-code.c:896:ec_code_detect] 0-disperseVol-disperse-1: Using 'sse' CPU extensions pending frames: frame : type(0) op(0) patchset: git://git.gluster.org/glusterfs.git signal received: 11 time of crash: 2017-04-21 14:31:27 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.10dev /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xa0)[0x7f1e244d1e90] /lib64/libglusterfs.so.0(gf_print_trace+0x324)[0x7f1e244db794] /lib64/libc.so.6(+0x35250)[0x7f1e22bb6250] /usr/lib64/glusterfs/3.10dev/xlator/cluster/disperse.so(+0x34dc1)[0x7f1e16839dc1] /usr/lib64/glusterfs/3.10dev/xlator/cluster/disperse.so(+0x33964)[0x7f1e16838964] /usr/lib64/glusterfs/3.10dev/xlator/cluster/disperse.so(init+0x1f4)[0x7f1e16812184] /lib64/libglusterfs.so.0(xlator_init+0x50)[0x7f1e244cf5f0] /lib64/libglusterfs.so.0(glusterfs_graph_init+0x29)[0x7f1e24506c29] /lib64/libglusterfs.so.0(glusterfs_graph_activate+0x3b)[0x7f1e2450754b] /usr/sbin/glusterfs(glusterfs_process_volfp+0x12d)[0x7f1e249cb58d] /usr/sbin/glusterfs(mgmt_getspec_cbk+0x3c1)[0x7f1e249d0f51] /lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7f1e2429c720] /lib64/libgfrpc.so.0(rpc_clnt_notify+0x1df)[0x7f1e2429c9ff] /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f1e242988e3] /usr/lib64/glusterfs/3.10dev/rpc-transport/socket.so(+0x72f4)[0x7f1e18d862f4] /usr/lib64/glusterfs/3.10dev/rpc-transport/socket.so(+0x9795)[0x7f1e18d88795] /lib64/libglusterfs.so.0(+0x846a0)[0x7f1e2452c6a0] /lib64/libpthread.so.0(+0x7dc5)[0x7f1e23333dc5] /lib64/libc.so.6(clone+0x6d)[0x7f1e22c7873d] ---------------- # gluster v info Volume Name: disperseVol Type: Distributed-Disperse Volume ID: 9c4e5c69-6a53-4ba3-877f-20536bc2f0d2 Status: Started Snapshot Count: 0 Number of Bricks: 2 x (4 + 2) = 12 Transport-type: tcp Bricks: Brick1: 10.70.47.47:/gluster/brick1/1 Brick2: 10.70.46.38:/gluster/brick1/1 Brick3: 10.70.46.109:/gluster/brick1/1 Brick4: 10.70.46.56:/gluster/brick1/1 Brick5: 10.70.47.126:/gluster/brick1/1 Brick6: 10.70.46.135:/gluster/brick1/1 Brick7: 10.70.47.47:/gluster/brick2/2 Brick8: 10.70.46.38:/gluster/brick2/2 Brick9: 10.70.46.109:/gluster/brick2/2 Brick10: 10.70.46.56:/gluster/brick2/2 Brick11: 10.70.47.126:/gluster/brick2/1 Brick12: 10.70.46.135:/gluster/brick2/1 Options Reconfigured: transport.address-family: inet performance.readdir-ahead: on nfs.disable: off --------- # rpm -qa | grep gluster glusterfs-libs-3.10dev-0.323.gitdbbd275.el7.centos.x86_64 glusterfs-api-3.10dev-0.323.gitdbbd275.el7.centos.x86_64 glusterfs-fuse-3.10dev-0.323.gitdbbd275.el7.centos.x86_64 glusterfs-resource-agents-3.10dev-0.323.gitdbbd275.el7.centos.noarch glusterfs-rdma-3.10dev-0.323.gitdbbd275.el7.centos.x86_64 glusterfs-regression-tests-3.10dev-0.323.gitdbbd275.el7.centos.x86_64 glusterfs-3.10dev-0.323.gitdbbd275.el7.centos.x86_64 glusterfs-client-xlators-3.10dev-0.323.gitdbbd275.el7.centos.x86_64 python-gluster-3.10dev-0.323.gitdbbd275.el7.centos.noarch nfs-ganesha-gluster-2.4.4-4.el7rhgs.x86_64 glusterfs-extra-xlators-3.10dev-0.323.gitdbbd275.el7.centos.x86_64 glusterfs-cli-3.10dev-0.323.gitdbbd275.el7.centos.x86_64 glusterfs-api-devel-3.10dev-0.323.gitdbbd275.el7.centos.x86_64 glusterfs-ganesha-3.10dev-0.323.gitdbbd275.el7.centos.x86_64 glusterfs-geo-replication-3.10dev-0.323.gitdbbd275.el7.centos.x86_64 glusterfs-debuginfo-3.10dev-0.323.gitdbbd275.el7.centos.x86_64 glusterfs-server-3.10dev-0.323.gitdbbd275.el7.centos.x86_64 glusterfs-devel-3.10dev-0.323.gitdbbd275.el7.centos.x86_64 glusterfs-events-3.10dev-0.323.gitdbbd275.el7.centos.x86_64 Niels, With the test build provided in comment #33, I created the EC volume and mounted it over GNFS.Ran cthon lock test suit. Cthon test got passed and no crash was observed. I could not try the scenario mentioned in this bug due to Bug 1444515 - [GNFS+EC] Unable to release the lock when the other client tries to acquire the lock on the same file. But this crash reported in this bug was even observed earlier with cthon lock test suit,which in this test build appears to be fixed. Rebased patches have been sent for review for the Gluster Community master branch: https://review.gluster.org/#/q/topic:bug-1381970 Once these are merged, I'll backport them to the other branches and RHGS repository. Hey Niels, I tried the use case of taking lock from 2 clients on 1 file on dist-replicate volume.NFS is getting crashed # rpm -qa | grep gluster glusterfs-3.8.4-28.el7rhgs.x86_64 glusterfs-server-3.8.4-28.el7rhgs.x86_64 python-gluster-3.8.4-28.el7rhgs.noarch vdsm-gluster-4.17.33-1.1.el7rhgs.noarch glusterfs-client-xlators-3.8.4-28.el7rhgs.x86_64 glusterfs-fuse-3.8.4-28.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-28.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch glusterfs-libs-3.8.4-28.el7rhgs.x86_64 glusterfs-api-3.8.4-28.el7rhgs.x86_64 glusterfs-cli-3.8.4-28.el7rhgs.x86_64 libvirt-daemon-driver-storage-gluster-3.2.0-4.el7.x86_64 gluster-nagios-addons-0.2.9-1.el7rhgs.x86_64 glusterfs-rdma-3.8.4-28.el7rhgs.x86_64 ify?! [Invalid argument] [2017-06-19 13:08:46.117375] W [socket.c:595:__socket_rwv] 0-NLM-client: readv on 10.70.37.142:34033 failed (No data available) [2017-06-19 13:08:46.117529] W [socket.c:595:__socket_rwv] 0-NLM-client: readv on 10.70.37.142:34033 failed (No data available) pending frames: frame : type(0) op(0) frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2017-06-19 13:08:48 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.8.4 /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xc2)[0x7f6ec83b54b2] /lib64/libglusterfs.so.0(gf_print_trace+0x324)[0x7f6ec83befe4] /lib64/libc.so.6(+0x35270)[0x7f6ec6a1e270] /lib64/libc.so.6(+0x165921)[0x7f6ec6b4e921] /usr/lib64/glusterfs/3.8.4/xlator/nfs/server.so(+0x3f9aa)[0x7f6eba13a9aa] /usr/lib64/glusterfs/3.8.4/xlator/nfs/server.so(+0x42349)[0x7f6eba13d349] /lib64/libgfrpc.so.0(rpc_clnt_notify+0x214)[0x7f6ec817eb54] /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f6ec817a9e3] /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so(+0x51d7)[0x7f6ebcfa71d7] /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so(+0x9918)[0x7f6ebcfab918] /lib64/libglusterfs.so.0(+0x849d6)[0x7f6ec840f9d6] /lib64/libpthread.so.0(+0x7e25)[0x7f6ec7214e25] /lib64/libc.so.6(clone+0x6d)[0x7f6ec6ae134d] Volume Name: GNFSVol Type: Distributed-Replicate Volume ID: 6e809852-79b2-4348-b051-df4d837ab585 Status: Started Snapshot Count: 0 Number of Bricks: 6 x 2 = 12 Transport-type: tcp Bricks: Brick1: dhcp42-1.lab.eng.blr.redhat.com:/gluster/brick1/1 Brick2: dhcp41-251.lab.eng.blr.redhat.com:/gluster/brick1/1 Brick3: dhcp41-248.lab.eng.blr.redhat.com:/gluster/brick1/1 Brick4: dhcp41-245.lab.eng.blr.redhat.com:/gluster/brick1/1 Brick5: dhcp42-1.lab.eng.blr.redhat.com:/gluster/brick2/1 Brick6: dhcp41-251.lab.eng.blr.redhat.com:/gluster/brick2/1 Brick7: dhcp41-248.lab.eng.blr.redhat.com:/gluster/brick2/1 Brick8: dhcp41-245.lab.eng.blr.redhat.com:/gluster/brick2/1 Brick9: dhcp42-1.lab.eng.blr.redhat.com:/gluster/brick3/1 Brick10: dhcp41-251.lab.eng.blr.redhat.com:/gluster/brick3/1 Brick11: dhcp41-248.lab.eng.blr.redhat.com:/gluster/brick3/1 Brick12: dhcp41-245.lab.eng.blr.redhat.com:/gluster/brick3/1 Options Reconfigured: transport.address-family: inet nfs.disable: off I ran the case on dist-replicate volume because of the already known issues with EC volume. Based on comment #38 ,Moving it to assigned state Can be easily reproducible by cthon lock test suit. Earlier in 3.2 ,GNFS used to crash after completion of Cthon test suit(Cthon test suit used to pass). Now the GNFS is crashing in between running cthon test suit during test 12. GNFS crashed in test 12 itself Test #11 - Make sure close() releases the process's locks. Parent: 11.0 - F_TLOCK [ 0, ENDING] PASSED. Parent: Closed testfile. Child: 11.1 - F_TLOCK [ 0, ENDING] PASSED. Child: 11.2 - F_ULOCK [ 0, ENDING] PASSED. Parent: 11.3 - F_TLOCK [ 1d, 5b7] PASSED. Parent: 11.4 - F_TLOCK [ 2000, 57] PASSED. Parent: Closed testfile. Child: 11.5 - F_TLOCK [ 0, ENDING] PASSED. Child: 11.6 - F_ULOCK [ 0, ENDING] PASSED. Parent: Wrote '123456789abcdef' to testfile [ 0, 16 ]. Parent: 11.7 - F_TLOCK [ 0, ENDING] PASSED. Parent: Wrote '123456789abcdef' to testfile [ 13, 16 ]. Parent: Closed testfile. Child: 11.8 - F_TLOCK [ 0, ENDING] PASSED. Child: 11.9 - F_ULOCK [ 0, ENDING] PASSED. Parent: Wrote '123456789abcdef' to testfile [ 0, 16 ]. Parent: 11.10 - F_TLOCK [ 0, ENDING] PASSED. Parent: Truncated testfile. Parent: Closed testfile. Child: 11.11 - F_TLOCK [ 0, ENDING] PASSED. Child: 11.12 - F_ULOCK [ 0, ENDING] PASSED. Test #12 - Signalled process should release locks. Child: 12.0 - F_TLOCK [ 0, ENDING] PASSED. Parent: Killed child process. Depending on what re-connect and re-try for procedure is being done, the following '.nlm4_lockargs' structure does not contain valid information:
xlators/nfs/server/src/nlm4.c:
932 case RPC_CLNT_CONNECT:
933 if (!cs->req) {
934 gf_msg (GF_NLM, GF_LOG_ERROR, EINVAL,
935 NFS_MSG_RPC_CLNT_ERROR, "Spurious notify?!");
936 goto err;
937 }
938
939 caller_name = cs->args.nlm4_lockargs.alock.caller_name;
940 ret = nlm_set_rpc_clnt (rpc_clnt, caller_name);
Further on in nlm_set_rpc_clnt() a strncmp(caller_name, ...) is done, but caller_name may point to garbage.
Depending on the procedure that is executed with the re-connect, the details will be in any of the 'args' structures:
from nfs3.h:
152 typedef union args_ {
153 nlm4_stat nlm4_stat;
154 nlm4_holder nlm4_holder;
155 nlm4_lock nlm4_lock;
156 nlm4_share nlm4_share;
157 nlm4_testrply nlm4_testrply;
158 nlm4_testres nlm4_testres;
159 nlm4_testargs nlm4_testargs;
160 nlm4_res nlm4_res;
161 nlm4_lockargs nlm4_lockargs;
162 nlm4_cancargs nlm4_cancargs;
163 nlm4_unlockargs nlm4_unlockargs;
164 nlm4_shareargs nlm4_shareargs;
165 nlm4_shareres nlm4_shareres;
166 nlm4_freeallargs nlm4_freeallargs;
...
171 } args;
It seems that the procedure that is executed can only be accessed through cs->req->procnum. The value will need to be matched with these from nlm4.h: 29 #define NLM4_NULL 0 30 #define NLM4_TEST 1 31 #define NLM4_LOCK 2 32 #define NLM4_CANCEL 3 33 #define NLM4_UNLOCK 4 34 #define NLM4_GRANTED 5 35 #define NLM4_TEST_MSG 6 36 #define NLM4_LOCK_MSG 7 37 #define NLM4_CANCEL_MSG 8 38 #define NLM4_UNLOCK_MSG 9 39 #define NLM4_GRANTED_MSG 10 40 #define NLM4_TEST_RES 11 41 #define NLM4_LOCK_RES 12 42 #define NLM4_CANCEL_RES 13 43 #define NLM4_UNLOCK_RES 14 44 #define NLM4_GRANTED_RES 15 45 #define NLM4_SM_NOTIFY 16 46 #define NLM4_SEVENTEEN 17 47 #define NLM4_EIGHTEEN 18 48 #define NLM4_NINETEEN 19 49 #define NLM4_SHARE 20 50 #define NLM4_UNSHARE 21 51 #define NLM4_NM_LOCK 22 52 #define NLM4_FREE_ALL 23 53 #define NLM4_PROC_COUNT 24 Not all of these are implemented, so the 'union _args' from the previous comment will need to be used for reference. Posted changes for upstream: nfs: make nfs3_call_state_t refcounted - https://review.gluster.org/17696 nfs/nlm: unref fds in nlm_client_free() - https://review.gluster.org/17697 nfs/nlm: handle reconnect for non-NLM4_LOCK requests - https://review.gluster.org/17698 nfs/nlm: use refcounting for nfs3_call_state_t - https://review.gluster.org/17699 nfs/nlm: keep track of the call-state and frame for notifications - https://review.gluster.org/17700 Is it possible to merge these bufixes into glusterfs3.10 also? (In reply to Jules from comment #53) > Is it possible to merge these bufixes into glusterfs3.10 also? These changes are being backported through bug 1471870. Note that this is a bug for Red Hat Gluster Storage, and not the Gluster Community project. Verified this bug on glusterfs-fuse-3.8.4-35.el7rhgs.x86_64 Use Case 1: 1.Create a volume 2 x (4 + 2) Distributed-Disperse Volume. 2.Export the volume via GNFS 3.Mount the volume on 2 clients. 4.Create 1G of file f1 from client 1 5.Take lock on file f1 from client 1 ->Lock Granted 6.Try taking lock from 2nd client on same file ->Failed to get lock as its with client 1 7.Release the Lock from client 1 on file f1->Lock is release from client 1 and is granted to client 2 8.Again try taking lock from client 1.->Lock is granted --Isuue is reported in bug- https://bugzilla.redhat.com/show_bug.cgi?id=1411338 Use Case 2: 1.Create a volume 2 x (4 + 2) Distributed-Disperse Volume. 2.Export the volume via GNFS 3.Mount the volume to 1 client 4.Run cthon lock test suit. No GNFS crash is being observed,Moving this bug to verified state. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774 |