1421759 – Gluster NFS server crashing in __mnt3svc_umountall

Bug 1421759 - Gluster NFS server crashing in __mnt3svc_umountall

Summary: Gluster NFS server crashing in __mnt3svc_umountall

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	nfs
Sub Component:
Version:	mainline
Hardware:	All
OS:	All
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Assignee:	Soumya Koduri
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1315544 1422391 1422392 1422394 1422404
TreeView+	depends on / blocked

Reported:	2017-02-13 15:38 UTC by Raghavendra Bhat
Modified:	2017-11-15 06:38 UTC (History)
CC List:	18 users (show)
Fixed In Version:	glusterfs-3.11.0
Clone Of:	1315544
Clones:	1422391 (view as bug list)
Environment:
Last Closed:	2017-05-30 18:43:01 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Raghavendra Bhat 2017-02-13 15:38:27 UTC

+++ This bug was initially created as a clone of Bug #1315544 +++

Description of problem:
[2016-02-17 21:51:29.261660] E [mem-pool.c:305:__gf_free] (-->/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(__mnt3svc_umountall+0x63) [0x7f0b67db3eb3] -->/usr/lib64/libglusterfs.so.0(dict_del+0xd7) [0x7f0b7a0ccb77] -->/usr/lib64/libglusterfs.so.0(__gf_free+0xb4) [0x7f0b7a108dd4] ) 0-: Assertion failed: GF_MEM_HEADER_MAGIC == header->magic
.....
signal received: 6
.....
package-string: glusterfs 3.7.1
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x7f0b7a0d28e6]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x7f0b7a0f24af]
/lib64/libc.so.6(+0x39a80326a0)[0x7f0b78a716a0]
/lib64/libc.so.6(gsignal+0x35)[0x7f0b78a71625]
/lib64/libc.so.6(abort+0x175)[0x7f0b78a72e05]
/lib64/libc.so.6(+0x39a8070537)[0x7f0b78aaf537]
/lib64/libc.so.6(+0x39a8075e66)[0x7f0b78ab4e66]
/usr/lib64/libglusterfs.so.0(dict_del+0xd7)[0x7f0b7a0ccb77]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(__mnt3svc_umountall+0x63)[0x7f0b67db3eb3]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(mnt3svc_umountall+0x34)[0x7f0b67db3f14]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(mnt3svc_umntall+0x2c)[0x7f0b67db47dc]
/usr/lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x7f0b7a11a2b2]
/lib64/libc.so.6(+0x39a80438f0)[0x7f0b78a828f0]
---------
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
[2016-02-22 21:54:29.319978] E [mem-pool.c:305:__gf_free] (-->/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(__mnt3svc_umountall+0x63) [0x7f3dd0488eb3] -->/usr/lib64/libglusterfs.so.0(dict_del+0xd7) [0x7f3dde701b77] -->/usr/lib64/libglusterfs.so.0(__gf_free+0xb4) [0x7f3dde73ddd4] ) 0-: Assertion failed: GF_MEM_HEADER_MAGIC == header->magic
pending frames: 
.....
signal received: 11
.....
package-string: glusterfs 3.7.1
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x7f51e5f8d8e6]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x7f51e5fad4af]
/lib64/libc.so.6(+0x39a80326a0)[0x7f51e492c6a0]
/usr/lib64/libglusterfs.so.0(__gf_free+0xc5)[0x7f51e5fc3de5]
/usr/lib64/libglusterfs.so.0(dict_del+0xd7)[0x7f51e5f87b77]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(__mnt3svc_umountall+0x63)[0x7f51d3b95eb3]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(mnt3svc_umountall+0x34)[0x7f51d3b95f14]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(mnt3svc_umntall+0x2c)[0x7f51d3b967dc]
/usr/lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x7f51e5fd52b2]
/lib64/libc.so.6(+0x39a80438f0)[0x7f51e493d8f0]
---------
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
pending frames:
.....
signal received: 11
.....
package-string: glusterfs 3.7.1
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x7f03e05ed8e6]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x7f03e060d4af]
/lib64/libc.so.6(+0x39a80326a0)[0x7f03def8c6a0]
/lib64/libpthread.so.0(pthread_spin_lock+0x0)[0x7f03df6dd400]
/usr/lib64/libglusterfs.so.0(mem_put+0x4b)[0x7f03e062402b]
/usr/lib64/libglusterfs.so.0(dict_del+0xec)[0x7f03e05e7b8c]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(__mnt3svc_umountall+0x63)[0x7f03d236eeb3]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(mnt3svc_umountall+0x34)[0x7f03d236ef14]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(mnt3svc_umntall+0x2c)[0x7f03d236f7dc]
/usr/lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x7f03e06352b2]
/lib64/libc.so.6(+0x39a80438f0)[0x7f03def9d8f0]
---------
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Version-Release number of selected component (if applicable):
Red Hat Enterprise Linux Server release 6.7 (Santiago)
glusterfs-3.7.1-11.el6rhs.x86_64

How reproducible:
easily/repeatably

Steps to Reproduce:
1. Run NFS
2.
3.

Actual results:
Crashes in __mnt3svc_umountall

Expected results:
No crashes


From the coredump:

(gdb) f 6
#6  0x00007f4658e65eb3 in __mountdict_remove (ms=0x7f465437fcf0) at mount3.c:213
213             dict_del (ms->mountdict, me->hashkey);
(gdb) l
208      * Not for external use.
209      */
210     inline void
211     __mountdict_remove (struct mount3_state *ms, struct mountentry *me)
212     {
213             dict_del (ms->mountdict, me->hashkey);
214     }
215
216     /* Generic error reply function, just pass the err status
217      * and it will do the rest, including transmission.
(gdb) p *me
$3 = {
  mlist = {
    next = 0xbabebabe, 
    prev = 0xcafecafe
  }, 
  exname = "/puppet_dev", '\000' <repeats 1012 times>, 
  hostname = "10.9.156.184", '\000' <repeats 1011 times>, 
  fullpath = '\000' <repeats 1023 times>, 
  has_full_path = _gf_false, 
  hashkey = '\000' <repeats 2049 times>
}


This shows that *me has been free'd already. I'm not sure yet how this can happen, maybe the nfs-client had mounted the export twice, but it was only contained once in the dict?

Continuing to look into this, and still waiting for usable steps to reproduce.


[2016-05-13 08:59:41.472502] C [mem-pool.c:571:mem_put] (-->/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(__mnt3svc_umountall+0x63) [0x7f3c83db3eb3] -->/usr/lib64/libglusterfs.so.0(dict_del+0xec) [0x7f3c96233bac] -->/usr/lib64/libglusterfs.so.0(mem_put+0xf6) [0x7f3c96270146] ) 0-mem-pool: mem_put called on freed ptr 0x7f3c936f2a34 of mem pool 0x7f3c96e59a20
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 6
time of crash:
2016-05-13 08:59:41
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.1
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x7f3c962398b6]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x7f3c9625947f]
/lib64/libc.so.6(+0x32500326a0)[0x7f3c94bd76a0]
/lib64/libc.so.6(gsignal+0x35)[0x7f3c94bd7625]
/lib64/libc.so.6(abort+0x175)[0x7f3c94bd8e05]
/lib64/libc.so.6(+0x3250070537)[0x7f3c94c15537]
/lib64/libc.so.6(+0x3250075f4e)[0x7f3c94c1af4e]
/lib64/libc.so.6(+0x3250079066)[0x7f3c94c1e066]
/usr/lib64/libglusterfs.so.0(fd_unref+0xba)[0x7f3c9626e17a]
/usr/lib64/glusterfs/3.7.1/xlator/protocol/client.so(client_local_wipe+0x28)[0x7f3c88b0e5e8]
/usr/lib64/glusterfs/3.7.1/xlator/protocol/client.so(client3_3_readv_cbk+0x4bd)[0x7f3c88b2940d]
/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)[0x7f3c96007455]
/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1a1)[0x7f3c96008981]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x7f3c96003ad8]
/usr/lib64/glusterfs/3.7.1/rpc-transport/socket.so(+0xa265)[0x7f3c8ad69265]
/usr/lib64/glusterfs/3.7.1/rpc-transport/socket.so(+0xbe3d)[0x7f3c8ad6ae3d]
/usr/lib64/libglusterfs.so.0(+0x89b20)[0x7f3c9629db20]
/lib64/libpthread.so.0(+0x3250407a51)[0x7f3c95323a51]
/lib64/libc.so.6(clone+0x6d)[0x7f3c94c8d93d]
---------
From node 9:

[2016-05-12 11:46:25.718617] E [mem-pool.c:305:__gf_free] (-->/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(__mnt3svc_umountall+0x63) [0x7f3a5c825eb3] -->/usr/lib64/libglusterfs.so.0(dict_del+0xd7) [0x7f3a6aca6b97] -->/usr/lib64/libglusterfs.so.0(__gf_free+0xb4) [0x7f3a6ace2e44] ) 0-: Assertion failed: GF_MEM_HEADER_MAGIC == header->magic
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 6
time of crash:
2016-05-12 11:46:25
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.1
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x7f3a6acac8b6]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x7f3a6accc47f]
/lib64/libc.so.6(+0x32500326a0)[0x7f3a6964a6a0]
/lib64/libc.so.6(gsignal+0x35)[0x7f3a6964a625]
/lib64/libc.so.6(abort+0x175)[0x7f3a6964be05]
/lib64/libc.so.6(+0x3250070537)[0x7f3a69688537]
/lib64/libc.so.6(+0x3250075f4e)[0x7f3a6968df4e]
/lib64/libc.so.6(+0x3250078cad)[0x7f3a69690cad]
/usr/lib64/libglusterfs.so.0(dict_del+0xd7)[0x7f3a6aca6b97]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(__mnt3svc_umountall+0x63)[0x7f3a5c825eb3]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(mnt3svc_umountall+0x34)[0x7f3a5c825f14]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(mnt3svc_umntall+0x2c)[0x7f3a5c8267dc]
/usr/lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x7f3a6acf4332]
/lib64/libc.so.6(+0x32500438f0)[0x7f3a6965b8f0]
---------
The latest crash from node 8:

[2016-05-12 10:49:54.013722] E [mem-pool.c:305:__gf_free] (-->/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(__mnt3svc_umountall+0x63) [0x7f9e2e13ceb3] -->/usr/lib64/libglusterfs.so.0(dict_del+0xd7) [0x7f9e3c3bab97] -->/usr/lib64/libglusterfs.so.0(__gf_free+0xb4) [0x7f9e3c3f6e44] ) 0-: Assertion failed: GF_MEM_HEADER_MAGIC == header->magic
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 6
time of crash:
2016-05-12 10:49:54
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.1
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x7f9e3c3c08b6]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x7f9e3c3e047f]
/lib64/libc.so.6(+0x32500326a0)[0x7f9e3ad5e6a0]
/lib64/libc.so.6(gsignal+0x35)[0x7f9e3ad5e625]
/lib64/libc.so.6(abort+0x175)[0x7f9e3ad5fe05]
/lib64/libc.so.6(+0x3250070537)[0x7f9e3ad9c537]
/lib64/libc.so.6(+0x3250075f4e)[0x7f9e3ada1f4e]
/lib64/libc.so.6(+0x3250078cad)[0x7f9e3ada4cad]
/usr/lib64/libglusterfs.so.0(dict_del+0xd7)[0x7f9e3c3bab97]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(__mnt3svc_umountall+0x63)[0x7f9e2e13ceb3]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(mnt3svc_umountall+0x34)[0x7f9e2e13cf14]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(mnt3svc_umntall+0x2c)[0x7f9e2e13d7dc]
/usr/lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x7f9e3c408332]
/lib64/libc.so.6(+0x32500438f0)[0x7f9e3ad6f8f0]
---------

--- Additional comment from Oonkwee Lim_ on 2016-05-13 20:33:21 EDT ---

Not sure whether below will give more clues:

Below are some old crashes with SIGSEGV (11) instead of SIGABRT (6):

Node 8:
[2016-03-17 13:32:09.891045] E [MSGID: 112006] [nfs3.c:1571:nfs3svc_lookup] 0-nfs-nfsv3: Error decoding args
The message "E [MSGID: 112006] [nfs3.c:4576:nfs3svc_readdirp] 0-nfs-nfsv3: Error decoding args" repeated 3 times between [2016-03-17 13:31:34.966156] and [2016-03-17 13:32:09.892507]
[2016-03-17 13:32:29.714812] E [mem-pool.c:305:__gf_free] (-->/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(__mnt3svc_umountall+0x63) [0x7f9389716eb3] -->/usr/lib64/libglusterfs.so.0(dict_del+0xd7) [0x7f9397994b97] -->/usr/lib64/libglusterfs.so.0(__gf_free+0xb4) [0x7f93979d0e44] ) 0-: Assertion failed: GF_MEM_HEADER_MAGIC == header->magic
[2016-03-17 13:32:29.714990] E [mem-pool.c:314:__gf_free] (-->/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(__mnt3svc_umountall+0x63) [0x7f9389716eb3] -->/usr/lib64/libglusterfs.so.0(dict_del+0xd7) [0x7f9397994b97] -->/usr/lib64/libglusterfs.so.0(__gf_free+0x114) [0x7f93979d0ea4] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash:
2016-03-17 13:32:29
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.1
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x7f939799a8b6]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x7f93979ba47f]
/lib64/libc.so.6(+0x32500326a0)[0x7f93963386a0]
/lib64/libpthread.so.0(pthread_spin_lock+0x0)[0x7f9396a89400]
/usr/lib64/libglusterfs.so.0(__gf_free+0x125)[0x7f93979d0eb5]
/usr/lib64/libglusterfs.so.0(dict_del+0xd7)[0x7f9397994b97]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(__mnt3svc_umountall+0x63)[0x7f9389716eb3]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(mnt3svc_umountall+0x34)[0x7f9389716f14]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(mnt3svc_umntall+0x2c)[0x7f93897177dc]
/usr/lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x7f93979e2332]
/lib64/libc.so.6(+0x32500438f0)[0x7f93963498f0]
---------
[2016-04-05 10:54:31.535218] W [MSGID: 114031] [client-rpc-fops.c:2666:client3_3_readdirp_cbk] 0-Broadcasting-client-10: remote operation failed [Permission denied]
[2016-04-05 10:54:31.527183] E [MSGID: 114031] [client-rpc-fops.c:2883:client3_3_opendir_cbk] 0-Broadcasting-client-0: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Permission denied]
[2016-04-05 10:54:31.527225] E [MSGID: 114031] [client-rpc-fops.c:2883:client3_3_opendir_cbk] 0-Broadcasting-client-1: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Permission denied]
[2016-04-05 10:54:31.527273] E [MSGID: 114031] [client-rpc-fops.c:2883:client3_3_opendir_cbk] 0-Broadcasting-client-2: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Permission denied]
[2016-04-05 10:54:31.535730] W [MSGID: 114031] [client-rpc-fops.c:2666:client3_3_readdirp_cbk] 0-Broadcasting-client-11: remote operation failed [Permission denied]
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash:
2016-04-05 10:54:46
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.1
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x7fe8fa0718b6]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x7fe8fa09147f]
/lib64/libc.so.6(+0x32500326a0)[0x7fe8f8a0f6a0]
/usr/lib64/libglusterfs.so.0(__gf_free+0xc5)[0x7fe8fa0a7e55]
/usr/lib64/libglusterfs.so.0(dict_del+0xd7)[0x7fe8fa06bb97]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(__mnt3svc_umountall+0x63)[0x7fe8e7b95eb3]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(mnt3svc_umountall+0x34)[0x7fe8e7b95f14]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(mnt3svc_umntall+0x2c)[0x7fe8e7b967dc]
/usr/lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x7fe8fa0b9332]
/lib64/libc.so.6(+0x32500438f0)[0x7fe8f8a208f0]
---------
Node 9:

[2016-03-31 16:12:22.335497] W [rpcsvc.c:278:rpcsvc_program_actor] 0-rpc-service: RPC program version not available (req 100003 2) for 10.69.41.19:49261
[2016-03-31 16:12:22.335519] E [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
The message "E [MSGID: 112006] [nfs3.c:934:nfs3svc_getattr] 0-nfs-nfsv3: Error decoding args" repeated 5 times between [2016-03-31 16:11:36.699818] and [2016-03-31 16:11:45.494469]
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash:
2016-03-31 16:12:26
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.1
[2016-03-31 16:12:26.516704] C [mem-pool.c:571:mem_put] (-->/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(__mnt3svc_umountall+0x63) [0x7fad4f71ceb3] -->/usr/lib64/libglusterfs.so.0(dict_del+0xec) [0x7fad61afcbac] -->/usr/lib64/libglusterfs.so.0(mem_put+0xf6) [0x7fad61b39146] ) 0-mem-pool: mem_put called on freed ptr 0x7fad5efc629c of mem pool 0x7fad63814a20
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x7fad61b028b6]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x7fad61b2247f]
/lib64/libc.so.6(+0x32500326a0)[0x7fad604a06a0]
/usr/lib64/libglusterfs.so.0(__gf_free+0xc5)[0x7fad61b38e55]
/usr/lib64/libglusterfs.so.0(loc_wipe+0x27)[0x7fad61b00067]
/usr/lib64/glusterfs/3.7.1/xlator/cluster/replicate.so(afr_local_cleanup+0x4d)[0x7fad541a474d]
/usr/lib64/glusterfs/3.7.1/xlator/cluster/replicate.so(afr_stat_cbk+0xf7)[0x7fad54179f67]
/usr/lib64/glusterfs/3.7.1/xlator/protocol/client.so(client3_3_stat_cbk+0x3df)[0x7fad543f7b4f]
/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)[0x7fad618d0455]
/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1a1)[0x7fad618d1981]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x7fad618ccad8]
/usr/lib64/glusterfs/3.7.1/rpc-transport/socket.so(+0xa265)[0x7fad56632265]
/usr/lib64/glusterfs/3.7.1/rpc-transport/socket.so(+0xbe3d)[0x7fad56633e3d]
/usr/lib64/libglusterfs.so.0(+0x89b20)[0x7fad61b66b20]
/lib64/libpthread.so.0(+0x3250407a51)[0x7fad60beca51]
/lib64/libc.so.6(clone+0x6d)[0x7fad6055693d]
---------

--- Additional comment from Niels de Vos on 2016-05-14 08:56:27 EDT ---

- Do you know the steps to reproduce this? Could you reproduce this yourself?
- Are there any coredumps that got captured?

looking at the back traces it looks more like a glibc issue .

# file *
160-core.23327.1463050194.dump.1: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/'
200-core.34740.1463053585.dump.1: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/'

# gdb -c 160-core.23327.1463050194.dump.1  /usr/sbin/glusterfs

(gdb) bt
#0  0x00007f9e3ad5e625 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007f9e3ad5fe05 in abort () at abort.c:92
#2  0x00007f9e3ad9c537 in __libc_message (do_abort=2, fmt=0x7f9e3ae848c0 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:198
#3  0x00007f9e3ada1f4e in malloc_printerr (action=3, str=0x7f9e3ae829ae "free(): invalid pointer", ptr=<value optimized out>, ar_ptr=<value optimized out>) at malloc.c:6350
#4  0x00007f9e3ada4cad in _int_free (av=0x7f9e3b0bbe80, p=0x7f9e107bee10, have_lock=0) at malloc.c:4836
#5  0x00007f9e3c3bab97 in dict_del (this=0x7f9e39978e14, key=0x7f9e04001534 "") at dict.c:528
#6  0x00007f9e2e13ceb3 in __mountdict_remove (ms=0x7f9e28198f30) at mount3.c:213
#7  __mnt3svc_umountall (ms=0x7f9e28198f30) at mount3.c:2508
#8  0x00007f9e2e13cf14 in mnt3svc_umountall (ms=0x7f9e28198f30) at mount3.c:2527
#9  0x00007f9e2e13d7dc in mnt3svc_umntall (req=0x7f9e2da61bd4) at mount3.c:2553
#10 0x00007f9e3c408332 in synctask_wrap (old_task=<value optimized out>) at syncop.c:381
#11 0x00007f9e3ad6f8f0 in ?? () from /lib64/libc.so.6
#12 0x0000000000000000 in ?? ()
(gdb) quit

# gdb -c 200-core.34740.1463053585.dump.1 /usr/sbin/glusterfs

(gdb) bt
#0  0x00007f3a6964a625 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007f3a6964be05 in abort () at abort.c:92
#2  0x00007f3a69688537 in __libc_message (do_abort=2, fmt=0x7f3a697708c0 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:198
#3  0x00007f3a6968df4e in malloc_printerr (action=3, str=0x7f3a6976e9ae "free(): invalid pointer", ptr=<value optimized out>, ar_ptr=<value optimized out>) at malloc.c:6350
#4  0x00007f3a69690cad in _int_free (av=0x7f3a699a7e80, p=0x7f3a3c03e030, have_lock=0) at malloc.c:4836
#5  0x00007f3a6aca6b97 in dict_del (this=0x7f3a68264e14, key=0x7f3a34001534 "") at dict.c:528
#6  0x00007f3a5c825eb3 in __mountdict_remove (ms=0x7f3a58198f30) at mount3.c:213
#7  __mnt3svc_umountall (ms=0x7f3a58198f30) at mount3.c:2508
#8  0x00007f3a5c825f14 in mnt3svc_umountall (ms=0x7f3a58198f30) at mount3.c:2527
#9  0x00007f3a5c8267dc in mnt3svc_umntall (req=0x7f3a5c13c3d8) at mount3.c:2553
#10 0x00007f3a6acf4332 in synctask_wrap (old_task=<value optimized out>) at syncop.c:381
#11 0x00007f3a6965b8f0 in ?? () from /lib64/libc.so.6
#12 0x0000000000000000 in ?? ()

--- Additional comment from Bipin Kunal on 2016-05-16 11:09:21 EDT ---

It looks similar to bug : 
https://bugzilla.redhat.com/show_bug.cgi?id=1027101
https://bugzilla.redhat.com/show_bug.cgi?id=1320423

According to Bug #1027101 the issue has been fixed in glibc-2.12-1.137.el6

and customer is using glibc-2.12-1.166.el6_7.3.x86_64. So may not be same issue.

--- Additional comment from Bipin Kunal on 2016-05-16 11:54:48 EDT ---

Doesn't seems a glibc bug anymore. Seems a gluster issue.

(gdb) p *pair
$6 = {
  hash_next = 0x7f9e39879efc, 
  prev = 0x0, 
  next = 0x7f9e39879efc, 
  value = 0x7f9e397a5358, 
  key = 0x7f9e107bee60 ""  <---
}
(gdb) f 6
#6  0x00007f9e2e13ceb3 in __mountdict_remove (ms=0x7f9e28198f30) at mount3.c:213
213	        dict_del (ms->mountdict, me->hashkey);
(gdb) l
208	 * Not for external use.
209	 */
210	inline void
211	__mountdict_remove (struct mount3_state *ms, struct mountentry *me)
212	{
213	        dict_del (ms->mountdict, me->hashkey);
214	}
215	
216	/* Generic error reply function, just pass the err status
217	 * and it will do the rest, including transmission.
(gdb) p *me
$7 = {
  mlist = {
    next = 0xbabebabe, 
    prev = 0xcafecafe
  }, 
  exname = "/vCDN", '\000' <repeats 1018 times>, 
  hostname = "10.69.37.51", '\000' <repeats 1012 times>, 
  fullpath = '\000' <repeats 1023 times>,    <----
  has_full_path = _gf_false, 
  hashkey = '\000' <repeats 2049 times>
}

This shows that *me has been free'd already. Same as comment #3

--- Additional comment from Niels de Vos on 2016-05-17 06:13:14 EDT ---

This crash happened while a client sent a UMNTALL procedure to the NFS-server. I am not sure if a Linux client actually implements UMNTALL, and does not send several UMNT calls instead...

From 160-core.23327.1463050194.dump.1 (case 01633922), this is the IP-address of the client doing the UMNTALL:

(gdb) f 9
#9  0x00007f9e2e13d7dc in mnt3svc_umntall (req=0x7f9e2da61bd4) at mount3.c:2553
2553            mnt3svc_umountall (ms);

(gdb) p req->trans->peerinfo->identifier
$26 = "10.69.41.91:930", '\000' <repeats 92 times>


It would be helpful to know what OS the client is running, and what NFS-client is used. With this information we may be able to create a reproducer for this problem.

Reading through the sources of the Gluster/NFS server that handle the UMNTALL call did not suggest any obvious issues yet.

--- Additional comment from Niels de Vos on 2016-05-17 06:36:56 EDT ---

Maybe related, it seems that Windows 7 and 2008 send UMNTALL requests:
  https://bugzilla.redhat.com/show_bug.cgi?id=GLUSTER-1666

These requests are normally sent after a(n unclean?) reboot. A few more details are in https://tools.ietf.org/html/rfc1813#section-5.2.4

Linux and pynfs do not implement UMNTALL, so it might only be reproducible with Windows or an other OS.

--- Additional comment from Bipin Kunal on 2016-05-17 08:52:59 EDT ---

 === In Red Hat Customer Portal Case 01633922 ===
--- Comment by IT Operations, Sky on 5/17/2016 5:37 PM ---

[2016-05-18 09:27:56.568638] E [mem-pool.c:305:__gf_free] (-->/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(__mnt3svc_umountall+0x63) [0x7f83090c6eb3] -->/usr/lib64/libglusterfs.so.0(dict_del+0xd7) [0x7f8317541b77] -->/usr/lib64/libglusterfs.so.0(__gf_free+0xb4) [0x7f831757ddd4] ) 0-: Assertion failed: GF_MEM_HEADER_MAGIC == header->magic
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 6
time of crash:
2016-05-18 09:27:56
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.1
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x7f83175478e6]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x7f83175674af]
/lib64/libc.so.6(+0x326a0)[0x7f8315ee66a0]
/lib64/libc.so.6(gsignal+0x35)[0x7f8315ee6625]
/lib64/libc.so.6(abort+0x175)[0x7f8315ee7e05]
/lib64/libc.so.6(+0x70537)[0x7f8315f24537]
/lib64/libc.so.6(+0x75e66)[0x7f8315f29e66]
/lib64/libc.so.6(+0x7897a)[0x7f8315f2c97a]
/usr/lib64/libglusterfs.so.0(dict_del+0xd7)[0x7f8317541b77]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(__mnt3svc_umountall+0x63)[0x7f83090c6eb3]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(mnt3svc_umountall+0x34)[0x7f83090c6f14]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(mnt3svc_umntall+0x2c)[0x7f83090c77dc]
/usr/lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x7f831758f2b2]
/lib64/libc.so.6(+0x438f0)[0x7f8315ef78f0]
---------


--- Additional comment from Niels de Vos on 2016-05-23 04:39 EDT ---

Completely untested patch that addresses the following two points:

  1. fix UMNTALL to only UMNT the exports from the client calling the procedure
  2. remove the duplication of structures in mountdict, use mountlist everywhere

I am confident that these two patches prevent the crashes that have been seen. The mountdict was used to optimize lookups when many clients are mounting exports. The removal of the mountdict may cause some performance drop while (un)mounting, but I do not expect that to be critical.

These patches were only build-tested, but it shows the approach I'd like to take in order to fix this problem.

--- Additional comment from Niels de Vos on 2016-05-23 04:40:14 EDT ---

(In reply to Bipin Kunal from comment #24)
> I tried playing with Windows 2012 NFS client yesterday but could not produce
> anything useful.
> 
> Can you suggest me the test or steps for reproducer?

Sorry, I still have no idea how this can happen. There might be a need for multiple clients mounting at the same time when one client sends the UMNTALL procedure. But without more details from the customers that hit this problem, it is very difficult/impossible to understand the cause.

Maybe one of the customers can capture a tcpdump on the NFS-server for all the MOUNT procedures? Depending on their configuration, this would be done on a different port than the NFS traffic. So the tcpdump should not contain much data, only the MNT, UMNT and UMNTALL procedures. Check with 'rpcinfo -p $SERVER' what port is used. Provide the tcpdump and the nfs.log once the process crashed again.

--- Additional comment from Oonkwee Lim_ on 2016-07-21 15:00:09 EDT ---

Hello Niels,

Which release are the fixes in comment #25 fixed?

--- Additional comment from Oonkwee Lim_ on 2016-08-03 01:42:41 EDT ---

More of the same:

[2016-07-31 20:25:23.323701] E [mem-pool.c:305:__gf_free] (-->/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(__mnt3svc_umountall+0x63) [0x7f705c4b4eb3] -->/usr/lib64/libglusterfs.so.0(dict_del+0xd7) [0x7f706a72db77] -->/usr/lib64/libglusterfs.so.0(__gf_free+0xb4) [0x7f706a769dd4] ) 0-: Assertion failed: GF_MEM_HEADER_MAGIC == header->magic
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash:
2016-07-31 20:25:23
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.1
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x7f706a7338e6]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x7f706a7534af]
/lib64/libc.so.6(+0x39a80326a0)[0x7f70690d26a0]
/usr/lib64/libglusterfs.so.0(__gf_free+0xc5)[0x7f706a769de5]
/usr/lib64/libglusterfs.so.0(dict_del+0xd7)[0x7f706a72db77]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(__mnt3svc_umountall+0x63)[0x7f705c4b4eb3]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(mnt3svc_umountall+0x34)[0x7f705c4b4f14]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(mnt3svc_umntall+0x2c)[0x7f705c4b57dc]
/usr/lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x7f706a77b2b2]
/lib64/libc.so.6(+0x39a80438f0)[0x7f70690e38f0]

--- Additional comment from Niels de Vos on 2016-08-03 03:57:47 EDT ---

Atin, this is really Gluster/NFS, not NFS-Ganesha. Moving back to the gluster-nfs component.


[2017-01-01 21:25:22.152530] I [rpc-clnt.c:1819:rpc_clnt_reconfig] 0-webfactsdev-client-3: changing port to 49514 (from 0)
[2017-01-01 21:25:22.158285] E [socket.c:2332:socket_connect_finish] 0-webfactsdev-client-3: connection to 10.9.125.58:49514 failed (Connection refused)
[2017-01-01 21:25:23.993903] C [mem-pool.c:571:mem_put] (-->/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(__mnt3svc_umountall+0x63) [0x7f33b3987eb3] -->/usr/lib64/libglusterfs.so.0(dict_del+0xec) [0x7f33c5c55b8c] -->/usr/lib64/libglusterfs.so.0(mem_put+0xf6) [0x7f33c5c920d6] ) 0-mem-pool: mem_put called on freed ptr 0x7f33c3136d30 of mem pool 0x7f33c76d4cb0
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash:
2017-01-01 21:25:23
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.1
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x7f33c5c5b8e6]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x7f33c5c7b4af]
/lib64/libc.so.6(+0x39a80326a0)[0x7f33c45fa6a0]
/usr/lib64/libglusterfs.so.0(__gf_free+0xc5)[0x7f33c5c91de5]
/usr/lib64/libglusterfs.so.0(iobref_destroy+0x54)[0x7f33c5c94db4]
/usr/lib64/libgfrpc.so.0(rpcsvc_request_destroy+0x1a)[0x7f33c5a2410a]
/usr/lib64/libgfrpc.so.0(rpcsvc_submit_generic+0x256)[0x7f33c5a243b6]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(mnt3svc_submit_reply+0x160)[0x7f33b39885f0]
/usr/lib64/glusterfs/3.7.1/xlator/nfs/server.so(mnt3svc_umntall+0x40)[0x7f33b39887f0]
/usr/lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x7f33c5ca32b2]
/lib64/libc.so.6(+0x39a80438f0)[0x7f33c460b8f0]
---------

--- Additional comment from Oonkwee Lim_ on 2017-01-03 19:08:16 EST ---

Below is the customer reply at this point:

I'm not able to replicate this issue, that I know of.

--- Additional comment from Oonkwee Lim_ on 2017-01-06 12:45:19 EST ---

From the customer:

Typical NFS mount options used:

RHEL
/etc/fstab
oktgluster1a:/DotHill-Agile /mnt/oktgluster1a/DotHill-Agile    nfs rsize=8192,wsize=8192,timeo=14,intr 0 0
oktgluster1a:/DotHill-Agile-tst /mnt/oktnfs1/DotHill-Agile-tst nfs defaults,_netdev 0 0
oktgluster1a:/DotHill-Agile-tst1 /mnt/oktnfs1/DotHill-Agile-tst1 nfs defaults,_netdev 0 0

IBM AIX
/etc/filesystems
/mnt/oktnfs1/oktgluster1/bobjdevlcm:
        dev             = "/bobjdevlcm"
        vfs             = nfs
        nodename        = oktgluster1
        mount           = true
        type            = nfs
        options         = bg,hard,intr,sec=sys
        account         = false

--- Additional comment from Soumya Koduri on 2017-02-07 02:46:17 EST ---

Okay...From the code, I see a potential issue with mountdict.

During gluster-nfs process initialization -->

mnt3svc_init (xlator_t *nfsx) {

...
...
...
        mstate->mountdict = dict_new ();
...
..
}

The ref taken for mountdict above seem to be getting un-refed in __mnt3svc_umountall() 

__mnt3svc_umountall (struct mount3_state *ms) {

       dict_unref (ms->mountdict);
}

So in a lifetime of gNFS process >1 UMOUNTALL requests may result in double unref resulting in accessing freed memory. 

Ideally the above dict_unref (ms->mountdict) should have been in mnt3svc_deinit() IMO.

Based on the above, I am working with Riyas to check if we can reproduce the issue by trying out multiple mount/umount or reboot scenarios using windows client.

--- Additional comment from Riyas Abdulrasak on 2017-02-07 10:10:24 EST ---

Soumya and I were able to reproduce the gluster/nfs crash with the following. 

1. From the windows 2012 machine, Mount multiple shares using the below command

#net use <driveletter>: \\serverip\share_name

Do 5-6 mounts 

2. Unmount all shares together with the net use /d * 


Below is the windows test machine details

x.x.x.x  (use rdesktop to connect)

Regards
Riyas

--- Additional comment from Soumya Koduri on 2017-02-08 03:52:24 EST ---

The crash happened during second UMOUNTALL as suspected in comment#58. 

Today while testing I was unable to reproduce this issue using rhgs-3.2 bits. Maybe this issue doesn't exist any more. Even Customer updated that they are not able to hit this issue in their pre-production environment which is on rhgs-3.1.3 (comment#57). Riyas shall try to test the scenario using rhgs-3.1.3 build and update.

--- Additional comment from Riyas Abdulrasak on 2017-02-08 09:21:57 EST ---

(In reply to Soumya Koduri from comment #60)
> The crash happened during second UMOUNTALL as suspected in comment#58. 
> 
> Today while testing I was unable to reproduce this issue using rhgs-3.2
> bits. Maybe this issue doesn't exist any more. Even Customer updated that
> they are not able to hit this issue in their pre-production environment
> which is on rhgs-3.1.3 (comment#57). Riyas shall try to test the scenario
> using rhgs-3.1.3 build and update.

Hello Soumya, 

I have tried reproducing it using 3.1.3. Exactly the same step we followed yesterday. It never crashed, But I was not able to see any umount_all call also. Below are the mount calls we saw in the tcpdump. 

~~~~~~~~~
$ tshark -r nfs_new.pcap  rpc.program == 100005
445   5.007142    127.0.0.1 → 127.0.0.1    MOUNT 112 V1 NULL Call
447   5.007496    127.0.0.1 → 127.0.0.1    MOUNT 96 V1 NULL Reply (Call In 445)
1190  20.265579    127.0.0.1 → 127.0.0.1    MOUNT 112 V1 NULL Call
1192  20.265937    127.0.0.1 → 127.0.0.1    MOUNT 96 V1 NULL Reply (Call In 1190)
2091  35.523741    127.0.0.1 → 127.0.0.1    MOUNT 112 V1 NULL Call
2093  35.524059    127.0.0.1 → 127.0.0.1    MOUNT 96 V1 NULL Reply (Call In 2091)
2919  50.788065    127.0.0.1 → 127.0.0.1    MOUNT 112 V1 NULL Call
2921  50.788422    127.0.0.1 → 127.0.0.1    MOUNT 96 V1 NULL Reply (Call In 2919)
3783  66.025470    127.0.0.1 → 127.0.0.1    MOUNT 112 V1 NULL Call
3785  66.025801    127.0.0.1 → 127.0.0.1    MOUNT 96 V1 NULL Reply (Call In 3783)
4526  81.267281    127.0.0.1 → 127.0.0.1    MOUNT 112 V1 NULL Call
4528  81.267587    127.0.0.1 → 127.0.0.1    MOUNT 96 V1 NULL Reply (Call In 4526)
5429  96.644392    127.0.0.1 → 127.0.0.1    MOUNT 112 V1 NULL Call
5431  96.644774    127.0.0.1 → 127.0.0.1    MOUNT 96 V1 NULL Reply (Call In 5429)
5816 102.296431  10.65.8.138 → 10.65.8.138  MOUNT 148 V3 EXPORT Call
5818 102.310090  10.65.8.138 → 10.65.8.138  MOUNT 188 V3 EXPORT Reply (Call In 5816)
~~~~~~~~~

Regards
Riyas

--- Additional comment from Raghavendra Bhat on 2017-02-10 17:29:31 EST ---


I have been able to recreate the issue with 3.1.3 code as well. This is my observation.

* The issue happens when UMNTALL RPC call is made by a NFS client. 

* From my understanding, when "net use /d *" is issued windows NFS client sends UMNTALL RPC call along with UMNT call as well. 

* Linux NFS clients send UMNT call. So with linux NFS clients alone, UMNTALL call does not come to NFS server. I mounted 2 clients on a linux machine and "umount <mount 1> <mount 2>" did not send UMNTALL call. Instead it sent 2 UMNT calls separately

Theory:
  
NFS server maintains a list of mount points in the structure "struct mount3_state" called mountlist (ms->mountlist in the code base). 

This is the __mnt3svc_umountall function that gets called in NFS server.

"
int
__mnt3svc_umountall (struct mount3_state *ms)
{
 	struct mountentry       *me = NULL;
        struct mountentry       *tmp = NULL;

	if (!ms)
                return -1;

        if (list_empty (&ms->mountlist))
                return 0;

        list_for_each_entry_safe (me, tmp, &ms->mountlist, mlist) {
                list_del (&me->mlist);       /* Remove from the mount list */
	        __mountdict_remove (ms, me); /* Remove from the mount dict */
	        GF_FREE (me);
	}

        dict_unref (ms->mountdict);

	return 0;
}
"

The first UMNTALL unrefes the dict (ms->mountdict), hence frees it.

Now when the second UMNTALL request comes, if ms->mountlist is not empty, then it will lead to further referencing of ms->mountdict for unrefing it and this will lead to a crash. Otherwise when the second UMNTALL request comes if ms->mountlist is empty, then the function will simply return.

IIUC, if we have a mixture of Windows and linux NFS clients, then we should be able to hit the issue. (Linux clients will keep ms->mountlist non empty and windows clients will send UMNTALL call).

I am still trying to recreate the issue. But the windows server's power shell from the machine x.x.x.x is hanging everytime and logging out.

As soumya mentioned in comment #58, dict_unref should not be done in __mnt3svc_umountall function. It should be done in mnt3svc_deinit.

--- Additional comment from Raghavendra Bhat on 2017-02-10 17:47:06 EST ---


More obvervations.

* Linux clients alone cannot cause the crash as they dont send UMNTALL call (Checked the linux kernel source. UMNTALL call is not made)

* Windows clients alone can recreate the crash based on the order of UMNT and UMNTALL calls (windows sends both). When I attached gdb to the NFS server and unmounted the windows clients (net use /d *), both UMNT and UMNTALL calls came.

Ex: If there are 2 volumes exported via NFS are mounted in windows, 

1) then the following order will not lead to the crash

UMNT->UMNT->UMNTALL->UMNTALL

Thats because both UMNT calls would have removed the mounts from the list maintained by NFS server and the dict would not be unrefed in UMNTALL (Its because as mentioned in #66, __mnt3svc_umountall will simply return if ms->mountlist is empty)

2) UMNT->UMNTALL->UMNT->UMNTALL

The above order will not lead to an immediate crash. But the dict (ms->mountdict) is still freed and any future reference to it will lead to a crash

3) UMNTALL->UMNT->UMNTALL->UMNT

This will lead to the crash, as by the time 2nd UMNTALL request comes, NFS server has still got one mount in its list. And it will crash the moment 2nd UMNTALL call comes

4) UMNTALL->UMNTALL->UMNT->UMNT

This will lead to the crash, as by the time 2nd UMNTALL request comes, NFS server has still got both the mounts in its list. And it will crash the moment 2nd UMNTALL call comes

Soumya? Can you please verify if the above scenario (this comment and the previous comment i.e. #66) is correct?

I have been able to HIT the crash with 3.1.3 code base once.

--- Additional comment from Soumya Koduri on 2017-02-11 12:42:21 EST ---

Raghavendra,

You are absolutely right. The crash is seen when UMNTALL call is made twice with an outstanding mount in ms->mountdict (i.e, there should be atleast one UMNT call not made by the client)

Riyas and myself have been trying out various combinations using windows client to get that sequence of calls but every time windows client was behaving differently. Your suggestion to use Linux client as well to always have an outstanding mount in the list seems like a good idea. With that then we should be able to hit this issue consistently right?

--- Additional comment from Raghavendra Bhat on 2017-02-11 13:42:38 EST ---


Soumya,

I was suucessfully able to recreate the issue only once. I was able to hit it with RHGS-3.1.3 source installed in my local laptop used as the NFS server. 

I made those observations when I attached gdb to NFS server and put break points to both mnt3svc_umnt and mnt3svc_umntall functions.

I wanted to consistently reproduce the issue further. But later for some reason, every time I tried to mount the NFS share on the windows, the power shell from windows 2012 server was hanging and getting crashed. It was logging me out everytime.

I used the same windows machine that Riyas mentioned in the earlier comment.

--- Additional comment from Riyas Abdulrasak on 2017-02-13 05:04:37 EST ---

Hi Raghavendra, 

Me and Soumya tried to reproduce the issue with two different windows 2012 machines. But after one "net use /d *" , Both Windows 2012 machine started to crash when we tried to mount the NFS share. 

Regards
Riyas

Comment 1 Niels de Vos 2017-02-13 16:19:00 UTC

Possible fix(es) are in attachment #1160478 [details] from bug 1315544.

Raghavendra, will you be verifying those changes and post them for review?

Comment 2 Worker Ant 2017-02-13 17:11:24 UTC

REVIEW: https://review.gluster.org/16611 (gNFS: Keep the mountdict as long as the service is active) posted (#2) for review on master by soumya k (skoduri)

Comment 3 Worker Ant 2017-02-14 09:16:50 UTC

COMMIT: https://review.gluster.org/16611 committed in master by Niels de Vos (ndevos) 
------
commit a88ae92de190af0956013780939ba6bdfd509ff8
Author: Soumya Koduri <skoduri>
Date:   Mon Feb 13 16:52:07 2017 +0530

    gNFS: Keep the mountdict as long as the service is active
    
    We initialize and take ref once on mountdict during NFS/MNT3 server
    initialization but seem to be unref'in it for every UMNTALL request.
    This can lead to crash when there are multiple UMNTALL requests
    with >=1 active mount entry(/ies) in the mountlist.
    
    Since we take the ref only once, we should keep the mountdict through
    out the life of the process and dereference it only during unitialization
    of mnt3 service.
    
    Change-Id: I3238a8df09b8972e56dd93fee426d866d40d9959
    BUG: 1421759
    Signed-off-by: Soumya Koduri <skoduri>
    Reviewed-on: https://review.gluster.org/16611
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Raghavendra Bhat <raghavendra>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: jiffin tony Thottan <jthottan>
    Reviewed-by: Niels de Vos <ndevos>

Comment 4 Niels de Vos 2017-02-14 09:17:34 UTC

You have a good theory on how this use-after-free can occur. You should be able to reproduce this with nfsshell from https://github.com/NetDirect/nfsshell as it can do UMNTALL too. Using this program it should be possible to trigger the problem. An example for running the shell is like:

    $ ./nfsshell
    nfs> host storage.lan.nixpanic.net
    Open storage.lan.example.net (172.31.96.2) TCP
    nfs> mount /repos
    Mount `/repos', TCP, transfer size 1048576 bytes.
    nfs> ls
    .
    ..
    .trashcan
    fedora
    rhel-7
    nfs> umountall
    Unmount `/repos'
    nfs> quit
    Close `storage.lan.nixpanic.net'

You can also put the commands in a textfile and pipe that to nfsshell:

    $ cat /tmp/umountall-reproducer.nfsshell
    host storage.lan.example.net
    mount /repos
    ls
    umountall
    quit

    $ ./nfsshell < /tmp/umountall-reproducer.nfsshell
    nfs> host storage.lan.nixpanic.net
    Open storage.lan.example.net (172.31.96.2) TCP
    nfs> mount /repos
    Mount `/repos', TCP, transfer size 1048576 bytes.
    nfs> ls
    .
    ..
    .trashcan
    fedora
    rhel-7
    nfs> umountall
    Unmount `/repos'
    nfs> quit
    Close `storage.lan.nixpanic.net'

I'd appreciate it if someone can look into reproducing the problem and confirm the patch fixes it.

Comment 5 Niels de Vos 2017-02-15 08:40:55 UTC

Riyas, could you post the exact steps with nfsshell to reproduce the segfault? I'd like them to be recorded here as well.

Comment 6 Worker Ant 2017-02-15 08:58:33 UTC

REVIEW: https://review.gluster.org/16625 (gNFS: Keep the mountdict as long as the service is active) posted (#1) for review on release-3.10 by Niels de Vos (ndevos)

Comment 7 Worker Ant 2017-02-15 08:59:10 UTC

REVIEW: https://review.gluster.org/16626 (gNFS: Keep the mountdict as long as the service is active) posted (#1) for review on release-3.9 by Niels de Vos (ndevos)

Comment 8 Shyamsundar 2017-05-30 18:43:01 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report.

glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.