Description of problem: ----------------------- 4 node cluster.4 clients mount a 2*2 export via v4. Ran iozone random writes in a distributed-multithreaded way. Ganesha crashed and dumped core on one of the nodes . BT from crash : (gdb) #0 0x00007fa3b4c5c1d7 in raise () from /lib64/libc.so.6 #1 0x00007fa3b4c5d8c8 in abort () from /lib64/libc.so.6 #2 0x00007fa3b715f80a in nfs_dupreq_rele (req=0x7f9ff4fb3348, func=<optimized out>) at /usr/src/debug/nfs-ganesha-2.4.1/src/RPCAL/nfs_dupreq.c:1234 #3 0x00007fa3b70e3ef1 in nfs_rpc_execute (reqdata=reqdata@entry=0x7f9ff4fb3320) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1405 #4 0x00007fa3b70e578a in worker_run (ctx=0x7fa3b8537400) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1548 #5 0x00007fa3b716f189 in fridgethr_start_routine (arg=0x7fa3b8537400) at /usr/src/debug/nfs-ganesha-2.4.1/src/support/fridgethr.c:550 #6 0x00007fa3b564fdc5 in start_thread () from /lib64/libpthread.so.0 #7 0x00007fa3b4d1e73d in clone () from /lib64/libc.so.6 (gdb) Version-Release number of selected component (if applicable): ------------------------------------------------------------ nfs-ganesha-gluster-2.4.1-1.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64 openssl-1.0.1e-60.el7.x86_64 How reproducible: ----------------- Reporting the first occurrence. Steps to Reproduce: ------------------- 1. TLS authenticate all the 4 nodes and the clients. 2. Mount the volume via Ganesha vers=4.Create data set using iozone seq writes. 3. Run iozone rand writes (iozone -+m <conf file> -+h <host> -C -w -c -e -i 2 -J 3 -+n -r 64k -s 2g -t 16) Actual results: --------------- Ganesha crashes on 1 of the nodes. Expected results: ---------------- No crashes. Additional info: ---------------- Volume Name: testvol Type: Distributed-Replicate Volume ID: c3d28ba6-98a2-4e66-b412-2a6580030ffd Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0 Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1 Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2 Brick4: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3 Options Reconfigured: ganesha.enable: on features.cache-invalidation: on server.ssl: on client.ssl: on nfs.disable: on performance.readdir-ahead: on transport.address-family: inet performance.stat-prefetch: off server.allow-insecure: on nfs-ganesha: enable cluster.enable-shared-storage: enable [root@gqas005 /]#
One thing I forgot to mention in description is that both - I/O and management encryption are enabled : [2016-11-26 12:41:45.479309] I [socket.c:4021:socket_init] 0-testvol-client-2: SSL support on the I/O path is ENABLED [2016-11-26 12:41:45.479330] I [socket.c:4024:socket_init] 0-testvol-client-2: SSL support for glusterd is ENABLED
(gdb) bt #0 0x00007fa3b4c5c1d7 in raise () from /lib64/libc.so.6 #1 0x00007fa3b4c5d8c8 in abort () from /lib64/libc.so.6 #2 0x00007fa3b715f80a in nfs_dupreq_rele (req=0x7f9ff4fb3348, func=<optimized out>) at /usr/src/debug/nfs-ganesha-2.4.1/src/RPCAL/nfs_dupreq.c:1234 #3 0x00007fa3b70e3ef1 in nfs_rpc_execute (reqdata=reqdata@entry=0x7f9ff4fb3320) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1405 #4 0x00007fa3b70e578a in worker_run (ctx=0x7fa3b8537400) at /usr/src/debug/nfs-ganesha-2.4.1/src/MainNFSD/nfs_worker_thread.c:1548 #5 0x00007fa3b716f189 in fridgethr_start_routine (arg=0x7fa3b8537400) at /usr/src/debug/nfs-ganesha-2.4.1/src/support/fridgethr.c:550 #6 0x00007fa3b564fdc5 in start_thread () from /lib64/libpthread.so.0 #7 0x00007fa3b4d1e73d in clone () from /lib64/libc.so.6 (gdb) f 2 #2 0x00007fa3b715f80a in nfs_dupreq_rele (req=0x7f9ff4fb3348, func=<optimized out>) at /usr/src/debug/nfs-ganesha-2.4.1/src/RPCAL/nfs_dupreq.c:1234 1234 PTHREAD_MUTEX_lock(&dv->mtx); (gdb) l 1229 func->free_function(req->rq_u2); 1230 free_nfs_res(req->rq_u2); 1231 goto out; 1232 } 1233 1234 PTHREAD_MUTEX_lock(&dv->mtx); 1235 1236 LogFullDebug(COMPONENT_DUPREQ, 1237 "releasing dv=%p xid=%u on DRC=%p state=%s, refcnt=%d", 1238 dv, dv->hin.tcp.rq_xid, dv->hin.drc, (gdb) p &dv Address requested for identifier "dv" which is in register $rbp (gdb) p dv $1 = (dupreq_entry_t *) 0x7fa264002460 (gdb) p dv->mtx $2 = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = -1, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 16 times>, "\377\377\377\377", '\000' <repeats 19 times>, __align = 0} (gdb) p &dv->mtx $3 = (pthread_mutex_t *) 0x7fa264002490 (gdb) p *$ $4 = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = -1, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 16 times>, "\377\377\377\377", '\000' <repeats 19 times>, __align = 0} (gdb) /var/log/messages - Nov 26 09:14:33 gqas011 systemd: nfs-ganesha.service: main process exited, code=killed, status=6/ABRT Nov 26 09:14:33 gqas011 systemd: Unit nfs-ganesha.service entered failed state. Nov 26 09:14:33 gqas011 systemd: nfs-ganesha.service failed. /var/log/ganesha.log - 26/11/2016 08:55:19 : epoch 58200000 : gqas011.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-16848[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy. Not sending heartbeat 26/11/2016 09:14:28 : epoch 58200000 : gqas011.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-16848[work-56] nfs_dupreq_rele :RW LOCK :CRIT :Error 22, acquiring mutex 0x7fa264002490 (&dv->mtx) at /builddir/build/BUILD/nfs-ganesha-2.4.1/src/RPCAL/nfs_dupreq.c:1234 pthread_mutex_lock failed with Error 22 (EINVAL). Maybe this mutex was not initialized properly resulting in error while accessing. From the code, looks like we do not abort if init of this mtx fails - 759 static inline dupreq_entry_t *alloc_dupreq(void) 760 { 761 dupreq_entry_t *dv; 762 763 dv = pool_alloc(dupreq_pool); 764 gsh_mutex_init(&dv->mtx, NULL); 765 TAILQ_INIT_ENTRY(dv, fifo_q); 766 767 return dv; 768 } Not sure if that is actual cause. Request Frank/Dan/Matt to provide comments. Attaching bt of all threads.
Created attachment 1225880 [details] thread_all_bt.txt
This smells like a double-free to me, rather than a failure to initialize. I don't know the dupreq code very well, but it looks like maybe both nfs_dupreq_finish() and nfs_dupreq_rele() can be called on the same dupreq? Maybe there's a refcount bug here. Frank?
Initialization failure is more or less impossible, so yes, I'd guess that there was an attempt to access a destroyed mutex (or freed memory). I don't see an easy way for glusterfs setup (e.g., SSL) to be conditioning this, unless somehow we scribbled on req->rq_u1 (ganesha private data). Moving the trace statement at line 1236 out of the critical section, and logging, could be helpful.
(In reply to Matt Benjamin (redhat) from comment #8) addl notes (debugging the core) 1. dv looks vaguely sane, and it has refcnt==1 and state=DUPREQ_COMPLETE, which happens in nfs_dupreq_finish 2. one thing that jumps out at me is that it looks like nfs_dupreq_rele() and and nfs_dupreq_finish() could maybe free the same entry if dv==ov in nfs_dupreq_finish(); worth checking for that case and asserting?
(In reply to Matt Benjamin (redhat) from comment #9) > (In reply to Matt Benjamin (redhat) from comment #8) > addl notes (debugging the core) > > 1. dv looks vaguely sane, and it has refcnt==1 and state=DUPREQ_COMPLETE, > which happens in nfs_dupreq_finish > > 2. one thing that jumps out at me is that it looks like nfs_dupreq_rele() > and and nfs_dupreq_finish() could maybe free the same entry if dv==ov in > nfs_dupreq_finish(); worth checking for that case and asserting? (not really that trivial, as we check ov's refcnt, and didn't decrement it--we don't do that till nfs_dupreq_rele(); as well, we have refcnt==1, as above, so I don't see how that call path could have happened)
This BZ has been accepted to be fixed in rhgs-3.2.0 as per today's triage exercise.
I think the answer was that deduced by Malahal Naineni upstream. Basically, the retire logic in nfs_dupreq_finish() races with the logic in nfs_dupreq_start() to re-use an entry. The existing code incorrectly tries to serialize on a mutex in the drc entry itself, which does not control its reachability. Malahal fixes this by taking the rbt partiton lock and rechecking refcnt, in the retire path. The upstream change (not yet merged to nfs-ganesha master) is here git fetch ssh://mattbenjamin.io:29419/ffilz/nfs-ganesha refs/changes/01/304701/2 && git cherry-pick FETCH_HEAD At the same time, working through this code, I found 3 issues that appear to make caches grow too large. Proposed changes for early review here: https://review.gerrithub.io/#/c/304819/
Cannot reproduce on 2.4.1-6/3.8.4-13. Verified on latest bits.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2017-0493.html