Description of problem: ----------------------- ganesha entered failed state in one node of the four node cluster while exporting volumes in loop. Tried to export 109 volumes one after the other in loop. =============================================================================== Version-Release number of selected component (if applicable): ------------------------------------------------------------- nfs-ganesha-2.5.5-10.el7rhgs.x86_64 nfs-ganesha-gluster-2.5.5-10.el7rhgs.x86_64 glusterfs-ganesha-3.12.2-28.el7rhgs.x86_64 =============================================================================== How reproducible: ----------------- 1/1 =============================================================================== Steps to Reproduce: ------------------- 1. Create 4 node ganesha cluster. 2. Create and start 100 or more volumes. 3. Verify status of all volumes. 4. Export volumes one after the other in a loop. =============================================================================== Actual results: --------------- Ganesha entered failed state in one of the nodes. =============================================================================== Expected results: ----------------- No failure should be observed. ============================================================================== Additional info: ---------------- * All volumes were exported on other 3 nodes in the 4 node cluster. * The failure observed is on a different node than the one from where export operation was executed. Setup is kept in same state and can be shared if required.
Logs and sos report will be shared shortly.
(gdb) bt #0 __memcmp_sse4_1 () at ../sysdeps/x86_64/multiarch/memcmp-sse4.S:74 #1 0x00007f5d18130664 in glfs_mgmt_getspec_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f5b74002cb0) at glfs-mgmt.c:625 #2 0x00007f5c8e9e8960 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f5a08cc5760, pollin=pollin@entry=0x7f5b7f09acb0) at rpc-clnt.c:778 #3 0x00007f5c8e9e8d03 in rpc_clnt_notify (trans=<optimized out>, mydata=0x7f5a08cc5790, event=<optimized out>, data=0x7f5b7f09acb0) at rpc-clnt.c:971 #4 0x00007f5c8e9e4a73 in rpc_transport_notify (this=this@entry=0x7f5a08cc5930, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f5b7f09acb0) at rpc-transport.c:538 #5 0x00007f5c849e5576 in socket_event_poll_in (this=this@entry=0x7f5a08cc5930, notify_handled=<optimized out>) at socket.c:2322 #6 0x00007f5c849e7b1c in socket_event_handler (fd=565, idx=0, gen=1, data=0x7f5a08cc5930, poll_in=1, poll_out=0, poll_err=0) at socket.c:2474 #7 0x00007f5c8ec7e824 in event_dispatch_epoll_handler (event=0x7f59e1f44500, event_pool=0x7f5a08cb74f0) at event-epoll.c:583 #8 event_dispatch_epoll_worker (data=0x7f5b760922a0) at event-epoll.c:659 #9 0x00007f5d20e44dd5 in start_thread (arg=0x7f59e1f45700) at pthread_create.c:307 #10 0x00007f5d2050fead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 (gdb) f 1 #1 0x00007f5d18130664 in glfs_mgmt_getspec_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f5b74002cb0) at glfs-mgmt.c:625 625 (memcmp (fs->oldvolfile, rsp.spec, size) == 0)) { (gdb) l 620 621 ret = 0; 622 size = rsp.op_ret; 623 624 if ((size == fs->oldvollen) && 625 (memcmp (fs->oldvolfile, rsp.spec, size) == 0)) { 626 gf_msg (frame->this->name, GF_LOG_INFO, 0, 627 API_MSG_VOLFILE_INFO, 628 "No change in volfile, continuing"); 629 goto out; (gdb) p fs->olvollen There is no member named olvollen. (gdb) p fs->oldvollen $1 = 1674 (gdb) p size $2 = 1674 (gdb) p fs->oldvolfile $3 = 0x7f5b76097cd0 "volume testvol82201-client-0\n type protocol/client\n option send-gids true\n option transport.socket.keepalive-count 9\n option transport.socket.keepalive-interval 2\n option transport.sock"... (gdb) p rsp.spec $4 = 0x7f5b7f9da9d0 "volume testvol82201-client-0\n type protocol/client\n option send-gids true\n option transport.socket.keepalive-count 9\n option transport.socket.keepalive-interval 2\n option transport.sock"... (gdb) The crash happened while doing memcmp of fs->oldvolfile and the new volfile received in the response (rsp.spec). The contents of both the variables seem fine in the core. From code reading observed that we update fs->oldvollen and fs->oldvolfile under fs->mutex lock, but that lock is not taken while reading those values here in glfs_mgmt_spec_cbk. That could have resulted in the crash while accessing un/partially intialized variable. @Jilju, Are you able to consistently reproduce this issue?
Are the buffers smaller than 1674? It might be going off the end of one of the buffers.
(In reply to Soumya Koduri from comment #4) > (gdb) bt > #0 __memcmp_sse4_1 () at ../sysdeps/x86_64/multiarch/memcmp-sse4.S:74 > #1 0x00007f5d18130664 in glfs_mgmt_getspec_cbk (req=<optimized out>, > iov=<optimized out>, count=<optimized out>, myframe=0x7f5b74002cb0) at > glfs-mgmt.c:625 > #2 0x00007f5c8e9e8960 in rpc_clnt_handle_reply > (clnt=clnt@entry=0x7f5a08cc5760, pollin=pollin@entry=0x7f5b7f09acb0) at > rpc-clnt.c:778 > #3 0x00007f5c8e9e8d03 in rpc_clnt_notify (trans=<optimized out>, > mydata=0x7f5a08cc5790, event=<optimized out>, data=0x7f5b7f09acb0) at > rpc-clnt.c:971 > #4 0x00007f5c8e9e4a73 in rpc_transport_notify > (this=this@entry=0x7f5a08cc5930, > event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, > data=data@entry=0x7f5b7f09acb0) at rpc-transport.c:538 > #5 0x00007f5c849e5576 in socket_event_poll_in > (this=this@entry=0x7f5a08cc5930, notify_handled=<optimized out>) at > socket.c:2322 > #6 0x00007f5c849e7b1c in socket_event_handler (fd=565, idx=0, gen=1, > data=0x7f5a08cc5930, poll_in=1, poll_out=0, poll_err=0) at socket.c:2474 > #7 0x00007f5c8ec7e824 in event_dispatch_epoll_handler > (event=0x7f59e1f44500, event_pool=0x7f5a08cb74f0) at event-epoll.c:583 > #8 event_dispatch_epoll_worker (data=0x7f5b760922a0) at event-epoll.c:659 > #9 0x00007f5d20e44dd5 in start_thread (arg=0x7f59e1f45700) at > pthread_create.c:307 > #10 0x00007f5d2050fead in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 > (gdb) f 1 > #1 0x00007f5d18130664 in glfs_mgmt_getspec_cbk (req=<optimized out>, > iov=<optimized out>, count=<optimized out>, myframe=0x7f5b74002cb0) at > glfs-mgmt.c:625 > 625 (memcmp (fs->oldvolfile, rsp.spec, size) == 0)) { > (gdb) l > 620 > 621 ret = 0; > 622 size = rsp.op_ret; > 623 > 624 if ((size == fs->oldvollen) && > 625 (memcmp (fs->oldvolfile, rsp.spec, size) == 0)) { > 626 gf_msg (frame->this->name, GF_LOG_INFO, 0, > 627 API_MSG_VOLFILE_INFO, > 628 "No change in volfile, continuing"); > 629 goto out; > (gdb) p fs->olvollen > There is no member named olvollen. > (gdb) p fs->oldvollen > $1 = 1674 > (gdb) p size > $2 = 1674 > (gdb) p fs->oldvolfile > $3 = 0x7f5b76097cd0 "volume testvol82201-client-0\n type > protocol/client\n option send-gids true\n option > transport.socket.keepalive-count 9\n option > transport.socket.keepalive-interval 2\n option transport.sock"... > (gdb) p rsp.spec > $4 = 0x7f5b7f9da9d0 "volume testvol82201-client-0\n type > protocol/client\n option send-gids true\n option > transport.socket.keepalive-count 9\n option > transport.socket.keepalive-interval 2\n option transport.sock"... > (gdb) > > > The crash happened while doing memcmp of fs->oldvolfile and the new volfile > received in the response (rsp.spec). The contents of both the variables seem > fine in the core. > > From code reading observed that we update fs->oldvollen and fs->oldvolfile > under fs->mutex lock, but that lock is not taken while reading those values > here in glfs_mgmt_spec_cbk. That could have resulted in the crash while > accessing un/partially intialized variable. > > @Jilju, > > Are you able to consistently reproduce this issue? Hi Soumya, The first occurrence is reported here. Kept the setup in same state for the favour of debugging. I can share the setup if required or I can try to reproduce.
The fix is available in the current rhgs-3.5 gluster rebase. Kindle validate and update.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (glusterfs bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5603