+++ This bug was initially created as a clone of Bug #1660577 +++ +++ This bug was initially created as a clone of Bug #1658132 +++ Description of problem: ----------------------- ganesha entered failed state in one node of the four node cluster while exporting volumes in loop. Tried to export 109 volumes one after the other in loop. =============================================================================== Version-Release number of selected component (if applicable): ------------------------------------------------------------- nfs-ganesha-2.5.5-10.el7rhgs.x86_64 nfs-ganesha-gluster-2.5.5-10.el7rhgs.x86_64 glusterfs-ganesha-3.12.2-28.el7rhgs.x86_64 =============================================================================== How reproducible: ----------------- 1/1 =============================================================================== Steps to Reproduce: ------------------- 1. Create 4 node ganesha cluster. 2. Create and start 100 or more volumes. 3. Verify status of all volumes. 4. Export volumes one after the other in a loop. =============================================================================== Actual results: --------------- Ganesha entered failed state in one of the nodes. =============================================================================== Expected results: ----------------- No failure should be observed. ============================================================================== Additional info: ---------------- * All volumes were exported on other 3 nodes in the 4 node cluster. * The failure observed is on a different node than the one from where export operation was executed. Setup is kept in same state and can be shared if required. --- Additional comment from Red Hat Bugzilla Rules Engine on 2018-12-11 10:35:37 UTC --- This bug is automatically being proposed for a Z-stream release of Red Hat Gluster Storage 3 under active development and open for bug fixes, by setting the release flag 'rhgs‑3.4.z' to '?'. If this bug should be proposed for a different release, please manually change the proposed release flag. --- Additional comment from Jilju Joy on 2018-12-11 10:37:00 UTC --- Logs and sos report will be shared shortly. --- Additional comment from Jilju Joy on 2018-12-11 11:59:20 UTC --- Logs and sosreport : http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/jj/1658132/ --- Additional comment from Soumya Koduri on 2018-12-11 16:26:13 UTC --- (gdb) bt #0 __memcmp_sse4_1 () at ../sysdeps/x86_64/multiarch/memcmp-sse4.S:74 #1 0x00007f5d18130664 in glfs_mgmt_getspec_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f5b74002cb0) at glfs-mgmt.c:625 #2 0x00007f5c8e9e8960 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f5a08cc5760, pollin=pollin@entry=0x7f5b7f09acb0) at rpc-clnt.c:778 #3 0x00007f5c8e9e8d03 in rpc_clnt_notify (trans=<optimized out>, mydata=0x7f5a08cc5790, event=<optimized out>, data=0x7f5b7f09acb0) at rpc-clnt.c:971 #4 0x00007f5c8e9e4a73 in rpc_transport_notify (this=this@entry=0x7f5a08cc5930, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f5b7f09acb0) at rpc-transport.c:538 #5 0x00007f5c849e5576 in socket_event_poll_in (this=this@entry=0x7f5a08cc5930, notify_handled=<optimized out>) at socket.c:2322 #6 0x00007f5c849e7b1c in socket_event_handler (fd=565, idx=0, gen=1, data=0x7f5a08cc5930, poll_in=1, poll_out=0, poll_err=0) at socket.c:2474 #7 0x00007f5c8ec7e824 in event_dispatch_epoll_handler (event=0x7f59e1f44500, event_pool=0x7f5a08cb74f0) at event-epoll.c:583 #8 event_dispatch_epoll_worker (data=0x7f5b760922a0) at event-epoll.c:659 #9 0x00007f5d20e44dd5 in start_thread (arg=0x7f59e1f45700) at pthread_create.c:307 #10 0x00007f5d2050fead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 (gdb) f 1 #1 0x00007f5d18130664 in glfs_mgmt_getspec_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f5b74002cb0) at glfs-mgmt.c:625 625 (memcmp (fs->oldvolfile, rsp.spec, size) == 0)) { (gdb) l 620 621 ret = 0; 622 size = rsp.op_ret; 623 624 if ((size == fs->oldvollen) && 625 (memcmp (fs->oldvolfile, rsp.spec, size) == 0)) { 626 gf_msg (frame->this->name, GF_LOG_INFO, 0, 627 API_MSG_VOLFILE_INFO, 628 "No change in volfile, continuing"); 629 goto out; (gdb) p fs->olvollen There is no member named olvollen. (gdb) p fs->oldvollen $1 = 1674 (gdb) p size $2 = 1674 (gdb) p fs->oldvolfile $3 = 0x7f5b76097cd0 "volume testvol82201-client-0\n type protocol/client\n option send-gids true\n option transport.socket.keepalive-count 9\n option transport.socket.keepalive-interval 2\n option transport.sock"... (gdb) p rsp.spec $4 = 0x7f5b7f9da9d0 "volume testvol82201-client-0\n type protocol/client\n option send-gids true\n option transport.socket.keepalive-count 9\n option transport.socket.keepalive-interval 2\n option transport.sock"... (gdb) The crash happened while doing memcmp of fs->oldvolfile and the new volfile received in the response (rsp.spec). The contents of both the variables seem fine in the core. From code reading observed that we update fs->oldvollen and fs->oldvolfile under fs->mutex lock, but that lock is not taken while reading those values here in glfs_mgmt_spec_cbk. That could have resulted in the crash while accessing un/partially intialized variable. @Jilju, Are you able to consistently reproduce this issue? --- Additional comment from Daniel Gryniewicz on 2018-12-11 16:33:41 UTC --- Are the buffers smaller than 1674? It might be going off the end of one of the buffers. --- Additional comment from Jilju Joy on 2018-12-12 04:50:00 UTC --- (In reply to Soumya Koduri from comment #4) > (gdb) bt > #0 __memcmp_sse4_1 () at ../sysdeps/x86_64/multiarch/memcmp-sse4.S:74 > #1 0x00007f5d18130664 in glfs_mgmt_getspec_cbk (req=<optimized out>, > iov=<optimized out>, count=<optimized out>, myframe=0x7f5b74002cb0) at > glfs-mgmt.c:625 > #2 0x00007f5c8e9e8960 in rpc_clnt_handle_reply > (clnt=clnt@entry=0x7f5a08cc5760, pollin=pollin@entry=0x7f5b7f09acb0) at > rpc-clnt.c:778 > #3 0x00007f5c8e9e8d03 in rpc_clnt_notify (trans=<optimized out>, > mydata=0x7f5a08cc5790, event=<optimized out>, data=0x7f5b7f09acb0) at > rpc-clnt.c:971 > #4 0x00007f5c8e9e4a73 in rpc_transport_notify > (this=this@entry=0x7f5a08cc5930, > event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, > data=data@entry=0x7f5b7f09acb0) at rpc-transport.c:538 > #5 0x00007f5c849e5576 in socket_event_poll_in > (this=this@entry=0x7f5a08cc5930, notify_handled=<optimized out>) at > socket.c:2322 > #6 0x00007f5c849e7b1c in socket_event_handler (fd=565, idx=0, gen=1, > data=0x7f5a08cc5930, poll_in=1, poll_out=0, poll_err=0) at socket.c:2474 > #7 0x00007f5c8ec7e824 in event_dispatch_epoll_handler > (event=0x7f59e1f44500, event_pool=0x7f5a08cb74f0) at event-epoll.c:583 > #8 event_dispatch_epoll_worker (data=0x7f5b760922a0) at event-epoll.c:659 > #9 0x00007f5d20e44dd5 in start_thread (arg=0x7f59e1f45700) at > pthread_create.c:307 > #10 0x00007f5d2050fead in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 > (gdb) f 1 > #1 0x00007f5d18130664 in glfs_mgmt_getspec_cbk (req=<optimized out>, > iov=<optimized out>, count=<optimized out>, myframe=0x7f5b74002cb0) at > glfs-mgmt.c:625 > 625 (memcmp (fs->oldvolfile, rsp.spec, size) == 0)) { > (gdb) l > 620 > 621 ret = 0; > 622 size = rsp.op_ret; > 623 > 624 if ((size == fs->oldvollen) && > 625 (memcmp (fs->oldvolfile, rsp.spec, size) == 0)) { > 626 gf_msg (frame->this->name, GF_LOG_INFO, 0, > 627 API_MSG_VOLFILE_INFO, > 628 "No change in volfile, continuing"); > 629 goto out; > (gdb) p fs->olvollen > There is no member named olvollen. > (gdb) p fs->oldvollen > $1 = 1674 > (gdb) p size > $2 = 1674 > (gdb) p fs->oldvolfile > $3 = 0x7f5b76097cd0 "volume testvol82201-client-0\n type > protocol/client\n option send-gids true\n option > transport.socket.keepalive-count 9\n option > transport.socket.keepalive-interval 2\n option transport.sock"... > (gdb) p rsp.spec > $4 = 0x7f5b7f9da9d0 "volume testvol82201-client-0\n type > protocol/client\n option send-gids true\n option > transport.socket.keepalive-count 9\n option > transport.socket.keepalive-interval 2\n option transport.sock"... > (gdb) > > > The crash happened while doing memcmp of fs->oldvolfile and the new volfile > received in the response (rsp.spec). The contents of both the variables seem > fine in the core. > > From code reading observed that we update fs->oldvollen and fs->oldvolfile > under fs->mutex lock, but that lock is not taken while reading those values > here in glfs_mgmt_spec_cbk. That could have resulted in the crash while > accessing un/partially intialized variable. > > @Jilju, > > Are you able to consistently reproduce this issue? Hi Soumya, The first occurrence is reported here. Kept the setup in same state for the favour of debugging. I can share the setup if required or I can try to reproduce. --- Additional comment from Worker Ant on 2018-12-18 17:05:42 UTC --- REVIEW: https://review.gluster.org/21882 (gfapi: Access fs->oldvolfile under mutex lock) posted (#1) for review on master by soumya k --- Additional comment from Worker Ant on 2018-12-26 02:17:03 UTC --- REVIEW: https://review.gluster.org/21882 (gfapi: Access fs->oldvolfile under mutex lock) posted (#2) for review on master by Amar Tumballi --- Additional comment from Worker Ant on 2018-12-26 10:33:07 UTC --- REVIEW: https://review.gluster.org/21927 (gfapi: nit cleanup related to releasing fs->mutex lock) posted (#1) for review on master by soumya k --- Additional comment from Worker Ant on 2018-12-31 16:10:41 UTC --- REVIEW: https://review.gluster.org/21927 (gfapi: nit cleanup related to releasing fs->mutex lock) posted (#2) for review on master by Kaleb KEITHLEY
REVIEW: https://review.gluster.org/21986 (gfapi: Access fs->oldvolfile under mutex lock) posted (#1) for review on release-4.1 by soumya k
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-4.1.7, please open a new bug report. glusterfs-4.1.7 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://lists.gluster.org/pipermail/announce/2019-January/000118.html [2] https://www.gluster.org/pipermail/gluster-users/