glusterfsd crashes when running the smallfile performance benchmark: - reproduced on a single vm - create a replica 2 volume - mount natively and run the smallfile benchmark: smallfile_cli.py --top /mnt/test --threads 4 --file-size 32 --file-size-distribution exponential --files 1000 --operation create - the test fails after a few seconds with the following error and both glusterfsd instances are gone: 0.00% of requested files processed, minimum is 70.00 Traceback (most recent call last): File "/root/smallfile/smallfile_cli.py", line 359, in <module> run_workload() File "/root/smallfile/smallfile_cli.py", line 334, in run_workload output_results.output_results( invoke_list, [ 'localhost' ], prm_thread_count, pct_files_min ) File "/root/smallfile/output_results.py", line 39, in output_results raise SMFResultException('at least one thread encountered error, test may be incomplete') smallfile_cli.SMFResultException: at least one thread encountered error, test may be incomplete
Running glusterfsd in gdb shows the following: Program received signal SIGSEGV, Segmentation fault. 0x00007fffecf1260f in iot_schedule (frame=<optimized out>, this=0x64ce00, stub=0x96f7f4c) at io-threads.c:327 327 gf_log (this->name, GF_LOG_DEBUG, "%s scheduled as %s fop", (gdb) bt #0 0x00007fffecf1260f in iot_schedule (frame=<optimized out>, this=0x64ce00, stub=0x96f7f4c) at io-threads.c:327 #1 0x00007fffecf132e3 in iot_finodelk (frame=0x99bb5c, this=0x64ce00, volume=<optimized out>, fd=<optimized out>, cmd=<optimized out>, lock=<optimized out>, xdata=0x0) at io-threads.c:2155 #2 0x00007ffff7da1353 in default_finodelk (frame=0x99bb5c, this=0x64df50, volume=0x998d50 "", fd=0x96b618c, cmd=<optimized out>, lock=<optimized out>, xdata=0x0) at defaults.c:1194 #3 0x00007ffff7da1353 in default_finodelk (frame=0x99bb5c, this=0x64f0c0, volume=0x998d50 "", fd=0x96b618c, cmd=<optimized out>, lock=<optimized out>, xdata=0x0) at defaults.c:1194 #4 0x00007ffff7da1353 in default_finodelk (frame=0x99bb5c, this=0x6502f0, volume=0x998d50 "", fd=0x96b618c, cmd=<optimized out>, lock=<optimized out>, xdata=0x0) at defaults.c:1194 #5 0x00007fffec6b646d in io_stats_finodelk (frame=0x96b43dc, this=0x6515a0, volume=0x998d50 "", fd=0x96b618c, cmd=6, flock=0x96b4ff8, xdata=0x0) at io-stats.c:1846 #6 0x00007fffec49155e in server_finodelk_resume (frame=0x96b4b84, bound_xl=0x6515a0) at server-rpc-fops.c:2345 #7 0x00007fffec483fd3 in server_resolve_done (frame=0x96b4b84) at server-resolve.c:530 #8 0x00007fffec48405d in server_resolve_all (frame=<optimized out>) at server-resolve.c:565 #9 0x00007fffec48434d in server_resolve (frame=0x96b4b84) at server-resolve.c:512 #10 0x00007fffec48409e in server_resolve_all (frame=<optimized out>) at server-resolve.c:561 #11 0x00007fffec484151 in server_resolve_fd (frame=0x96b4b84) at server-resolve.c:476 #12 0x00007fffec4842f1 in server_resolve (frame=0x96b4b84) at server-resolve.c:493 #13 0x00007fffec48407e in server_resolve_all (frame=<optimized out>) at server-resolve.c:554 #14 0x00007fffec4848f4 in resolve_and_resume (frame=<optimized out>, fn=<optimized out>) at server-resolve.c:584 #15 0x00007fffec492faf in server3_3_finodelk (req=0x96b695c) at server-rpc-fops.c:5348 #16 0x00007ffff7b69f15 in rpcsvc_handle_rpc_call (svc=0x654530, trans=<optimized out>, msg=0x8009640) at rpcsvc.c:586 #17 0x00007ffff7b6a453 in rpcsvc_notify (trans=0x673c80, mydata=<optimized out>, event=<optimized out>, data=0x8009640) at rpcsvc.c:680 #18 0x00007ffff7b6d357 in rpc_transport_notify (this=<optimized out>, event=<optimized out>, data=<optimized out>) at rpc-transport.c:499 #19 0x00007fffee3a4d1b in socket_event_poll_in (this=0x673c80) at socket.c:2119 #20 0x00007fffee3a544c in socket_event_handler (fd=<optimized out>, idx=<optimized out>, data=0x673c80, poll_in=1, poll_out=0, poll_err=0) at socket.c:2234 #21 0x00007ffff7ddc3f3 in event_dispatch_epoll_handler (i=<optimized out>, events=0x6419d0, event_pool=0x62f710) at event-epoll.c:384 #22 event_dispatch_epoll (event_pool=0x62f710) at event-epoll.c:445 #23 0x00000000004049fb in main (argc=20, argv=0x7fffffffe2e8) at glusterfsd.c:1953 (gdb) p stub $1 = (call_stub_t *) 0x96f7f4c (gdb) p stub->fop Cannot access memory at address 0x96f7f6c I suspect this is a use-after-free race, since this log message appears after a call to do_iot_schedule(), which enqueues the stub and returns.
The test runs without failure if I reorder the debug log message and call to do_iot_schedule().
REVIEW: http://review.gluster.org/5418 (performance/io-threads: fix potential use after free crash) posted (#1) for review on master by Brian Foster (bfoster)
COMMIT: http://review.gluster.org/5418 committed in master by Anand Avati (avati) ------ commit a1ece79c84c0306d38cb42cfa947dbc387af3ad0 Author: Brian Foster <bfoster> Date: Mon Jul 29 11:18:49 2013 -0400 performance/io-threads: fix potential use after free crash do_iot_schedule() enqueues the stub and kicks the worker thread. The stub is eventually destroyed after it has been resumed and thus unsafe to access after being enqueued. Though likely difficult to reproduce in a real deployment, a crash is reproducible by running a smallfile benchmark on a replica 2 volume on a single vm. Reorder the debug log message prior to the do_iot_schedule() call to avoid the crash. BUG: 989579 Change-Id: Ifc6502c02ae455c959a90ff1ca62a690e31ceafb Signed-off-by: Brian Foster <bfoster> Reviewed-on: http://review.gluster.org/5418 Reviewed-by: Santosh Pradhan <spradhan> Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Anand Avati <avati>
REVIEW: http://review.gluster.org/5815 (performance/io-threads: fix potential use after free crash) posted (#1) for review on release-3.4 by Vijay Bellur (vbellur)
REVIEW: http://review.gluster.org/5815 (performance/io-threads: fix potential use after free crash) posted (#2) for review on release-3.4 by Vijay Bellur (vbellur)
REVIEW: http://review.gluster.org/5815 (performance/io-threads: fix potential use after free crash) posted (#3) for review on release-3.4 by Vijay Bellur (vbellur)
REVIEW: http://review.gluster.org/5815 (performance/io-threads: fix potential use after free crash) posted (#4) for review on release-3.4 by Vijay Bellur (vbellur)
REVIEW: http://review.gluster.org/5815 (performance/io-threads: fix potential use after free crash) posted (#5) for review on release-3.4 by Vijay Bellur (vbellur)
REVIEW: http://review.gluster.org/5815 (performance/io-threads: fix potential use after free crash) posted (#6) for review on release-3.4 by Vijay Bellur (vbellur)
COMMIT: http://review.gluster.org/5815 committed in release-3.4 by Anand Avati (avati) ------ commit 1e09a63fe4ce14ee6a3f8f3a0cc5657812adcbf9 Author: Brian Foster <bfoster> Date: Mon Jul 29 11:18:49 2013 -0400 performance/io-threads: fix potential use after free crash do_iot_schedule() enqueues the stub and kicks the worker thread. The stub is eventually destroyed after it has been resumed and thus unsafe to access after being enqueued. Though likely difficult to reproduce in a real deployment, a crash is reproducible by running a smallfile benchmark on a replica 2 volume on a single vm. Reorder the debug log message prior to the do_iot_schedule() call to avoid the crash. BUG: 989579 Change-Id: Ifc6502c02ae455c959a90ff1ca62a690e31ceafb Signed-off-by: Brian Foster <bfoster> Reviewed-on: http://review.gluster.org/5418 Reviewed-by: Santosh Pradhan <spradhan> Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Anand Avati <avati> Reviewed-on: http://review.gluster.org/5815 Reviewed-by: Kaleb KEITHLEY <kkeithle>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.4.3, please reopen this bug report. glusterfs-3.4.3 has been announced on the Gluster Developers mailinglist [1], packages for several distributions should already be or become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. The fix for this bug likely to be included in all future GlusterFS releases i.e. release > 3.4.3. In the same line the recent release i.e. glusterfs-3.5.0 [3] likely to have the fix. You can verify this by reading the comments in this bug report and checking for comments mentioning "committed in release-3.5". [1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/5978 [2] http://news.gmane.org/gmane.comp.file-systems.gluster.user [3] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/6137