Bug 989579 - glusterfsd crashes on smallfile benchmark
Summary: glusterfsd crashes on smallfile benchmark
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: core
Version: 3.4.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Brian Foster
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 996312
TreeView+ depends on / blocked
 
Reported: 2013-07-29 14:51 UTC by Brian Foster
Modified: 2014-04-17 13:13 UTC (History)
1 user (show)

Fixed In Version: glusterfs-3.4.3
Clone Of:
: 996312 (view as bug list)
Environment:
Last Closed: 2014-04-17 13:13:45 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Brian Foster 2013-07-29 14:51:01 UTC
glusterfsd crashes when running the smallfile performance benchmark:

- reproduced on a single vm
- create a replica 2 volume
- mount natively and run the smallfile benchmark:

smallfile_cli.py --top /mnt/test --threads 4 --file-size 32 --file-size-distribution exponential --files 1000 --operation create

- the test fails after a few seconds with the following error and both glusterfsd instances are gone:

  0.00% of requested files processed, minimum is  70.00
Traceback (most recent call last):
  File "/root/smallfile/smallfile_cli.py", line 359, in <module>
    run_workload()
  File "/root/smallfile/smallfile_cli.py", line 334, in run_workload
    output_results.output_results( invoke_list, [ 'localhost' ], prm_thread_count, pct_files_min )
  File "/root/smallfile/output_results.py", line 39, in output_results
    raise SMFResultException('at least one thread encountered error, test may be incomplete')
smallfile_cli.SMFResultException: at least one thread encountered error, test may be incomplete

Comment 1 Brian Foster 2013-07-29 14:55:43 UTC
Running glusterfsd in gdb shows the following:

Program received signal SIGSEGV, Segmentation fault.
0x00007fffecf1260f in iot_schedule (frame=<optimized out>, this=0x64ce00, stub=0x96f7f4c) at io-threads.c:327
327	        gf_log (this->name, GF_LOG_DEBUG, "%s scheduled as %s fop",
(gdb) bt
#0  0x00007fffecf1260f in iot_schedule (frame=<optimized out>, this=0x64ce00, stub=0x96f7f4c) at io-threads.c:327
#1  0x00007fffecf132e3 in iot_finodelk (frame=0x99bb5c, this=0x64ce00, volume=<optimized out>, fd=<optimized out>, cmd=<optimized out>, 
    lock=<optimized out>, xdata=0x0) at io-threads.c:2155
#2  0x00007ffff7da1353 in default_finodelk (frame=0x99bb5c, this=0x64df50, volume=0x998d50 "", fd=0x96b618c, cmd=<optimized out>, lock=<optimized out>, 
    xdata=0x0) at defaults.c:1194
#3  0x00007ffff7da1353 in default_finodelk (frame=0x99bb5c, this=0x64f0c0, volume=0x998d50 "", fd=0x96b618c, cmd=<optimized out>, lock=<optimized out>, 
    xdata=0x0) at defaults.c:1194
#4  0x00007ffff7da1353 in default_finodelk (frame=0x99bb5c, this=0x6502f0, volume=0x998d50 "", fd=0x96b618c, cmd=<optimized out>, lock=<optimized out>, 
    xdata=0x0) at defaults.c:1194
#5  0x00007fffec6b646d in io_stats_finodelk (frame=0x96b43dc, this=0x6515a0, volume=0x998d50 "", fd=0x96b618c, cmd=6, flock=0x96b4ff8, xdata=0x0)
    at io-stats.c:1846
#6  0x00007fffec49155e in server_finodelk_resume (frame=0x96b4b84, bound_xl=0x6515a0) at server-rpc-fops.c:2345
#7  0x00007fffec483fd3 in server_resolve_done (frame=0x96b4b84) at server-resolve.c:530
#8  0x00007fffec48405d in server_resolve_all (frame=<optimized out>) at server-resolve.c:565
#9  0x00007fffec48434d in server_resolve (frame=0x96b4b84) at server-resolve.c:512
#10 0x00007fffec48409e in server_resolve_all (frame=<optimized out>) at server-resolve.c:561
#11 0x00007fffec484151 in server_resolve_fd (frame=0x96b4b84) at server-resolve.c:476
#12 0x00007fffec4842f1 in server_resolve (frame=0x96b4b84) at server-resolve.c:493
#13 0x00007fffec48407e in server_resolve_all (frame=<optimized out>) at server-resolve.c:554
#14 0x00007fffec4848f4 in resolve_and_resume (frame=<optimized out>, fn=<optimized out>) at server-resolve.c:584
#15 0x00007fffec492faf in server3_3_finodelk (req=0x96b695c) at server-rpc-fops.c:5348
#16 0x00007ffff7b69f15 in rpcsvc_handle_rpc_call (svc=0x654530, trans=<optimized out>, msg=0x8009640) at rpcsvc.c:586
#17 0x00007ffff7b6a453 in rpcsvc_notify (trans=0x673c80, mydata=<optimized out>, event=<optimized out>, data=0x8009640) at rpcsvc.c:680
#18 0x00007ffff7b6d357 in rpc_transport_notify (this=<optimized out>, event=<optimized out>, data=<optimized out>) at rpc-transport.c:499
#19 0x00007fffee3a4d1b in socket_event_poll_in (this=0x673c80) at socket.c:2119
#20 0x00007fffee3a544c in socket_event_handler (fd=<optimized out>, idx=<optimized out>, data=0x673c80, poll_in=1, poll_out=0, poll_err=0) at socket.c:2234
#21 0x00007ffff7ddc3f3 in event_dispatch_epoll_handler (i=<optimized out>, events=0x6419d0, event_pool=0x62f710) at event-epoll.c:384
#22 event_dispatch_epoll (event_pool=0x62f710) at event-epoll.c:445
#23 0x00000000004049fb in main (argc=20, argv=0x7fffffffe2e8) at glusterfsd.c:1953
(gdb) p stub
$1 = (call_stub_t *) 0x96f7f4c
(gdb) p stub->fop
Cannot access memory at address 0x96f7f6c

I suspect this is a use-after-free race, since this log message appears after a call to do_iot_schedule(), which enqueues the stub and returns.

Comment 2 Brian Foster 2013-07-29 15:03:04 UTC
The test runs without failure if I reorder the debug log message and call to do_iot_schedule().

Comment 3 Anand Avati 2013-07-29 15:19:31 UTC
REVIEW: http://review.gluster.org/5418 (performance/io-threads: fix potential use after free crash) posted (#1) for review on master by Brian Foster (bfoster)

Comment 4 Anand Avati 2013-08-02 04:03:59 UTC
COMMIT: http://review.gluster.org/5418 committed in master by Anand Avati (avati) 
------
commit a1ece79c84c0306d38cb42cfa947dbc387af3ad0
Author: Brian Foster <bfoster>
Date:   Mon Jul 29 11:18:49 2013 -0400

    performance/io-threads: fix potential use after free crash
    
    do_iot_schedule() enqueues the stub and kicks the worker thread.
    The stub is eventually destroyed after it has been resumed and thus
    unsafe to access after being enqueued.
    
    Though likely difficult to reproduce in a real deployment, a crash
    is reproducible by running a smallfile benchmark on a replica 2
    volume on a single vm. Reorder the debug log message prior to the
    do_iot_schedule() call to avoid the crash.
    
    BUG: 989579
    Change-Id: Ifc6502c02ae455c959a90ff1ca62a690e31ceafb
    Signed-off-by: Brian Foster <bfoster>
    Reviewed-on: http://review.gluster.org/5418
    Reviewed-by: Santosh Pradhan <spradhan>
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Anand Avati <avati>

Comment 5 Anand Avati 2013-09-05 10:41:28 UTC
REVIEW: http://review.gluster.org/5815 (performance/io-threads: fix potential use after free crash) posted (#1) for review on release-3.4 by Vijay Bellur (vbellur)

Comment 6 Anand Avati 2013-09-05 14:48:39 UTC
REVIEW: http://review.gluster.org/5815 (performance/io-threads: fix potential use after free crash) posted (#2) for review on release-3.4 by Vijay Bellur (vbellur)

Comment 7 Anand Avati 2013-09-05 15:11:05 UTC
REVIEW: http://review.gluster.org/5815 (performance/io-threads: fix potential use after free crash) posted (#3) for review on release-3.4 by Vijay Bellur (vbellur)

Comment 8 Anand Avati 2013-09-06 04:25:13 UTC
REVIEW: http://review.gluster.org/5815 (performance/io-threads: fix potential use after free crash) posted (#4) for review on release-3.4 by Vijay Bellur (vbellur)

Comment 9 Anand Avati 2013-09-06 07:41:13 UTC
REVIEW: http://review.gluster.org/5815 (performance/io-threads: fix potential use after free crash) posted (#5) for review on release-3.4 by Vijay Bellur (vbellur)

Comment 10 Anand Avati 2013-09-07 14:59:19 UTC
REVIEW: http://review.gluster.org/5815 (performance/io-threads: fix potential use after free crash) posted (#6) for review on release-3.4 by Vijay Bellur (vbellur)

Comment 11 Anand Avati 2013-09-10 00:26:20 UTC
COMMIT: http://review.gluster.org/5815 committed in release-3.4 by Anand Avati (avati) 
------
commit 1e09a63fe4ce14ee6a3f8f3a0cc5657812adcbf9
Author: Brian Foster <bfoster>
Date:   Mon Jul 29 11:18:49 2013 -0400

    performance/io-threads: fix potential use after free crash
    
    do_iot_schedule() enqueues the stub and kicks the worker thread.
    The stub is eventually destroyed after it has been resumed and thus
    unsafe to access after being enqueued.
    
    Though likely difficult to reproduce in a real deployment, a crash
    is reproducible by running a smallfile benchmark on a replica 2
    volume on a single vm. Reorder the debug log message prior to the
    do_iot_schedule() call to avoid the crash.
    
    BUG: 989579
    Change-Id: Ifc6502c02ae455c959a90ff1ca62a690e31ceafb
    Signed-off-by: Brian Foster <bfoster>
    Reviewed-on: http://review.gluster.org/5418
    Reviewed-by: Santosh Pradhan <spradhan>
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Anand Avati <avati>
    Reviewed-on: http://review.gluster.org/5815
    Reviewed-by: Kaleb KEITHLEY <kkeithle>

Comment 12 Niels de Vos 2014-04-17 13:13:45 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.4.3, please reopen this bug report.

glusterfs-3.4.3 has been announced on the Gluster Developers mailinglist [1], packages for several distributions should already be or become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

The fix for this bug likely to be included in all future GlusterFS releases i.e. release > 3.4.3. In the same line the recent release i.e. glusterfs-3.5.0 [3] likely to have the fix. You can verify this by reading the comments in this bug report and checking for comments mentioning "committed in release-3.5".

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/5978
[2] http://news.gmane.org/gmane.comp.file-systems.gluster.user
[3] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/6137


Note You need to log in before you can comment on or make changes to this bug.