Bug 1180231 - glusterfs-fuse: Crash due to race in FUSE notify when multiple epoll threads invoke the routine
Summary: glusterfs-fuse: Crash due to race in FUSE notify when multiple epoll threads ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: fuse
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Shyamsundar
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-01-08 16:39 UTC by Shyamsundar
Modified: 2015-05-14 17:45 UTC (History)
2 users (show)

Fixed In Version: glusterfs-3.7.0
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-05-14 17:28:55 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Shyamsundar 2015-01-08 16:39:12 UTC
Description of problem:
On running through the regression test suite for the multi-thread epoll patch (http://review.gluster.org/#/c/3842/), crashes were observed in glusterfs as follows, these are mostly when the volume is just mounted via FUSE and a graph change event is triggered (or even otherwise at times, i.e no events generated)

Core details:
Core was generated by `glusterfs --volfile-server=127.1.1.1 --volfile-id=patchy /mnt/glusterfs/0'.
Program terminated with signal 11, Segmentation fault.
#0  get_fuse_state (this=0x1f9c8e0, finh=0x7f3c28000900) at fuse-helpers.c:127
127                     active_subvol->winds++;

(gdb) bt
#0  get_fuse_state (this=0x1f9c8e0, finh=0x7f3c28000900) at fuse-helpers.c:127
#1  0x00007f3c472ef6cd in fuse_getattr (this=0x1f9c8e0, finh=0x7f3c28000900, msg=<value optimized out>) at fuse-bridge.c:846
#2  0x00007f3c472f9950 in fuse_thread_proc (data=0x1f9c8e0) at fuse-bridge.c:4899
#3  0x00007f3c4f0b19d1 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f3c4ea1b9dd in clone () from /lib64/libc.so.6

(gdb) p active_subvol 
$1 = (xlator_t *) 0x0

<<edited output below>>
(gdb) p *(fuse_private_t *)state->this->private 
$7 = {fd = 8, volfile = 0x0, mount_point = 0x1f9db20 "/mnt/glusterfs/0", fuse_thread = 139896664205056, fuse_thread_started = 1 '\001',<...> event_recvd = 1 '\001', init_recvd = 1 '\001', <...> next_graph = 0x7f3c34000900, active_subvol = 0x0, <...> use_readdirp = _gf_true}

How reproducible:
Happens quite a few times on the regression runs, quite difficult to reproduce on some other machines though.

Steps to Reproduce:
Regrssion runs failed in the above review, or specifically test case, bug-948686.t

Additional info:
The root cause is as follows,

1) As there are 2 epoll threads (or more) we get notification on child up in both threads on the same graph (graph ID 0).

2) In the notify when both threads race and call fuse_graph_setup, one would succeed setting the graph->used and the other would bail out seeing that the graph is used.

3) Now if the thread that bails out, would start the fuse_thread_proc and the other would still be updating fuse private

4) As fuse_thread_proc is started, and the other thread has not completed fuse_graph_setup, priv->next_graph would be NULL at this point, causing the fuse_graph_sync to not promote this to active_subvol

As a result when the call is processed active_subvol is NULL and we crash.

Other areas where graph can go NULL are explored (by Du and self) and the graph can never be NULL, so this happens at startup time itself due to the race above.

The resolution for this looks like increasing the critical section where graph is used and set into next_graph, so that the fuse_graph_sync sees the right state.

I did have a log message on when graph_sync is called and when graph_setup is done, etc. So what happens is that graph sync is called, but there are 2 graph_setup racing at that point causing the said issue.

I did not verify that one of the threads exited detecting graph->used as true though. That seems to be most likely as there are more epoll threads and notify would race.

Comment 1 Anand Avati 2015-01-08 19:02:50 UTC
REVIEW: http://review.gluster.org/9421 (fuse: Fix cores in notify function when this is executed in parallel) posted (#1) for review on master by Shyamsundar Ranganathan (srangana)

Comment 2 Anand Avati 2015-01-12 16:25:48 UTC
REVIEW: http://review.gluster.org/9421 (fuse: Fix cores in notify function when this is executed in parallel) posted (#2) for review on master by Shyamsundar Ranganathan (srangana)

Comment 3 Anand Avati 2015-01-12 16:30:45 UTC
REVIEW: http://review.gluster.org/9421 (fuse: Fix cores in notify function when this is executed in parallel) posted (#3) for review on master by Shyamsundar Ranganathan (srangana)

Comment 4 Anand Avati 2015-01-13 05:14:58 UTC
COMMIT: http://review.gluster.org/9421 committed in master by Raghavendra G (rgowdapp) 
------
commit 3971315248c57386e05e6c8f57369a4571555cb2
Author: Shyam <srangana>
Date:   Thu Jan 8 13:56:08 2015 -0500

    fuse: Fix cores in notify function when this is executed in parallel
    
    The fuse notify function gets called by the epoll or the poll thread
    and till the point there is a single epoll thread, 2 notify
    instances would not race with each other.
    
    With the upcoming multi thread epoll changes, it is possible that
    2 epoll threads invoke the notify function. As a result races
    in this function are fixed with this commit.
    
    The races seen are detailed in the bug, and the fix here is to
    enforce a (slightly) longer critical section when updating the
    fuse private structure and reserving state updates post error
    handling.
    
    Change-Id: I6974bc043cb59eb6dc39c5777123364dcefca358
    BUG: 1180231
    Signed-off-by: Shyam <srangana>
    Reviewed-on: http://review.gluster.org/9421
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Raghavendra G <rgowdapp>
    Tested-by: Raghavendra G <rgowdapp>

Comment 5 Niels de Vos 2015-05-14 17:28:55 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 6 Niels de Vos 2015-05-14 17:35:47 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 7 Niels de Vos 2015-05-14 17:38:10 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 8 Niels de Vos 2015-05-14 17:45:36 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user


Note You need to log in before you can comment on or make changes to this bug.