Bug 1246121

Summary: Disperse volume : client glusterfs crashed while running IO
Product: [Community] GlusterFS Reporter: Xavi Hernandez <jahernan>
Component: disperseAssignee: Xavi Hernandez <jahernan>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.7.2CC: annair, asrivast, bugs, byarlaga, gluster-bugs, nsathyan, pkarampu, rhs-bugs
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.7.9 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1243187 Environment:
Last Closed: 2015-07-30 09:47:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1243187    
Bug Blocks: 1233025    

Description Xavi Hernandez 2015-07-23 13:32:51 UTC
+++ This bug was initially created as a clone of Bug #1243187 +++

Description of problem:
======================

Seen the crash while running IO on fuse mount. (files creation with dd varying block size, directory creations and linux untar's)

Backtrace:
=========

(gdb) bt
#0  ec_unlock_timer_del (link=0x7fe43c231c2c) at ec-common.c:1696
#1  0x00007fe451f28713 in gf_timer_proc (ctx=0x7fe452d71010) at timer.c:194
#2  0x00007fe450ff1a51 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fe45095b96d in clone () from /lib64/libc.so.6
(gdb)  t a a  bt

Thread 9 (Thread 0x7fe437fff700 (LWP 15718)):
#0  0x00007fe450953a57 in writev () from /lib64/libc.so.6
#1  0x00007fe4492251ff in send_fuse_iov (this=<value optimized out>, finh=<value optimized out>, iov_out=0x7fe437ffe120, count=2) at fuse-bridge.c:191
#2  0x00007fe44922541a in send_fuse_data (this=<value optimized out>, finh=<value optimized out>, data=<value optimized out>, size=<value optimized out>)
    at fuse-bridge.c:230
#3  0x00007fe449235bda in fuse_writev_cbk (frame=0x7fe44f9c0958, cookie=<value optimized out>, this=0x7fe452d9e600, op_ret=65536, op_errno=0, 
    stbuf=0x7fe437ffe500, postbuf=0x7fe437ffe500, xdata=0x0) at fuse-bridge.c:2269
#4  0x00007fe43eb2a3ae in io_stats_writev_cbk (frame=0x7fe44fb0b380, cookie=<value optimized out>, this=<value optimized out>, op_ret=65536, op_errno=0, 
    prebuf=0x7fe437ffe500, postbuf=0x7fe437ffe500, xdata=0x0) at io-stats.c:1404
#5  0x00007fe43ef4ddb0 in mdc_writev_cbk (frame=0x7fe44fb57fb0, cookie=<value optimized out>, this=<value optimized out>, op_ret=65536, 
    op_errno=<value optimized out>, prebuf=0x7fe437ffe500, postbuf=0x7fe437ffe500, xdata=0x0) at md-cache.c:1524
#6  0x00007fe451f1706c in default_writev_cbk (frame=0x7fe44fb28dd8, cookie=<value optimized out>, this=<value optimized out>, op_ret=65536, op_errno=0, 
    prebuf=<value optimized out>, postbuf=0x7fe437ffe500, xdata=0x0) at defaults.c:1021
#7  0x00007fe451f1706c in default_writev_cbk (frame=0x7fe44fb3a0a4, cookie=<value optimized out>, this=<value optimized out>, op_ret=65536, op_errno=0, 
    prebuf=<value optimized out>, postbuf=0x7fe437ffe500, xdata=0x0) at defaults.c:1021
#8  0x00007fe43f56e974 in ioc_writev_cbk (frame=0x7fe44fb2da70, cookie=<value optimized out>, this=<value optimized out>, op_ret=65536, op_errno=0, 
    prebuf=0x7fe437ffe500, postbuf=0x7fe437ffe500, xdata=0x0) at io-cache.c:1244
#9  0x00007fe43f982bd9 in ra_writev_cbk (frame=0x7fe44fb35360, cookie=<value optimized out>, this=<value optimized out>, op_ret=65536, op_errno=0, 
    prebuf=0x7fe437ffe500, postbuf=0x7fe437ffe500, xdata=0x0) at read-ahead.c:661
#10 0x00007fe43fb8f30c in wb_do_unwinds (wb_inode=<value optimized out>, lies=0x7fe437ffe5d0) at write-behind.c:924
#11 0x00007fe43fb91b9c in wb_process_queue (wb_inode=0x7fe4258cb4a0) at write-behind.c:1213
#12 0x00007fe43fb924f8 in wb_writev (frame=0x7fe44fadae2c, this=<value optimized out>, fd=0x7fe441da9cd0, vector=0x7fe3eb17e6f0, count=1, offset=6160384, 
    flags=32769, iobref=0x7fe401611f30, xdata=0x0) at write-behind.c:1333
#13 0x00007fe43f98288a in ra_writev (frame=0x7fe44fb35360, this=0x7fe439c61a20, fd=0x7fe441da9cd0, vector=0x7fe3eb17e6f0, count=1, offset=6160384, 
    flags=32769, iobref=0x7fe401611f30, xdata=0x0) at read-ahead.c:689
#14 0x00007fe451f0f960 in default_writev (frame=0x7fe44fb35360, this=0x7fe439c62780, fd=0x7fe441da9cd0, vector=0x7fe3eb17e6f0, count=<value optimized out>, 
    off=<value optimized out>, flags=32769, iobref=0x7fe401611f30, xdata=0x0) at defaults.c:1836
#15 0x00007fe43f56e6c6 in ioc_writev (frame=0x7fe44fb2da70, this=0x7fe439c63610, fd=0x7fe441da9cd0, vector=0x7fe3eb17e6f0, count=1, offset=6160384, 
    flags=32769, iobref=0x7fe401611f30, xdata=0x0) at io-cache.c:1285
#16 0x00007fe43f362040 in qr_writev (frame=0x7fe44fb3a0a4, this=0x7fe439c64360, fd=0x7fe441da9cd0, iov=0x7fe3eb17e6f0, count=1, offset=6160384, flags=32769, 
    iobref=0x7fe401611f30, xdata=0x0) at quick-read.c:636
#17 0x00007fe451f13a23 in default_writev_resume (frame=0x7fe44fb28dd8, this=0x7fe439c65120, fd=0x7fe441da9cd0, vector=0x7fe3eb17e6f0, count=1, off=6160384, 
    flags=32769, iobref=0x7fe401611f30, xdata=0x0) at defaults.c:1395
#18 0x00007fe451f2d3aa in call_resume_wind (stub=<value optimized out>) at call-stub.c:2124
#19 0x00007fe451f31640 in call_resume (stub=0x7fe44f5a5c24) at call-stub.c:2576
#20 0x00007fe43f158760 in open_and_resume (this=0x7fe439c65120, fd=0x7fe441da9cd0, stub=0x7fe44f5a5c24) at open-behind.c:242
#21 0x00007fe43f159dfb in ob_writev (frame=0x7fe44fb28dd8, this=0x7fe439c65120, fd=0x7fe441da9cd0, iov=<value optimized out>, count=<value optimized out>, 
    offset=<value optimized out>, flags=32769, iobref=0x7fe401611f30, xdata=0x0) at open-behind.c:414
#22 0x00007fe43ef4ab3b in mdc_writev (frame=0x7fe44fb57fb0, this=0x7fe439c65ee0, fd=0x7fe441da9cd0, vector=0x7fe405cc40c0, count=1, offset=6160384, 
    flags=32769, iobref=0x7fe401611f30, xdata=0x0) at md-cache.c:1542
#23 0x00007fe451f0f960 in default_writev (frame=0x7fe44fb57fb0, this=0x7fe439c68030, fd=0x7fe441da9cd0, vector=0x7fe405cc40c0, count=<value optimized out>, 
    off=<value optimized out>, flags=32769, iobref=0x7fe401611f30, xdata=0x0) at defaults.c:1836
#24 0x00007fe43eb227a7 in io_stats_writev (frame=0x7fe44fb0b380, this=0x7fe439c69180, fd=0x7fe441da9cd0, vector=0x7fe405cc40c0, count=1, offset=6160384, 
    flags=32769, iobref=0x7fe401611f30, xdata=0x0) at io-stats.c:2164
#25 0x00007fe451f0f960 in default_writev (frame=0x7fe44fb0b380, this=0x7fe439c6a180, fd=0x7fe441da9cd0, vector=0x7fe405cc40c0, count=<value optimized out>, 
    off=<value optimized out>, flags=32769, iobref=0x7fe401611f30, xdata=0x0) at defaults.c:1836
#26 0x00007fe43e90904e in meta_writev (frame=0x7fe44fb0b380, this=0x7fe439c6a180, fd=0x7fe441da9cd0, iov=0x7fe405cc40c0, count=<value optimized out>, 
    offset=<value optimized out>, flags=32769, iobref=0x7fe401611f30, xdata=0x0) at meta.c:147
#27 0x00007fe4492313eb in fuse_write_resume (state=<value optimized out>) at fuse-bridge.c:2309
#28 0x00007fe4492245f5 in fuse_resolve_done (state=<value optimized out>) at fuse-resolve.c:644
#29 fuse_resolve_all (state=<value optimized out>) at fuse-resolve.c:671
#30 0x00007fe449224326 in fuse_resolve (state=0x7fe405cc39d0) at fuse-resolve.c:635
#31 0x00007fe44922463e in fuse_resolve_all (state=<value optimized out>) at fuse-resolve.c:667
#32 0x00007fe4492246a3 in fuse_resolve_continue (state=0x7fe405cc39d0) at fuse-resolve.c:687
#33 0x00007fe4492242ce in fuse_resolve_fd (state=0x7fe405cc39d0) at fuse-resolve.c:547
#34 fuse_resolve (state=0x7fe405cc39d0) at fuse-resolve.c:624
#35 0x00007fe44922461e in fuse_resolve_all (state=<value optimized out>) at fuse-resolve.c:660
#36 0x00007fe449224668 in fuse_resolve_and_resume (state=0x7fe405cc39d0, fn=0x7fe449231170 <fuse_write_resume>) at fuse-resolve.c:699
---Type <return> to continue, or q <return> to quit---
#37 0x00007fe449239240 in fuse_thread_proc (data=0x7fe452d9e600) at fuse-bridge.c:4903
#38 0x00007fe450ff1a51 in start_thread () from /lib64/libpthread.so.0
#39 0x00007fe45095b96d in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7fe44881a700 (LWP 15709)):
#0  0x00007fe450ff9535 in sigwait () from /lib64/libpthread.so.0
#1  0x00007fe4523d002b in glusterfs_sigwaiter ()
#2  0x00007fe450ff1a51 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fe45095b96d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7fe4523b3740 (LWP 15707)):
#0  0x00007fe450ff22ad in pthread_join () from /lib64/libpthread.so.0
#1  0x00007fe451f6a53d in event_dispatch_epoll (event_pool=0x7fe452d8fc90) at event-epoll.c:762
#2  0x00007fe4523d1ef1 in main ()

Thread 6 (Thread 0x7fe4375fe700 (LWP 15719)):
#0  0x00007fe450ff87dd in read () from /lib64/libpthread.so.0
#1  0x00007fe44922c9b3 in read (data=<value optimized out>) at /usr/include/bits/unistd.h:45
#2  notify_kernel_loop (data=<value optimized out>) at fuse-bridge.c:3824
#3  0x00007fe450ff1a51 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fe45095b96d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7fe43e903700 (LWP 15715)):
#0  0x00007fe45095bf63 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fe451f6a8c1 in event_dispatch_epoll_worker (data=0x7fe440028e40) at event-epoll.c:668
#2  0x00007fe450ff1a51 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fe45095b96d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7fe447418700 (LWP 15711)):
#0  0x00007fe450ff5a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fe451f4dd6b in syncenv_task (proc=0x7fe452db1a50) at syncop.c:607
#2  0x00007fe451f52b80 in syncenv_processor (thdata=0x7fe452db1a50) at syncop.c:699
#3  0x00007fe450ff1a51 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fe45095b96d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fe447e19700 (LWP 15710)):
#0  0x00007fe450ff5a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fe451f4dd6b in syncenv_task (proc=0x7fe452db1690) at syncop.c:607
#2  0x00007fe451f52b80 in syncenv_processor (thdata=0x7fe452db1690) at syncop.c:699
#3  0x00007fe450ff1a51 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fe45095b96d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fe444fdf700 (LWP 15714)):
#0  0x00007fe45095bf63 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fe451f6a8c1 in event_dispatch_epoll_worker (data=0x7fe452de4c00) at event-epoll.c:668
#2  0x00007fe450ff1a51 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fe45095b96d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fe44921b700 (LWP 15708)):
#0  ec_unlock_timer_del (link=0x7fe43c231c2c) at ec-common.c:1696
#1  0x00007fe451f28713 in gf_timer_proc (ctx=0x7fe452d71010) at timer.c:194
#2  0x00007fe450ff1a51 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fe45095b96d in clone () from /lib64/libc.so.6
(gdb)

--- Additional comment from Anand Avati on 2015-07-15 03:27:30 CEST ---

REVIEW: http://review.gluster.org/6459 (timer: fix race between gf_timer_call_cancel() and gf_timer_proc()) posted (#2) for review on master by Pranith Kumar Karampuri (pkarampu)

--- Additional comment from Anand Avati on 2015-07-15 03:27:33 CEST ---

REVIEW: http://review.gluster.org/11670 (cluster/ec: Handle race between unlock-timer, new lock) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu)

--- Additional comment from Anand Avati on 2015-07-16 04:14:59 CEST ---

REVIEW: http://review.gluster.org/6459 (timer: fix race between gf_timer_call_cancel() and gf_timer_proc()) posted (#4) for review on master by Pranith Kumar Karampuri (pkarampu)

--- Additional comment from Anand Avati on 2015-07-23 15:31:00 CEST ---

COMMIT: http://review.gluster.org/11670 committed in master by Xavier Hernandez (xhernandez) 
------
commit 332cb6a6901c68f0c79070d7103cc7a728ea6d26
Author: Pranith Kumar K <pkarampu>
Date:   Wed Jul 15 06:16:54 2015 +0530

    cluster/ec: Handle race between unlock-timer, new lock
    
    Problem:
    New lock could come at the time timer is on the way to unlock. This was leading
    to crash in timer thread because thread executing new lock can free up the
    timer_link->fop and then timer thread will try to access structures already
    freed.
    
    Fix:
    If the timer event is fired, set lock->release to true and wait for unlock to
    complete.
    
    Thanks to Xavi and Bhaskar for helping in confirming that this race is the RC.
    Thanks to Kritika for pointing out and explaining how Avati's patch can be used
    to fix this bug.
    
    Change-Id: I45fa5470bbc1f03b5f3d133e26d1e0ab24303378
    BUG: 1243187
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: http://review.gluster.org/11670
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Xavier Hernandez <xhernandez>
    Tested-by: NetBSD Build System <jenkins.org>

Comment 1 Anand Avati 2015-07-23 13:49:12 UTC
REVIEW: http://review.gluster.org/11752 (cluster/ec: Handle race between unlock-timer, new lock) posted (#1) for review on release-3.7 by Xavier Hernandez (xhernandez)

Comment 2 Anand Avati 2015-07-23 16:36:59 UTC
COMMIT: http://review.gluster.org/11752 committed in release-3.7 by Xavier Hernandez (xhernandez) 
------
commit 75d50eaba3fd7d24874ba8acc9a776c863a932e2
Author: Pranith Kumar K <pkarampu>
Date:   Wed Jul 15 06:16:54 2015 +0530

    cluster/ec: Handle race between unlock-timer, new lock
    
    Problem:
    New lock could come at the time timer is on the way to unlock. This was leading
    to crash in timer thread because thread executing new lock can free up the
    timer_link->fop and then timer thread will try to access structures already
    freed.
    
    Fix:
    If the timer event is fired, set lock->release to true and wait for unlock to
    complete.
    
    Thanks to Xavi and Bhaskar for helping in confirming that this race is the RC.
    Thanks to Kritika for pointing out and explaining how Avati's patch can be used
    to fix this bug.
    
    > Change-Id: I45fa5470bbc1f03b5f3d133e26d1e0ab24303378
    > BUG: 1243187
    > Signed-off-by: Pranith Kumar K <pkarampu>
    > Reviewed-on: http://review.gluster.org/11670
    > Tested-by: Gluster Build System <jenkins.com>
    > Reviewed-by: Xavier Hernandez <xhernandez>
    > Tested-by: NetBSD Build System <jenkins.org>
    
    Change-Id: I9af012e717493684b7cd7d1c63baf2fa401fb542
    BUG: 1246121
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: http://review.gluster.org/11752
    Tested-by: NetBSD Build System <jenkins.org>
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Xavier Hernandez <xhernandez>

Comment 3 Anand Avati 2015-07-29 11:19:47 UTC
REVIEW: http://review.gluster.org/11796 (timer: fix race between gf_timer_call_cancel() and gf_timer_proc()) posted (#1) for review on release-3.7 by Pranith Kumar Karampuri (pkarampu)

Comment 4 Kaushal 2015-07-30 09:47:42 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.3, please open a new bug report.

glusterfs-3.7.3 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/12078
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 5 Vijay Bellur 2016-02-08 10:30:06 UTC
REVIEW: http://review.gluster.org/11796 (timer: fix race between gf_timer_call_cancel() and gf_timer_proc()) posted (#2) for review on release-3.7 by Pranith Kumar Karampuri (pkarampu)

Comment 6 Vijay Bellur 2016-02-17 05:05:06 UTC
COMMIT: http://review.gluster.org/11796 committed in release-3.7 by Pranith Kumar Karampuri (pkarampu) 
------
commit dcd1db157479bf9f9ac3c39fd3fd8cb0359c0779
Author: Anand Avati <avati>
Date:   Fri Dec 6 17:31:57 2013 -0800

    timer: fix race between gf_timer_call_cancel() and gf_timer_proc()
    
     >Change-Id: Ie264d3d591352e4a8ddaa90ae2174d9c552396f1
     >BUG: 1243187
     >Signed-off-by: Anand Avati <avati>
     >Reviewed-on: http://review.gluster.org/6459
     >Reviewed-by: Krutika Dhananjay <kdhananj>
     >Reviewed-by: Poornima G <pgurusid>
     >Tested-by: Gluster Build System <jenkins.com>
     >Reviewed-by: Niels de Vos <ndevos>
     >(cherry picked from commit ea90c92820ee0ca500345863cdfb5009d08b6ca7)
    
    Change-Id: I2bc136e7d676826428afbf57f5afe50e2238fd33
    BUG: 1246121
    Signed-off-by: Anand Avati <avati>
    Reviewed-on: http://review.gluster.org/11796
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    Tested-by: Pranith Kumar Karampuri <pkarampu>
    Smoke: Gluster Build System <jenkins.com>
    CentOS-regression: Gluster Build System <jenkins.com>
    Reviewed-by: Niels de Vos <ndevos>
    NetBSD-regression: NetBSD Build System <jenkins.org>

Comment 7 Kaushal 2016-04-19 07:25:16 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.9, please open a new bug report.

glusterfs-3.7.9 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://www.gluster.org/pipermail/gluster-users/2016-March/025922.html
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user