Bug 1243187 - Disperse volume : client glusterfs crashed while running IO
Summary: Disperse volume : client glusterfs crashed while running IO
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: disperse
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Pranith Kumar K
QA Contact:
URL:
Whiteboard:
Depends On: 1242423
Blocks: 1223636 1246121
TreeView+ depends on / blocked
 
Reported: 2015-07-15 00:57 UTC by Pranith Kumar K
Modified: 2016-06-16 13:24 UTC (History)
9 users (show)

Fixed In Version: glusterfs-3.8rc2
Doc Type: Bug Fix
Doc Text:
Clone Of: 1242423
: 1246121 (view as bug list)
Environment:
Last Closed: 2016-06-16 13:24:08 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)

Comment 1 Pranith Kumar K 2015-07-15 00:58:34 UTC
Description of problem:
======================

Seen the crash while running IO on fuse mount. (files creation with dd varying block size, directory creations and linux untar's)

Backtrace:
=========

(gdb) bt
#0  ec_unlock_timer_del (link=0x7fe43c231c2c) at ec-common.c:1696
#1  0x00007fe451f28713 in gf_timer_proc (ctx=0x7fe452d71010) at timer.c:194
#2  0x00007fe450ff1a51 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fe45095b96d in clone () from /lib64/libc.so.6
(gdb)  t a a  bt

Thread 9 (Thread 0x7fe437fff700 (LWP 15718)):
#0  0x00007fe450953a57 in writev () from /lib64/libc.so.6
#1  0x00007fe4492251ff in send_fuse_iov (this=<value optimized out>, finh=<value optimized out>, iov_out=0x7fe437ffe120, count=2) at fuse-bridge.c:191
#2  0x00007fe44922541a in send_fuse_data (this=<value optimized out>, finh=<value optimized out>, data=<value optimized out>, size=<value optimized out>)
    at fuse-bridge.c:230
#3  0x00007fe449235bda in fuse_writev_cbk (frame=0x7fe44f9c0958, cookie=<value optimized out>, this=0x7fe452d9e600, op_ret=65536, op_errno=0, 
    stbuf=0x7fe437ffe500, postbuf=0x7fe437ffe500, xdata=0x0) at fuse-bridge.c:2269
#4  0x00007fe43eb2a3ae in io_stats_writev_cbk (frame=0x7fe44fb0b380, cookie=<value optimized out>, this=<value optimized out>, op_ret=65536, op_errno=0, 
    prebuf=0x7fe437ffe500, postbuf=0x7fe437ffe500, xdata=0x0) at io-stats.c:1404
#5  0x00007fe43ef4ddb0 in mdc_writev_cbk (frame=0x7fe44fb57fb0, cookie=<value optimized out>, this=<value optimized out>, op_ret=65536, 
    op_errno=<value optimized out>, prebuf=0x7fe437ffe500, postbuf=0x7fe437ffe500, xdata=0x0) at md-cache.c:1524
#6  0x00007fe451f1706c in default_writev_cbk (frame=0x7fe44fb28dd8, cookie=<value optimized out>, this=<value optimized out>, op_ret=65536, op_errno=0, 
    prebuf=<value optimized out>, postbuf=0x7fe437ffe500, xdata=0x0) at defaults.c:1021
#7  0x00007fe451f1706c in default_writev_cbk (frame=0x7fe44fb3a0a4, cookie=<value optimized out>, this=<value optimized out>, op_ret=65536, op_errno=0, 
    prebuf=<value optimized out>, postbuf=0x7fe437ffe500, xdata=0x0) at defaults.c:1021
#8  0x00007fe43f56e974 in ioc_writev_cbk (frame=0x7fe44fb2da70, cookie=<value optimized out>, this=<value optimized out>, op_ret=65536, op_errno=0, 
    prebuf=0x7fe437ffe500, postbuf=0x7fe437ffe500, xdata=0x0) at io-cache.c:1244
#9  0x00007fe43f982bd9 in ra_writev_cbk (frame=0x7fe44fb35360, cookie=<value optimized out>, this=<value optimized out>, op_ret=65536, op_errno=0, 
    prebuf=0x7fe437ffe500, postbuf=0x7fe437ffe500, xdata=0x0) at read-ahead.c:661
#10 0x00007fe43fb8f30c in wb_do_unwinds (wb_inode=<value optimized out>, lies=0x7fe437ffe5d0) at write-behind.c:924
#11 0x00007fe43fb91b9c in wb_process_queue (wb_inode=0x7fe4258cb4a0) at write-behind.c:1213
#12 0x00007fe43fb924f8 in wb_writev (frame=0x7fe44fadae2c, this=<value optimized out>, fd=0x7fe441da9cd0, vector=0x7fe3eb17e6f0, count=1, offset=6160384, 
    flags=32769, iobref=0x7fe401611f30, xdata=0x0) at write-behind.c:1333
#13 0x00007fe43f98288a in ra_writev (frame=0x7fe44fb35360, this=0x7fe439c61a20, fd=0x7fe441da9cd0, vector=0x7fe3eb17e6f0, count=1, offset=6160384, 
    flags=32769, iobref=0x7fe401611f30, xdata=0x0) at read-ahead.c:689
#14 0x00007fe451f0f960 in default_writev (frame=0x7fe44fb35360, this=0x7fe439c62780, fd=0x7fe441da9cd0, vector=0x7fe3eb17e6f0, count=<value optimized out>, 
    off=<value optimized out>, flags=32769, iobref=0x7fe401611f30, xdata=0x0) at defaults.c:1836
#15 0x00007fe43f56e6c6 in ioc_writev (frame=0x7fe44fb2da70, this=0x7fe439c63610, fd=0x7fe441da9cd0, vector=0x7fe3eb17e6f0, count=1, offset=6160384, 
    flags=32769, iobref=0x7fe401611f30, xdata=0x0) at io-cache.c:1285
#16 0x00007fe43f362040 in qr_writev (frame=0x7fe44fb3a0a4, this=0x7fe439c64360, fd=0x7fe441da9cd0, iov=0x7fe3eb17e6f0, count=1, offset=6160384, flags=32769, 
    iobref=0x7fe401611f30, xdata=0x0) at quick-read.c:636
#17 0x00007fe451f13a23 in default_writev_resume (frame=0x7fe44fb28dd8, this=0x7fe439c65120, fd=0x7fe441da9cd0, vector=0x7fe3eb17e6f0, count=1, off=6160384, 
    flags=32769, iobref=0x7fe401611f30, xdata=0x0) at defaults.c:1395
#18 0x00007fe451f2d3aa in call_resume_wind (stub=<value optimized out>) at call-stub.c:2124
#19 0x00007fe451f31640 in call_resume (stub=0x7fe44f5a5c24) at call-stub.c:2576
#20 0x00007fe43f158760 in open_and_resume (this=0x7fe439c65120, fd=0x7fe441da9cd0, stub=0x7fe44f5a5c24) at open-behind.c:242
#21 0x00007fe43f159dfb in ob_writev (frame=0x7fe44fb28dd8, this=0x7fe439c65120, fd=0x7fe441da9cd0, iov=<value optimized out>, count=<value optimized out>, 
    offset=<value optimized out>, flags=32769, iobref=0x7fe401611f30, xdata=0x0) at open-behind.c:414
#22 0x00007fe43ef4ab3b in mdc_writev (frame=0x7fe44fb57fb0, this=0x7fe439c65ee0, fd=0x7fe441da9cd0, vector=0x7fe405cc40c0, count=1, offset=6160384, 
    flags=32769, iobref=0x7fe401611f30, xdata=0x0) at md-cache.c:1542
#23 0x00007fe451f0f960 in default_writev (frame=0x7fe44fb57fb0, this=0x7fe439c68030, fd=0x7fe441da9cd0, vector=0x7fe405cc40c0, count=<value optimized out>, 
    off=<value optimized out>, flags=32769, iobref=0x7fe401611f30, xdata=0x0) at defaults.c:1836
#24 0x00007fe43eb227a7 in io_stats_writev (frame=0x7fe44fb0b380, this=0x7fe439c69180, fd=0x7fe441da9cd0, vector=0x7fe405cc40c0, count=1, offset=6160384, 
    flags=32769, iobref=0x7fe401611f30, xdata=0x0) at io-stats.c:2164
#25 0x00007fe451f0f960 in default_writev (frame=0x7fe44fb0b380, this=0x7fe439c6a180, fd=0x7fe441da9cd0, vector=0x7fe405cc40c0, count=<value optimized out>, 
    off=<value optimized out>, flags=32769, iobref=0x7fe401611f30, xdata=0x0) at defaults.c:1836
#26 0x00007fe43e90904e in meta_writev (frame=0x7fe44fb0b380, this=0x7fe439c6a180, fd=0x7fe441da9cd0, iov=0x7fe405cc40c0, count=<value optimized out>, 
    offset=<value optimized out>, flags=32769, iobref=0x7fe401611f30, xdata=0x0) at meta.c:147
#27 0x00007fe4492313eb in fuse_write_resume (state=<value optimized out>) at fuse-bridge.c:2309
#28 0x00007fe4492245f5 in fuse_resolve_done (state=<value optimized out>) at fuse-resolve.c:644
#29 fuse_resolve_all (state=<value optimized out>) at fuse-resolve.c:671
#30 0x00007fe449224326 in fuse_resolve (state=0x7fe405cc39d0) at fuse-resolve.c:635
#31 0x00007fe44922463e in fuse_resolve_all (state=<value optimized out>) at fuse-resolve.c:667
#32 0x00007fe4492246a3 in fuse_resolve_continue (state=0x7fe405cc39d0) at fuse-resolve.c:687
#33 0x00007fe4492242ce in fuse_resolve_fd (state=0x7fe405cc39d0) at fuse-resolve.c:547
#34 fuse_resolve (state=0x7fe405cc39d0) at fuse-resolve.c:624
#35 0x00007fe44922461e in fuse_resolve_all (state=<value optimized out>) at fuse-resolve.c:660
#36 0x00007fe449224668 in fuse_resolve_and_resume (state=0x7fe405cc39d0, fn=0x7fe449231170 <fuse_write_resume>) at fuse-resolve.c:699
---Type <return> to continue, or q <return> to quit---
#37 0x00007fe449239240 in fuse_thread_proc (data=0x7fe452d9e600) at fuse-bridge.c:4903
#38 0x00007fe450ff1a51 in start_thread () from /lib64/libpthread.so.0
#39 0x00007fe45095b96d in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7fe44881a700 (LWP 15709)):
#0  0x00007fe450ff9535 in sigwait () from /lib64/libpthread.so.0
#1  0x00007fe4523d002b in glusterfs_sigwaiter ()
#2  0x00007fe450ff1a51 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fe45095b96d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7fe4523b3740 (LWP 15707)):
#0  0x00007fe450ff22ad in pthread_join () from /lib64/libpthread.so.0
#1  0x00007fe451f6a53d in event_dispatch_epoll (event_pool=0x7fe452d8fc90) at event-epoll.c:762
#2  0x00007fe4523d1ef1 in main ()

Thread 6 (Thread 0x7fe4375fe700 (LWP 15719)):
#0  0x00007fe450ff87dd in read () from /lib64/libpthread.so.0
#1  0x00007fe44922c9b3 in read (data=<value optimized out>) at /usr/include/bits/unistd.h:45
#2  notify_kernel_loop (data=<value optimized out>) at fuse-bridge.c:3824
#3  0x00007fe450ff1a51 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fe45095b96d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7fe43e903700 (LWP 15715)):
#0  0x00007fe45095bf63 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fe451f6a8c1 in event_dispatch_epoll_worker (data=0x7fe440028e40) at event-epoll.c:668
#2  0x00007fe450ff1a51 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fe45095b96d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7fe447418700 (LWP 15711)):
#0  0x00007fe450ff5a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fe451f4dd6b in syncenv_task (proc=0x7fe452db1a50) at syncop.c:607
#2  0x00007fe451f52b80 in syncenv_processor (thdata=0x7fe452db1a50) at syncop.c:699
#3  0x00007fe450ff1a51 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fe45095b96d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fe447e19700 (LWP 15710)):
#0  0x00007fe450ff5a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fe451f4dd6b in syncenv_task (proc=0x7fe452db1690) at syncop.c:607
#2  0x00007fe451f52b80 in syncenv_processor (thdata=0x7fe452db1690) at syncop.c:699
#3  0x00007fe450ff1a51 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fe45095b96d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fe444fdf700 (LWP 15714)):
#0  0x00007fe45095bf63 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fe451f6a8c1 in event_dispatch_epoll_worker (data=0x7fe452de4c00) at event-epoll.c:668
#2  0x00007fe450ff1a51 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fe45095b96d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fe44921b700 (LWP 15708)):
#0  ec_unlock_timer_del (link=0x7fe43c231c2c) at ec-common.c:1696
#1  0x00007fe451f28713 in gf_timer_proc (ctx=0x7fe452d71010) at timer.c:194
#2  0x00007fe450ff1a51 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fe45095b96d in clone () from /lib64/libc.so.6
(gdb)

Comment 2 Anand Avati 2015-07-15 01:27:30 UTC
REVIEW: http://review.gluster.org/6459 (timer: fix race between gf_timer_call_cancel() and gf_timer_proc()) posted (#2) for review on master by Pranith Kumar Karampuri (pkarampu@redhat.com)

Comment 3 Anand Avati 2015-07-15 01:27:33 UTC
REVIEW: http://review.gluster.org/11670 (cluster/ec: Handle race between unlock-timer, new lock) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu@redhat.com)

Comment 4 Anand Avati 2015-07-16 02:14:59 UTC
REVIEW: http://review.gluster.org/6459 (timer: fix race between gf_timer_call_cancel() and gf_timer_proc()) posted (#4) for review on master by Pranith Kumar Karampuri (pkarampu@redhat.com)

Comment 5 Anand Avati 2015-07-23 13:31:00 UTC
COMMIT: http://review.gluster.org/11670 committed in master by Xavier Hernandez (xhernandez@datalab.es) 
------
commit 332cb6a6901c68f0c79070d7103cc7a728ea6d26
Author: Pranith Kumar K <pkarampu@redhat.com>
Date:   Wed Jul 15 06:16:54 2015 +0530

    cluster/ec: Handle race between unlock-timer, new lock
    
    Problem:
    New lock could come at the time timer is on the way to unlock. This was leading
    to crash in timer thread because thread executing new lock can free up the
    timer_link->fop and then timer thread will try to access structures already
    freed.
    
    Fix:
    If the timer event is fired, set lock->release to true and wait for unlock to
    complete.
    
    Thanks to Xavi and Bhaskar for helping in confirming that this race is the RC.
    Thanks to Kritika for pointing out and explaining how Avati's patch can be used
    to fix this bug.
    
    Change-Id: I45fa5470bbc1f03b5f3d133e26d1e0ab24303378
    BUG: 1243187
    Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
    Reviewed-on: http://review.gluster.org/11670
    Tested-by: Gluster Build System <jenkins@build.gluster.com>
    Reviewed-by: Xavier Hernandez <xhernandez@datalab.es>
    Tested-by: NetBSD Build System <jenkins@build.gluster.org>

Comment 6 Nagaprasad Sathyanarayana 2015-10-25 14:47:06 UTC
Fix for this BZ is already present in a GlusterFS release. You can find clone of this BZ, fixed in a GlusterFS release and closed. Hence closing this mainline BZ as well.

Comment 7 Niels de Vos 2016-06-16 13:24:08 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report.

glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user


Note You need to log in before you can comment on or make changes to this bug.