Description of problem: Created a 3 way replicate volume and ran some opearations on it like untarring the linux kernel, building glusterfs, running posix compliance etc. Meanwhile I was taking down and bringing up the same replicate subvolume. Then client process crashed with segfault.fd was null in 'afr_fd_ctx_get' Version-Release number of selected component (if applicable): c3aa99d907591f72b6302287b9b8899514fb52f1 How reproducible: 1/1 Steps to Reproduce: 1. Create and start a 3 way replicate volume. 2. mount via fuse and untar linux kernel and glusterfs. 3. From different terminal run posix compliance test and 'make' of glusterfs source. 4. Bring down 2 of the replicate subvolumes down and after sometime bring them back online. keep doing this. 5. After posix compliance and 'make' run fileop and dbench. Actual results: make, fileop and dbench all failed because fuse client crashed with following backtrace (gdb) bt #0 0x000000334d20c100 in pthread_spin_lock () from /lib64/libpthread.so.0 #1 0x00007fa92bb6fc23 in afr_fd_ctx_get (fd=0x0, this=0x10cfff0) at afr-transaction.c:74 #2 0x00007fa92bb6e934 in afr_openfd_fix_open_cbk (frame=0x7fa932f00f58, cookie=0x2, this=0x10cfff0, op_ret=-1, op_errno=107, fd=0x0) at afr-open.c:335 #3 0x00007fa92bde7382 in client3_1_open (frame=0x7fa933f8fcb0, this=0x10cf110, data=0x7fa928cb2eb0) at client3_1-fops.c:3542 #4 0x00007fa92bdcfcbc in client_open (frame=0x7fa933f8fcb0, this=0x10cf110, loc=0x7fa924726e48, flags=32770, fd=0x7fa928cb62fc, wbflags=0) at client.c:743 #5 0x00007fa92bb6f45a in afr_fix_open (frame=0x7fa933f14560, this=0x10cfff0, fd_ctx=0x18bef00, need_open_count=1, need_open=0x7fa9255a2080) at afr-open.c:435 #6 0x00007fa92bb63efd in afr_open_fd_fix (frame=0x7fa933f14560, this=0x10cfff0, pause_fop=_gf_false) at afr-inode-write.c:431 #7 0x00007fa92bb61de5 in afr_readv (frame=0x7fa933f14560, this=0x10cfff0, fd=0x7fa928cb62fc, size=131072, offset=0) at afr-inode-read.c:1147 #8 0x00007fa92b93919e in wb_readv_helper (frame=0x7fa933f4ea9c, this=0x10d12f0, fd=0x7fa928cb62fc, size=131072, offset=0) at write-behind.c:2241 #9 0x00007fa935304712 in call_resume_wind (stub=0x7fa932db0d34) at call-stub.c:2257 #10 0x00007fa93530be65 in call_resume (stub=0x7fa932db0d34) at call-stub.c:3932 #11 0x00007fa92b9375cf in wb_resume_other_requests (frame=0x7fa933f4ea9c, file=0x1365f60, other_requests=0x7fa928cb33c0) at write-behind.c:1832 #12 0x00007fa92b937833 in wb_do_ops (frame=0x7fa933f4ea9c, file=0x1365f60, winds=0x7fa928cb33e0, unwinds=0x7fa928cb33d0, other_requests=0x7fa928cb33c0) at write-behind.c:1870 #13 0x00007fa92b938033 in wb_process_queue (frame=0x7fa933f4ea9c, file=0x1365f60) at write-behind.c:2053 #14 0x00007fa92b9394cc in wb_readv (frame=0x7fa933f4ea9c, this=0x10d12f0, fd=0x7fa928cb62fc, size=131072, offset=0) at write-behind.c:2302 #15 0x00007fa92b72940a in ra_page_fault (file=0x118a060, frame=0x7fa933f8f29c, offset=0) at page.c:278 #16 0x00007fa92b723c95 in dispatch_requests (frame=0x7fa933f8f29c, file=0x118a060) at read-ahead.c:435 #17 0x00007fa92b72458f in ra_readv (frame=0x7fa933f8f29c, this=0x10d2570, fd=0x7fa928cb62fc, size=131072, offset=0) at read-ahead.c:543 #18 0x00007fa92b5196f1 in ioc_page_fault (ioc_inode=0x6128e20, frame=0x7fa933f49140, fd=0x7fa928cb62fc, offset=0) at page.c:631 #19 0x00007fa92b512cb8 in ioc_dispatch_requests (frame=0x7fa933f49140, ioc_inode=0x6128e20, fd=0x7fa928cb62fc, offset=0, size=131072) at io-cache.c:1041 #20 0x00007fa92b513be7 in ioc_readv (frame=0x7fa933f49140, this=0x10d3840, fd=0x7fa928cb62fc, size=131072, offset=0) at io-cache.c:1204 #21 0x00007fa92b2f904f in qr_readv (frame=0x7fa933f93ddc, this=0x10d49a0, fd=0x7fa928cb62fc, size=131072, offset=0) at quick-read.c:1320 #22 0x00007fa92b0dfc9c in sp_readv (frame=0x7fa933f1639c, this=0x10d5c60, fd=0x7fa928cb62fc, size=131072, offset=0) at stat-prefetch.c:2817 #23 0x00007fa92aebffd3 in io_stats_readv (frame=0x7fa933f49cac, this=0x10d6f20, fd=0x7fa928cb62fc, size=131072, offset=0) at io-stats.c:2064 #24 0x00007fa932b81ad0 in fuse_readv_resume (state=0x7fa92408ae10) at fuse-bridge.c:2036 #25 0x00007fa932b75c81 in fuse_resolve_and_resume (state=0x7fa92408ae10, fn=0x7fa932b81695 <fuse_readv_resume>) at fuse-resolve.c:578 #26 0x00007fa932b81c7c in fuse_readv (this=0x10c2680, finh=0x7fa9244778f0, msg=0x7fa924477918) at fuse-bridge.c:2065 #27 0x00007fa932b8b042 in fuse_thread_proc (data=0x10c2680) at fuse-bridge.c:3707 #28 0x000000334d2077e1 in start_thread () from /lib64/libpthread.so.0 #29 0x000000334cae577d in clone () from /lib64/libc.so.6 (gdb) f 2 #2 0x00007fa92bb6e934 in afr_openfd_fix_open_cbk (frame=0x7fa932f00f58, cookie=0x2, this=0x10cfff0, op_ret=-1, op_errno=107, fd=0x0) at afr-open.c:335 335 fd_ctx = afr_fd_ctx_get (fd, this); (gdb) f 1 #1 0x00007fa92bb6fc23 in afr_fd_ctx_get (fd=0x0, this=0x10cfff0) at afr-transaction.c:74 74 LOCK(&fd->lock); Expected results: fuse client should not crash. Additional info: I have attached the client log file and archived other log files and core.
Created attachment 559033 [details] 500 entries from client log
Actual client file is too big to attach (42MB). Attaching file which has only last 500 entries from client log file.
CHANGE: http://review.gluster.com/2792 (cluster/afr: Don't trust the fd returned in open_cbk) merged in master by Vijay Bellur (vijay)
Not reproducible consistently, but did not see the crash with glusterfs-3.3.0qa41.