+++ This bug was initially created as a clone of Bug #1123289 +++ [2014-07-25 02:54:44.925942] W [fuse-resolve.c:546:fuse_resolve_fd] 0-fuse-resolve: migration of basefd (ptr:0x7f1a9c6f0dcc inode-gfid:66143cb4-7050-4c4a-a638-4f4c549c1a16) did not complete, failing fop with EBADF (old-subvolume:gv-nova-3 new-subvolume:gv-nova-4) pending frames: frame : type(1) op(FSYNC) frame : type(0) op(0) frame : type(0) op(0) frame : type(1) op(OPEN) frame : type(1) op(OPEN) frame : type(0) op(0) frame : type(0) op(0) frame : type(1) op(LOOKUP) frame : type(1) op(LOOKUP) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2014-07-25 02:54:44configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.4.3 /lib/x86_64-linux-gnu/libc.so.6(+0x364a0)[0x7f1a98f5e4a0] /usr/lib/x86_64-linux-gnu/glusterfs/3.4.3/xlator/cluster/distribute.so(dht_fsync+0x166)[0x7f1a8fdee2a6] /usr/lib/x86_64-linux-gnu/glusterfs/3.4.3/xlator/performance/write-behind.so(wb_fsync+0x2e5)[0x7f1a94019965] /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(default_fsync_resume+0x13f)[0x7f1a9994d21f] /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(call_resume+0x1f7)[0x7f1a99965457] /usr/lib/x86_64-linux-gnu/glusterfs/3.4.3/xlator/performance/open-behind.so(open_and_resume+0xf0)[0x7f1a8fbb41a0] /usr/lib/x86_64-linux-gnu/glusterfs/3.4.3/xlator/performance/open-behind.so(ob_fsync+0x40)[0x7f1a8fbb5130] /usr/lib/x86_64-linux-gnu/glusterfs/3.4.3/xlator/debug/io-stats.so(io_stats_fsync+0x16a)[0x7f1a8f99d9fa] /usr/lib/x86_64-linux-gnu/glusterfs/3.4.3/xlator/mount/fuse.so(fuse_fsync_resume+0x230)[0x7f1a971837b0] /usr/lib/x86_64-linux-gnu/glusterfs/3.4.3/xlator/mount/fuse.so(+0x8196)[0x7f1a97177196] /usr/lib/x86_64-linux-gnu/glusterfs/3.4.3/xlator/mount/fuse.so(+0x7ed6)[0x7f1a97176ed6] /usr/lib/x86_64-linux-gnu/glusterfs/3.4.3/xlator/mount/fuse.so(+0x81de)[0x7f1a971771de] /usr/lib/x86_64-linux-gnu/glusterfs/3.4.3/xlator/mount/fuse.so(fuse_resolve_continue+0x41)[0x7f1a97176ab1] /usr/lib/x86_64-linux-gnu/glusterfs/3.4.3/xlator/mount/fuse.so(+0x7e5e)[0x7f1a97176e5e] /usr/lib/x86_64-linux-gnu/glusterfs/3.4.3/xlator/mount/fuse.so(+0x81be)[0x7f1a971771be] /usr/lib/x86_64-linux-gnu/glusterfs/3.4.3/xlator/mount/fuse.so(fuse_resolve_and_resume+0x28)[0x7f1a971776c8] /usr/lib/x86_64-linux-gnu/glusterfs/3.4.3/xlator/mount/fuse.so(+0x201f4)[0x7f1a9718f1f4] /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a)[0x7f1a992efe9a] /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f1a9901c3fd] --- Additional comment from Joe Julian on 2014-07-25 17:11:03 EDT --- Second client to fail in 12 hours: [2014-07-25 07:00:29.287680] W [fuse-resolve.c:546:fuse_resolve_fd] 0-fuse-resolve: migration of basefd (ptr:0x7f17cb846444 inode-gfid:87544fde-9bad-46d8-b610-1a8c93b85113) did not complete, failing fop with EBADF (old-subvolume:gv-nova-3 new-subvolume:gv-nova-4) pending frames: frame : type(1) op(FSYNC) frame : type(0) op(0) frame : type(1) op(FSYNC) frame : type(0) op(0) frame : type(1) op(OPEN) frame : type(1) op(OPEN) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2014-07-25 07:00:29configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.4.4 /lib/x86_64-linux-gnu/libc.so.6(+0x364a0)[0x7f17c763f4a0] /usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/cluster/distribute.so(dht_fsync+0x166)[0x7f17c26ee296] /usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/performance/write-behind.so(wb_fsync+0x2e5)[0x7f17c24b3965] /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(default_fsync_resume+0x13f)[0x7f17c802e21f] /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(call_resume+0x1f7)[0x7f17c8046457] /usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/performance/open-behind.so(open_and_resume+0xf0)[0x7f17c22a71a0] /usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/performance/open-behind.so(ob_fsync+0x40)[0x7f17c22a8130] /usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/debug/io-stats.so(io_stats_fsync+0x16a)[0x7f17c20909fa] /usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/mount/fuse.so(fuse_fsync_resume+0x230)[0x7f17c58647b0] /usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/mount/fuse.so(+0x8196)[0x7f17c5858196] /usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/mount/fuse.so(+0x7ed6)[0x7f17c5857ed6] /usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/mount/fuse.so(+0x81de)[0x7f17c58581de] /usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/mount/fuse.so(fuse_resolve_continue+0x41)[0x7f17c5857ab1] /usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/mount/fuse.so(+0x7e5e)[0x7f17c5857e5e] /usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/mount/fuse.so(+0x81be)[0x7f17c58581be] /usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/mount/fuse.so(fuse_resolve_and_resume+0x28)[0x7f17c58586c8] /usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/mount/fuse.so(+0x201f4)[0x7f17c58701f4] /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a)[0x7f17c79d0e9a] /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f17c76fd3fd] --- Additional comment from Joe Julian on 2014-07-25 18:10:10 EDT --- I replaced a brick on Monday. Could this be related to that? --- Additional comment from Raghavendra G on 2014-07-26 06:58:02 EDT --- Joe, Replacing a brick results in a graph switch to new graph containing new-brick. That is when migration of fds get triggered. So, replacing brick uncovered this bug. regards, Raghavendra. --- Additional comment from Joe Julian on 2014-07-26 13:52:32 EDT --- Volume Name: gv-nova Type: Distributed-Replicate Volume ID: 978c6843-1d6b-48dc-a498-862c57637916 Status: Started Number of Bricks: 7 x 2 = 14 Transport-type: tcp Bricks: Brick1: storage02-stor:/gluster/brick02/nova Brick2: storage06-stor:/gluster/brick02/nova Brick3: storage03-stor:/gluster/brick03/nova Brick4: storage07-stor:/gluster/brick03/nova Brick5: storage03-stor:/gluster/brick04/nova Brick6: storage08-stor:/gluster/brick03/nova Brick7: storage04-stor:/gluster/brick03/nova Brick8: storage07-stor:/gluster/brick04/nova Brick9: storage04-stor:/gluster/brick04/nova Brick10: storage08-stor:/gluster/brick04/nova Brick11: storage02-stor:/gluster/brick03/nova Brick12: storage07-stor:/gluster/brick02/nova Brick13: storage02-stor:/gluster/brick04/nova Brick14: storage08-stor:/gluster/brick02/nova Options Reconfigured: cluster.heal-timeout: 500 cluster.self-heal-window-size: 2 cluster.self-heal-readdir-size: 2KB cluster.background-self-heal-count: 32 performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: enable storage.owner-uid: 500 storage.owner-gid: 500 performance.io-thread-count: 24 cluster.self-heal-daemon: on diagnostics.latency-measurement: on diagnostics.count-fop-hits: on --- Additional comment from Joe Julian on 2014-07-26 13:54:28 EDT --- graph switch occurs at 2014-07-24 05:23:43 in the attached log file --- Additional comment from Joe Julian on 2014-07-29 16:29:09 EDT --- The series of: [2014-07-29 06:39:13.410535] W [socket.c:514:__socket_rwv] 3-gv-nova-client-0: readv failed (No data available) [2014-07-29 06:39:13.410560] I [client.c:2098:client_rpc_notify] 3-gv-nova-client-0: disconnected Is the client querying for the removed brick's port number: 116 4.036814 10.30.6.28 10.30.6.3 Gluster Portmap 154 V1 PORTBYBRICK Call (Reply In 117) + Gluster Portmap [Program Version: 1] [Gluster Portmap: PORTBYBRICK (1)] + Brick: /gluster/st01brick01/nova 117 4.036993 10.30.6.3 10.30.6.28 Gluster Portmap 98 V1 PORTBYBRICK Reply (Call In 116) + Gluster Portmap [Program Version: 1] [Gluster Portmap: PORTBYBRICK (1)] Return value: -1 Errno: 0 (Success) Status: 0 Port: 0 Since that brick no longer exists due to using the appreciated command, "remove-brick ... commit force", that continues until the crash happens. --- Additional comment from Joe Julian on 2014-07-30 04:26:36 EDT --- replace-brick, not remove-brick --- Additional comment from Pranith Kumar K on 2014-07-30 08:58:26 EDT --- Sorry was on leave till yesterday so couldn't debug this issue. I am able to re-create the crash. Also found the root cause for the crash. All I need to find now is why the nameless lookup is failing on some of the files which lead to failures in fd-migration to new graph. For the crash to happen, fd-migration needs to fail and fsync needs to come on the fd on which the migration failed. There are checks in fuse code to prevent sending the fops on fds which failed migration. But the problem is that error checking code has a bug introduced by http://review.gluster.org/6262. But this patch is fixing an issue with 'link' fop. So I need to revert this patch and fix the 'link' fop bug separately to fix the issue. This fixes the crash but we still don't know why fd-migration failed. We need to find root-cause for the same. Joe, Please check if there are any other mounts where you have fd-migration errors. They may also crash when fsync is issued on the mounts on those fds. Could you please give me the info of what are the files represented by the following gfids and on which bricks do we have these files? Brick logs would be helpful as well. pk1@pranithk-laptop - ~/Downloads/gluster-users/JoeJulian 18:21:12 :) ⚡ grep Invalid client.log [2014-07-24 05:23:44.931310] W [fuse-bridge.c:3932:fuse_migrate_fd_open] 0-glusterfs-fuse: name-less lookup of gfid (b7e794e9-67c0-4d47-9ee5-2b41d1de792f) failed (Invalid argument)(old-subvolume:gv-nova-3 new-subvolume:gv-nova-4) [2014-07-24 05:23:44.980643] W [fuse-bridge.c:3932:fuse_migrate_fd_open] 0-glusterfs-fuse: name-less lookup of gfid (9abcf733-05b4-45e1-9845-ff8f4b97fbc4) failed (Invalid argument)(old-subvolume:gv-nova-3 new-subvolume:gv-nova-4) [2014-07-24 05:23:45.088577] W [fuse-bridge.c:3932:fuse_migrate_fd_open] 0-glusterfs-fuse: name-less lookup of gfid (87544fde-9bad-46d8-b610-1a8c93b85113) failed (Invalid argument)(old-subvolume:gv-nova-3 new-subvolume:gv-nova-4) [2014-07-24 05:23:45.122628] W [fuse-bridge.c:3932:fuse_migrate_fd_open] 0-glusterfs-fuse: name-less lookup of gfid (37c5ffcb-3937-4bd5-a719-244526c278f5) failed (Invalid argument)(old-subvolume:gv-nova-3 new-subvolume:gv-nova-4) Here is the backtrace when the crash happens after fsync on fd which failed migration. Program received signal SIGSEGV, Segmentation fault. 0x00007f1003b494a9 in dht_fsync (frame=0x7f100b2ef1a4, this=0x13f8620, fd=0x13cd5c8, datasync=0, xdata=0x0) at dht-inode-read.c:817 817 STACK_WIND (frame, dht_fsync_cbk, subvol, subvol->fops->fsync, (gdb) bt #0 0x00007f1003b494a9 in dht_fsync (frame=0x7f100b2ef1a4, this=0x13f8620, fd=0x13cd5c8, datasync=0, xdata=0x0) at dht-inode-read.c:817 #1 0x00007f100803ba8d in wb_fsync (frame=0x7f100b2ef3a8, this=0x13fdfc0, fd=0x13cd5c8, datasync=0, xdata=0x0) at write-behind.c:1523 #2 0x00007f10038ee95c in ra_fsync (frame=0x7f100b2ef704, this=0x13fea20, fd=0x13cd5c8, datasync=0, xdata=0x0) at read-ahead.c:628 #3 0x00007f100c4e5aca in default_fsync (frame=0x7f100b2ef704, this=0x13ff480, fd=0x13cd5c8, flags=0, xdata=0x0) at defaults.c:991 #4 0x00007f100c4e5aca in default_fsync (frame=0x7f100b2ef704, this=0x13ffee0, fd=0x13cd5c8, flags=0, xdata=0x0) at defaults.c:991 #5 0x00007f100c4dfd63 in default_fsync_resume (frame=0x7f100b2efbb8, this=0x1400a60, fd=0x13cd5c8, flags=0, xdata=0x0) at defaults.c:561 #6 0x00007f100c4fe7bc in call_resume_wind (stub=0x7f100ad78da4) at call-stub.c:2389 #7 0x00007f100c5064d5 in call_resume (stub=0x7f100ad78da4) at call-stub.c:2841 #8 0x00007f10032c2c1a in open_and_resume (this=0x1400a60, fd=0x13cd5c8, stub=0x7f100ad78da4) at open-behind.c:236 #9 0x00007f10032c3e9f in ob_fsync (frame=0x7f100b2efbb8, this=0x1400a60, fd=0x13cd5c8, flag=0, xdata=0x0) at open-behind.c:487 #10 0x00007f10030b644c in mdc_fsync (frame=0x7f100b2efb0c, this=0x1403710, fd=0x13cd5c8, datasync=0, xdata=0x0) at md-cache.c:1635 #11 0x00007f1002ea1a3c in io_stats_fsync (frame=0x7f100b2ef5ac, this=0x1404170, fd=0x13cd5c8, flags=0, xdata=0x0) at io-stats.c:2168 #12 0x00007f100a8eb1b3 in fuse_fsync_resume (state=0x7f0ffc016f80) at fuse-bridge.c:2384 #13 0x00007f100a8da00b in fuse_resolve_done (state=0x7f0ffc016f80) at fuse-resolve.c:663 #14 0x00007f100a8da0e1 in fuse_resolve_all (state=0x7f0ffc016f80) at fuse-resolve.c:692 #15 0x00007f100a8d9fd4 in fuse_resolve (state=0x7f0ffc016f80) at fuse-resolve.c:649 #16 0x00007f100a8da0b8 in fuse_resolve_all (state=0x7f0ffc016f80) at fuse-resolve.c:688 #17 0x00007f100a8da13f in fuse_resolve_continue (state=0x7f0ffc016f80) at fuse-resolve.c:708 #18 0x00007f100a8d9d8d in fuse_resolve_fd (state=0x7f0ffc016f80) at fuse-resolve.c:561 #19 0x00007f100a8d9f82 in fuse_resolve (state=0x7f0ffc016f80) at fuse-resolve.c:638 #20 0x00007f100a8da063 in fuse_resolve_all (state=0x7f0ffc016f80) at fuse-resolve.c:681 #21 0x00007f100a8da17d in fuse_resolve_and_resume (state=0x7f0ffc016f80, fn=0x7f100a8eaba4 <fuse_fsync_resume>) at fuse-resolve.c:721 #22 0x00007f100a8eb2fa in fuse_fsync (this=0x1338860, finh=0x7f0ffc016160, msg=0x7f0ffc016188) at fuse-bridge.c:2403 #23 0x00007f100a8f8af9 in fuse_thread_proc (data=0x1338860) at fuse-bridge.c:4798 #24 0x00000038e4807f33 in start_thread () from /lib64/libpthread.so.0 ---Type <return> to continue, or q <return> to quit--- #25 0x00000038e40f4ded in clone () from /lib64/libc.so.6 (gdb) fr 0 #0 0x00007f1003b494a9 in dht_fsync (frame=0x7f100b2ef1a4, this=0x13f8620, fd=0x13cd5c8, datasync=0, xdata=0x0) at dht-inode-read.c:817 817 STACK_WIND (frame, dht_fsync_cbk, subvol, subvol->fops->fsync, (gdb) p subvol <<----- This is the reason for the crash. $1 = (xlator_t *) 0x0 --- Additional comment from Joe Julian on 2014-07-31 01:29:28 EDT ---
REVIEW: http://review.gluster.org/8402 (mount/fuse: Fail FUSE_FOP on resolution errors after resume) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu)
REVIEW: http://review.gluster.org/8402 (mount/fuse: Handle fd resolution failures) posted (#2) for review on master by Pranith Kumar Karampuri (pkarampu)
You can't have a mainline bug block the glusterfs-3.5.3 tracker. If this change needs to go into 3.5.x and 3.6.x you will need to create clones for each version and have them block their own tracker (glusterfs-3.5.3 or glusterfs-3.6.0).
REVIEW: http://review.gluster.org/8402 (mount/fuse: Handle fd resolution failures) posted (#3) for review on master by Pranith Kumar Karampuri (pkarampu)
COMMIT: http://review.gluster.org/8402 committed in master by Vijay Bellur (vbellur) ------ commit b87f92c0eee630690e2b515a7f79c28af6496a5e Author: Pranith Kumar K <pkarampu> Date: Fri Aug 1 18:30:32 2014 +0530 mount/fuse: Handle fd resolution failures Problem: Even when the fd resolution failed, the fop is continuing on the new graph which may not have valid inode. This lead to NULL layout subvols in dht which lead to crash in fsync after graph migration. Fix: - Remove resolution error handling in FUSE_FOP as it was only added to handle fd migration failures. - check in fuse_resolve_done for fd resolution failures and fail the fop right away. - loc resolution failures are already handled in the corresponding fops. - Return errno from state->resolve.op_errno in resume functions. - Send error to fuse on frame allocation failures. - Removed unused variable state->resolved - Removed unused macro FUSE_FOP_COOKIE Change-Id: I479d6e1ff2ca626ad8c8fcb6f293022149474992 BUG: 1126048 Signed-off-by: Pranith Kumar K <pkarampu> Reviewed-on: http://review.gluster.org/8402 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Raghavendra G <rgowdapp> Reviewed-by: Vijay Bellur <vbellur>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report. glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939 [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user