Description of problem: One of my clients has been crashing every night for the past few days. Version-Release number of selected component (if applicable): Glusterfs 3.1.7 on Ubuntu Oneiric 11.10 (servers & client)... Server & client kernels: 3.0.0-26-virtual (latest currently) This is the only client for this volume, and the client machine doesn't mount any other glusterfs volumes. The servers do host several other volumes which are mounted by several other client machines, all running the same glusterfs & os/kernel versions, which don't have any problems. How reproducible: I dont know how to reproduce it but it has happened a few times this week already. Here is the log file from the last crash, which begins with mounting & shows no activity until the crash many hours later... [2012-09-19 15:10:59.251188] I [client-handshake.c:1016:select_server_supported_programs] 0-builder-client-1: Using Program GlusterFS-3.1.0, Num (1298437), Version (310) [2012-09-19 15:10:59.251361] I [client-handshake.c:1016:select_server_supported_programs] 0-builder-client-0: Using Program GlusterFS-3.1.0, Num (1298437), Version (310) [2012-09-19 15:10:59.270508] I [client-handshake.c:852:client_setvolume_cbk] 0-builder-client-0: Connected to 10.44.185.22:24020, attached to remote volume '/bricks/builder0'. [2012-09-19 15:10:59.270587] I [afr-common.c:2646:afr_notify] 0-builder-replicate-0: Subvolume 'builder-client-0' came back up; going online. [2012-09-19 15:10:59.286563] I [client-handshake.c:852:client_setvolume_cbk] 0-builder-client-1: Connected to 10.4.126.119:24036, attached to remote volume '/bricks/builder0'. [2012-09-19 15:10:59.295463] I [fuse-bridge.c:3312:fuse_graph_setup] 0-fuse: switched graph to 0 [2012-09-19 15:10:59.295686] I [fuse-bridge.c:2900:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.16 [2012-09-19 15:10:59.320368] I [afr-common.c:893:afr_fresh_lookup_cbk] 0-builder-replicate-0: added root inode [2012-09-20 05:03:01.834802] W [fuse-bridge.c:1751:fuse_readv_cbk] 0-glusterfs-fuse: 6750101: READ => -1 (No such file or directory) [2012-09-20 05:03:01.834911] E [mem-pool.c:469:mem_put] 0-mem-pool: invalid argument [2012-09-20 05:03:01.836458] W [fuse-bridge.c:1751:fuse_readv_cbk] 0-glusterfs-fuse: 6750104: READ => -1 (No such file or directory) [2012-09-20 05:03:01.836496] E [mem-pool.c:469:mem_put] 0-mem-pool: invalid argument pending frames: frame : type(1) op(CREATE) frame : type(1) op(CREATE) frame : type(1) op(CREATE) frame : type(1) op(READ) frame : type(1) op(LOOKUP) frame : type(1) op(FLUSH) frame : type(1) op(FLUSH) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2012-09-20 05:03:01 configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.1.7 /lib/x86_64-linux-gnu/libc.so.6(+0x36420)[0x7f54074dd420] /lib/x86_64-linux-gnu/libpthread.so.0(pthread_spin_lock+0x0)[0x7f5407854d60] /usr/lib/libglusterfs.so.0(fd_unref+0x3b)[0x7f5407ec74ab] /usr/lib/glusterfs/3.1.7/xlator/protocol/client.so(client_local_wipe+0x1f)[0x7f54045d824f] /usr/lib/glusterfs/3.1.7/xlator/protocol/client.so(client3_1_open_cbk+0x19b)[0x7f54045dd08b] /usr/lib/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)[0x7f5407c8a065] /usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x7d)[0x7f5407c8a42d] /usr/lib/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7f5407c86867] /usr/lib/glusterfs/3.1.7/rpc-transport/socket.so(socket_event_poll_in+0x34)[0x7f5405635984] /usr/lib/glusterfs/3.1.7/rpc-transport/socket.so(socket_event_handler+0xb3)[0x7f5405635c23] /usr/lib/libglusterfs.so.0(+0x37f81)[0x7f5407ec8f81] /usr/sbin/glusterfs(main+0x23a)[0x40313a] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f54074c830d] /usr/sbin/glusterfs[0x4031d5]
Here is the volume info for this volume... Volume Name: builder Type: Replicate Status: Started Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: server1:/bricks/builder0 Brick2: server2:/bricks/builder0 Options Reconfigured: diagnostics.client-log-level: INFO And here is a sample line from both of the brick logs on the servers, they produce a bunch of lines like this right before the client crashes... [2012-09-20 05:03:04.976830] I [server-helpers.c:459:do_fd_cleanup] 0-builder-server: fd cleanup on /nexus/sonatype-work/nexus/timeline/index/_3pi.prx
3.1.7? any chance of at least upgrading to 3.2.x series at the least? Meantime, will try to figure out the issue in release-3.1 branch
The client made it through the night without crashing yesterday. And yes, I will work on upgrading to 3.2. I was hoping someone would see that stacktrace or log & recognize an obvious problem, because I do not think this will be easy to reproduce. I have lots of clients, many with uptimes of monthsm, and have almost never seen any crash.
I was able to reproduce the crash again this afternoon since my last comment. This volume stores SVN repos & a Nexus maven repository, among other things. When I tried doing a build which checked out from SVN and Nexus the mount crashed. This explains the nightly crashes, they happened when Jenkins would run the nightly builds. Joe Julian suggested (in IRC) that I try stopping & starting the volume. I did that, and also rebooted the client machine, and now everything seems to be working fine -- I am able to do the Jenkins builds without the client crashing. This whole problem seemed to be caused by a rolling reboot of the servers for this volume. I have done this many times in the past with this volume and other volumes and never ran into this kind of trouble. In any case, it seems to be resolved now since I stopped & restarted the volume.
moving the priority down as workaround exists, and also because the version is 3.1.x which is not *actively* looked into. Louis, does that sound ok for you?
Yes that is fine with me. I have not seen this bug happen again since I reported it. My volumes have been very stable. Thanks!
WORKSFORME with latest release then. Please upgrade to 3.3.x (or at least 3.2.x), don't remain in 3.1.x releases