Description of problem: I'm sharing a volume between two web server. After a problem on one brick, the volume was unreachable ((107)Transport endpoint is not connected) on the other brick too. Impossible to umount/mount/restart glusterfsd, i had to reboot the two server. This happened tree time this week (after more than two years running perfectly). For the history, I was used to mount partitions using "glusterfs" as fstype. Four months ago, I switched to "nfs" as fstype for mounting partition, but we discovered a lot of NFS Stale files during last weeks, so, we switched back to "glusterfs" this week. Since we switched, one server (always the same) has crashed tree times. For me crash can happen, but entire volume hang is a problem. Additionnal informations : Volume Name: ApacheRoot Type: Replicate Status: Started Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: xxxxx-srv-web06:/glusterfs/ApacheRoot Brick2: xxxxx-srv-web05:/glusterfs/ApacheRoot Options Reconfigured: performance.io-thread-count: 32 performance.write-behind-window-size: 8MB performance.cache-refresh-timeout: 3 performance.cache-max-file-size: 2MB performance.cache-size: 1GB df output : xxxxx-srv-web06:/ApacheRoot 92G 17G 71G 19% /var/www Here is what I've found in logs : [2014-05-06 18:37:00.363251] W [afr-common.c:1121:afr_conflicting_iattrs] 0-ApacheRoot-replicate-0: /html/yyyown/cms/agents/cbnbdb/prod/_library/photo_gallery/kpa-map.jpg: gfid differs on subvolume 1 (26e0651f-8397-459d-b8cb-0657df085925, e3e846f9-0165-41bd-9f6e-e22825705653) [2014-05-06 18:37:00.363278] E [afr-self-heal-common.c:1333:afr_sh_common_lookup_cbk] 0-ApacheRoot-replicate-0: Conflicting entries for /html/yyyown/cms/agents/cbnbdb/prod/_library/photo_gallery/kpa-map.jpg [2014-05-06 18:37:00.364115] E [afr-self-heal-common.c:2074:afr_self_heal_completion_cbk] 0-ApacheRoot-replicate-0: background entry self-heal failed on /html/yyyown/cms/agents/cbnbdb/prod/_library/photo_gallery [2014-05-06 18:37:00.465741] E [client3_1-fops.c:2228:client3_1_lookup_cbk] 0-ApacheRoot-client-0: remote operation failed: Stale NFS file handle [2014-05-06 18:37:00.466103] W [stat-prefetch.c:1549:sp_open_helper] 0-ApacheRoot-stat-prefetch: lookup-behind has failed for path (/html/yyyown/cms/agents/cbnbdb/prod/_library/photo_gallery/kpa-map.jpg)(Stale NFS file handle), unwinding open call waiting on it [2014-05-06 18:37:00.466155] W [fuse-bridge.c:588:fuse_fd_cbk] 0-glusterfs-fuse: 34582930: OPEN() /html/yyyown/cms/agents/cbnbdb/prod/_library/photo_gallery/kpa-map.jpg => -1 (Stale NFS file handle) [2014-05-06 18:37:00.467172] E [client3_1-fops.c:366:client3_1_open_cbk] 0-ApacheRoot-client-0: remote operation failed: No such file or directory pending frames: frame : type(1) op(OPEN) frame : type(1) op(OPEN) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2014-05-06 18:37:00 configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.2.5 /lib/x86_64-linux-gnu/libc.so.6(+0x364a0)[0x7f9693c004a0] /usr/lib/glusterfs/3.2.5/xlator/performance/io-cache.so(ioc_open_cbk+0x99)[0x7f9690883af9] /usr/lib/glusterfs/3.2.5/xlator/performance/read-ahead.so(ra_open_cbk+0x1ac)[0x7f9690a93bcc] /usr/lib/glusterfs/3.2.5/xlator/performance/write-behind.so(wb_open_cbk+0x127)[0x7f9690c9fa17] /usr/lib/glusterfs/3.2.5/xlator/cluster/replicate.so(afr_open_cbk+0x25e)[0x7f9690ecc6be] /usr/lib/glusterfs/3.2.5/xlator/protocol/client.so(client3_1_open_cbk+0x228)[0x7f969111e188] /usr/lib/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)[0x7f96943caec5] /usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x7d)[0x7f96943cb84d] /usr/lib/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7f96943c7ab7] /usr/lib/glusterfs/3.2.5/rpc-transport/socket.so(socket_event_poll_in+0x34)[0x7f9691d58074] /usr/lib/glusterfs/3.2.5/rpc-transport/socket.so(socket_event_handler+0xc7)[0x7f9691d583c7] /usr/lib/libglusterfs.so.0(+0x3bce7)[0x7f969460ece7] /usr/sbin/glusterfs(main+0x2a5)[0x7f9694a5e4b5] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f9693beb76d] /usr/sbin/glusterfs(+0x4745)[0x7f9694a5e745] Version-Release number of selected component (if applicable): ubuntu 12.10 LTS glusterfs 3.2.5 How reproducible: Dunno Steps to Reproduce: 1. 2. 3. Actual results: Volume hang Expected results: No volume hang Additional info:
Good evening, It happened again. I also got some problems with some files : "Input/output error while trying to stat ..." Impossible to view, list, modify, delete those files. Any help is welcome. Regards
Could you let us know if this problem is still happening on a current version? 3.2.x is not getting updated anymore. Version 3.4 and newer are actively maintained and could have fixes for the issue that you are facing. Thanks!
The version that this bug has been reported against, does not get any updates from the Gluster Community anymore. Please verify if this report is still valid against a current (3.4, 3.5 or 3.6) release and update the version, or close this bug. If there has been no update before 9 December 2014, this bug will get automatocally closed.
Good morning, You can forget this bug, it did not happened since the opening of this ticket. Regards