Bug 1095179

Summary:	Gluster volume inaccessible on all bricks after a glusterfsd segfault on one brick
Product:	[Community] GlusterFS	Reporter:	psk <spammable>
Component:	replicate	Assignee:	bugs <bugs>
Status:	CLOSED WONTFIX	QA Contact:
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	3.2.5	CC:	bugs, gluster-bugs, spammable
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-12-14 19:40:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description psk 2014-05-07 09:01:11 UTC

Description of problem:
I'm sharing a volume between two web server.
After a problem on one brick, the volume was unreachable ((107)Transport endpoint is not connected) on the other brick too. Impossible to umount/mount/restart glusterfsd, i had to reboot the two server. This happened tree time this week (after more than two years running perfectly).
For the history, I was used to mount partitions using "glusterfs" as fstype. Four months ago, I switched to "nfs" as fstype for mounting partition, but we discovered a lot of NFS Stale files during last weeks, so, we switched back to "glusterfs" this week. Since we switched, one server (always the same) has crashed tree times.
For me crash can happen, but entire volume hang is a problem.


Additionnal informations :
Volume Name: ApacheRoot
Type: Replicate
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: xxxxx-srv-web06:/glusterfs/ApacheRoot
Brick2: xxxxx-srv-web05:/glusterfs/ApacheRoot
Options Reconfigured:
performance.io-thread-count: 32
performance.write-behind-window-size: 8MB
performance.cache-refresh-timeout: 3
performance.cache-max-file-size: 2MB
performance.cache-size: 1GB

df output :
xxxxx-srv-web06:/ApacheRoot      92G   17G   71G  19% /var/www


Here is what I've found in logs :

[2014-05-06 18:37:00.363251] W [afr-common.c:1121:afr_conflicting_iattrs] 0-ApacheRoot-replicate-0: /html/yyyown/cms/agents/cbnbdb/prod/_library/photo_gallery/kpa-map.jpg: gfid differs on subvolume 1 (26e0651f-8397-459d-b8cb-0657df085925, e3e846f9-0165-41bd-9f6e-e22825705653)
[2014-05-06 18:37:00.363278] E [afr-self-heal-common.c:1333:afr_sh_common_lookup_cbk] 0-ApacheRoot-replicate-0: Conflicting entries for /html/yyyown/cms/agents/cbnbdb/prod/_library/photo_gallery/kpa-map.jpg
[2014-05-06 18:37:00.364115] E [afr-self-heal-common.c:2074:afr_self_heal_completion_cbk] 0-ApacheRoot-replicate-0: background  entry self-heal failed on /html/yyyown/cms/agents/cbnbdb/prod/_library/photo_gallery
[2014-05-06 18:37:00.465741] E [client3_1-fops.c:2228:client3_1_lookup_cbk] 0-ApacheRoot-client-0: remote operation failed: Stale NFS file handle
[2014-05-06 18:37:00.466103] W [stat-prefetch.c:1549:sp_open_helper] 0-ApacheRoot-stat-prefetch: lookup-behind has failed for path (/html/yyyown/cms/agents/cbnbdb/prod/_library/photo_gallery/kpa-map.jpg)(Stale NFS file handle), unwinding open call waiting on it
[2014-05-06 18:37:00.466155] W [fuse-bridge.c:588:fuse_fd_cbk] 0-glusterfs-fuse: 34582930: OPEN() /html/yyyown/cms/agents/cbnbdb/prod/_library/photo_gallery/kpa-map.jpg => -1 (Stale NFS file handle)
[2014-05-06 18:37:00.467172] E [client3_1-fops.c:366:client3_1_open_cbk] 0-ApacheRoot-client-0: remote operation failed: No such file or directory
pending frames:
frame : type(1) op(OPEN)
frame : type(1) op(OPEN)

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2014-05-06 18:37:00
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.2.5
/lib/x86_64-linux-gnu/libc.so.6(+0x364a0)[0x7f9693c004a0]
/usr/lib/glusterfs/3.2.5/xlator/performance/io-cache.so(ioc_open_cbk+0x99)[0x7f9690883af9]
/usr/lib/glusterfs/3.2.5/xlator/performance/read-ahead.so(ra_open_cbk+0x1ac)[0x7f9690a93bcc]
/usr/lib/glusterfs/3.2.5/xlator/performance/write-behind.so(wb_open_cbk+0x127)[0x7f9690c9fa17]
/usr/lib/glusterfs/3.2.5/xlator/cluster/replicate.so(afr_open_cbk+0x25e)[0x7f9690ecc6be]
/usr/lib/glusterfs/3.2.5/xlator/protocol/client.so(client3_1_open_cbk+0x228)[0x7f969111e188]
/usr/lib/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)[0x7f96943caec5]
/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x7d)[0x7f96943cb84d]
/usr/lib/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7f96943c7ab7]
/usr/lib/glusterfs/3.2.5/rpc-transport/socket.so(socket_event_poll_in+0x34)[0x7f9691d58074]
/usr/lib/glusterfs/3.2.5/rpc-transport/socket.so(socket_event_handler+0xc7)[0x7f9691d583c7]
/usr/lib/libglusterfs.so.0(+0x3bce7)[0x7f969460ece7]
/usr/sbin/glusterfs(main+0x2a5)[0x7f9694a5e4b5]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f9693beb76d]
/usr/sbin/glusterfs(+0x4745)[0x7f9694a5e745]


Version-Release number of selected component (if applicable):
ubuntu 12.10 LTS
glusterfs 3.2.5




How reproducible:
Dunno

Steps to Reproduce:
1.
2.
3.

Actual results:
Volume hang

Expected results:
No volume hang

Additional info:

Comment 1 psk 2014-05-07 14:27:22 UTC

Good evening,

It happened again.
I also got some problems with some files : "Input/output error while trying to stat ..."
Impossible to view, list, modify, delete those files.

Any help is welcome.

Regards

Comment 2 Niels de Vos 2014-11-11 09:39:56 UTC

Could you let us know if this problem is still happening on a current version? 3.2.x is not getting updated anymore. Version 3.4 and newer are actively maintained and could have fixes for the issue that you are facing.

Thanks!

Comment 3 Niels de Vos 2014-11-27 14:54:55 UTC

The version that this bug has been reported against, does not get any updates from the Gluster Community anymore. Please verify if this report is still valid against a current (3.4, 3.5 or 3.6) release and update the version, or close this bug.

If there has been no update before 9 December 2014, this bug will get automatocally closed.

Comment 4 psk 2015-05-04 11:07:44 UTC

Good morning,

You can forget this bug, it did not happened since the opening of this ticket.

Regards