Bug 762920 (GLUSTER-1188) - 3.0.5 client crash - afr_set_split_brain
Summary: 3.0.5 client crash - afr_set_split_brain
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: GLUSTER-1188
Product: GlusterFS
Classification: Community
Component: replicate
Version: 3.0.5
Hardware: All
OS: Linux
low
high
Target Milestone: ---
Assignee: Pranith Kumar K
QA Contact:
URL:
Whiteboard:
: 763888 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-07-21 10:22 UTC by Anush Shetty
Modified: 2015-12-01 16:45 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)

Description Anush Shetty 2010-07-21 10:22:16 UTC
Reported by Jon Swanson on gluster-users

--Setup--
Client:
Fedora 12 2.6.32.16-141.fc12.x86_64
# rpm -qa |egrep 'fuse|glust'
fuse-2.8.4-1.fc12.x86_64
glusterfs-client-3.0.5-1.fc11.x86_64
fuse-libs-2.8.4-1.fc12.x86_64
glusterfs-common-3.0.5-1.fc11.x86_64


Servers - 6 nodes with a 3 x distribute:
Fedora 12 2.6.32.9-70.fc12.x86_64
[root.prod ~]# rpm -qa | grep glust
glusterfs-common-3.0.5-1.fc11.x86_64
glusterfs-server-3.0.5-1.fc11.x86_64


Process:
1. Client copies a large amount of files to the gluster mount
2. Client tries to do a recursive list of all files copied (ls -R)
3. Recursive list comes across a file where the checksum does not match for some reason (see following log snipped)
4. Client dies horribly, the mount point will becoming invalid with the following error:
gluster-mount/file: Transport endpoint is not connected

I've tried to keep the snippets below as brief as possible.  If you think the volume definition files would help, let me know and i'll be happy to post those here as well.

Any help or suggestions are most welcome.

Thanks!

---

This is the corresponding snipped from 'tail -f gluster-mount.log':

> [2010-07-21 16:34:48] N [client-protocol.c:6288:client_setvolume_cbk] pdbindex2-1: Connected to 192.168.201.88:6996, attached to remote volume 'brick'.

> [2010-07-21 16:35:33] E [afr.c:107:afr_set_split_brain] mirror-0: invalid argument: inode
> [2010-07-21 16:35:33] E [afr-self-heal-algorithm.c:768:sh_diff_checksum_cbk] mirror-0: checksum on /index.201007211105.deploy/file failed on subvolume indexcopy-0 (File descriptor in bad state)
> [2010-07-21 16:35:33] E [afr-self-heal-algorithm.c:768:sh_diff_checksum_cbk] mirror-0: checksum on /index.201007211105.deploy/file failed on subvolume indexcopy-1 (File descriptor in bad state)
> pending frames:
> frame : type(1) op(LOOKUP)
> frame : type(1) op(LOOKUP)
> frame : type(1) op(LOOKUP)
>
> patchset: v3.0.5
> signal received: 11
> time of crash: 2010-07-21 16:35:33
> configuration details:
> argp 1
> backtrace 1
> dlfcn 1
> fdatasync 1
> libpthread 1
> llistxattr 1
> setfsid 1
> spinlock 1
> epoll.h 1
> xattr.h 1
> st_atim.tv_nsec 1
> package-string: glusterfs 3.0.5
> /lib64/libc.so.6(+0x32740)[0x7fa9c949b740]
> /usr/lib64/glusterfs/3.0.5/xlator/cluster/replicate.so(+0x4b2ea)[0x7fa9c85ff2ea]
> /usr/lib64/glusterfs/3.0.5/xlator/cluster/replicate.so(+0x4b557)[0x7fa9c85ff557]
> /usr/lib64/glusterfs/3.0.5/xlator/cluster/replicate.so(+0x4be10)[0x7fa9c85ffe10]
> /usr/lib64/glusterfs/3.0.5/xlator/cluster/replicate.so(afr_sh_algo_diff+0x196)[0x7fa9c85fffc2]
> /usr/lib64/glusterfs/3.0.5/xlator/cluster/replicate.so(afr_sh_data_sync_prepare+0x256)[0x7fa9c85e9a91]
> /usr/lib64/glusterfs/3.0.5/xlator/cluster/replicate.so(afr_sh_data_fix+0x5db)[0x7fa9c85ea078]
> /usr/lib64/glusterfs/3.0.5/xlator/cluster/replicate.so(afr_sh_data_fstat_cbk+0x167)[0x7fa9c85ea34e]
> /usr/lib64/glusterfs/3.0.5/xlator/cluster/distribute.so(dht_attr_cbk+0x238)[0x7fa9c8820e08]
> /usr/lib64/glusterfs/3.0.5/xlator/protocol/client.so(client_fstat_cbk+0x178)[0x7fa9c8a59868]
> /usr/lib64/glusterfs/3.0.5/xlator/protocol/client.so(protocol_client_interpret+0x1df)[0x7fa9c8a60274]
> /usr/lib64/glusterfs/3.0.5/xlator/protocol/client.so(protocol_client_pollin+0xc6)[0x7fa9c8a60ff5]
> /usr/lib64/glusterfs/3.0.5/xlator/protocol/client.so(notify+0x158)[0x7fa9c8a6154d]
> /usr/lib64/libglusterfs.so.0(xlator_notify+0xd8)[0x7fa9c9c1b639]
> /usr/lib64/glusterfs/3.0.5/transport/socket.so(socket_event_poll_in+0x46)[0x7fa9c6f59249]
> /usr/lib64/glusterfs/3.0.5/transport/socket.so(socket_event_handler+0xc4)[0x7fa9c6f5957c]
> /usr/lib64/libglusterfs.so.0(+0x3eefc)[0x7fa9c9c40efc]
> /usr/lib64/libglusterfs.so.0(+0x3f0ee)[0x7fa9c9c410ee]
> /usr/lib64/libglusterfs.so.0(event_dispatch+0x74)[0x7fa9c9c4140d]
> /usr/sbin/glusterfs(main+0xf53)[0x406187]
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fa9c9487b1d]
> /usr/sbin/glusterfs[0x402679]
> ---------

If we look at the respective files, their checksums are fine:
> [16:40] ~> for i in `seq 10 15`; do echo -n "search$i: "; ssh search$i md5sum /data/export/index.201007211105.deploy/file; done
> search10: md5sum: /data/export/index.201007211105.deploy/file: No such file or directory
> search11: 8605b1467bece54ed7ccd13e086ee299  /data/export/index.201007211105.deploy/file
> search12: md5sum: /data/export/index.201007211105.deploy/file: No such file or directory
> search13: md5sum: /data/export/index.201007211105.deploy/file: No such file or directory
> search14: 8605b1467bece54ed7ccd13e086ee299  /data/export/index.201007211105.deploy/file
> search15: md5sum: /data/export/index.201007211105.deploy/file: No such file or directory

If we look at extended attributes however, we notice that 'trusted.posix.gen' is different:
> for i in `seq 10 15`; do echo -n "search$i: "; ssh pdbsearch$i getfattr -d -m - /data/export/index.201007211105.deploy/file; done
> search10: getfattr: /data/export/index.201007211105.deploy/file: No such file or directory
> search11: getfattr: Removing leading '/' from absolute path names
> # file: data/export/index.201007211105.deploy/file
> security.selinux="unconfined_u:object_r:default_t:s0
> trusted.afr.indexcopy-0=0sAAAAAQAAAAAAAAAA
> trusted.afr.indexcopy-1=0sAAAAAQAAAAAAAAAA
> trusted.posix.gen=0sTEFukQAAAEY=
>
> search12: getfattr: /data/export/index.201007211105.deploy/file: No such file or directory
> search13: getfattr: /data/export/index.201007211105.deploy/file: No such file or directory
> search14: getfattr: Removing leading '/' from absolute path names
> # file: data/export/index.201007211105.deploy/file
> security.selinux="unconfined_u:object_r:default_t:s0
> trusted.afr.indexcopy-0=0sAAAAAQAAAAAAAAAA
> trusted.afr.indexcopy-1=0sAAAAAQAAAAAAAAAA
> trusted.posix.gen=0sTEaPaAAAAAI=
>
> search15: getfattr: /data/export/index.201007211105.deploy/file: No such file or directory

Comment 1 Vijay Bellur 2011-01-03 06:43:18 UTC
*** Bug 2156 has been marked as a duplicate of this bug. ***

Comment 2 Anand Avati 2011-02-04 05:39:07 UTC
PATCH: http://patches.gluster.com/patch/6095 in master (cluster/afr: fix races in self-heal)


Note You need to log in before you can comment on or make changes to this bug.