Description of problem: ------------------------ appending data continuously to a file using "echo" command on a fuse mount reports EIO after sometime. [06/15/12 - 01:47:31 root@ARF-Client1 ~]# ls -lh /mnt/gfsc1/output_file.20120615010422 -rw-r--r--. 1 root root 14M Jun 15 01:27 /mnt/gfsc1/output_file.20120615010422 Version-Release number of selected component (if applicable): ------------------------------------------------------------ 3.3.0qa45 How reproducible: ----------------- Often append_data.sh:- --------------- #!/bin/bash path=$1 start_time=$(date +%s) append=`date '+%Y%m%d%H%M%S'` for i in {1..100000};do echo -e "Date: `date`\n" >> $path/output_file.$append echo -e "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n" >> $path/output_file.$append echo -e "Calender:\n" >> $path/output_file.$append echo -e "`cal`\n" >> $path/output_file.$append echo -e "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n" >> $path/output_file.$append done finish_time=$(date +%s) echo "Time duration: $((finish_time - start_time)) secs." >> $path/output_file.$append Steps to Reproduce: ------------------- 1.Create a replicate volume (1x3) 2.Set "write-behind on" and "eager-lock on" on the volume 3.Start the volume 4.Create a fuse mount. 5.execute the script "append_data <mount_point>" . mount_point as argument to the script. 6.After sometime the script fails with Input/Output Error Actual results: --------------- [2012-06-15 01:27:51.461317] W [afr-self-heal-data.c:831:afr_lookup_select_read_child_by_txn_type] 0-vol-replicate-0: /output_file.20120615010422: Possible split-brain [2012-06-15 01:27:51.461524] W [afr-common.c:1226:afr_detect_self_heal_by_lookup_status] 0-vol-replicate-0: split brain detected during lookup of /output_file.20120615010422. [2012-06-15 01:27:51.461736] W [afr-open.c:213:afr_open] 0-vol-replicate-0: failed to open as split brain seen, returning EIO [2012-06-15 01:27:51.461972] W [fuse-bridge.c:713:fuse_fd_cbk] 0-glusterfs-fuse: 2565915: OPEN() /output_file.20120615010422 => -1 (Input/output error) [2012-06-15 01:27:51.462492] W [afr-open.c:213:afr_open] 0-vol-replicate-0: failed to open as split brain seen, returning EIO [2012-06-15 01:27:51.462560] W [fuse-bridge.c:713:fuse_fd_cbk] 0-glusterfs-fuse: 2565916: OPEN() /output_file.20120615010422 => -1 (Input/output error) [2012-06-15 01:27:51.465513] W [afr-open.c:213:afr_open] 0-vol-replicate-0: failed to open as split brain seen, returning EIO [2012-06-15 01:27:51.465558] W [fuse-bridge.c:713:fuse_fd_cbk] 0-glusterfs-fuse: 2565917: OPEN() /output_file.20120615010422 => -1 (Input/output error) [2012-06-15 01:27:51.466011] W [afr-open.c:213:afr_open] 0-vol-replicate-0: failed to open as split brain seen, returning EIO [2012-06-15 01:27:51.466056] W [fuse-bridge.c:713:fuse_fd_cbk] 0-glusterfs-fuse: 2565918: OPEN() /output_file.20120615010422 => -1 (Input/output error) [2012-06-15 01:27:51.468327] W [afr-open.c:213:afr_open] 0-vol-replicate-0: failed to open as split brain seen, returning EIO Output of stat on file after the script failure: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [06/15/12 - 01:40:28 root@ARF-Client1 ~]# stat /mnt/gfsc1/output_file.20120615010422 File: `/mnt/gfsc1/output_file.20120615010422' Size: 13812273 Blocks: 32776 IO Block: 131072 regular file Device: 15h/21d Inode: 9581628177895579223 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2012-06-15 01:04:27.966005057 -0400 Modify: 2012-06-15 01:27:51.459004464 -0400 Change: 2012-06-15 01:27:51.460121389 -0400 Fuse Mount log when executed stat:- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [2012-06-15 01:47:31.337301] W [afr-common.c:1226:afr_detect_self_heal_by_lookup_status] 0-vol-replicate-0: split brain detected during lookup of /output_file.20120615010422. [2012-06-15 01:47:31.337387] I [afr-common.c:1340:afr_launch_self_heal] 0-vol-replicate-0: background data gfid self-heal triggered. path: /output_file.20120615010422, reason: lookup detected pending operations [2012-06-15 01:47:31.339249] I [afr-self-heal-common.c:1189:afr_sh_missing_entry_call_impunge_recreate] 0-vol-replicate-0: no missing files - /output_file.20120615010422. proceeding to metadata check [2012-06-15 01:47:31.340112] I [afr-self-heal-common.c:994:afr_sh_missing_entries_done] 0-vol-replicate-0: split brain found, aborting selfheal of /output_file.20120615010422 [2012-06-15 01:47:31.340185] E [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk] 0-vol-replicate-0: background data gfid self-heal failed on /output_file.20120615010422 Xattrs of file after split-brain reported:- ------------------------------------------ brick3: ~~~~~~~ [06/15/12 - 01:41:18 root@AFR-Server3 ~]# getfattr -d -m. -ehex /export_b1/dir1/output_file.20120615010422 getfattr: Removing leading '/' from absolute path names # file: export_b1/dir1/output_file.20120615010422 security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000 trusted.afr.vol-client-0=0x000000000000000000000000 trusted.afr.vol-client-1=0x000000000000000000000000 trusted.afr.vol-client-2=0x000000000000000000000000 trusted.gfid=0x8438a813ad7646c684f8c80fe956de57 brick2: ~~~~~~~~ [06/15/12 - 01:41:13 root@AFR-Server2 ~]# getfattr -d -m. -ehex /export_b1/dir1/output_file.20120615010422 getfattr: Removing leading '/' from absolute path names # file: export_b1/dir1/output_file.20120615010422 security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000 trusted.afr.vol-client-0=0x000000000000000000000000 trusted.afr.vol-client-1=0x000000000000000000000000 trusted.afr.vol-client-2=0x000000000000000000000000 trusted.gfid=0x8438a813ad7646c684f8c80fe956de57 brick1: ~~~~~~~~ [06/15/12 - 01:41:03 root@AFR-Server1 ~]# getfattr -d -m. -ehex /export_b1/dir1/output_file.20120615010422 getfattr: Removing leading '/' from absolute path names # file: export_b1/dir1/output_file.20120615010422 security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000 trusted.afr.vol-client-0=0x000000000000000000000000 trusted.afr.vol-client-1=0x000000000000000000000000 trusted.afr.vol-client-2=0x000000000000000000000000 trusted.gfid=0x8438a813ad7646c684f8c80fe956de57
Created attachment 592039 [details] glusterfs logs, history of commands executed on storage_node1 and fuse mount log
md5sum on brick1:- ----------------- [06/15/12 - 02:28:59 root@AFR-Server1 ~]# arequal-checksum -p /export_b1/dir1/output_file.20120615010422 Entry counts Regular files : 1 Directories : 0 Symbolic links : 0 Other : 0 Total : 1 Metadata checksums Regular files : 486e85 Directories : 3e9 Symbolic links : 3e9 Other : 3e9 Checksums Regular files : b05e703fbeac795f934ec0aa043387c9 Directories : 0 Symbolic links : 0 Other : 0 Total : 2310b095ba9ffe96 md5sum on brick2:- ------------------ [06/15/12 - 02:29:04 root@AFR-Server2 ~]# arequal-checksum -p /export_b1/dir1/output_file.20120615010422 Entry counts Regular files : 1 Directories : 0 Symbolic links : 0 Other : 0 Total : 1 Metadata checksums Regular files : 486e85 Directories : 3e9 Symbolic links : 3e9 Other : 3e9 Checksums Regular files : b05e703fbeac795f934ec0aa043387c9 Directories : 0 Symbolic links : 0 Other : 0 Total : 2310b095ba9ffe96 md5sum on brick3:- ------------------ [06/15/12 - 02:30:48 root@AFR-Server3 ~]# arequal-checksum -p /export_b1/dir1/output_file.20120615010422 Entry counts Regular files : 1 Directories : 0 Symbolic links : 0 Other : 0 Total : 1 Metadata checksums Regular files : 486e85 Directories : 3e9 Symbolic links : 3e9 Other : 3e9 Checksums Regular files : b05e703fbeac795f934ec0aa043387c9 Directories : 0 Symbolic links : 0 Other : 0 Total : 2310b095ba9ffe96
There are 2 parts to the bug. 1) For some reason split-brain flag is set, we haven't figured out why that happened. 2) Once the flag is set, there is no way to reset the flag. This fixes the second part. The first part does not happen in my setup. With this fix even in her setup it did not happen. She is trying to recreate it without this patch once more. This bug is tracking 2). 833727 is created to track 1). Moving this bug to ON_QA
Verified the fix on RC2-C. Bug is fixed.
*** Bug 836101 has been marked as a duplicate of this bug. ***
(In reply to comment #5) > *** Bug 836101 has been marked as a duplicate of this bug. *** I would like to check the fix in my setup (bug 836101). Where can I obtain the fix or RC2-C?
http://review.gluster.com/3583
*** Bug 857270 has been marked as a duplicate of this bug. ***