Bug 832305 - appending data to a file on fuse mount fails with EIO
Summary: appending data to a file on fuse mount fails with EIO
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: replicate
Version: 3.3-beta
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
Assignee: Pranith Kumar K
QA Contact:
URL:
Whiteboard:
: 857270 (view as bug list)
Depends On:
Blocks: 833727 840394 840820
TreeView+ depends on / blocked
 
Reported: 2012-06-15 06:13 UTC by Shwetha Panduranga
Modified: 2013-07-24 17:13 UTC (History)
4 users (show)

Fixed In Version: glusterfs-3.4.0
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 833727 (view as bug list)
Environment:
Last Closed: 2013-07-24 17:13:40 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
glusterfs logs, history of commands executed on storage_node1 and fuse mount log (3.35 MB, application/x-gzip)
2012-06-15 06:16 UTC, Shwetha Panduranga
no flags Details

Description Shwetha Panduranga 2012-06-15 06:13:41 UTC
Description of problem:
------------------------
appending data continuously to a file using "echo" command on a fuse mount  reports EIO after sometime. 

[06/15/12 - 01:47:31 root@ARF-Client1 ~]# ls -lh /mnt/gfsc1/output_file.20120615010422 
-rw-r--r--. 1 root root 14M Jun 15 01:27 /mnt/gfsc1/output_file.20120615010422


Version-Release number of selected component (if applicable):
------------------------------------------------------------
3.3.0qa45

How reproducible:
-----------------
Often

append_data.sh:-
---------------
#!/bin/bash

path=$1
start_time=$(date +%s)
append=`date '+%Y%m%d%H%M%S'`

for i in {1..100000};do
        echo -e "Date: `date`\n" >> $path/output_file.$append
        echo -e "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n" >> $path/output_file.$append
	echo -e "Calender:\n" >> $path/output_file.$append
	echo -e "`cal`\n" >> $path/output_file.$append
	echo -e "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n" >> $path/output_file.$append
done
finish_time=$(date +%s)
echo "Time duration: $((finish_time - start_time)) secs." >> $path/output_file.$append

Steps to Reproduce:
-------------------
1.Create a replicate volume (1x3)

2.Set "write-behind on" and "eager-lock on" on the volume

3.Start the volume

4.Create a fuse mount.

5.execute the script "append_data <mount_point>" . mount_point as argument to the script. 

6.After sometime the script fails with Input/Output Error

Actual results:
---------------
[2012-06-15 01:27:51.461317] W [afr-self-heal-data.c:831:afr_lookup_select_read_child_by_txn_type] 0-vol-replicate-0: /output_file.20120615010422: Possible split-brain
[2012-06-15 01:27:51.461524] W [afr-common.c:1226:afr_detect_self_heal_by_lookup_status] 0-vol-replicate-0: split brain detected during lookup of /output_file.20120615010422.
[2012-06-15 01:27:51.461736] W [afr-open.c:213:afr_open] 0-vol-replicate-0: failed to open as split brain seen, returning EIO
[2012-06-15 01:27:51.461972] W [fuse-bridge.c:713:fuse_fd_cbk] 0-glusterfs-fuse: 2565915: OPEN() /output_file.20120615010422 => -1 (Input/output error)
[2012-06-15 01:27:51.462492] W [afr-open.c:213:afr_open] 0-vol-replicate-0: failed to open as split brain seen, returning EIO
[2012-06-15 01:27:51.462560] W [fuse-bridge.c:713:fuse_fd_cbk] 0-glusterfs-fuse: 2565916: OPEN() /output_file.20120615010422 => -1 (Input/output error)
[2012-06-15 01:27:51.465513] W [afr-open.c:213:afr_open] 0-vol-replicate-0: failed to open as split brain seen, returning EIO
[2012-06-15 01:27:51.465558] W [fuse-bridge.c:713:fuse_fd_cbk] 0-glusterfs-fuse: 2565917: OPEN() /output_file.20120615010422 => -1 (Input/output error)
[2012-06-15 01:27:51.466011] W [afr-open.c:213:afr_open] 0-vol-replicate-0: failed to open as split brain seen, returning EIO
[2012-06-15 01:27:51.466056] W [fuse-bridge.c:713:fuse_fd_cbk] 0-glusterfs-fuse: 2565918: OPEN() /output_file.20120615010422 => -1 (Input/output error)
[2012-06-15 01:27:51.468327] W [afr-open.c:213:afr_open] 0-vol-replicate-0: failed to open as split brain seen, returning EIO

Output of stat on file after the script failure:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[06/15/12 - 01:40:28 root@ARF-Client1 ~]# stat /mnt/gfsc1/output_file.20120615010422
  File: `/mnt/gfsc1/output_file.20120615010422'
  Size: 13812273  	Blocks: 32776      IO Block: 131072 regular file
Device: 15h/21d	Inode: 9581628177895579223  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2012-06-15 01:04:27.966005057 -0400
Modify: 2012-06-15 01:27:51.459004464 -0400
Change: 2012-06-15 01:27:51.460121389 -0400


Fuse Mount log when executed stat:-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[2012-06-15 01:47:31.337301] W [afr-common.c:1226:afr_detect_self_heal_by_lookup_status] 0-vol-replicate-0: split brain detected during lookup of /output_file.20120615010422.
[2012-06-15 01:47:31.337387] I [afr-common.c:1340:afr_launch_self_heal] 0-vol-replicate-0: background  data gfid self-heal triggered. path: /output_file.20120615010422, reason: lookup detected pending operations
[2012-06-15 01:47:31.339249] I [afr-self-heal-common.c:1189:afr_sh_missing_entry_call_impunge_recreate] 0-vol-replicate-0: no missing files - /output_file.20120615010422. proceeding to metadata check
[2012-06-15 01:47:31.340112] I [afr-self-heal-common.c:994:afr_sh_missing_entries_done] 0-vol-replicate-0: split brain found, aborting selfheal of /output_file.20120615010422
[2012-06-15 01:47:31.340185] E [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk] 0-vol-replicate-0: background  data gfid self-heal failed on /output_file.20120615010422


Xattrs of file after split-brain reported:-
------------------------------------------
brick3:
~~~~~~~
[06/15/12 - 01:41:18 root@AFR-Server3 ~]# getfattr -d -m. -ehex /export_b1/dir1/output_file.20120615010422
getfattr: Removing leading '/' from absolute path names
# file: export_b1/dir1/output_file.20120615010422
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.afr.vol-client-0=0x000000000000000000000000
trusted.afr.vol-client-1=0x000000000000000000000000
trusted.afr.vol-client-2=0x000000000000000000000000
trusted.gfid=0x8438a813ad7646c684f8c80fe956de57

brick2:
~~~~~~~~
[06/15/12 - 01:41:13 root@AFR-Server2 ~]# getfattr -d -m. -ehex /export_b1/dir1/output_file.20120615010422
getfattr: Removing leading '/' from absolute path names
# file: export_b1/dir1/output_file.20120615010422
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.afr.vol-client-0=0x000000000000000000000000
trusted.afr.vol-client-1=0x000000000000000000000000
trusted.afr.vol-client-2=0x000000000000000000000000
trusted.gfid=0x8438a813ad7646c684f8c80fe956de57

brick1:
~~~~~~~~
[06/15/12 - 01:41:03 root@AFR-Server1 ~]# getfattr -d -m. -ehex /export_b1/dir1/output_file.20120615010422
getfattr: Removing leading '/' from absolute path names
# file: export_b1/dir1/output_file.20120615010422
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.afr.vol-client-0=0x000000000000000000000000
trusted.afr.vol-client-1=0x000000000000000000000000
trusted.afr.vol-client-2=0x000000000000000000000000
trusted.gfid=0x8438a813ad7646c684f8c80fe956de57

Comment 1 Shwetha Panduranga 2012-06-15 06:16:48 UTC
Created attachment 592039 [details]
glusterfs logs, history of commands executed on storage_node1 and fuse mount log

Comment 2 Shwetha Panduranga 2012-06-15 06:31:19 UTC
md5sum on brick1:-
-----------------

[06/15/12 - 02:28:59 root@AFR-Server1 ~]# arequal-checksum -p /export_b1/dir1/output_file.20120615010422

Entry counts
Regular files   : 1
Directories     : 0
Symbolic links  : 0
Other           : 0
Total           : 1

Metadata checksums
Regular files   : 486e85
Directories     : 3e9
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : b05e703fbeac795f934ec0aa043387c9
Directories     : 0
Symbolic links  : 0
Other           : 0
Total           : 2310b095ba9ffe96

md5sum on brick2:-
------------------
[06/15/12 - 02:29:04 root@AFR-Server2 ~]# arequal-checksum -p /export_b1/dir1/output_file.20120615010422 

Entry counts
Regular files   : 1
Directories     : 0
Symbolic links  : 0
Other           : 0
Total           : 1

Metadata checksums
Regular files   : 486e85
Directories     : 3e9
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : b05e703fbeac795f934ec0aa043387c9
Directories     : 0
Symbolic links  : 0
Other           : 0
Total           : 2310b095ba9ffe96

md5sum on brick3:-
------------------
[06/15/12 - 02:30:48 root@AFR-Server3 ~]# arequal-checksum -p /export_b1/dir1/output_file.20120615010422 

Entry counts
Regular files   : 1
Directories     : 0
Symbolic links  : 0
Other           : 0
Total           : 1

Metadata checksums
Regular files   : 486e85
Directories     : 3e9
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : b05e703fbeac795f934ec0aa043387c9
Directories     : 0
Symbolic links  : 0
Other           : 0
Total           : 2310b095ba9ffe96

Comment 3 Pranith Kumar K 2012-06-20 09:44:13 UTC
There are 2 parts to the bug. 1) For some reason split-brain flag is set, we haven't figured out why that happened. 2) Once the flag is set, there is no way to reset the flag. This fixes the second part. The first part does not happen in my setup. With this fix even in her setup it did not happen. She is trying to recreate it without this patch once more.
This bug is tracking 2).

833727  is created to track 1).

Moving this bug to ON_QA

Comment 4 Shwetha Panduranga 2012-06-25 09:43:56 UTC
Verified the fix on RC2-C. Bug is fixed.

Comment 5 Pranith Kumar K 2012-07-02 10:32:01 UTC
*** Bug 836101 has been marked as a duplicate of this bug. ***

Comment 6 Johannes Martin 2012-07-02 11:03:39 UTC
(In reply to comment #5)
> *** Bug 836101 has been marked as a duplicate of this bug. ***

I would like to check the fix in my setup (bug 836101). Where can I obtain the fix or RC2-C?

Comment 7 Pranith Kumar K 2012-07-04 07:25:48 UTC
http://review.gluster.com/3583

Comment 8 Joe Julian 2012-09-14 07:31:43 UTC
*** Bug 857270 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.