Bug 761897 (GLUSTER-165)

Summary: distribute/replicate: Error on fsync using unfs3booster
Product: [Community] GlusterFS Reporter: Shehjar Tikoo <shehjart>
Component: replicateAssignee: Vikas Gorur <vikas>
Status: CLOSED NOTABUG QA Contact:
Severity: medium Docs Contact:
Priority: low    
Version: 2.0.5CC: gluster-bugs, lakshmipathi, vijay, vikas
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: RTNR Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Shehjar Tikoo 2009-07-25 20:11:07 UTC
This is actually seen on 2.0.5 but this version is not yet listed in the Bz setup.

Comment 1 Shehjar Tikoo 2009-07-25 20:15:09 UTC
The hang on ls -lh is in fact a self-healing bottleneck, as seen in the log:

[2009-07-25 11:07:26] D [libglusterfsclient.c:4330:__glusterfs_stat] XXXXXXXX: 27 1
[2009-07-25 11:07:26] N [trace.c:1888:trace_readdir] tr: 14: (fd=0x1be198c0, size=4096, offset=8192)
[2009-07-25 11:07:26] N [trace.c:272:trace_readdir_cbk] tr: 14 :(op_ret=1, op_errno=0)
[2009-07-25 11:07:26] D [libglusterfsclient.c:1410:libgf_vmp_search_entry] libglusterfsclient: VMP Entry found: /testpath//testfile: /testpath/
[2009-07-25 11:07:26] D [libglusterfsclient-dentry.c:385:libgf_client_path_lookup] XXXXXXX: Need path resolution
[2009-07-25 11:07:26] D [libglusterfsclient-dentry.c:224:__do_path_resolve] libglusterfsclient-dentry: loc->parent(1) already present. sending lookup for 1//t
estfile
[2009-07-25 11:07:26] D [libglusterfsclient-dentry.c:245:__do_path_resolve] libglusterfsclient-dentry: resolved path(/testfile) till 1(/). sending lookup for 
remaining path
[2009-07-25 11:07:26] D [libglusterfsclient-dentry.c:307:__do_path_resolve] XXXXXX: Must Lookup basename
[2009-07-25 11:07:26] N [trace.c:1245:trace_lookup] tr: 15: (loc {path=/testfile, ino=0})
[2009-07-25 11:07:26] D [afr-self-heal-data.c:797:afr_sh_data_sync_prepare] repl2: self-healing file /testfile from subvolume brick5client to 1 other

Comment 2 Shehjar Tikoo 2009-07-25 23:09:05 UTC
I am trying to write an 78.12GiB file sequentially using dd like below and getting the following error over a NFS mount. The server at the end is a unfs3booster over a fully-loaded distribute replicate setup.

 $ dd if=/dev/zero of=/root/shehjart/mount/testfile bs=65536 count=1280000
dd: closing output file `/root/shehjart/mount/testfile': Input/output error

The relevant log lines while using the trace translator are below:
[2009-07-25 09:10:03] N [trace.c:1738:trace_fsync] tr: 9020257: (flags=0, *fd=0x2aaab561b6f0)
[2009-07-25 09:10:03] D [dht-common.c:1405:dht_err_cbk] dist-repl: subvolume repl1 returned -1 (File descriptor in bad state)
[2009-07-25 09:10:03] N [trace.c:292:trace_fsync_cbk] tr: 9020257: (op_ret=-1, op_errno=77)


It looks like error is reported TO distribute by replicate but there is no corresponding message from replicate.

Please note that the log file containing the above log message was generated by a slightly hacked trace translator. The default behaviour of trace to force log-level to normal, whereas the above lines are printed because in trace, I changed the source line to force log-level to debug.

After the dd command fails, an ls does return the right file size, for eg:

[root@client12 shehjart]# ls mount -lh
total 79G
-rw-r--r-- 1 root root 79G Jul 25  2009 testfile

I cant post the full log as it is over 500Megs in size.

Another strange thing is that if I unmount the NFS client, and re-mount it again. then run the previous ls command again, the commands just waits there from mins till killed. Does this sound self-healing related?

Comment 3 Vikas Gorur 2009-07-26 07:39:27 UTC
This almost certainly looks like a case of AFR split-brain. AFR generates the I/O error when self-heal fails, because there has been a split-brain (the two servers have been separately written to and data now is inconsistent). 

Can you check whether this is actually the case, or is it a spurious split-brain.

Getting the xattrs on the backends will help.

# getfattr -d -m '.*' <file>

Comment 4 Shehjar Tikoo 2009-07-27 03:35:08 UTC
Here is more info.

It definitely looks like a case of split brain as shown below:

After I unmount the NFS export on the NFS client, kill the NFS server and all GlusterFS instances, the following shows the file info frmo the two replica backends:

First replica:
[root@brick5 shehjart]# ls /jbod/shehjart/testdir/ -lh
total 79G
-rw-r--r-- 1 root root 79G Jul 26 23:12 testfile
[root@brick5 shehjart]# getfattr -d -m '.*' /jbod/shehjart/testdir/testfile
getfattr: Removing leading '/' from absolute path names
# file: jbod/shehjart/testdir/testfile
trusted.afr.brick4client=0sAACwawAAAAAAAAAA
trusted.afr.brick5client=0sAAAAAAAAAAAAAAAA

Second replica:
[root@brick4 shehjart]# ls /jbod/shehjart/testdir/ -lh
total 7.6G
-rw-r--r-- 1 root root 8.4G Jul 26 23:11 testfile
[root@brick4 shehjart]# getfattr -d -m '.*' /jbod/shehjart/testdir/testfile
getfattr: Removing leading '/' from absolute path names
# file: jbod/shehjart/testdir/testfile
trusted.afr.brick4client=0sAACwawAAAAAAAAAA
trusted.afr.brick5client=0sAAAAAAAAAAAAAAAA


Vikas,
Is that info of any help here?

Comment 5 Shehjar Tikoo 2009-07-27 06:58:17 UTC
Here is the data from another test with xattrs in hex. In this case, I killed the dd in the middle of a really large sequential write but the effect, as shown below, is the same:

First replica:
[root@brick5 shehjart]# ls -lh /jbod/shehjart/testdir/
total 170G
-rw-r--r-- 1 root root 170G Jul 27 02:53 testfile
[root@brick5 shehjart]# getfattr -d -m '.*' -e hex /jbod/shehjart/testdir/testfilegetfattr: Removing leading '/' from absolute path names
# file: jbod/shehjart/testdir/testfile
trusted.afr.brick4client=0x000198950000000000000000
trusted.afr.brick5client=0x000000000000000000000000


Second replica:
[root@brick4 shehjart]# ls -lh /jbod/shehjart/testdir/
total 7.6G
-rw-r--r-- 1 root root 8.1G Jul 27 02:53 testfile
[root@brick4 shehjart]# getfattr -d -m '.*' -e hex /jbod/shehjart/testdir/testfile
getfattr: Removing leading '/' from absolute path names
# file: jbod/shehjart/testdir/testfile
trusted.afr.brick4client=0x000198950000000000000000
trusted.afr.brick5client=0x000000000000000000000000

Comment 6 Shehjar Tikoo 2009-07-27 07:03:00 UTC
FWIW, i think this is a false alarm. The /jbod directory on brick4 is a directory on root and not mounted on the the real jbod attached to the system. The problem is of disk being out of space.

Comment 7 Shehjar Tikoo 2009-07-27 12:12:35 UTC
Still, I feel it is counter-intuitive to not have replicate return ENOSPC when one replica runs out of space. I am pretty sure we're going against principle of least surprise here.