| Summary: | distribute/replicate: Error on fsync using unfs3booster | ||
|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Shehjar Tikoo <shehjart> |
| Component: | replicate | Assignee: | Vikas Gorur <vikas> |
| Status: | CLOSED NOTABUG | QA Contact: | |
| Severity: | medium | Docs Contact: | |
| Priority: | low | ||
| Version: | 2.0.5 | CC: | gluster-bugs, lakshmipathi, vijay, vikas |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | --- | |
| Regression: | RTNR | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Shehjar Tikoo
2009-07-25 20:11:07 UTC
The hang on ls -lh is in fact a self-healing bottleneck, as seen in the log:
[2009-07-25 11:07:26] D [libglusterfsclient.c:4330:__glusterfs_stat] XXXXXXXX: 27 1
[2009-07-25 11:07:26] N [trace.c:1888:trace_readdir] tr: 14: (fd=0x1be198c0, size=4096, offset=8192)
[2009-07-25 11:07:26] N [trace.c:272:trace_readdir_cbk] tr: 14 :(op_ret=1, op_errno=0)
[2009-07-25 11:07:26] D [libglusterfsclient.c:1410:libgf_vmp_search_entry] libglusterfsclient: VMP Entry found: /testpath//testfile: /testpath/
[2009-07-25 11:07:26] D [libglusterfsclient-dentry.c:385:libgf_client_path_lookup] XXXXXXX: Need path resolution
[2009-07-25 11:07:26] D [libglusterfsclient-dentry.c:224:__do_path_resolve] libglusterfsclient-dentry: loc->parent(1) already present. sending lookup for 1//t
estfile
[2009-07-25 11:07:26] D [libglusterfsclient-dentry.c:245:__do_path_resolve] libglusterfsclient-dentry: resolved path(/testfile) till 1(/). sending lookup for
remaining path
[2009-07-25 11:07:26] D [libglusterfsclient-dentry.c:307:__do_path_resolve] XXXXXX: Must Lookup basename
[2009-07-25 11:07:26] N [trace.c:1245:trace_lookup] tr: 15: (loc {path=/testfile, ino=0})
[2009-07-25 11:07:26] D [afr-self-heal-data.c:797:afr_sh_data_sync_prepare] repl2: self-healing file /testfile from subvolume brick5client to 1 other
I am trying to write an 78.12GiB file sequentially using dd like below and getting the following error over a NFS mount. The server at the end is a unfs3booster over a fully-loaded distribute replicate setup. $ dd if=/dev/zero of=/root/shehjart/mount/testfile bs=65536 count=1280000 dd: closing output file `/root/shehjart/mount/testfile': Input/output error The relevant log lines while using the trace translator are below: [2009-07-25 09:10:03] N [trace.c:1738:trace_fsync] tr: 9020257: (flags=0, *fd=0x2aaab561b6f0) [2009-07-25 09:10:03] D [dht-common.c:1405:dht_err_cbk] dist-repl: subvolume repl1 returned -1 (File descriptor in bad state) [2009-07-25 09:10:03] N [trace.c:292:trace_fsync_cbk] tr: 9020257: (op_ret=-1, op_errno=77) It looks like error is reported TO distribute by replicate but there is no corresponding message from replicate. Please note that the log file containing the above log message was generated by a slightly hacked trace translator. The default behaviour of trace to force log-level to normal, whereas the above lines are printed because in trace, I changed the source line to force log-level to debug. After the dd command fails, an ls does return the right file size, for eg: [root@client12 shehjart]# ls mount -lh total 79G -rw-r--r-- 1 root root 79G Jul 25 2009 testfile I cant post the full log as it is over 500Megs in size. Another strange thing is that if I unmount the NFS client, and re-mount it again. then run the previous ls command again, the commands just waits there from mins till killed. Does this sound self-healing related? This almost certainly looks like a case of AFR split-brain. AFR generates the I/O error when self-heal fails, because there has been a split-brain (the two servers have been separately written to and data now is inconsistent). Can you check whether this is actually the case, or is it a spurious split-brain. Getting the xattrs on the backends will help. # getfattr -d -m '.*' <file> Here is more info. It definitely looks like a case of split brain as shown below: After I unmount the NFS export on the NFS client, kill the NFS server and all GlusterFS instances, the following shows the file info frmo the two replica backends: First replica: [root@brick5 shehjart]# ls /jbod/shehjart/testdir/ -lh total 79G -rw-r--r-- 1 root root 79G Jul 26 23:12 testfile [root@brick5 shehjart]# getfattr -d -m '.*' /jbod/shehjart/testdir/testfile getfattr: Removing leading '/' from absolute path names # file: jbod/shehjart/testdir/testfile trusted.afr.brick4client=0sAACwawAAAAAAAAAA trusted.afr.brick5client=0sAAAAAAAAAAAAAAAA Second replica: [root@brick4 shehjart]# ls /jbod/shehjart/testdir/ -lh total 7.6G -rw-r--r-- 1 root root 8.4G Jul 26 23:11 testfile [root@brick4 shehjart]# getfattr -d -m '.*' /jbod/shehjart/testdir/testfile getfattr: Removing leading '/' from absolute path names # file: jbod/shehjart/testdir/testfile trusted.afr.brick4client=0sAACwawAAAAAAAAAA trusted.afr.brick5client=0sAAAAAAAAAAAAAAAA Vikas, Is that info of any help here? Here is the data from another test with xattrs in hex. In this case, I killed the dd in the middle of a really large sequential write but the effect, as shown below, is the same: First replica: [root@brick5 shehjart]# ls -lh /jbod/shehjart/testdir/ total 170G -rw-r--r-- 1 root root 170G Jul 27 02:53 testfile [root@brick5 shehjart]# getfattr -d -m '.*' -e hex /jbod/shehjart/testdir/testfilegetfattr: Removing leading '/' from absolute path names # file: jbod/shehjart/testdir/testfile trusted.afr.brick4client=0x000198950000000000000000 trusted.afr.brick5client=0x000000000000000000000000 Second replica: [root@brick4 shehjart]# ls -lh /jbod/shehjart/testdir/ total 7.6G -rw-r--r-- 1 root root 8.1G Jul 27 02:53 testfile [root@brick4 shehjart]# getfattr -d -m '.*' -e hex /jbod/shehjart/testdir/testfile getfattr: Removing leading '/' from absolute path names # file: jbod/shehjart/testdir/testfile trusted.afr.brick4client=0x000198950000000000000000 trusted.afr.brick5client=0x000000000000000000000000 FWIW, i think this is a false alarm. The /jbod directory on brick4 is a directory on root and not mounted on the the real jbod attached to the system. The problem is of disk being out of space. Still, I feel it is counter-intuitive to not have replicate return ENOSPC when one replica runs out of space. I am pretty sure we're going against principle of least surprise here. |