Created attachment 935083 [details] A small python script that reproduces the file corruption Description of problem: Using the Fuse client on a distributed, replicated volume results in permanently corrupt data if the file is opened simultaneously for reading and writing and the O_APPEND flag is set. This is a single reader, single writer scenario. The issue is that the Fuse client is sending the wrong offsets to the remote glusterfsd daemons. I verified this with tcpdump and stracing the targets glusterfsd processes. I am confident this is a Fuse-only issue because the problem is not reproducible when the volume is mounted via NFS. There is a thread on gluster-user about this issue: http://supercolony.gluster.org/pipermail/gluster-users/2014-September/018652.html Version-Release number of selected component (if applicable): 3.5.1, 3.5.2 How reproducible: It is not deterministic (the record number varies), but it occurs 100% of the time with the attached python script. Steps to Reproduce: 1. python2.7 gluster-fuse-client-interleaved-read-write-corruption.py /path/to/file/mounted/on/gluster Actual results: If you see output like "FAILED: expected record id 21 retrieved record id 22" you have reproduced the issue. The record ids vary from run to run. Expected results: The program should exit with no output. Additional info:
Comment on attachment 935083 [details] A small python script that reproduces the file corruption This script has tiny syntax bug.
Created attachment 935084 [details] Python script to reproduce corruption Fixes a syntax bug in the first version I uploaded.
Raghavendra, I observed that disabling write-behind is preventing the issue. Do you mind taking a look once. Pranith
@Pranith, was the test to disable write behind (as a part of comment #3) made in 3.5.2 code or master? Thinking about this patch, http://review.gluster.org/#/c/6835/ at master, that may help in this bug?
Shyam, 3.5.2 has that patch. Pranith
Has there been any more progress on this? This is still a major concern for our deployment. Is there some way I can better isolate which layer is corrupting the offsets?
I have confirmed that disabling write-behind and mounting with the fuse client does allow the test to succeed.
Seems like there is no permanent corruption (though the reader running concurrently with write gets wrong sequence number) [root@booradley 1138970]# >/mnt/glusterfs/file; ./gluster-fuse-client-interleaved-read-write-corruption.py /mnt/glusterfs/file [root@booradley 1138970]# cp -fv /mnt/glusterfs/file . `/mnt/glusterfs/file' -> `./file' [root@booradley 1138970]# >/mnt/glusterfs/file; ./gluster-fuse-client-interleaved-read-write-corruption.py /mnt/glusterfs/file [root@booradley 1138970]# >/mnt/glusterfs/file; ./gluster-fuse-client-interleaved-read-write-corruption.py /mnt/glusterfs/file FAILED: expected record id 58 retrieved record id 59 [root@booradley 1138970]# diff /mnt/glusterfs/file ./file [root@booradley 1138970]# >/mnt/glusterfs/file; ./gluster-fuse-client-interleaved-read-write-corruption.py /mnt/glusterfs/file FAILED: expected record id 39 retrieved record id 40 [root@booradley 1138970]# diff /mnt/glusterfs/file ./file As can be seen above, I am comparing the file where: 1. the script reported an error 2. the script didn't report an error diff didn't report any difference b/w both files above. This seems to point to corruption being transient, where reader got corrupted data only when executing with a concurrent writer.
The issue is reproducible only when BOTH write-behind and stat-prefetch are turned on. If any one of them is off, we don't hit this issue.
Also, presence/absence of read-cache translators (read-ahead, io-cache, quick-read) has no impact on the reproducibility of the issue.
The culprit seems to be stale stat information cached in stat-prefetch/md-cache. Consider following sequence of events: 1. stat is cached with size=10 in md-cache. 2. write of size 10 bytes happens, and stat is flushed. 3. write is _NOT_ flushed from write-behind and _IS CACHED_. 4. An operation in its reply gives out stat information (with size=10, since write was not flushed to brick) and this information is cached in md-cache. 5. Now, any fstat calls from applications would still get a stale stat with size=10. This is because stat is served from md-cache and fstat call never reaches write-behind and hence write is not flushed. The evidence to this seems to be from following logs: [2014-10-13 10:28:23.357618] I [trace.c:2320:trace_writev] 0-ptop-trace: 2966: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1, offset=120 flags=08401) [2014-10-13 10:28:23.357652] I [trace.c:267:trace_writev_cbk] 0-ptop-trace: 2966: (op_ret=10, *prebuf = {gfid=00000000-0000-0000-0000-000000000000 ino=0, mode=0, nlink=0, uid=0, gid=0, size=0, blocks=0, atime=[Jan 01 05:30:00], mtime=[Jan 01 05:30:00], ctime=[Jan 01 05:30:00]}, *postbuf = {gfid=00000000-0000-0000-0000-000000000000 ino=0, mode=0, nlink=0, uid=0, gid=0, size=0, blocks=0, atime=[Jan 01 05:30:00], mtime=[Jan 01 05:30:00], ctime=[Jan 01 05:30:00]}) [2014-10-13 10:28:23.357781] I [trace.c:1225:trace_fstat_cbk] 0-ptop-trace: 2964: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 op_ret=0 buf=gfid=496dbe26-3103-4a33-84d4-4facb08c7623 ino=9571362711184111139, mode=100644, nlink=1, uid=0, gid=0, size=120, blocks=1, atime=[Oct 13 15:58:23], mtime=[Oct 13 15:58:23], ctime=[Oct 13 15:58:23] [2014-10-13 10:28:23.357870] I [trace.c:2848:trace_fstat] 0-ptop-trace: 2967: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1ecc [2014-10-13 10:28:23.357891] I [trace.c:1225:trace_fstat_cbk] 0-ptop-trace: 2967: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 op_ret=0 buf=gfid=496dbe26-3103-4a33-84d4-4facb08c7623 ino=9571362711184111139, mode=100644, nlink=1, uid=0, gid=0, size=120, blocks=1, atime=[Oct 13 15:58:23], mtime=[Oct 13 15:58:23], ctime=[Oct 13 15:58:23] [2014-10-13 10:28:23.357931] I [trace.c:2287:trace_readv] 0-ptop-trace: 2968: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1ecc, size=4096offset=0 flags=08000) [2014-10-13 10:28:23.358140] I [trace.c:228:trace_readv_cbk] 0-ptop-trace: 2968: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 op_ret=130 buf=gfid=00000000-0000-0000-0000-000000000000 ino=0, mode=0, nlink=0, uid=0, gid=0, size=0, blocks=0, atime=[Jan 01 05:30:00], mtime=[Jan 01 05:30:00], ctime=[Jan 01 05:30:00] [2014-10-13 10:28:23.359398] I [trace.c:2848:trace_fstat] 0-ptop-trace: 2969: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1ecc [2014-10-13 10:28:23.359489] I [trace.c:1225:trace_fstat_cbk] 0-ptop-trace: 2969: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 op_ret=0 buf=gfid=496dbe26-3103-4a33-84d4-4facb08c7623 ino=9571362711184111139, mode=100644, nlink=1, uid=0, gid=0, size=120, blocks=1, atime=[Oct 13 15:58:23], mtime=[Oct 13 15:58:23], ctime=[Oct 13 15:58:23] [2014-10-13 10:28:23.363943] I [trace.c:2320:trace_writev] 0-ptop-trace: 2971: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1, offset=120 flags=08401) As can be seen two writes are issued to same offset 120 above. Also see that stat returns size as 120 even though a write of 10 bytes at offset 120 is successful and reply is returned to application. For sake of brevity I am pasting just write traces below at around offset 120: [2014-10-13 10:28:23.356296] I [trace.c:2320:trace_writev] 0-ptop-trace: 2963: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1, offset=110 flags=08401) [2014-10-13 10:28:23.357618] I [trace.c:2320:trace_writev] 0-ptop-trace: 2966: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1, offset=120 flags=08401) [2014-10-13 10:28:23.363943] I [trace.c:2320:trace_writev] 0-ptop-trace: 2971: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1, offset=120 flags=08401) [2014-10-13 10:28:23.365178] I [trace.c:2320:trace_writev] 0-ptop-trace: 2974: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1, offset=130 flags=08401) However, there is one hole in this hypothesis. Since we pass the same flags to underlying fs in storage/posix during open, file would be opened in O_APPEND mode on bricks. From man 2 pwrite, about interaction of O_APPEND and pwrite, <man 2 pwrite> BUGS POSIX requires that opening a file with the O_APPEND flag should have no effect on the location at which pwrite() writes data. However, on Linux, if a file is opened with O_APPEND, pwrite() appends data to the end of the file, regardless of the value of offset. </man 2 pwrite> Also logs show that last write of 10 bytes happened at offset 970. But the file size happens to 990 bytes as shown below. [root@booradley 1138970]# ls -l /home/export/ptop/file -rw-r--r--. 2 root root 990 Oct 13 15:58 /home/export/ptop/file So, whatever offset being passed to storage/posix is ignored by pwrite when O_APPEND is present. This behaviour is also inline with the observation in comment 8 that file on backend is no different when this issue is reported than the case where no issue was reported.
Created attachment 946342 [details] trace logs
Flipping the order of md-cache and write-behind xlators in xlator graph makes the issue go away. This combined with the observation that file is never corrupted on back-end points to caching gone bad in the layers above glusterfs - either linux vfs or python File object libraries in this case. I tried to verify by setting --direct-io-mode=yes during glusterfs mount. However, the attached script hangs in an infinite loop while trying to read the file (except for first few reads, reads are not sent to glusterfs at all). So, there are two possible fixes to this problem: 1. Either have some communication b/w write-behind and md-cache to indicate that stat is invalid when writes are cached by write-behind. The communication can be an explicit one through xdata or an implicit one in the form of zeroed out iatt etc. 2. Flip the order or write-behind and md-cache in xlator graph.
I should've been more clear while specifying the xlator graph. The order I am suggesting is: Fuse-bridge -> io-stats -> open-behind -> quick-read -> io-cache -> read-ahead -> write-behind -> md-cache -> distribute -> replicate ...
From what I recollect, write-behind already explicitly wipes post-iatt in write_cbk, so md-cache should not be caching. My impression looking at the backlog is that, you found that the corruption symptoms disappeared by conducting an experiment where the order of xlators were reversed. Based on this alone you are suggesting the change in xlator order itself as a fix. It appears it is still not clear what the exact reason for the corruption symptoms are. It appears there is uncertainty at various levels - whether the "caching" problem is in VFS or within gluster itself, the unexplained "hang" of reads in direct-io-mode=yes etc. Please do not propose or go ahead with a fix until there is a clear understanding of what the exact root cause is.
changed the python script to not to do buffered reads/writes, by passing 0 as the buffer length to third argument of open: writer = RecordWriter(open(fname, 'a', 0)) reader = RecordReader(open(fname, 'r', 0)) But, the issue is still reproducible. Test script still hangs when run on a gluster volume mounted with direct-io-mode=yes. Following is the output of strace: [pid 22338] set_robust_list(0x7f2a364f2a20, 0x18) = 0 [pid 22337] lseek(4, 0, SEEK_CUR) = 0 [pid 22337] read(4, <unfinished ...> [pid 22338] write(3, "\0\0\0\1\0001\00010", 9 <unfinished ...> [pid 22337] <... read resumed> "", 8) = 0 [pid 22337] select(0, NULL, NULL, NULL, {0, 1000} <unfinished ...> [pid 22338] <... write resumed> ) = 9 [pid 22338] select(0, NULL, NULL, NULL, {0, 3000} <unfinished ...> [pid 22337] <... select resumed> ) = 0 (Timeout) [pid 22337] lseek(4, 0, SEEK_SET) = 0 [pid 22337] read(4, "\0\0\0\0\0\0\0\0", 8) = 8 [pid 22337] select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) [pid 22337] lseek(4, 8, SEEK_SET) = 8 [pid 22337] select(0, NULL, NULL, NULL, {0, 1000} <unfinished ...> [pid 22338] <... select resumed> ) = 0 (Timeout) [pid 22338] write(3, "\0\0\0\1\0002\00021", 9 <unfinished ...> [pid 22337] <... select resumed> ) = 0 (Timeout) After the last read in the above output, process 22337 (reader) never issues a read or fstat. It spends all its time in an infinite execution of: [pid 22337] lseek(4, 8, SEEK_SET) = 8 [pid 22337] select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) This behaviour of reader is puzzling since without issuing a stat or read call how can it detect whether the file is growing or not. I think this is the cause for reader to be hung in infinite loop of _read function in attached test-script. So, the only assumption to verify is whether the issue is seen because of buffering in kernel VFS layer, but the above issue is preventing from verifying it.
<Avati> My impression looking at the backlog is that, you found that the corruption symptoms disappeared by conducting an experiment where the order of xlators were reversed. Based on this alone you are suggesting the change in xlator order itself as a fix. </Avati> No, I arrived at the experiment by observing that file was never really corrupted on the bricks. So, this pointed to some caching gone bad. I had disabled io-cache, read-ahead and quick-read on client and was still able to reproduce the issue. So, this left three components which might be doing caching: 1. python libraries 2. client side vfs 3. vfs on brick Since I had seen two writes being issued to same offset (which were never supposed to when this bug is not hit), I suspected something gone wrong with stat as seen by python and vfs. So, I suspected caching in 1 and 2 and tried flipping the order of md-cache and write-behind. But anyways, to ascertain the actual cause I need to find out some way to test 2 above.
Update on the script hang when run on glusterfs mounted with direct-io-mode on: <read_record> if not header_data: return None data_len, checksum = struct.unpack(self.record_header_format, header_data) data = self._read(data_len) </read_record> Seems like we are getting an invalid header_data which when unpacked gives out data_len and checksum as 0. _read function goes into an infinite loop when called with size 0. Though this is a bug to be fixed in _read, more interesting question is why we got a non-NULL header_data which has 0 length and checksum. The reason is: Even though glusterfs returns 0 bytes in read reply (also checked file-size in stat, which was 0), read call on file object is returning an object which gives a length of 8 bytes on invoking len () on that object. Checked through the 2.7.8 python src code, but couldn't find anything suspicious. Will try to work around this problem and get the script running with direct-io-mode on.
I know this an old thread, but this one just caught us out too. Disabling stat-prefetch on the volume prevents the issue in our case, but can I please confirm if this issue has been fixed or if it is still in latest? Thanks
This bug is getting closed because the 3.5 is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.