+++ This bug was initially created as a clone of Bug #1138970 +++ Description of problem: Using the Fuse client on a distributed, replicated volume results in permanently corrupt data if the file is opened simultaneously for reading and writing and the O_APPEND flag is set. This is a single reader, single writer scenario. The issue is that the Fuse client is sending the wrong offsets to the remote glusterfsd daemons. I verified this with tcpdump and stracing the targets glusterfsd processes. I am confident this is a Fuse-only issue because the problem is not reproducible when the volume is mounted via NFS. There is a thread on gluster-user about this issue: http://supercolony.gluster.org/pipermail/gluster-users/2014-September/018652.html Version-Release number of selected component (if applicable): 3.5.1, 3.5.2 How reproducible: It is not deterministic (the record number varies), but it occurs 100% of the time with the attached python script. Steps to Reproduce: 1. python2.7 gluster-fuse-client-interleaved-read-write-corruption.py /path/to/file/mounted/on/gluster Actual results: If you see output like "FAILED: expected record id 21 retrieved record id 22" you have reproduced the issue. The record ids vary from run to run. Expected results: The program should exit with no output. Additional info: --- Additional comment from on 2014-09-06 19:00:00 EDT --- This script has tiny syntax bug. --- Additional comment from on 2014-09-06 19:01:29 EDT --- Fixes a syntax bug in the first version I uploaded. --- Additional comment from Pranith Kumar K on 2014-09-07 09:20:52 EDT --- Raghavendra, I observed that disabling write-behind is preventing the issue. Do you mind taking a look once. Pranith --- Additional comment from Shyamsundar on 2014-09-08 10:12:51 EDT --- @Pranith, was the test to disable write behind (as a part of comment #3) made in 3.5.2 code or master? Thinking about this patch, http://review.gluster.org/#/c/6835/ at master, that may help in this bug? --- Additional comment from Pranith Kumar K on 2014-09-08 11:08:16 EDT --- Shyam, 3.5.2 has that patch. Pranith --- Additional comment from on 2014-09-12 00:56:42 EDT --- Has there been any more progress on this? This is still a major concern for our deployment. Is there some way I can better isolate which layer is corrupting the offsets? --- Additional comment from on 2014-09-12 13:23:43 EDT --- I have confirmed that disabling write-behind and mounting with the fuse client does allow the test to succeed. --- Additional comment from Raghavendra G on 2014-10-13 02:54:34 EDT --- Seems like there is no permanent corruption (though the reader running concurrently with write gets wrong sequence number) [root@booradley 1138970]# >/mnt/glusterfs/file; ./gluster-fuse-client-interleaved-read-write-corruption.py /mnt/glusterfs/file [root@booradley 1138970]# cp -fv /mnt/glusterfs/file . `/mnt/glusterfs/file' -> `./file' [root@booradley 1138970]# >/mnt/glusterfs/file; ./gluster-fuse-client-interleaved-read-write-corruption.py /mnt/glusterfs/file [root@booradley 1138970]# >/mnt/glusterfs/file; ./gluster-fuse-client-interleaved-read-write-corruption.py /mnt/glusterfs/file FAILED: expected record id 58 retrieved record id 59 [root@booradley 1138970]# diff /mnt/glusterfs/file ./file [root@booradley 1138970]# >/mnt/glusterfs/file; ./gluster-fuse-client-interleaved-read-write-corruption.py /mnt/glusterfs/file FAILED: expected record id 39 retrieved record id 40 [root@booradley 1138970]# diff /mnt/glusterfs/file ./file As can be seen above, I am comparing the file where: 1. the script reported an error 2. the script didn't report an error diff didn't report any difference b/w both files above. This seems to point to corruption being transient, where reader got corrupted data only when executing with a concurrent writer. --- Additional comment from Raghavendra G on 2014-10-13 03:15:31 EDT --- The issue is reproducible only when BOTH write-behind and stat-prefetch are turned on. If any one of them is off, we don't hit this issue. --- Additional comment from Raghavendra G on 2014-10-13 03:23:38 EDT --- Also, presence/absence of read-cache translators (read-ahead, io-cache, quick-read) has no impact on the reproducibility of the issue. --- Additional comment from Raghavendra G on 2014-10-13 07:44:09 EDT --- The culprit seems to be stale stat information cached in stat-prefetch/md-cache. Consider following sequence of events: 1. stat is cached with size=10 in md-cache. 2. write of size 10 bytes happens, and stat is flushed. 3. write is _NOT_ flushed from write-behind and _IS CACHED_. 4. An operation in its reply gives out stat information (with size=10, since write was not flushed to brick) and this information is cached in md-cache. 5. Now, any fstat calls from applications would still get a stale stat with size=10. This is because stat is served from md-cache and fstat call never reaches write-behind and hence write is not flushed. The evidence to this seems to be from following logs: [2014-10-13 10:28:23.357618] I [trace.c:2320:trace_writev] 0-ptop-trace: 2966: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1, offset=120 flags=08401) [2014-10-13 10:28:23.357652] I [trace.c:267:trace_writev_cbk] 0-ptop-trace: 2966: (op_ret=10, *prebuf = {gfid=00000000-0000-0000-0000-000000000000 ino=0, mode=0, nlink=0, uid=0, gid=0, size=0, blocks=0, atime=[Jan 01 05:30:00], mtime=[Jan 01 05:30:00], ctime=[Jan 01 05:30:00]}, *postbuf = {gfid=00000000-0000-0000-0000-000000000000 ino=0, mode=0, nlink=0, uid=0, gid=0, size=0, blocks=0, atime=[Jan 01 05:30:00], mtime=[Jan 01 05:30:00], ctime=[Jan 01 05:30:00]}) [2014-10-13 10:28:23.357781] I [trace.c:1225:trace_fstat_cbk] 0-ptop-trace: 2964: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 op_ret=0 buf=gfid=496dbe26-3103-4a33-84d4-4facb08c7623 ino=9571362711184111139, mode=100644, nlink=1, uid=0, gid=0, size=120, blocks=1, atime=[Oct 13 15:58:23], mtime=[Oct 13 15:58:23], ctime=[Oct 13 15:58:23] [2014-10-13 10:28:23.357870] I [trace.c:2848:trace_fstat] 0-ptop-trace: 2967: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1ecc [2014-10-13 10:28:23.357891] I [trace.c:1225:trace_fstat_cbk] 0-ptop-trace: 2967: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 op_ret=0 buf=gfid=496dbe26-3103-4a33-84d4-4facb08c7623 ino=9571362711184111139, mode=100644, nlink=1, uid=0, gid=0, size=120, blocks=1, atime=[Oct 13 15:58:23], mtime=[Oct 13 15:58:23], ctime=[Oct 13 15:58:23] [2014-10-13 10:28:23.357931] I [trace.c:2287:trace_readv] 0-ptop-trace: 2968: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1ecc, size=4096offset=0 flags=08000) [2014-10-13 10:28:23.358140] I [trace.c:228:trace_readv_cbk] 0-ptop-trace: 2968: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 op_ret=130 buf=gfid=00000000-0000-0000-0000-000000000000 ino=0, mode=0, nlink=0, uid=0, gid=0, size=0, blocks=0, atime=[Jan 01 05:30:00], mtime=[Jan 01 05:30:00], ctime=[Jan 01 05:30:00] [2014-10-13 10:28:23.359398] I [trace.c:2848:trace_fstat] 0-ptop-trace: 2969: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1ecc [2014-10-13 10:28:23.359489] I [trace.c:1225:trace_fstat_cbk] 0-ptop-trace: 2969: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 op_ret=0 buf=gfid=496dbe26-3103-4a33-84d4-4facb08c7623 ino=9571362711184111139, mode=100644, nlink=1, uid=0, gid=0, size=120, blocks=1, atime=[Oct 13 15:58:23], mtime=[Oct 13 15:58:23], ctime=[Oct 13 15:58:23] [2014-10-13 10:28:23.363943] I [trace.c:2320:trace_writev] 0-ptop-trace: 2971: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1, offset=120 flags=08401) As can be seen two writes are issued to same offset 120 above. Also see that stat returns size as 120 even though a write of 10 bytes at offset 120 is successful and reply is returned to application. For sake of brevity I am pasting just write traces below at around offset 120: [2014-10-13 10:28:23.356296] I [trace.c:2320:trace_writev] 0-ptop-trace: 2963: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1, offset=110 flags=08401) [2014-10-13 10:28:23.357618] I [trace.c:2320:trace_writev] 0-ptop-trace: 2966: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1, offset=120 flags=08401) [2014-10-13 10:28:23.363943] I [trace.c:2320:trace_writev] 0-ptop-trace: 2971: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1, offset=120 flags=08401) [2014-10-13 10:28:23.365178] I [trace.c:2320:trace_writev] 0-ptop-trace: 2974: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1, offset=130 flags=08401) However, there is one hole in this hypothesis. Since we pass the same flags to underlying fs in storage/posix during open, file would be opened in O_APPEND mode on bricks. From man 2 pwrite, about interaction of O_APPEND and pwrite, <man 2 pwrite> BUGS POSIX requires that opening a file with the O_APPEND flag should have no effect on the location at which pwrite() writes data. However, on Linux, if a file is opened with O_APPEND, pwrite() appends data to the end of the file, regardless of the value of offset. </man 2 pwrite> Also logs show that last write of 10 bytes happened at offset 970. But the file size happens to 990 bytes as shown below. [root@booradley 1138970]# ls -l /home/export/ptop/file -rw-r--r--. 2 root root 990 Oct 13 15:58 /home/export/ptop/file So, whatever offset being passed to storage/posix is ignored by pwrite when O_APPEND is present. This behaviour is also inline with the observation in comment 8 that file on backend is no different when this issue is reported than the case where no issue was reported. --- Additional comment from Raghavendra G on 2014-10-13 07:45:03 EDT ---
Cloning to 3.1. To be fixed in future release.