Bug 1153892 - file corruption during concurrent read/write
Summary: file corruption during concurrent read/write
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: core
Version: 2.1
Hardware: Unspecified
OS: Linux
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Raghavendra G
QA Contact: storage-qa-internal@redhat.com
URL:
Whiteboard:
Depends On: 1138970
Blocks: 1286102
TreeView+ depends on / blocked
 
Reported: 2014-10-17 05:14 UTC by Nagaprasad Sathyanarayana
Modified: 2016-09-17 14:43 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 1138970
: 1286102 (view as bug list)
Environment:
Last Closed: 2015-11-27 10:49:01 UTC
Embargoed:


Attachments (Terms of Use)

Description Nagaprasad Sathyanarayana 2014-10-17 05:14:44 UTC
+++ This bug was initially created as a clone of Bug #1138970 +++

Description of problem:

Using the Fuse client on a distributed, replicated volume results in permanently corrupt data if the file is opened simultaneously for reading and writing and the O_APPEND flag is set. This is a single reader, single writer scenario.

The issue is that the Fuse client is sending the wrong offsets to the remote glusterfsd daemons. I verified this with tcpdump and stracing the targets glusterfsd processes.

I am confident this is a Fuse-only issue because the problem is not reproducible when the volume is mounted via NFS.

There is a thread on gluster-user about this issue:

http://supercolony.gluster.org/pipermail/gluster-users/2014-September/018652.html

Version-Release number of selected component (if applicable):
3.5.1, 3.5.2

How reproducible:

It is not deterministic (the record number varies), but it occurs 100% of the time with the attached python script.

Steps to Reproduce:
1. python2.7 gluster-fuse-client-interleaved-read-write-corruption.py /path/to/file/mounted/on/gluster

Actual results:
If you see output like "FAILED: expected record id 21 retrieved record id 22" you have reproduced the issue. The record ids vary from run to run.

Expected results:
The program should exit with no output.

Additional info:

--- Additional comment from  on 2014-09-06 19:00:00 EDT ---

This script has tiny syntax bug.

--- Additional comment from  on 2014-09-06 19:01:29 EDT ---

Fixes a syntax bug in the first version I uploaded.

--- Additional comment from Pranith Kumar K on 2014-09-07 09:20:52 EDT ---

Raghavendra,
    I observed that disabling write-behind is preventing the issue. Do you mind taking a look once.

Pranith

--- Additional comment from Shyamsundar on 2014-09-08 10:12:51 EDT ---

@Pranith, was the test to disable write behind (as a part of comment #3) made in 3.5.2 code or master?

Thinking about this patch, http://review.gluster.org/#/c/6835/ at master, that may help in this bug?

--- Additional comment from Pranith Kumar K on 2014-09-08 11:08:16 EDT ---

Shyam,
     3.5.2 has that patch.

Pranith

--- Additional comment from  on 2014-09-12 00:56:42 EDT ---

Has there been any more progress on this? This is still a major concern for our deployment.

Is there some way I can better isolate which layer is corrupting the offsets?

--- Additional comment from  on 2014-09-12 13:23:43 EDT ---

I have confirmed that disabling write-behind and mounting with the fuse client does allow the test to succeed.

--- Additional comment from Raghavendra G on 2014-10-13 02:54:34 EDT ---

Seems like there is no permanent corruption (though the reader running concurrently with write gets wrong sequence number)

[root@booradley 1138970]# >/mnt/glusterfs/file; ./gluster-fuse-client-interleaved-read-write-corruption.py /mnt/glusterfs/file
[root@booradley 1138970]# cp -fv /mnt/glusterfs/file .
`/mnt/glusterfs/file' -> `./file'
[root@booradley 1138970]# >/mnt/glusterfs/file; ./gluster-fuse-client-interleaved-read-write-corruption.py /mnt/glusterfs/file
[root@booradley 1138970]# >/mnt/glusterfs/file; ./gluster-fuse-client-interleaved-read-write-corruption.py /mnt/glusterfs/file
FAILED: expected record id 58 retrieved record id 59
[root@booradley 1138970]# diff /mnt/glusterfs/file ./file 
[root@booradley 1138970]# >/mnt/glusterfs/file; ./gluster-fuse-client-interleaved-read-write-corruption.py /mnt/glusterfs/file
FAILED: expected record id 39 retrieved record id 40
[root@booradley 1138970]# diff /mnt/glusterfs/file ./file 

As can be seen above, I am comparing the file where:
1. the script reported an error
2. the script didn't report an error

diff didn't report any difference b/w both files above. This seems to point to corruption being transient, where reader got corrupted data only when executing with a concurrent writer.

--- Additional comment from Raghavendra G on 2014-10-13 03:15:31 EDT ---

The issue is reproducible only when BOTH write-behind and stat-prefetch are turned on. If any one of them is off, we don't hit this issue.

--- Additional comment from Raghavendra G on 2014-10-13 03:23:38 EDT ---

Also, presence/absence of read-cache translators (read-ahead, io-cache, quick-read) has no impact on the reproducibility of the issue.

--- Additional comment from Raghavendra G on 2014-10-13 07:44:09 EDT ---

The culprit seems to be stale stat information cached in stat-prefetch/md-cache. Consider following sequence of events:

1. stat is cached with size=10 in md-cache.
2. write of size 10 bytes happens, and stat is flushed.
3. write is _NOT_ flushed from write-behind and _IS CACHED_.
4. An operation in its reply gives out stat information (with size=10, since write was not flushed to brick) and this information is cached in md-cache.
5. Now, any fstat calls from applications would still get a stale stat with size=10. This is because stat is served from md-cache and fstat call never reaches write-behind and hence write is not flushed.

The evidence to this seems to be from following logs:
[2014-10-13 10:28:23.357618] I [trace.c:2320:trace_writev] 0-ptop-trace: 2966: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1,  offset=120 flags=08401)
[2014-10-13 10:28:23.357652] I [trace.c:267:trace_writev_cbk] 0-ptop-trace: 2966: (op_ret=10, *prebuf = {gfid=00000000-0000-0000-0000-000000000000 ino=0, mode=0, nlink=0, uid=0, gid=0, size=0, blocks=0, atime=[Jan 01 05:30:00], mtime=[Jan 01 05:30:00], ctime=[Jan 01 05:30:00]}, *postbuf = {gfid=00000000-0000-0000-0000-000000000000 ino=0, mode=0, nlink=0, uid=0, gid=0, size=0, blocks=0, atime=[Jan 01 05:30:00], mtime=[Jan 01 05:30:00], ctime=[Jan 01 05:30:00]})
[2014-10-13 10:28:23.357781] I [trace.c:1225:trace_fstat_cbk] 0-ptop-trace: 2964: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 op_ret=0 buf=gfid=496dbe26-3103-4a33-84d4-4facb08c7623 ino=9571362711184111139, mode=100644, nlink=1, uid=0, gid=0, size=120, blocks=1, atime=[Oct 13 15:58:23], mtime=[Oct 13 15:58:23], ctime=[Oct 13 15:58:23]
[2014-10-13 10:28:23.357870] I [trace.c:2848:trace_fstat] 0-ptop-trace: 2967: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1ecc
[2014-10-13 10:28:23.357891] I [trace.c:1225:trace_fstat_cbk] 0-ptop-trace: 2967: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 op_ret=0 buf=gfid=496dbe26-3103-4a33-84d4-4facb08c7623 ino=9571362711184111139, mode=100644, nlink=1, uid=0, gid=0, size=120, blocks=1, atime=[Oct 13 15:58:23], mtime=[Oct 13 15:58:23], ctime=[Oct 13 15:58:23]
[2014-10-13 10:28:23.357931] I [trace.c:2287:trace_readv] 0-ptop-trace: 2968: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1ecc, size=4096offset=0 flags=08000)
[2014-10-13 10:28:23.358140] I [trace.c:228:trace_readv_cbk] 0-ptop-trace: 2968: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 op_ret=130 buf=gfid=00000000-0000-0000-0000-000000000000 ino=0, mode=0, nlink=0, uid=0, gid=0, size=0, blocks=0, atime=[Jan 01 05:30:00], mtime=[Jan 01 05:30:00], ctime=[Jan 01 05:30:00]
[2014-10-13 10:28:23.359398] I [trace.c:2848:trace_fstat] 0-ptop-trace: 2969: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1ecc
[2014-10-13 10:28:23.359489] I [trace.c:1225:trace_fstat_cbk] 0-ptop-trace: 2969: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 op_ret=0 buf=gfid=496dbe26-3103-4a33-84d4-4facb08c7623 ino=9571362711184111139, mode=100644, nlink=1, uid=0, gid=0, size=120, blocks=1, atime=[Oct 13 15:58:23], mtime=[Oct 13 15:58:23], ctime=[Oct 13 15:58:23]
[2014-10-13 10:28:23.363943] I [trace.c:2320:trace_writev] 0-ptop-trace: 2971: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1,  offset=120 flags=08401)

As can be seen two writes are issued to same offset 120 above. Also see that stat returns size as 120 even though a write of 10 bytes at offset 120 is successful and reply is returned to application. For sake of brevity I am pasting just write traces below at around offset 120:

[2014-10-13 10:28:23.356296] I [trace.c:2320:trace_writev] 0-ptop-trace: 2963: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1,  offset=110 flags=08401)
[2014-10-13 10:28:23.357618] I [trace.c:2320:trace_writev] 0-ptop-trace: 2966: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1,  offset=120 flags=08401)
[2014-10-13 10:28:23.363943] I [trace.c:2320:trace_writev] 0-ptop-trace: 2971: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1,  offset=120 flags=08401)
[2014-10-13 10:28:23.365178] I [trace.c:2320:trace_writev] 0-ptop-trace: 2974: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1,  offset=130 flags=08401)

However, there is one hole in this hypothesis. Since we pass the same flags to underlying fs in storage/posix during open, file would be opened in O_APPEND mode on bricks. From man 2 pwrite, about interaction of O_APPEND and pwrite,

<man 2 pwrite>
BUGS        
       POSIX requires that opening a file with the O_APPEND flag should have
       no effect on the location at which pwrite() writes data.  However, on
       Linux, if a file is opened with O_APPEND, pwrite() appends data to
       the end of the file, regardless of the value of offset.
</man 2 pwrite>

Also logs show that last write of 10 bytes happened at offset 970. But the file size happens to 990 bytes as shown below.

[root@booradley 1138970]# ls -l /home/export/ptop/file 
-rw-r--r--. 2 root root 990 Oct 13 15:58 /home/export/ptop/file

So, whatever offset being passed to storage/posix is ignored by pwrite when O_APPEND is present. This behaviour is also inline with the observation in comment 8 that file on backend is no different when this issue is reported than the case where no issue was reported.

--- Additional comment from Raghavendra G on 2014-10-13 07:45:03 EDT ---

Comment 2 Susant Kumar Palai 2015-11-27 10:49:01 UTC
Cloning to 3.1. To be fixed in future release.


Note You need to log in before you can comment on or make changes to this bug.