Bug 811919 - [6a995ab3300a5ee0ee79a4d7d75281a79deec96e] Truncated read output
Summary: [6a995ab3300a5ee0ee79a4d7d75281a79deec96e] Truncated read output
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: GlusterFS
Classification: Community
Component: fuse
Version: mainline
Hardware: Unspecified
OS: Unspecified
medium
unspecified
Target Milestone: ---
Assignee: Raghavendra G
QA Contact:
URL:
Whiteboard:
: 784205 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-04-12 10:22 UTC by Anush Shetty
Modified: 2012-06-07 07:22 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-06-06 12:28:04 UTC
Regression: ---
Mount Type: fuse
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Anush Shetty 2012-04-12 10:22:56 UTC
Description of problem: When a file is written with different content in a loop and read from the same client, the read output is not as expected always.

This issue was seen even with all performance translators disabled and the fuse client mounted with direct-io-mode=on

Version-Release number of selected component (if applicable): Upstream


How reproducible: Consistently


Steps to Reproduce:
1. while true; do echo 'abc' > /mnt/gluster/dot; done
2. while true; do echo 'defghijklmnopqrstuvwxyz' > /mnt/gluster/dot; done
3.  while true; do cat /mnt/gluster/dot;done
  
Actual results:

Read output:
abc
hijklmnopqrstuvwxyz
abc
abc
abc
abc
abc
defgabc
hijklmnopqrstuvwxyz
abc
hijklmnopqrstuvwxyz
defghijklmnopqrstuvwxyz
defghijklmnopqrstuvwxyz
abc


Additional info:

This was on a single brick.

Comment 1 Raghavendra G 2012-06-06 12:28:04 UTC

*** This bug has been marked as a duplicate of bug 762533 ***

Comment 2 Anand Avati 2012-06-07 06:41:51 UTC
This has been wrongly analyzed as a bug. While the shell commands makes it intuitively appear that the output should always either be "abc" or "defghijklmnopqrstuvwxyz", in reality the sequence of syscalls for the writer (taken from strace sh -c "echo 'abc' > /mnt/dot") are:

-----------------------------
open("/mnt/dot", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
...
write(1, "abc\n", 4)                    = 4
-----------------------------

Note the "\n" at the end.

And the sequence of syscalls for the reader (taken from strace sh -c "cat /mnt/dot") are:

-----------------------------
open("/tmp/dot", O_RDONLY)              = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=4, ...}) = 0
read(3, "abc\n", 32768)                 = 4
write(1, "abc\n", 4)                    = 4 # write to stdout
read(3, "", 32768)                      = 0
-----------------------------

It is important to note that the two writer thread's truncate + write is not atomic. Both can truncate in parallel, followed by thread 2 write the longer string and then thread 1 overwriting the initial few bytes with its contents. This results in outputs like:

----------------------------
open("/mnt/dot", O_RDONLY)              = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
read(3, "abc\nhijklmnopqrstuvwxyz\n", 131072) = 24
write(1, "abc\nhijklmnopqrstuvwxyz\n", 24) = 24
read(3, "", 131072)                     = 0
----------------------------

The "\n" in the middle makes it look like this on the screen:

----------------------------
...
abc
hijklmnopqrstuvwxyz
...
---------------------------

There is another sequence of operations which leads to the above behavior. This is due to the nature of how the cat command behaves. Notice in all the above reader strace logs, there is always an extra read() attempt in the end which returns 0. This is how cat checks if the file had "grown" while it was ready. It basically keeps looping by attempting to read more till it gets a 0 byte (there is no other way it can detect EOF on stdin) even if the previous read attempt had not filled the buffer completely. So, in situations where the first read() attempt of cat returns "abc\n", before the next read() attempt, the second writer could have overwritten the file with new content, but of course, the fd of the reader has seek'd to offset 5 already. This results in the following sequence of syscalls:

---------------------------
open("/mnt/dot", O_RDONLY)              = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
read(3, "abc\n", 131072)                = 4
write(1, "abc\n", 4)                    = 4
read(3, "hijklmnopqrstuvwxyz\n", 131072) = 20
write(1, "hijklmnopqrstuvwxyz\n", 20)   = 20
read(3, "", 131072)                     = 0
----------------------------

And all this is reproducible over ext3/ext4/XFS as well!

The FUSE/gluster specific part of the behavior comes with the output line:

---------------------------
...
defgabc
...
---------------------------

In this case, the reader's strace looks like this:

-------------------------
open("/mnt/dot", O_RDONLY)              = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
read(3, "defg", 131072)                 = 4
write(1, "defg", 4)                     = 4
read(3, "", 131072)                     = 0
----------------------------

Notice that "defg" is not followed by a "\n". This actually shows the line "defgabc" is actually the output of two consecutive cat command outputs. The first shown above, followed by an "OK" cat which displayed "abc\n". Now the question is how the hell did we get "defg" and not "defghijklmnopqrstuvwxyz"? This has got to do with how FUSE behaves. FUSE (VFS) only reads as many bytes as the inode size attribute. My theory is that, there has been a race in FUSE, between the open() and read() of the reader where the first writer's acknowledgement came from gluster first and the inode attribute in FUSE was set to 4, then the second write proceeded to gluster which updated the content in the backend but the reply hasn't yet arrived back to FUSE. Now the read is issued for only 4 bytes as that is the size cached in the inode, and those 4 bytes are displayed. By the time the next read() is attempted by cat, very likely yet another open("/mnt/dot", O_TRUNC) had truncated, resulting in a 0 byte read.

Note that, the validity of this theory (read issued by FUSE only up to the cached inode size) is corroborated by the observation that read call's size paramter in gluster is 4 while cat's read size was 131072. (how did FUSE know to read only 4 bytes?). Whether or not this behavior is a "bug" is debatable. In any case this does not look like a Gluster bug (it might be a FUSE bug, if at all it is a bug)

Comment 3 shishir gowda 2012-06-07 07:22:39 UTC
*** Bug 784205 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.