Bug 1765547 - Fallocate on XFS may discard concurrent AIO write
Summary: Fallocate on XFS may discard concurrent AIO write
Keywords:
Status: ON_QA
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: kernel
Version: 8.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: 8.0
Assignee: Eric Sandeen
QA Contact: Zorro Lang
URL:
Whiteboard:
: 1782090 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-25 11:47 UTC by Max Reitz
Modified: 2019-12-20 13:06 UTC (History)
6 users (show)

Fixed In Version: kernel-4.18.0-165.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Target Upstream Version:


Attachments (Terms of Use)
Test program to demonstrate the issue (1.69 KB, text/plain)
2019-10-25 11:47 UTC, Max Reitz
no flags Details
Test program to demonstrate the issue (under MIT license) (2.92 KB, text/x-csrc)
2019-10-29 07:33 UTC, Max Reitz
no flags Details

Description Max Reitz 2019-10-25 11:47:27 UTC
Created attachment 1629144 [details]
Test program to demonstrate the issue

Description of problem:

On XFS, when running fallocate()[1] and concurrently[2] submitting a pwrite through the AIO interface (io_submit()), the write may be lost if:
- The fallocate() touches a region beyond the file’s EOF,
- The pwrite occurs behind the fallocate() (so even further behind the EOF), and
- The fallocate() and pwrite start before either one is finished.

[1] I suspect this applies to all fallocate() calls without FALLOC_FL_KEEP_SIZE, but I know it to happen for mode=FALLOC_FL_ZERO_RANGE and mode=0.

[2] I.e., in two different threads.

To me it looks like fallocate() at some point decide whether it will need to increase the file’s length, and if so will later truncate it to that point regardless of whether any other requests have grown the file even further in the meantime, thus dropping the data that was written.


Version-Release number of selected component (if applicable):

4.18.0-80.el8.x86_64
5.3.6-200.fc30.x86_64
5.3.7-arch1-1-ARCH


How reproducible:

Depending on your disk, the attached program ranges from about 50 % to 100 %.


Steps to Reproduce:
1. Download the attached C source file

2. Compile it:

$ gcc parallel-falloc-and-pwrite.c -pthread -laio -Wall -Wextra -pedantic -std=c11

3. Run it:
$ df -hT . | awk '{ print $2 }'
Type
xfs
$ ./a.out tmp-file

Actual results:

submitting pwrite
starting fallocate
pwrite done
fallocate done

Hexdump should show 4k of 0s and 4k of 42s:

00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000


Expected results:

submitting pwrite
starting fallocate
pwrite done
fallocate done

Hexdump should show 4k of 0s and 4k of 42s:

00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000  2a 2a 2a 2a 2a 2a 2a 2a  2a 2a 2a 2a 2a 2a 2a 2a  |****************|
*
00002000


Additional info:

On btrfs and ext4, the result is expected.  It is also as expected when defining the IN_ORDER macro in the attached C source file, as this will cause the fallocate() thread to be joined before submitting the pwrite request.

When using the synchronous I/O interface instead of io_submit(), the result is as expected, too.

Comment 1 Eric Sandeen 2019-10-25 12:35:41 UTC
Max, thank you for working this down to a clear testcase.

Comment 2 Max Reitz 2019-10-28 10:18:40 UTC
(In case this is really an XFS bug and not something that turns out to be caused by my misunderstanding of how fallocate() is supposed to work, or something like that,)

It would be very good if there was a way to detect whether this has been fixed in the XFS driver.  We plan on implementing a workaround in qemu, and of course we’d like to disable it where it isn’t necessary.

Stefan Hajnoczi has proposed adding a feature/bugfix bitmap ioctl that would report which features and/or bug fixes the current XFS driver contains.  Would that be feasible?

Max

Comment 3 Eric Sandeen 2019-10-28 13:32:29 UTC
Pretty sure we agree it's a bug.  And I totally understand the desire for runtime detection, but that sounds quite tricky.
I don't really see a bug-bitmap as feasible.  We can see what others think, though.

Of course upstream, you could reasonably count on uname/kernel version.  But I understand that doesn't help distro OSes that backport kernel fixes.

Comment 4 Max Reitz 2019-10-28 13:43:25 UTC
Hm, yes, uname(2) is one of the fallback ideas.  (In addition to a ./configure option for qemu that disables the workaround at the discretion of the distribution.)

Max

Comment 5 Eric Sandeen 2019-10-28 14:00:43 UTC
Let me talk w/ other developers and see if there might be a relatively harmless way to work around the bug - like possibly truncating out the file prior to the fallocate, or something like that.  Of course if the workaround has negative performance effects that won't be desirable but let's see if we can think of anything that could help.

Comment 6 Max Reitz 2019-10-28 14:58:51 UTC
We did truncate before and thus we actually got the very same bug in qemu itself at one point. :-)  (It’s been fixed.)

(The problem was that we used xfsctl(XFS_IOC_ZERO_RANGE) before instead of fallocate(FALLOC_FL_ZERO_RANGE), but that doesn’t grow the file for ranges beyond the EOF.  So we had to truncate it, but that caused basically the same problems as described in this bug, namely concurrent writes being dropped because the truncate interferes with them.)

So our current idea for a workaround in qemu is to simply prevent concurrent write requests that happen after a write-zeroes operation after the EOF.  Because that’s such a corner case (otherwise people would have noticed this problem before), I don’t think it should have much impact on performance.  (And we have all the infrastructure in place to prevent concurrent writes on certain conditions, so the patch just adds 42 lines of code.)

It’s just that we are still a bit uncomfortable about keeping that workaround in qemu indefinitely when it’s probably rarely going to be necessary after a year or so.

Max

Comment 7 Dave Chinner 2019-10-28 23:06:02 UTC
(In reply to Max Reitz from comment #6)
> We did truncate before and thus we actually got the very same bug in qemu
> itself at one point. :-)  (It’s been fixed.)

Doing AIO writes to extend files is -tricky- to get correct and maintain performance.

> (The problem was that we used xfsctl(XFS_IOC_ZERO_RANGE) before instead of
> fallocate(FALLOC_FL_ZERO_RANGE), but that doesn’t grow the file for ranges
> beyond the EOF.  So we had to truncate it, but that caused basically the
> same problems as described in this bug, namely concurrent writes being
> dropped because the truncate interferes with them.)

What does "concurrent writes being dropped" mean? That's the bug in qemu you are referencing above?

I have to ask, though: do you know that fallocate() does exactly the same thing as truncate w.r.t. aio writes? That is, it drains and block all other IO in progress until it completes? IOWs, if ftruncate caused performance problems extending the file in teh presence of AIO writes, then using fallocate to extend the file will cause you exactly the same performance problems for exactly the same reasons...

This is why doing 3 sequential AIOs to zero-around a AIO-DIO write is prefered and often much faster than using fallocate to zero - the block layer combines the 3 AIO writes into a single physical IO and they are done asynchronously along with all the other AIO *read* and writes in progress. Writing zeros via AIO wites does not create an IO pipeline bubble like truncate/fallocate does, and as such you should *always* try to avoid using fallocate() as much as possible when using AIO. 

My rule of thumb for AIO is that fallocate() is only used for setting up the file initially, and then it never gets used again. If you need zero-around for small DIO writes, then you start with a sparse file and use extent size hints on the file to force the allocator to allocate large unwritten extents around the DIO that is being done. This is identical to doing FALLOC_FL_ZERO_RANGE then writing over the top of the zero'd range, but without any serialisation or additional performance overhead at all. It doesn't even require the application to do anything special except set up extent size hints when the file is created (i.e. via FS_IOC_FS[GS]ETXATTR).

> So our current idea for a workaround in qemu is to simply prevent concurrent
> write requests that happen after a write-zeroes operation after the EOF.

I think you need to do more than this - there are several caveats around using AIO+DIO to extend the file size. The main one being is that when io_submit() reutrns to userspace the file size is not guaranteed to reflect the size of the IOs in progress. THe file size may not change until the AIOs complete, and so doing something like:

fd = open(O_TRUNC)
io_prep_write(fd, 4k, 4k)
io_submit()
fstat(fd)
if (st.size == 0)
  printf("size is zero")

Will output "size is zero" more often than not. The behaviour is actually dependent on the underlying storage. e.g. pmem/ramdisk will never output "size is zero" because they are synchronous storage and so AIO is not possible, whilst a spinning disk will almost always report "size is zero" because it takes milliseconds for the IO to complete.

IOWs, using AIO+DIO to extend files requires the application to be very careful about asynchronous size updates, and as such if it is deciding to extend files manually itself it needs to serialise all non-IO operations that either depend on the size being correct or modify the inode size against AIO writes in progress.

The filesystem can't do this for the application - the filesystem has no knowledge of the ordering of operations that the application requires. Hence the responsibility for ensuring AIO-DIO is coherent at the applicaiton level lies with the application, not the filesystem.

> Because that’s such a corner case (otherwise people would have noticed this
> problem before), I don’t think it should have much impact on performance. 
> (And we have all the infrastructure in place to prevent concurrent writes on
> certain conditions, so the patch just adds 42 lines of code.)

Great!

> It’s just that we are still a bit uncomfortable about keeping that
> workaround in qemu indefinitely when it’s probably rarely going to be
> necessary after a year or so.

At a fundamental level, AIO+DIO requires applications to manage their data integrity directly and correctly. That means applications need to serialise file extension operations themselves via similar mechanisms that filesystems do for normal synchronous IO. e.g. consider this sequence of operations, and tell me what the result is supposed to be:

open(O_TRUNC)
io_prep_write(4k, 4k)
io_submit()
io_prep_fsync()
io_submit()
io_prep_read(4k,4k)
io_submit()
ftruncate(0);
io_getevents(3)

I can tell you now that you could successfully write the data, read 0 bytes, successfully truncate teh file and still end up with the file size being 8kB in size, yet an immedisate crash results in a zero length file. This is despite the fact that the filesystem has correctly serialised all the individual operations internally. This is because AIO does not guarantee that AIOs are submitted immediately or even in order. They are not guaranteed to complete in order, and they are not guaranteed to be ordered correctly against other concurrent synchronous operations.

So the AIO read can run before the write, be beyond EOF and so return 0 bytes. the ftruncate can then run before the write, truncating successfully to zero bytes. The write then runs, extending the file. The fsync can run at any time, so if we crash directly after this sequence, there is no guarantee that the file contains anything even though the fsync() completed successfully.

The only way to solve these AIO write extension ordering problems is by correctly ordering application operations by their completion events, not by expecting synchronous syscalls to block in-flight AIO. That is, the fsync cannot be submitted until the write completes if it is expected to stabilise teh write. THe read can be submitted concurrently with teh fsync(), but we have to wait for both to complete before running the truncate. This ensures that there is no AIO in progress when a truncate/fallocate is run, thereby guaranteeing the order of operations is correct at the application level and there is no chance of a race (or a serialisation bug at the filesystem level) from causing data corruption.

Order-by-completion is the model kernel filesystems are based on - they do not start new operations until all other dependent operations have been completed. The bug reported here w/ XFS is because the fallocate operation isn't draining AIO early enough in the ZERO_RANGE operation, and so it's file size checking is racing with AIO completion that updates the file size. I'm working on fixing that - it's clearly a bug in XFS - but that doesn't solve any of the above application AIO ordering and serialisation issues I mention above that lead to uncovering this XFS issue...

-Dave.

Comment 8 Eric Sandeen 2019-10-29 02:33:35 UTC
Max, can you update your testcase w/ your copyright & GPL license so that it can be included in the xfstests regression suite?

Comment 9 Max Reitz 2019-10-29 07:25:03 UTC
(In reply to Dave Chinner from comment #7)
> > (The problem was that we used xfsctl(XFS_IOC_ZERO_RANGE) before instead of
> > fallocate(FALLOC_FL_ZERO_RANGE), but that doesn’t grow the file for ranges
> > beyond the EOF.  So we had to truncate it, but that caused basically the
> > same problems as described in this bug, namely concurrent writes being
> > dropped because the truncate interferes with them.)
> 
> What does "concurrent writes being dropped" mean? That's the bug in qemu you
> are referencing above?

Yes, everything in parentheses here references the bug in qemu, not the one in the XFS driver.

> I have to ask, though: do you know that fallocate() does exactly the same
> thing as truncate w.r.t. aio writes? That is, it drains and block all other
> IO in progress until it completes? IOWs, if ftruncate caused performance
> problems extending the file in teh presence of AIO writes, then using
> fallocate to extend the file will cause you exactly the same performance
> problems for exactly the same reasons...

I didn’t, actually.  That’s very interesting and explains why the fallocate() generally completes after concurrent write operations.

> This is why doing 3 sequential AIOs to zero-around a AIO-DIO write is
> prefered and often much faster than using fallocate to zero - the block
> layer combines the 3 AIO writes into a single physical IO and they are done
> asynchronously along with all the other AIO *read* and writes in progress.
> Writing zeros via AIO wites does not create an IO pipeline bubble like
> truncate/fallocate does, and as such you should *always* try to avoid using
> fallocate() as much as possible when using AIO. 

Thanks a lot for this explanation.  We currently have an upstream discussion exactly about that exact performance impact.  (Because there was a patch to qemu recently that added that use of fallocate() where before we didn’t have it.)

> My rule of thumb for AIO is that fallocate() is only used for setting up the
> file initially, and then it never gets used again. If you need zero-around
> for small DIO writes, then you start with a sparse file and use extent size
> hints on the file to force the allocator to allocate large unwritten extents
> around the DIO that is being done. This is identical to doing
> FALLOC_FL_ZERO_RANGE then writing over the top of the zero'd range, but
> without any serialisation or additional performance overhead at all. It
> doesn't even require the application to do anything special except set up
> extent size hints when the file is created (i.e. via FS_IOC_FS[GS]ETXATTR).

Actually, we don’t necessarily need the explicit zero extents either.  From qemu’s POV, all that matters is that we read back zeroes and so we could leave it unallocated (as far as I know).  It’s just that we also support images on block devices and cannot assume that they are zero, so the current version just makes sure they are.

In any case, thanks a lot, this is very valuable information for the upstream qemu discussion.

> > So our current idea for a workaround in qemu is to simply prevent concurrent
> > write requests that happen after a write-zeroes operation after the EOF.
> 
> I think you need to do more than this - there are several caveats around
> using AIO+DIO to extend the file size. The main one being is that when
> io_submit() reutrns to userspace the file size is not guaranteed to reflect
> the size of the IOs in progress.

We generally don’t inquire the size of the file while I/O is ongoing, because we just have an internal variable that keeps track of it (by assuming that post-EOF writes implicitly grow the file).

But maybe it’s still something to audit the code for...

THe file size may not change until the AIOs
> complete, and so doing something like:
> 
> fd = open(O_TRUNC)
> io_prep_write(fd, 4k, 4k)
> io_submit()
> fstat(fd)
> if (st.size == 0)
>   printf("size is zero")
> 
> Will output "size is zero" more often than not. The behaviour is actually
> dependent on the underlying storage. e.g. pmem/ramdisk will never output
> "size is zero" because they are synchronous storage and so AIO is not
> possible, whilst a spinning disk will almost always report "size is zero"
> because it takes milliseconds for the IO to complete.
> 
> IOWs, using AIO+DIO to extend files requires the application to be very
> careful about asynchronous size updates, and as such if it is deciding to
> extend files manually itself it needs to serialise all non-IO operations
> that either depend on the size being correct or modify the inode size
> against AIO writes in progress.
> 
> The filesystem can't do this for the application - the filesystem has no
> knowledge of the ordering of operations that the application requires. Hence
> the responsibility for ensuring AIO-DIO is coherent at the applicaiton level
> lies with the application, not the filesystem.
> 
> > Because that’s such a corner case (otherwise people would have noticed this
> > problem before), I don’t think it should have much impact on performance. 
> > (And we have all the infrastructure in place to prevent concurrent writes on
> > certain conditions, so the patch just adds 42 lines of code.)
> 
> Great!
> 
> > It’s just that we are still a bit uncomfortable about keeping that
> > workaround in qemu indefinitely when it’s probably rarely going to be
> > necessary after a year or so.
> 
> At a fundamental level, AIO+DIO requires applications to manage their data
> integrity directly and correctly. That means applications need to serialise
> file extension operations themselves via similar mechanisms that filesystems
> do for normal synchronous IO.

Well, we (qemu) generally don’t do ftruncate() while I/O is ongoing.  Or at least we try to (except for when we have a bug, as described before :-/).

Again, thanks a lot for the very detailed explanations.


(In reply to Eric Sandeen from comment #8)
> Max, can you update your testcase w/ your copyright & GPL license so that it
> can be included in the xfstests regression suite?

Sure.

Max

Comment 10 Max Reitz 2019-10-29 07:33:41 UTC
Created attachment 1630037 [details]
Test program to demonstrate the issue (under MIT license)

Comment 11 Max Reitz 2019-10-29 13:15:47 UTC
FWIW, applying “xfs: properly serialise fallocate against AIO+DIO” fixes the original qemu problem (BZ 1751934) for me.

Max

Comment 14 Eric Sandeen 2019-12-11 21:50:27 UTC
*** Bug 1782090 has been marked as a duplicate of this bug. ***

Comment 15 Bruno Meneguele 2019-12-14 12:36:28 UTC
Patch(es) available on kernel-4.18.0-165.el8


Note You need to log in before you can comment on or make changes to this bug.