1765547 – Fallocate on XFS may discard concurrent AIO write

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1765547 - Fallocate on XFS may discard concurrent AIO write

Summary: Fallocate on XFS may discard concurrent AIO write

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	8.0
Assignee:	Eric Sandeen
QA Contact:	Zorro Lang
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1782090 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-10-25 11:47 UTC by Hanna Czenczek
Modified:	2023-04-08 09:24 UTC (History)
CC List:	6 users (show)
Fixed In Version:	kernel-4.18.0-165.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-04-28 16:31:14 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)
Test program to demonstrate the issue (1.69 KB, text/plain) 2019-10-25 11:47 UTC, Hanna Czenczek	no flags	Details
Test program to demonstrate the issue (under MIT license) (2.92 KB, text/x-csrc) 2019-10-29 07:33 UTC, Hanna Czenczek	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHELPLAN-29141	0	None	None	None	2023-04-08 09:24:05 UTC
Red Hat Product Errata	RHSA-2020:1769	0	None	None	None	2020-04-28 16:31:39 UTC

Description Hanna Czenczek 2019-10-25 11:47:27 UTC

Created attachment 1629144 [details]
Test program to demonstrate the issue

Description of problem:

On XFS, when running fallocate()[1] and concurrently[2] submitting a pwrite through the AIO interface (io_submit()), the write may be lost if:
- The fallocate() touches a region beyond the file’s EOF,
- The pwrite occurs behind the fallocate() (so even further behind the EOF), and
- The fallocate() and pwrite start before either one is finished.

[1] I suspect this applies to all fallocate() calls without FALLOC_FL_KEEP_SIZE, but I know it to happen for mode=FALLOC_FL_ZERO_RANGE and mode=0.

[2] I.e., in two different threads.

To me it looks like fallocate() at some point decide whether it will need to increase the file’s length, and if so will later truncate it to that point regardless of whether any other requests have grown the file even further in the meantime, thus dropping the data that was written.


Version-Release number of selected component (if applicable):

4.18.0-80.el8.x86_64
5.3.6-200.fc30.x86_64
5.3.7-arch1-1-ARCH


How reproducible:

Depending on your disk, the attached program ranges from about 50 % to 100 %.


Steps to Reproduce:
1. Download the attached C source file

2. Compile it:

$ gcc parallel-falloc-and-pwrite.c -pthread -laio -Wall -Wextra -pedantic -std=c11

3. Run it:
$ df -hT . | awk '{ print $2 }'
Type
xfs
$ ./a.out tmp-file

Actual results:

submitting pwrite
starting fallocate
pwrite done
fallocate done

Hexdump should show 4k of 0s and 4k of 42s:

00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000


Expected results:

submitting pwrite
starting fallocate
pwrite done
fallocate done

Hexdump should show 4k of 0s and 4k of 42s:

00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000  2a 2a 2a 2a 2a 2a 2a 2a  2a 2a 2a 2a 2a 2a 2a 2a  |****************|
*
00002000


Additional info:

On btrfs and ext4, the result is expected.  It is also as expected when defining the IN_ORDER macro in the attached C source file, as this will cause the fallocate() thread to be joined before submitting the pwrite request.

When using the synchronous I/O interface instead of io_submit(), the result is as expected, too.

Comment 1 Eric Sandeen 2019-10-25 12:35:41 UTC

Max, thank you for working this down to a clear testcase.

Comment 2 Hanna Czenczek 2019-10-28 10:18:40 UTC

(In case this is really an XFS bug and not something that turns out to be caused by my misunderstanding of how fallocate() is supposed to work, or something like that,)

It would be very good if there was a way to detect whether this has been fixed in the XFS driver.  We plan on implementing a workaround in qemu, and of course we’d like to disable it where it isn’t necessary.

Stefan Hajnoczi has proposed adding a feature/bugfix bitmap ioctl that would report which features and/or bug fixes the current XFS driver contains.  Would that be feasible?

Max

Comment 3 Eric Sandeen 2019-10-28 13:32:29 UTC

Pretty sure we agree it's a bug.  And I totally understand the desire for runtime detection, but that sounds quite tricky.
I don't really see a bug-bitmap as feasible.  We can see what others think, though.

Of course upstream, you could reasonably count on uname/kernel version.  But I understand that doesn't help distro OSes that backport kernel fixes.

Comment 4 Hanna Czenczek 2019-10-28 13:43:25 UTC

Hm, yes, uname(2) is one of the fallback ideas.  (In addition to a ./configure option for qemu that disables the workaround at the discretion of the distribution.)

Max

Comment 5 Eric Sandeen 2019-10-28 14:00:43 UTC

Let me talk w/ other developers and see if there might be a relatively harmless way to work around the bug - like possibly truncating out the file prior to the fallocate, or something like that.  Of course if the workaround has negative performance effects that won't be desirable but let's see if we can think of anything that could help.

Comment 6 Hanna Czenczek 2019-10-28 14:58:51 UTC

We did truncate before and thus we actually got the very same bug in qemu itself at one point. :-)  (It’s been fixed.)

(The problem was that we used xfsctl(XFS_IOC_ZERO_RANGE) before instead of fallocate(FALLOC_FL_ZERO_RANGE), but that doesn’t grow the file for ranges beyond the EOF.  So we had to truncate it, but that caused basically the same problems as described in this bug, namely concurrent writes being dropped because the truncate interferes with them.)

So our current idea for a workaround in qemu is to simply prevent concurrent write requests that happen after a write-zeroes operation after the EOF.  Because that’s such a corner case (otherwise people would have noticed this problem before), I don’t think it should have much impact on performance.  (And we have all the infrastructure in place to prevent concurrent writes on certain conditions, so the patch just adds 42 lines of code.)

It’s just that we are still a bit uncomfortable about keeping that workaround in qemu indefinitely when it’s probably rarely going to be necessary after a year or so.

Max

Comment 7 Dave Chinner 2019-10-28 23:06:02 UTC

(In reply to Max Reitz from comment #6)
> We did truncate before and thus we actually got the very same bug in qemu
> itself at one point. :-)  (It’s been fixed.)

Doing AIO writes to extend files is -tricky- to get correct and maintain performance.

> (The problem was that we used xfsctl(XFS_IOC_ZERO_RANGE) before instead of
> fallocate(FALLOC_FL_ZERO_RANGE), but that doesn’t grow the file for ranges
> beyond the EOF.  So we had to truncate it, but that caused basically the
> same problems as described in this bug, namely concurrent writes being
> dropped because the truncate interferes with them.)

What does "concurrent writes being dropped" mean? That's the bug in qemu you are referencing above?

I have to ask, though: do you know that fallocate() does exactly the same thing as truncate w.r.t. aio writes? That is, it drains and block all other IO in progress until it completes? IOWs, if ftruncate caused performance problems extending the file in teh presence of AIO writes, then using fallocate to extend the file will cause you exactly the same performance problems for exactly the same reasons...

This is why doing 3 sequential AIOs to zero-around a AIO-DIO write is prefered and often much faster than using fallocate to zero - the block layer combines the 3 AIO writes into a single physical IO and they are done asynchronously along with all the other AIO *read* and writes in progress. Writing zeros via AIO wites does not create an IO pipeline bubble like truncate/fallocate does, and as such you should *always* try to avoid using fallocate() as much as possible when using AIO. 

My rule of thumb for AIO is that fallocate() is only used for setting up the file initially, and then it never gets used again. If you need zero-around for small DIO writes, then you start with a sparse file and use extent size hints on the file to force the allocator to allocate large unwritten extents around the DIO that is being done. This is identical to doing FALLOC_FL_ZERO_RANGE then writing over the top of the zero'd range, but without any serialisation or additional performance overhead at all. It doesn't even require the application to do anything special except set up extent size hints when the file is created (i.e. via FS_IOC_FS[GS]ETXATTR).

> So our current idea for a workaround in qemu is to simply prevent concurrent
> write requests that happen after a write-zeroes operation after the EOF.

I think you need to do more than this - there are several caveats around using AIO+DIO to extend the file size. The main one being is that when io_submit() reutrns to userspace the file size is not guaranteed to reflect the size of the IOs in progress. THe file size may not change until the AIOs complete, and so doing something like:

fd = open(O_TRUNC)
io_prep_write(fd, 4k, 4k)
io_submit()
fstat(fd)
if (st.size == 0)
  printf("size is zero")

Will output "size is zero" more often than not. The behaviour is actually dependent on the underlying storage. e.g. pmem/ramdisk will never output "size is zero" because they are synchronous storage and so AIO is not possible, whilst a spinning disk will almost always report "size is zero" because it takes milliseconds for the IO to complete.

IOWs, using AIO+DIO to extend files requires the application to be very careful about asynchronous size updates, and as such if it is deciding to extend files manually itself it needs to serialise all non-IO operations that either depend on the size being correct or modify the inode size against AIO writes in progress.

The filesystem can't do this for the application - the filesystem has no knowledge of the ordering of operations that the application requires. Hence the responsibility for ensuring AIO-DIO is coherent at the applicaiton level lies with the application, not the filesystem.

> Because that’s such a corner case (otherwise people would have noticed this
> problem before), I don’t think it should have much impact on performance. 
> (And we have all the infrastructure in place to prevent concurrent writes on
> certain conditions, so the patch just adds 42 lines of code.)

Great!

> It’s just that we are still a bit uncomfortable about keeping that
> workaround in qemu indefinitely when it’s probably rarely going to be
> necessary after a year or so.

At a fundamental level, AIO+DIO requires applications to manage their data integrity directly and correctly. That means applications need to serialise file extension operations themselves via similar mechanisms that filesystems do for normal synchronous IO. e.g. consider this sequence of operations, and tell me what the result is supposed to be:

open(O_TRUNC)
io_prep_write(4k, 4k)
io_submit()
io_prep_fsync()
io_submit()
io_prep_read(4k,4k)
io_submit()
ftruncate(0);
io_getevents(3)

I can tell you now that you could successfully write the data, read 0 bytes, successfully truncate teh file and still end up with the file size being 8kB in size, yet an immedisate crash results in a zero length file. This is despite the fact that the filesystem has correctly serialised all the individual operations internally. This is because AIO does not guarantee that AIOs are submitted immediately or even in order. They are not guaranteed to complete in order, and they are not guaranteed to be ordered correctly against other concurrent synchronous operations.

So the AIO read can run before the write, be beyond EOF and so return 0 bytes. the ftruncate can then run before the write, truncating successfully to zero bytes. The write then runs, extending the file. The fsync can run at any time, so if we crash directly after this sequence, there is no guarantee that the file contains anything even though the fsync() completed successfully.

The only way to solve these AIO write extension ordering problems is by correctly ordering application operations by their completion events, not by expecting synchronous syscalls to block in-flight AIO. That is, the fsync cannot be submitted until the write completes if it is expected to stabilise teh write. THe read can be submitted concurrently with teh fsync(), but we have to wait for both to complete before running the truncate. This ensures that there is no AIO in progress when a truncate/fallocate is run, thereby guaranteeing the order of operations is correct at the application level and there is no chance of a race (or a serialisation bug at the filesystem level) from causing data corruption.

Order-by-completion is the model kernel filesystems are based on - they do not start new operations until all other dependent operations have been completed. The bug reported here w/ XFS is because the fallocate operation isn't draining AIO early enough in the ZERO_RANGE operation, and so it's file size checking is racing with AIO completion that updates the file size. I'm working on fixing that - it's clearly a bug in XFS - but that doesn't solve any of the above application AIO ordering and serialisation issues I mention above that lead to uncovering this XFS issue...

-Dave.

Comment 8 Eric Sandeen 2019-10-29 02:33:35 UTC

Max, can you update your testcase w/ your copyright & GPL license so that it can be included in the xfstests regression suite?

Comment 9 Hanna Czenczek 2019-10-29 07:25:03 UTC

(In reply to Dave Chinner from comment #7)
> > (The problem was that we used xfsctl(XFS_IOC_ZERO_RANGE) before instead of
> > fallocate(FALLOC_FL_ZERO_RANGE), but that doesn’t grow the file for ranges
> > beyond the EOF.  So we had to truncate it, but that caused basically the
> > same problems as described in this bug, namely concurrent writes being
> > dropped because the truncate interferes with them.)
> 
> What does "concurrent writes being dropped" mean? That's the bug in qemu you
> are referencing above?

Yes, everything in parentheses here references the bug in qemu, not the one in the XFS driver.

> I have to ask, though: do you know that fallocate() does exactly the same
> thing as truncate w.r.t. aio writes? That is, it drains and block all other
> IO in progress until it completes? IOWs, if ftruncate caused performance
> problems extending the file in teh presence of AIO writes, then using
> fallocate to extend the file will cause you exactly the same performance
> problems for exactly the same reasons...

I didn’t, actually.  That’s very interesting and explains why the fallocate() generally completes after concurrent write operations.

> This is why doing 3 sequential AIOs to zero-around a AIO-DIO write is
> prefered and often much faster than using fallocate to zero - the block
> layer combines the 3 AIO writes into a single physical IO and they are done
> asynchronously along with all the other AIO *read* and writes in progress.
> Writing zeros via AIO wites does not create an IO pipeline bubble like
> truncate/fallocate does, and as such you should *always* try to avoid using
> fallocate() as much as possible when using AIO. 

Thanks a lot for this explanation.  We currently have an upstream discussion exactly about that exact performance impact.  (Because there was a patch to qemu recently that added that use of fallocate() where before we didn’t have it.)

> My rule of thumb for AIO is that fallocate() is only used for setting up the
> file initially, and then it never gets used again. If you need zero-around
> for small DIO writes, then you start with a sparse file and use extent size
> hints on the file to force the allocator to allocate large unwritten extents
> around the DIO that is being done. This is identical to doing
> FALLOC_FL_ZERO_RANGE then writing over the top of the zero'd range, but
> without any serialisation or additional performance overhead at all. It
> doesn't even require the application to do anything special except set up
> extent size hints when the file is created (i.e. via FS_IOC_FS[GS]ETXATTR).

Actually, we don’t necessarily need the explicit zero extents either.  From qemu’s POV, all that matters is that we read back zeroes and so we could leave it unallocated (as far as I know).  It’s just that we also support images on block devices and cannot assume that they are zero, so the current version just makes sure they are.

In any case, thanks a lot, this is very valuable information for the upstream qemu discussion.

> > So our current idea for a workaround in qemu is to simply prevent concurrent
> > write requests that happen after a write-zeroes operation after the EOF.
> 
> I think you need to do more than this - there are several caveats around
> using AIO+DIO to extend the file size. The main one being is that when
> io_submit() reutrns to userspace the file size is not guaranteed to reflect
> the size of the IOs in progress.

We generally don’t inquire the size of the file while I/O is ongoing, because we just have an internal variable that keeps track of it (by assuming that post-EOF writes implicitly grow the file).

But maybe it’s still something to audit the code for...

THe file size may not change until the AIOs
> complete, and so doing something like:
> 
> fd = open(O_TRUNC)
> io_prep_write(fd, 4k, 4k)
> io_submit()
> fstat(fd)
> if (st.size == 0)
>   printf("size is zero")
> 
> Will output "size is zero" more often than not. The behaviour is actually
> dependent on the underlying storage. e.g. pmem/ramdisk will never output
> "size is zero" because they are synchronous storage and so AIO is not
> possible, whilst a spinning disk will almost always report "size is zero"
> because it takes milliseconds for the IO to complete.
> 
> IOWs, using AIO+DIO to extend files requires the application to be very
> careful about asynchronous size updates, and as such if it is deciding to
> extend files manually itself it needs to serialise all non-IO operations
> that either depend on the size being correct or modify the inode size
> against AIO writes in progress.
> 
> The filesystem can't do this for the application - the filesystem has no
> knowledge of the ordering of operations that the application requires. Hence
> the responsibility for ensuring AIO-DIO is coherent at the applicaiton level
> lies with the application, not the filesystem.
> 
> > Because that’s such a corner case (otherwise people would have noticed this
> > problem before), I don’t think it should have much impact on performance. 
> > (And we have all the infrastructure in place to prevent concurrent writes on
> > certain conditions, so the patch just adds 42 lines of code.)
> 
> Great!
> 
> > It’s just that we are still a bit uncomfortable about keeping that
> > workaround in qemu indefinitely when it’s probably rarely going to be
> > necessary after a year or so.
> 
> At a fundamental level, AIO+DIO requires applications to manage their data
> integrity directly and correctly. That means applications need to serialise
> file extension operations themselves via similar mechanisms that filesystems
> do for normal synchronous IO.

Well, we (qemu) generally don’t do ftruncate() while I/O is ongoing.  Or at least we try to (except for when we have a bug, as described before :-/).

Again, thanks a lot for the very detailed explanations.


(In reply to Eric Sandeen from comment #8)
> Max, can you update your testcase w/ your copyright & GPL license so that it
> can be included in the xfstests regression suite?

Sure.

Max

Comment 10 Hanna Czenczek 2019-10-29 07:33:41 UTC

Created attachment 1630037 [details]
Test program to demonstrate the issue (under MIT license)

Comment 11 Hanna Czenczek 2019-10-29 13:15:47 UTC

FWIW, applying “xfs: properly serialise fallocate against AIO+DIO” fixes the original qemu problem (BZ 1751934) for me.

Max

Comment 12 Eric Sandeen 2019-11-08 03:46:12 UTC

Upstream patch:

https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/commit/?h=for-next&id=249bd9087a5264d2b8a974081870e2e27671b4dc

Comment 14 Eric Sandeen 2019-12-11 21:50:27 UTC

*** Bug 1782090 has been marked as a duplicate of this bug. ***

Comment 15 Bruno Meneguele 2019-12-14 12:36:28 UTC

Patch(es) available on kernel-4.18.0-165.el8

Comment 18 Zorro Lang 2020-03-14 15:36:15 UTC

Reproduced on kernel 164.el8
# ./check generic/586
FSTYP         -- xfs (non-debug)
PLATFORM      -- Linux/x86_64 hp-dl380pg8-01 4.18.0-164.el8.x86_64+debug #1 SMP Wed Dec 11 15:55:15 UTC 2019
MKFS_OPTIONS  -- -f -bsize=4096 /dev/mapper/rhel_hp--dl380pg8--01-xfscratch
MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 /dev/mapper/rhel_hp--dl380pg8--01-xfscratch /mnt/scratch

generic/586     - output mismatch (see /home/xfstests-dev/results//generic/586.out.bad)
    --- tests/generic/586.out   2019-12-05 07:36:38.666115349 -0500
    +++ /home/xfstests-dev/results//generic/586.out.bad 2020-03-14 11:29:22.395183413 -0400
    @@ -1,2 +1,95 @@
     QA output created by 586
    +[0]: sbuf.st_size=4096, expected 8192.
    +[1]: sbuf.st_size=4096, expected 8192.
    +[2]: sbuf.st_size=4096, expected 8192.
    +[3]: sbuf.st_size=4096, expected 8192.
    +[4]: sbuf.st_size=4096, expected 8192.
    +[5]: sbuf.st_size=4096, expected 8192.
    ...
    (Run 'diff -u /home/xfstests-dev/tests/generic/586.out /home/xfstests-dev/results//generic/586.out.bad'  to see the entire diff)
Ran: generic/586
Failures: generic/586
Failed 1 of 1 tests

Test passed on kernel 187.el8:
FSTYP         -- xfs (non-debug)
PLATFORM      -- Linux/x86_64 dell-per630-fc-03 4.18.0-187.el8.x86_64+debug #1 SMP Sat Mar 7 03:26:18 UTC 2020
MKFS_OPTIONS  -- -f -m crc=1,finobt=1,rmapbt=0,reflink=1 -i sparse=1 /dev/sda5
MOUNT_OPTIONS -- -o context=system_u:object_r:nfs_t:s0 /dev/sda5 /mnt/xfstests/mnt2

generic/586	 3s
Ran: generic/586
Passed all 1 tests

Comment 20 errata-xmlrpc 2020-04-28 16:31:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:1769

Note You need to log in before you can comment on or make changes to this bug.