1684780 – aio O_DIRECT writes to non-page-aligned file locations on ext4 can result in the overlapped portion of the page containing zeros

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1684780 - aio O_DIRECT writes to non-page-aligned file locations on ext4 can result in the overlapped portion of the page containing zeros

Summary: aio O_DIRECT writes to non-page-aligned file locations on ext4 can result in ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	7.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Lukáš Czerner
QA Contact:	Boyang Xue
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1690497 1693561
TreeView+	depends on / blocked

Reported:	2019-03-02 12:05 UTC by Frank Sorenson
Modified:	2021-08-30 12:08 UTC (History)
CC List:	15 users (show)
Fixed In Version:	kernel-3.10.0-1031.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1690497 1693561 (view as bug list)
Environment:
Last Closed:	2019-08-06 12:28:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
reproducer (3.66 KB, text/x-csrc) 2019-03-02 12:05 UTC, Frank Sorenson	no flags	Details
updated/simplified reproducer (2.31 KB, text/x-csrc) 2019-03-07 21:28 UTC, Frank Sorenson	no flags	Details
updated updated reproducer (2.32 KB, text/x-csrc) 2019-03-11 16:55 UTC, Frank Sorenson	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3953911	0	Troubleshoot	None	Possible Oracle corruption when using filesystemio_options=SETALL	2019-03-05 02:30:18 UTC
Red Hat Product Errata	RHSA-2019:2029	0	None	None	None	2019-08-06 12:28:28 UTC

Description Frank Sorenson 2019-03-02 12:05:05 UTC

Created attachment 1540033 [details]
reproducer

Description of problem:

Sequential non-page-aligned aio writes can result in a portion of the shared page containing all zeros.

This only occurs on ext4 with aio + O_DIRECT


         **************** **************** ****************
         *    page 1    * *    page 2    * *    page 3    *
         **************** **************** ****************
existing 0000000000000000 0000000000000000 0000000000000000
write 1    AAAAAAAAAAAAAA AA
write 2                     BBBBBBBBBBBBBB BB

result   00AAAAAAAAAAAAAA 00BBBBBBBBBBBBBB BB00000000000000

desired  00AAAAAAAAAAAAAA AABBBBBBBBBBBBBB BB00000000000000


Version-Release number of selected component (if applicable):

RHEL 7.4+ (and quite probably earlier as well)
also appears to occur up through at least 4.19


How reproducible:

reproducer attached


Steps to Reproduce:

reproducer submits several sequential writes which are not aligned to 4 KiB (these begin 512 bytes into the page):

    ftruncate(fd, 65012224)
    io_prep_pwrite(iocbs[0], fd, buf[0], 1048576, 63963648);
    io_prep_pwrite(iocbs[1], fd, buf[1], 1048576, 65012224);

    io_prep_pwrite(iocbs[2], fd, buf[2], 1048576, 66060800);
    io_prep_pwrite(iocbs[3], fd, buf[3], 1048576, 67109376);

    io_submit(io_ctx, 1, &iocbs[0]);
    io_submit(io_ctx, 1, &iocbs[1]);
    io_submit(io_ctx, 1, &iocbs[2]);
    io_submit(io_ctx, 1, &iocbs[3]);

    io_getevents(io_ctx, 4, 4, events, NULL)


Actual results:

a portion of the page contains all 0s

result   00AAAAAAAAAAAAAA 00BBBBBBBBBBBBBB BB00000000000000


Expected results:

the writes occur to the expected file locations

desired  00AAAAAAAAAAAAAA AABBBBBBBBBBBBBB BB00000000000000



Additional info:

introducing a delay between each io_submit can prevent the writes from interfering with each other

does not occur on xfs, nfs, etc.

Comment 2 Frank Sorenson 2019-03-03 06:34:34 UTC

This particular condition is obviously known to be problematic, as shown by numerous comments in the ext4 kernel code:

for example, commit e9e3bcecf44c0, which went into 2.6.38 added:
/*
 * This tests whether the IO in question is block-aligned or not.
 * Ext4 utilizes unwritten extents when hole-filling during direct IO, and they
 * are converted to written only after the IO is complete.  Until they are
 * mapped, these blocks appear as holes, so dio_zero_block() will assume that
 * it needs to zero out portions of the start and/or end block.  If 2 AIO
 * threads are at work on the same unwritten block, they must be synchronized
 * or one thread will zero the other's data, causing corruption.
 */

and upstream commit e142d05263a4b went into 4.6 added:

+        * Unaligned direct AIO must be serialized among each other as zeroing
+        * of partial blocks of two competing unaligned AIOs can result in data
+        * corruption.

however I've reproduced this with 4.19+, so this does not appear resolved.

Comment 15 Frank Sorenson 2019-03-07 21:28:40 UTC

Created attachment 1541975 [details]
updated/simplified reproducer

Here's a much-simplified reproducer

Comment 16 Zorro Lang 2019-03-09 13:19:29 UTC

(In reply to Frank Sorenson from comment #15)
> Created attachment 1541975 [details]
> updated/simplified reproducer
> 
> Here's a much-simplified reproducer

Can someone help to explain why the reopen is needed after ftruncate?

        ftruncate(fd, WRITE0_SIZE);
        close(fd);
        ....
        fd = open(filename, O_RDWR|O_DIRECT);

I've tried, if no this reopen, can't reproduce this bug.


Thanks,
Zorro

Comment 17 Frank Sorenson 2019-03-11 16:54:16 UTC

Okay, the close()/open() can be eliminated if you initially open the file with 'O_DIRECT' as well (the unlink() can also be removed, since the following open() call includes O_TRUNC):

        if ((fd = open(filename, O_RDWR|O_CREAT|O_TRUNC|O_DIRECT, 0660)) == -1)

after changing the initial open options like this, the bug seems to reproduce consistently again.


However, also note that the file corruption is dependent on timing of the IOs, so there is a possibility that the test may return SUCCESS, rather than FAILURE; some light testing leads me to believe this may occur once about every 150-250 runs (on my test VMs).  Separating the ftruncate() from the io_submit() with the close()/open() appears to double the number of iterations between successful runs (failures to FAIL?).

Might be best to run the test several times, just in case it doesn't reproduce the bug the first time (though I suppose a very slow machine could potentially always return SUCCESS, even though the bug is still there).

I'm attaching an updated reproducer that tests multiple times, and eliminates the reopen.  With testing 5 times, I haven't had it incorrectly report SUCCESS in hundreds of thousands of runs.

Comment 18 Frank Sorenson 2019-03-11 16:55:13 UTC

Created attachment 1542966 [details]
updated updated reproducer

Comment 19 Zorro Lang 2019-03-12 03:26:37 UTC

(In reply to Frank Sorenson from comment #17)
> Okay, the close()/open() can be eliminated if you initially open the file
> with 'O_DIRECT' as well (the unlink() can also be removed, since the
> following open() call includes O_TRUNC):
> 
>         if ((fd = open(filename, O_RDWR|O_CREAT|O_TRUNC|O_DIRECT, 0660)) ==
> -1)

Oh, sorry I didn't notice that the 1st open didn't use 'O_DIRECT'.

I've written a case for xfstests, but I still have one question which I don't know
how to comment it: why the ftruncate is necessary? And looks like must truncate
to the end of the 1st write operation (offset + 1st write size) to reproduce this bug.

Thanks,
Zorro

> 
> after changing the initial open options like this, the bug seems to
> reproduce consistently again.
> 
> 
> However, also note that the file corruption is dependent on timing of the
> IOs, so there is a possibility that the test may return SUCCESS, rather than
> FAILURE; some light testing leads me to believe this may occur once about
> every 150-250 runs (on my test VMs).  Separating the ftruncate() from the
> io_submit() with the close()/open() appears to double the number of
> iterations between successful runs (failures to FAIL?).
> 
> Might be best to run the test several times, just in case it doesn't
> reproduce the bug the first time (though I suppose a very slow machine could
> potentially always return SUCCESS, even though the bug is still there).
> 
> I'm attaching an updated reproducer that tests multiple times, and
> eliminates the reopen.  With testing 5 times, I haven't had it incorrectly
> report SUCCESS in hundreds of thousands of runs.

Comment 20 Lukáš Czerner 2019-03-12 06:35:04 UTC

Hi Zorro,

the ftruncate is absolutely necessary because this bug is about writes at the i_size boundary, or rather in the same fs block as i_size (see the description of my ext4 patch).

https://patchwork.ozlabs.org/patch/1052277/

-Lukas

Comment 21 Zorro Lang 2019-03-12 16:43:46 UTC

(In reply to Lukáš Czerner from comment #20)
> Hi Zorro,
> 
> the ftruncate is absolutely necessary because this bug is about writes at
> the i_size boundary, or rather in the same fs block as i_size (see the
> description of my ext4 patch).
> 
> https://patchwork.ozlabs.org/patch/1052277/

Thanks Lukas, finally I can understand all things by reading the patch:)
I've written a case to xfstests, please feel free to review it if you'd like.

> 
> -Lukas

Comment 27 Bruno Meneguele 2019-03-27 20:57:00 UTC

Patch(es) committed on kernel-3.10.0-1031.el7

Comment 37 Eric Sandeen 2019-05-02 17:34:53 UTC

*** Bug 1699799 has been marked as a duplicate of this bug. ***

Comment 40 errata-xmlrpc 2019-08-06 12:28:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:2029

Note You need to log in before you can comment on or make changes to this bug.