Bug 1684780

Summary: aio O_DIRECT writes to non-page-aligned file locations on ext4 can result in the overlapped portion of the page containing zeros
Product: Red Hat Enterprise Linux 7 Reporter: Frank Sorenson <fsorenso>
Component: kernelAssignee: Lukáš Czerner <lczerner>
kernel sub component: ext* QA Contact: Boyang Xue <bxue>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: bxue, dhoward, dwysocha, esandeen, herbert.van.den.bergh, ibodunov, jmagrini, jshivers, lczerner, mlinden, nmurray, pdwyer, plambri, xzhou, zlang
Version: 7.6Keywords: Reproducer, ZStream
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-3.10.0-1031.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1690497 1693561 (view as bug list) Environment:
Last Closed: 2019-08-06 12:28:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1690497, 1693561    
Attachments:
Description Flags
reproducer
none
updated/simplified reproducer
none
updated updated reproducer none

Description Frank Sorenson 2019-03-02 12:05:05 UTC
Created attachment 1540033 [details]
reproducer

Description of problem:

Sequential non-page-aligned aio writes can result in a portion of the shared page containing all zeros.

This only occurs on ext4 with aio + O_DIRECT


         **************** **************** ****************
         *    page 1    * *    page 2    * *    page 3    *
         **************** **************** ****************
existing 0000000000000000 0000000000000000 0000000000000000
write 1    AAAAAAAAAAAAAA AA
write 2                     BBBBBBBBBBBBBB BB

result   00AAAAAAAAAAAAAA 00BBBBBBBBBBBBBB BB00000000000000

desired  00AAAAAAAAAAAAAA AABBBBBBBBBBBBBB BB00000000000000


Version-Release number of selected component (if applicable):

RHEL 7.4+ (and quite probably earlier as well)
also appears to occur up through at least 4.19


How reproducible:

reproducer attached


Steps to Reproduce:

reproducer submits several sequential writes which are not aligned to 4 KiB (these begin 512 bytes into the page):

    ftruncate(fd, 65012224)
    io_prep_pwrite(iocbs[0], fd, buf[0], 1048576, 63963648);
    io_prep_pwrite(iocbs[1], fd, buf[1], 1048576, 65012224);

    io_prep_pwrite(iocbs[2], fd, buf[2], 1048576, 66060800);
    io_prep_pwrite(iocbs[3], fd, buf[3], 1048576, 67109376);

    io_submit(io_ctx, 1, &iocbs[0]);
    io_submit(io_ctx, 1, &iocbs[1]);
    io_submit(io_ctx, 1, &iocbs[2]);
    io_submit(io_ctx, 1, &iocbs[3]);

    io_getevents(io_ctx, 4, 4, events, NULL)


Actual results:

a portion of the page contains all 0s

result   00AAAAAAAAAAAAAA 00BBBBBBBBBBBBBB BB00000000000000


Expected results:

the writes occur to the expected file locations

desired  00AAAAAAAAAAAAAA AABBBBBBBBBBBBBB BB00000000000000



Additional info:

introducing a delay between each io_submit can prevent the writes from interfering with each other

does not occur on xfs, nfs, etc.

Comment 2 Frank Sorenson 2019-03-03 06:34:34 UTC
This particular condition is obviously known to be problematic, as shown by numerous comments in the ext4 kernel code:

for example, commit e9e3bcecf44c0, which went into 2.6.38 added:
/*
 * This tests whether the IO in question is block-aligned or not.
 * Ext4 utilizes unwritten extents when hole-filling during direct IO, and they
 * are converted to written only after the IO is complete.  Until they are
 * mapped, these blocks appear as holes, so dio_zero_block() will assume that
 * it needs to zero out portions of the start and/or end block.  If 2 AIO
 * threads are at work on the same unwritten block, they must be synchronized
 * or one thread will zero the other's data, causing corruption.
 */

and upstream commit e142d05263a4b went into 4.6 added:

+        * Unaligned direct AIO must be serialized among each other as zeroing
+        * of partial blocks of two competing unaligned AIOs can result in data
+        * corruption.

however I've reproduced this with 4.19+, so this does not appear resolved.

Comment 15 Frank Sorenson 2019-03-07 21:28:40 UTC
Created attachment 1541975 [details]
updated/simplified reproducer

Here's a much-simplified reproducer

Comment 16 Zorro Lang 2019-03-09 13:19:29 UTC
(In reply to Frank Sorenson from comment #15)
> Created attachment 1541975 [details]
> updated/simplified reproducer
> 
> Here's a much-simplified reproducer

Can someone help to explain why the reopen is needed after ftruncate?

        ftruncate(fd, WRITE0_SIZE);
        close(fd);
        ....
        fd = open(filename, O_RDWR|O_DIRECT);

I've tried, if no this reopen, can't reproduce this bug.


Thanks,
Zorro

Comment 17 Frank Sorenson 2019-03-11 16:54:16 UTC
Okay, the close()/open() can be eliminated if you initially open the file with 'O_DIRECT' as well (the unlink() can also be removed, since the following open() call includes O_TRUNC):

        if ((fd = open(filename, O_RDWR|O_CREAT|O_TRUNC|O_DIRECT, 0660)) == -1)

after changing the initial open options like this, the bug seems to reproduce consistently again.


However, also note that the file corruption is dependent on timing of the IOs, so there is a possibility that the test may return SUCCESS, rather than FAILURE; some light testing leads me to believe this may occur once about every 150-250 runs (on my test VMs).  Separating the ftruncate() from the io_submit() with the close()/open() appears to double the number of iterations between successful runs (failures to FAIL?).

Might be best to run the test several times, just in case it doesn't reproduce the bug the first time (though I suppose a very slow machine could potentially always return SUCCESS, even though the bug is still there).

I'm attaching an updated reproducer that tests multiple times, and eliminates the reopen.  With testing 5 times, I haven't had it incorrectly report SUCCESS in hundreds of thousands of runs.

Comment 18 Frank Sorenson 2019-03-11 16:55:13 UTC
Created attachment 1542966 [details]
updated updated reproducer

Comment 19 Zorro Lang 2019-03-12 03:26:37 UTC
(In reply to Frank Sorenson from comment #17)
> Okay, the close()/open() can be eliminated if you initially open the file
> with 'O_DIRECT' as well (the unlink() can also be removed, since the
> following open() call includes O_TRUNC):
> 
>         if ((fd = open(filename, O_RDWR|O_CREAT|O_TRUNC|O_DIRECT, 0660)) ==
> -1)

Oh, sorry I didn't notice that the 1st open didn't use 'O_DIRECT'.

I've written a case for xfstests, but I still have one question which I don't know
how to comment it: why the ftruncate is necessary? And looks like must truncate
to the end of the 1st write operation (offset + 1st write size) to reproduce this bug.

Thanks,
Zorro

> 
> after changing the initial open options like this, the bug seems to
> reproduce consistently again.
> 
> 
> However, also note that the file corruption is dependent on timing of the
> IOs, so there is a possibility that the test may return SUCCESS, rather than
> FAILURE; some light testing leads me to believe this may occur once about
> every 150-250 runs (on my test VMs).  Separating the ftruncate() from the
> io_submit() with the close()/open() appears to double the number of
> iterations between successful runs (failures to FAIL?).
> 
> Might be best to run the test several times, just in case it doesn't
> reproduce the bug the first time (though I suppose a very slow machine could
> potentially always return SUCCESS, even though the bug is still there).
> 
> I'm attaching an updated reproducer that tests multiple times, and
> eliminates the reopen.  With testing 5 times, I haven't had it incorrectly
> report SUCCESS in hundreds of thousands of runs.

Comment 20 Lukáš Czerner 2019-03-12 06:35:04 UTC
Hi Zorro,

the ftruncate is absolutely necessary because this bug is about writes at the i_size boundary, or rather in the same fs block as i_size (see the description of my ext4 patch).

https://patchwork.ozlabs.org/patch/1052277/

-Lukas

Comment 21 Zorro Lang 2019-03-12 16:43:46 UTC
(In reply to Lukáš Czerner from comment #20)
> Hi Zorro,
> 
> the ftruncate is absolutely necessary because this bug is about writes at
> the i_size boundary, or rather in the same fs block as i_size (see the
> description of my ext4 patch).
> 
> https://patchwork.ozlabs.org/patch/1052277/

Thanks Lukas, finally I can understand all things by reading the patch:)
I've written a case to xfstests, please feel free to review it if you'd like.

> 
> -Lukas

Comment 27 Bruno Meneguele 2019-03-27 20:57:00 UTC
Patch(es) committed on kernel-3.10.0-1031.el7

Comment 37 Eric Sandeen 2019-05-02 17:34:53 UTC
*** Bug 1699799 has been marked as a duplicate of this bug. ***

Comment 40 errata-xmlrpc 2019-08-06 12:28:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:2029