Bug 1684780
Summary: | aio O_DIRECT writes to non-page-aligned file locations on ext4 can result in the overlapped portion of the page containing zeros | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Frank Sorenson <fsorenso> | ||||||||
Component: | kernel | Assignee: | Lukáš Czerner <lczerner> | ||||||||
kernel sub component: | ext* | QA Contact: | Boyang Xue <bxue> | ||||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||||
Severity: | urgent | ||||||||||
Priority: | urgent | CC: | bxue, dhoward, dwysocha, esandeen, herbert.van.den.bergh, ibodunov, jmagrini, jshivers, lczerner, mlinden, nmurray, pdwyer, plambri, xzhou, zlang | ||||||||
Version: | 7.6 | Keywords: | Reproducer, ZStream | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | kernel-3.10.0-1031.el7 | Doc Type: | If docs needed, set a value | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 1690497 1693561 (view as bug list) | Environment: | |||||||||
Last Closed: | 2019-08-06 12:28:05 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1690497, 1693561 | ||||||||||
Attachments: |
|
This particular condition is obviously known to be problematic, as shown by numerous comments in the ext4 kernel code: for example, commit e9e3bcecf44c0, which went into 2.6.38 added: /* * This tests whether the IO in question is block-aligned or not. * Ext4 utilizes unwritten extents when hole-filling during direct IO, and they * are converted to written only after the IO is complete. Until they are * mapped, these blocks appear as holes, so dio_zero_block() will assume that * it needs to zero out portions of the start and/or end block. If 2 AIO * threads are at work on the same unwritten block, they must be synchronized * or one thread will zero the other's data, causing corruption. */ and upstream commit e142d05263a4b went into 4.6 added: + * Unaligned direct AIO must be serialized among each other as zeroing + * of partial blocks of two competing unaligned AIOs can result in data + * corruption. however I've reproduced this with 4.19+, so this does not appear resolved. Created attachment 1541975 [details]
updated/simplified reproducer
Here's a much-simplified reproducer
(In reply to Frank Sorenson from comment #15) > Created attachment 1541975 [details] > updated/simplified reproducer > > Here's a much-simplified reproducer Can someone help to explain why the reopen is needed after ftruncate? ftruncate(fd, WRITE0_SIZE); close(fd); .... fd = open(filename, O_RDWR|O_DIRECT); I've tried, if no this reopen, can't reproduce this bug. Thanks, Zorro Okay, the close()/open() can be eliminated if you initially open the file with 'O_DIRECT' as well (the unlink() can also be removed, since the following open() call includes O_TRUNC): if ((fd = open(filename, O_RDWR|O_CREAT|O_TRUNC|O_DIRECT, 0660)) == -1) after changing the initial open options like this, the bug seems to reproduce consistently again. However, also note that the file corruption is dependent on timing of the IOs, so there is a possibility that the test may return SUCCESS, rather than FAILURE; some light testing leads me to believe this may occur once about every 150-250 runs (on my test VMs). Separating the ftruncate() from the io_submit() with the close()/open() appears to double the number of iterations between successful runs (failures to FAIL?). Might be best to run the test several times, just in case it doesn't reproduce the bug the first time (though I suppose a very slow machine could potentially always return SUCCESS, even though the bug is still there). I'm attaching an updated reproducer that tests multiple times, and eliminates the reopen. With testing 5 times, I haven't had it incorrectly report SUCCESS in hundreds of thousands of runs. Created attachment 1542966 [details]
updated updated reproducer
(In reply to Frank Sorenson from comment #17) > Okay, the close()/open() can be eliminated if you initially open the file > with 'O_DIRECT' as well (the unlink() can also be removed, since the > following open() call includes O_TRUNC): > > if ((fd = open(filename, O_RDWR|O_CREAT|O_TRUNC|O_DIRECT, 0660)) == > -1) Oh, sorry I didn't notice that the 1st open didn't use 'O_DIRECT'. I've written a case for xfstests, but I still have one question which I don't know how to comment it: why the ftruncate is necessary? And looks like must truncate to the end of the 1st write operation (offset + 1st write size) to reproduce this bug. Thanks, Zorro > > after changing the initial open options like this, the bug seems to > reproduce consistently again. > > > However, also note that the file corruption is dependent on timing of the > IOs, so there is a possibility that the test may return SUCCESS, rather than > FAILURE; some light testing leads me to believe this may occur once about > every 150-250 runs (on my test VMs). Separating the ftruncate() from the > io_submit() with the close()/open() appears to double the number of > iterations between successful runs (failures to FAIL?). > > Might be best to run the test several times, just in case it doesn't > reproduce the bug the first time (though I suppose a very slow machine could > potentially always return SUCCESS, even though the bug is still there). > > I'm attaching an updated reproducer that tests multiple times, and > eliminates the reopen. With testing 5 times, I haven't had it incorrectly > report SUCCESS in hundreds of thousands of runs. Hi Zorro, the ftruncate is absolutely necessary because this bug is about writes at the i_size boundary, or rather in the same fs block as i_size (see the description of my ext4 patch). https://patchwork.ozlabs.org/patch/1052277/ -Lukas (In reply to Lukáš Czerner from comment #20) > Hi Zorro, > > the ftruncate is absolutely necessary because this bug is about writes at > the i_size boundary, or rather in the same fs block as i_size (see the > description of my ext4 patch). > > https://patchwork.ozlabs.org/patch/1052277/ Thanks Lukas, finally I can understand all things by reading the patch:) I've written a case to xfstests, please feel free to review it if you'd like. > > -Lukas Patch(es) committed on kernel-3.10.0-1031.el7 *** Bug 1699799 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:2029 |
Created attachment 1540033 [details] reproducer Description of problem: Sequential non-page-aligned aio writes can result in a portion of the shared page containing all zeros. This only occurs on ext4 with aio + O_DIRECT **************** **************** **************** * page 1 * * page 2 * * page 3 * **************** **************** **************** existing 0000000000000000 0000000000000000 0000000000000000 write 1 AAAAAAAAAAAAAA AA write 2 BBBBBBBBBBBBBB BB result 00AAAAAAAAAAAAAA 00BBBBBBBBBBBBBB BB00000000000000 desired 00AAAAAAAAAAAAAA AABBBBBBBBBBBBBB BB00000000000000 Version-Release number of selected component (if applicable): RHEL 7.4+ (and quite probably earlier as well) also appears to occur up through at least 4.19 How reproducible: reproducer attached Steps to Reproduce: reproducer submits several sequential writes which are not aligned to 4 KiB (these begin 512 bytes into the page): ftruncate(fd, 65012224) io_prep_pwrite(iocbs[0], fd, buf[0], 1048576, 63963648); io_prep_pwrite(iocbs[1], fd, buf[1], 1048576, 65012224); io_prep_pwrite(iocbs[2], fd, buf[2], 1048576, 66060800); io_prep_pwrite(iocbs[3], fd, buf[3], 1048576, 67109376); io_submit(io_ctx, 1, &iocbs[0]); io_submit(io_ctx, 1, &iocbs[1]); io_submit(io_ctx, 1, &iocbs[2]); io_submit(io_ctx, 1, &iocbs[3]); io_getevents(io_ctx, 4, 4, events, NULL) Actual results: a portion of the page contains all 0s result 00AAAAAAAAAAAAAA 00BBBBBBBBBBBBBB BB00000000000000 Expected results: the writes occur to the expected file locations desired 00AAAAAAAAAAAAAA AABBBBBBBBBBBBBB BB00000000000000 Additional info: introducing a delay between each io_submit can prevent the writes from interfering with each other does not occur on xfs, nfs, etc.