Bug 184086
Summary: | aio_return incorrectly returns 0 sometimes | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Matthew Gregan [:kinetik] <kinetik> | ||||||
Component: | glibc | Assignee: | Jakub Jelinek <jakub> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 3.0 | CC: | drepper, olivier, rhentosh, tao | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | i386 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | RHBA-2007-0471 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2007-06-11 18:49:07 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 179629 | ||||||||
Attachments: |
|
Description
Matthew Gregan [:kinetik]
2006-03-06 05:34:45 UTC
Created attachment 125693 [details]
testcase
I've been able to reproduce this bug fairly consistently on one of my systems (i.e. within 4 iterations) using redhatready's STORAGE2 test, and occasionally on another box. STORAGE2 failed because dt was failing with error status 254 (i.e. spurious end of file). My investigation lead to the same conclusion as kinetik. There's a basically a race in /lib/tls/librt.so.1's kernel_callback(): int aio_error (aiocbp) const struct aiocb *aiocbp; { int ret = aiocbp->__error_code; if (ret == EINPROGRESS) { __aio_read_one_event (); ret = aiocbp->__error_code; } return ret; } ssize_t aio_return (aiocbp) struct aiocb *aiocbp; { if (aiocbp->__error_code == EINPROGRESS) __aio_read_one_event (); return aiocbp->__return_value; } static void kernel_callback (kctx_t ctx, struct kiocb *kiocb, long res, long res2) { struct requestlist *req = (struct requestlist *)kiocb; => req->aiocbp->aiocb.__error_code = 0; req->aiocbp->aiocb.__return_value = res; if (res < 0 && res > -1000) { req->aiocbp->aiocb.__error_code = -res; req->aiocbp->aiocb.__return_value = -1; } __aio_notify (req); assert (req->running == allocated); req->running = done; __aio_remove_krequest (req); __aio_free_request (req); } Explaination: kernel_callback() updates __error_code before __return_value. So, if a user calls aio_error() between these two instructions, he gets a return status 0. That's != EINPROGRESS, so he calls aoi_return() to get the number of bytes transferred / error code. aio_return() returns a nul __return_value because it was zeroed out in the issuing path. And in posix, 0 bytes transfered means end of file. dt (in this case) spaces out because the IO is nowhere near the end of the file. The proper code should be something like this: req->aiocbp->aiocb.__return_value = res; write_memory_barrier(); req->aiocbp->aiocb.__error_code = 0; I've done a binary patch of /lib/tls/librt.so.1 to reverse the two instructions: 000036d0 <kernel_callback>: 36d0: 55 push %ebp 36d1: 89 e5 mov %esp,%ebp 36d3: 56 push %esi 36d4: 8b 75 0c mov 0xc(%ebp),%esi 36d7: 8b 55 10 mov 0x10(%ebp),%edx 36da: 8b 4e 58 mov 0x58(%esi),%ecx 36dd: 85 d2 test %edx,%edx 36df: c7 41 60 00 00 00 00 movl $0x0,0x60(%ecx) 36e6: 8b 4e 58 mov 0x58(%esi),%ecx 36e9: 89 51 64 mov %edx,0x64(%ecx) Patched: => 36df: 89 51 64 mov %edx,0x64(%ecx) => 36e2: 8b 4e 58 mov 0x58(%esi),%ecx => 36e5: c7 41 60 00 00 00 00 movl $0x0,0x60(%ecx) That has passed over 600 iterations of STORAGE2 with no spurious end of file or any other error vs one failure within 4 iterations previously. There are a couple of other places in the code where __error_code is updated before __return_status. These would need to be fixed as well. But the most damaging one is the one in kernel_callback(). BTW, looking at the code, the same defect is also present in RHEL4. Created attachment 132724 [details]
glibc-rtkaio-errretval-order.patch
Untested patch that could fix this.
I made the patch public. I rebuilt glibc 2.3.2-95.39 from the SPRM with the patch provided by Jakub included. I've run my simple reproducer against a local install of the patched glibc (using "LD_LIBRARY_PATH=$(pwd) ./ld-linux.so.2 ~/aio") for around four hours without seeing a recurrence of the race condition. It seems that this patch resolves the problem I reported. Yep, the patch fixes it for me too. Thanks. This should be fixed before we go into maintenance and work is already done. PM ACK for 3.9 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0471.html |