Bug 1845326
| Summary: | libaio is returning duplicate events | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Clebert Suconic <csuconic> | |
| Component: | kernel | Assignee: | Jeff Moyer <jmoyer> | |
| kernel sub component: | Storage - Other | QA Contact: | ChanghuiZhong <czhong> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | unspecified | |||
| Priority: | high | CC: | afield, czhong, dhoward, dhowells, jmoyer, jpittman, kdsouza, minlei, mmillson, mszeredi, pvlasin, revers, rmonther, skudupud, smatasar, swhiteho, tbueno | |
| Version: | 7.8 | Keywords: | ZStream | |
| Target Milestone: | rc | |||
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | kernel-3.10.0-1152.el7 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1850055 (view as bug list) | Environment: | ||
| Last Closed: | 2020-09-29 21:16:46 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1850055 | |||
|
Description
Clebert Suconic
2020-06-08 21:49:09 UTC
I was able to reproduce it on: Red Hat Enterprise Linux Server release 7.9 Beta (Maipo) Linux 3.10.0-1136.el7.x86_64 #1 SMP Fri Apr 17 11:40:59 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.8 (Maipo) Linux 3.10.0-1127.8.2.el7.x86_64 #1 SMP Thu May 7 19:30:37 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux Looks like the problem do not occur on: Red Hat Enterprise Linux Server release 6.10 (Santiago) Linux host-10-0-133-19 2.6.32-754.28.1.el6.x86_64 #1 SMP Fri Jan 31 06:05:42 EST 2020 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.7 (Maipo) Linux 3.10.0-1062.18.1.el7.x86_64 #1 SMP Wed Feb 12 14:08:31 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 8.2 (Ootpa) Linux 4.18.0-193.1.2.el8_2.x86_64 #1 SMP Thu May 7 16:37:54 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux The sequence of events goes like this:
Userspace code checks the ring. head is equal to tail (no events in the completion queue), so it calls the io_getevents system call.
The io_getevents system call waits for events to become available. One event completes, and this code runs:
aio_complete():
smp_wmb(); /* make event visible before updating tail */
ctx->tail = tail;
ring = kmap_atomic(ctx->ring_pages[0]);
head = ring->head;
ring->tail = tail;
kunmap_atomic(ring);
flush_dcache_page(ctx->ring_pages[0]);
Then read_events rings does the following:
aio_read_events_ring():
ring = kmap_atomic(ctx->ring_pages[0]);
head = ring->head;
kunmap_atomic(ring);
/*
* Ensure that once we've read the current tail pointer, that
* we also see the events that were stored up to the tail.
*/
tail = ctx->tail;
smp_rmb();
Notice how we are using the kernel's copy of the tail pointer, stored in ctx (not the copy in the ring). Also notice that the aio_complete code updates the in-kernel copy of tail before updating the shared copy in the ring. What happens is we enter the kernel with head equal to tail. When we return from io_getevents, both head and tail have advanced. However, there is a race condition by which the userspace code sees the udpated head pointer, but NOT the updated tail pointer. This makes it appear as though the ring buffer is full of completions.
This problem was introduced by RHEL 7 commit e62960f36f9ed ("[fs] aio: get rid of unnecessary locking in aio_read_events_ring"). That commit does this:
@@ -1118,21 +1117,11 @@ static long aio_read_events_ring(struct kioctx *ctx,
head %= ctx->nr_events;
}
- /*
- * The aio ring pages are user space pages, so they can be migrated.
- * When writing to an aio ring page, we should ensure the page is not
- * being migrated. Aio page migration procedure is protected by
- * ctx->completion_lock, so we add this lock here.
- */
- spin_lock_irqsave(&ctx->completion_lock, flags);
-
ring = kmap_atomic(ctx->ring_pages[0]);
ring->head = head;
kunmap_atomic(ring);
flush_dcache_page(ctx->ring_pages[0]);
- spin_unlock_irqrestore(&ctx->completion_lock, flags);
-
The completion_lock is also taken in the aio_complete path. By taking that lock in this code, we ensure that the contents of ring->tail and ctx->tail are the same. We cannot re-introduce that lock, though, because it was causing long interrupt service routine times in some environments. It turns out the problem was solved upstream as a side-effect of commit 5ffac122dbda8 ("aio: Don't use ctx->tail unnecessarily"). Specifically, this part:
@@ -877,11 +880,12 @@ static long aio_read_events_ring(struct kioctx *ctx,
ring = kmap_atomic(ctx->ring_pages[0]);
head = ring->head;
+ tail = ring->tail;
kunmap_atomic(ring);
- pr_debug("h%u t%u m%u\n", head, ctx->tail, ctx->nr_events);
+ pr_debug("h%u t%u m%u\n", head, tail, ctx->nr_events);
- if (head == ctx->tail)
+ if (head == tail)
This changes the event reaping code from using the kernel copy of tail to the shared copy, which prevents the race from occuring (both the kernel event reaping code and the userspace code operate on the same data, the ring tail pointer). I will attach a targeted fix for this to the bug after it's been thoroughly tested.
It will likely look something like this:
diff --git a/fs/aio.c b/fs/aio.c
index 5f3bd80f5a75..619d6f43b7e3 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1069,13 +1069,13 @@ static long aio_read_events_ring(struct kioctx *ctx,
/* Access to ->ring_pages here is protected by ctx->ring_lock. */
ring = kmap_atomic(ctx->ring_pages[0]);
head = ring->head;
+ tail = ring->tail;
kunmap_atomic(ring);
/*
* Ensure that once we've read the current tail pointer, that
* we also see the events that were stored up to the tail.
*/
- tail = ctx->tail;
smp_rmb();
pr_debug("h%u t%u m%u\n", head, tail, ctx->nr_events);
@@ -1084,6 +1084,7 @@ static long aio_read_events_ring(struct kioctx *ctx,
goto out;
head %= ctx->nr_events;
+ tail %= ctx->nr_events;
while (ret < nr) {
long avail;
Patch(es) committed on kernel-3.10.0-1152.el7 reproduce this issue with kernel-3.10.0-1127.8.2.el7,output: SUREFIRE-859: ouch!!!! it was null!!!!, system would crash!!!! SUREFIRE-859: ouch!!!! it was null!!!!, system would crash!!!! verified with kernel-3.10.0-1152.el7, this issue has fixed. no iocb duplicated and the fprintf not called. move the status to VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:4060 |