Bug 1845326 - libaio is returning duplicate events
Summary: libaio is returning duplicate events
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel
Version: 7.8
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: rc
: ---
Assignee: Jeff Moyer
QA Contact: ChanghuiZhong
URL:
Whiteboard:
Depends On:
Blocks: 1850055
TreeView+ depends on / blocked
 
Reported: 2020-06-08 21:49 UTC by Clebert Suconic
Modified: 2021-09-06 15:00 UTC (History)
17 users (show)

Fixed In Version: kernel-3.10.0-1152.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1850055 (view as bug list)
Environment:
Last Closed: 2020-09-29 21:16:46 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5749061 0 None None None 2021-01-27 20:28:41 UTC
Red Hat Product Errata RHSA-2020:4060 0 None None None 2020-09-29 21:17:26 UTC

Description Clebert Suconic 2020-06-08 21:49:09 UTC
Description of problem:

libaio is returning duplicate events, in a case where I'm interchanging io_getevents and user space ring operations.


Version-Release number of selected component (if applicable):

7.8 (3.10.0-1062.el7.x86_64)

How reproducible:
100%. However I could only reproduce this with KVM, and RHEL 7.8. it could be that the hypervisor makes things slower making it more likely to fail.


Steps to Reproduce:

This issue manifested into AMQ Artemis Native.

I isolated the issue on my personal github project.
You will need maven and openjdk installed.

1. git clone https://github.com/clebertsuconic/activemq-artemis-native.git
2. cd activemq-artemis-native
3. git checkout replicate-issue
4. mvn install
5. mvn test

(if you wish to change the JNI library code, you may do the following:)

cmake .
make


Actual results:

On the output you will see results from this printf:

fprintf (stderr, "ouch!!!! it was null!!!!, system would crash!!!! \n");

I'm avoiding the system from crashing with a "hack" in place, by setting the returned data as null and ignore eventual duplicates. (This could be actually a candidate fix, but we need to investigate it first).


Expected results:
no iocb should be duplicated and the fprintf above should not be called.

Additional info:

Comment 5 Tiago Bueno 2020-06-09 19:35:34 UTC
I was able to reproduce it on:

Red Hat Enterprise Linux Server release 7.9 Beta (Maipo)
Linux 3.10.0-1136.el7.x86_64 #1 SMP Fri Apr 17 11:40:59 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

Red Hat Enterprise Linux Server release 7.8 (Maipo)
Linux 3.10.0-1127.8.2.el7.x86_64 #1 SMP Thu May 7 19:30:37 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux



Looks like the problem do not occur on:

Red Hat Enterprise Linux Server release 6.10 (Santiago)
Linux host-10-0-133-19 2.6.32-754.28.1.el6.x86_64 #1 SMP Fri Jan 31 06:05:42 EST 2020 x86_64 x86_64 x86_64 GNU/Linux

Red Hat Enterprise Linux Server release 7.7 (Maipo)
Linux 3.10.0-1062.18.1.el7.x86_64 #1 SMP Wed Feb 12 14:08:31 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Red Hat Enterprise Linux release 8.2 (Ootpa)
Linux 4.18.0-193.1.2.el8_2.x86_64 #1 SMP Thu May 7 16:37:54 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Comment 8 Jeff Moyer 2020-06-10 21:06:43 UTC
The sequence of events goes like this:

Userspace code checks the ring.  head is equal to tail (no events in the completion queue), so it calls the io_getevents system call.
The io_getevents system call waits for events to become available.  One event completes, and this code runs:

aio_complete():

        smp_wmb();      /* make event visible before updating tail */

        ctx->tail = tail;

        ring = kmap_atomic(ctx->ring_pages[0]);
        head = ring->head;
        ring->tail = tail;
        kunmap_atomic(ring);
        flush_dcache_page(ctx->ring_pages[0]);

Then read_events rings does the following:

aio_read_events_ring():

        ring = kmap_atomic(ctx->ring_pages[0]);
        head = ring->head;
	kunmap_atomic(ring);

        /*                                                                      
         * Ensure that once we've read the current tail pointer, that           
         * we also see the events that were stored up to the tail.              
         */
        tail = ctx->tail;                                                       
        smp_rmb();

Notice how we are using the kernel's copy of the tail pointer, stored in ctx (not the copy in the ring).  Also notice that the aio_complete code updates the in-kernel copy of tail before updating the shared copy in the ring.  What happens is we enter the kernel with head equal to tail.  When we return from io_getevents, both head and tail have advanced.  However, there is a race condition by which the userspace code sees the udpated head pointer, but NOT the updated tail pointer.  This makes it appear as though the ring buffer is full of completions.

This problem was introduced by RHEL 7 commit e62960f36f9ed ("[fs] aio: get rid of unnecessary locking in aio_read_events_ring").  That commit does this:

@@ -1118,21 +1117,11 @@ static long aio_read_events_ring(struct kioctx *ctx,
                head %= ctx->nr_events;
        }
 
-       /*
-        * The aio ring pages are user space pages, so they can be migrated.
-        * When writing to an aio ring page, we should ensure the page is not
-        * being migrated. Aio page migration procedure is protected by
-        * ctx->completion_lock, so we add this lock here.
-        */
-       spin_lock_irqsave(&ctx->completion_lock, flags);
-
        ring = kmap_atomic(ctx->ring_pages[0]);
        ring->head = head;
        kunmap_atomic(ring);
        flush_dcache_page(ctx->ring_pages[0]);
 
-       spin_unlock_irqrestore(&ctx->completion_lock, flags);
-

The completion_lock is also taken in the aio_complete path.  By taking that lock in this code, we ensure that the contents of ring->tail and ctx->tail are the same.  We cannot re-introduce that lock, though, because it was causing long interrupt service routine times in some environments.  It turns out the problem was solved upstream as a side-effect of commit 5ffac122dbda8 ("aio: Don't use ctx->tail unnecessarily").  Specifically, this part:

@@ -877,11 +880,12 @@ static long aio_read_events_ring(struct kioctx *ctx,
 
        ring = kmap_atomic(ctx->ring_pages[0]);
        head = ring->head;
+       tail = ring->tail;
        kunmap_atomic(ring);
 
-       pr_debug("h%u t%u m%u\n", head, ctx->tail, ctx->nr_events);
+       pr_debug("h%u t%u m%u\n", head, tail, ctx->nr_events);
 
-       if (head == ctx->tail)
+       if (head == tail)

This changes the event reaping code from using the kernel copy of tail to the shared copy, which prevents the race from occuring (both the kernel event reaping code and the userspace code operate on the same data, the ring tail pointer).  I will attach a targeted fix for this to the bug after it's been thoroughly tested.
It will likely look something like this:

diff --git a/fs/aio.c b/fs/aio.c
index 5f3bd80f5a75..619d6f43b7e3 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1069,13 +1069,13 @@ static long aio_read_events_ring(struct kioctx *ctx,
        /* Access to ->ring_pages here is protected by ctx->ring_lock. */
        ring = kmap_atomic(ctx->ring_pages[0]);
        head = ring->head;
+       tail = ring->tail;
        kunmap_atomic(ring);
 
        /*
         * Ensure that once we've read the current tail pointer, that
         * we also see the events that were stored up to the tail.
         */
-       tail = ctx->tail;
        smp_rmb();
 
        pr_debug("h%u t%u m%u\n", head, tail, ctx->nr_events);
@@ -1084,6 +1084,7 @@ static long aio_read_events_ring(struct kioctx *ctx,
                goto out;
 
        head %= ctx->nr_events;
+       tail %= ctx->nr_events;
 
        while (ret < nr) {
                long avail;

Comment 19 Jan Stancek 2020-06-20 05:30:44 UTC
Patch(es) committed on kernel-3.10.0-1152.el7

Comment 24 ChanghuiZhong 2020-06-30 06:16:21 UTC
reproduce this issue with kernel-3.10.0-1127.8.2.el7,output:
SUREFIRE-859: ouch!!!! it was null!!!!, system would crash!!!! 
SUREFIRE-859: ouch!!!! it was null!!!!, system would crash!!!!  

verified with kernel-3.10.0-1152.el7, this issue has fixed.
no iocb duplicated and the fprintf not called.

move the status to VERIFIED.

Comment 32 errata-xmlrpc 2020-09-29 21:16:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4060


Note You need to log in before you can comment on or make changes to this bug.