1845326 – libaio is returning duplicate events

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1845326 - libaio is returning duplicate events

Summary: libaio is returning duplicate events

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	7.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Jeff Moyer
QA Contact:	ChanghuiZhong
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1850055
TreeView+	depends on / blocked

Reported:	2020-06-08 21:49 UTC by Clebert Suconic
Modified:	2023-12-15 18:06 UTC (History)
CC List:	17 users (show)
Fixed In Version:	kernel-3.10.0-1152.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1850055 (view as bug list)
Environment:
Last Closed:	2020-09-29 21:16:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	5749061	0	None	None	None	2021-01-27 20:28:41 UTC
Red Hat Product Errata	RHSA-2020:4060	0	None	None	None	2020-09-29 21:17:26 UTC

Description Clebert Suconic 2020-06-08 21:49:09 UTC

Description of problem:

libaio is returning duplicate events, in a case where I'm interchanging io_getevents and user space ring operations.


Version-Release number of selected component (if applicable):

7.8 (3.10.0-1062.el7.x86_64)

How reproducible:
100%. However I could only reproduce this with KVM, and RHEL 7.8. it could be that the hypervisor makes things slower making it more likely to fail.


Steps to Reproduce:

This issue manifested into AMQ Artemis Native.

I isolated the issue on my personal github project.
You will need maven and openjdk installed.

1. git clone https://github.com/clebertsuconic/activemq-artemis-native.git
2. cd activemq-artemis-native
3. git checkout replicate-issue
4. mvn install
5. mvn test

(if you wish to change the JNI library code, you may do the following:)

cmake .
make


Actual results:

On the output you will see results from this printf:

fprintf (stderr, "ouch!!!! it was null!!!!, system would crash!!!! \n");

I'm avoiding the system from crashing with a "hack" in place, by setting the returned data as null and ignore eventual duplicates. (This could be actually a candidate fix, but we need to investigate it first).


Expected results:
no iocb should be duplicated and the fprintf above should not be called.

Additional info:

Comment 5 Tiago Bueno 2020-06-09 19:35:34 UTC

I was able to reproduce it on:

Red Hat Enterprise Linux Server release 7.9 Beta (Maipo)
Linux 3.10.0-1136.el7.x86_64 #1 SMP Fri Apr 17 11:40:59 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

Red Hat Enterprise Linux Server release 7.8 (Maipo)
Linux 3.10.0-1127.8.2.el7.x86_64 #1 SMP Thu May 7 19:30:37 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux



Looks like the problem do not occur on:

Red Hat Enterprise Linux Server release 6.10 (Santiago)
Linux host-10-0-133-19 2.6.32-754.28.1.el6.x86_64 #1 SMP Fri Jan 31 06:05:42 EST 2020 x86_64 x86_64 x86_64 GNU/Linux

Red Hat Enterprise Linux Server release 7.7 (Maipo)
Linux 3.10.0-1062.18.1.el7.x86_64 #1 SMP Wed Feb 12 14:08:31 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Red Hat Enterprise Linux release 8.2 (Ootpa)
Linux 4.18.0-193.1.2.el8_2.x86_64 #1 SMP Thu May 7 16:37:54 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Comment 8 Jeff Moyer 2020-06-10 21:06:43 UTC

The sequence of events goes like this:

Userspace code checks the ring.  head is equal to tail (no events in the completion queue), so it calls the io_getevents system call.
The io_getevents system call waits for events to become available.  One event completes, and this code runs:

aio_complete():

        smp_wmb();      /* make event visible before updating tail */

        ctx->tail = tail;

        ring = kmap_atomic(ctx->ring_pages[0]);
        head = ring->head;
        ring->tail = tail;
        kunmap_atomic(ring);
        flush_dcache_page(ctx->ring_pages[0]);

Then read_events rings does the following:

aio_read_events_ring():

        ring = kmap_atomic(ctx->ring_pages[0]);
        head = ring->head;
	kunmap_atomic(ring);

        /*                                                                      
         * Ensure that once we've read the current tail pointer, that           
         * we also see the events that were stored up to the tail.              
         */
        tail = ctx->tail;                                                       
        smp_rmb();

Notice how we are using the kernel's copy of the tail pointer, stored in ctx (not the copy in the ring).  Also notice that the aio_complete code updates the in-kernel copy of tail before updating the shared copy in the ring.  What happens is we enter the kernel with head equal to tail.  When we return from io_getevents, both head and tail have advanced.  However, there is a race condition by which the userspace code sees the udpated head pointer, but NOT the updated tail pointer.  This makes it appear as though the ring buffer is full of completions.

This problem was introduced by RHEL 7 commit e62960f36f9ed ("[fs] aio: get rid of unnecessary locking in aio_read_events_ring").  That commit does this:

@@ -1118,21 +1117,11 @@ static long aio_read_events_ring(struct kioctx *ctx,
                head %= ctx->nr_events;
        }
 
-       /*
-        * The aio ring pages are user space pages, so they can be migrated.
-        * When writing to an aio ring page, we should ensure the page is not
-        * being migrated. Aio page migration procedure is protected by
-        * ctx->completion_lock, so we add this lock here.
-        */
-       spin_lock_irqsave(&ctx->completion_lock, flags);
-
        ring = kmap_atomic(ctx->ring_pages[0]);
        ring->head = head;
        kunmap_atomic(ring);
        flush_dcache_page(ctx->ring_pages[0]);
 
-       spin_unlock_irqrestore(&ctx->completion_lock, flags);
-

The completion_lock is also taken in the aio_complete path.  By taking that lock in this code, we ensure that the contents of ring->tail and ctx->tail are the same.  We cannot re-introduce that lock, though, because it was causing long interrupt service routine times in some environments.  It turns out the problem was solved upstream as a side-effect of commit 5ffac122dbda8 ("aio: Don't use ctx->tail unnecessarily").  Specifically, this part:

@@ -877,11 +880,12 @@ static long aio_read_events_ring(struct kioctx *ctx,
 
        ring = kmap_atomic(ctx->ring_pages[0]);
        head = ring->head;
+       tail = ring->tail;
        kunmap_atomic(ring);
 
-       pr_debug("h%u t%u m%u\n", head, ctx->tail, ctx->nr_events);
+       pr_debug("h%u t%u m%u\n", head, tail, ctx->nr_events);
 
-       if (head == ctx->tail)
+       if (head == tail)

This changes the event reaping code from using the kernel copy of tail to the shared copy, which prevents the race from occuring (both the kernel event reaping code and the userspace code operate on the same data, the ring tail pointer).  I will attach a targeted fix for this to the bug after it's been thoroughly tested.
It will likely look something like this:

diff --git a/fs/aio.c b/fs/aio.c
index 5f3bd80f5a75..619d6f43b7e3 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1069,13 +1069,13 @@ static long aio_read_events_ring(struct kioctx *ctx,
        /* Access to ->ring_pages here is protected by ctx->ring_lock. */
        ring = kmap_atomic(ctx->ring_pages[0]);
        head = ring->head;
+       tail = ring->tail;
        kunmap_atomic(ring);
 
        /*
         * Ensure that once we've read the current tail pointer, that
         * we also see the events that were stored up to the tail.
         */
-       tail = ctx->tail;
        smp_rmb();
 
        pr_debug("h%u t%u m%u\n", head, tail, ctx->nr_events);
@@ -1084,6 +1084,7 @@ static long aio_read_events_ring(struct kioctx *ctx,
                goto out;
 
        head %= ctx->nr_events;
+       tail %= ctx->nr_events;
 
        while (ret < nr) {
                long avail;

Comment 19 Jan Stancek 2020-06-20 05:30:44 UTC

Patch(es) committed on kernel-3.10.0-1152.el7

Comment 24 ChanghuiZhong 2020-06-30 06:16:21 UTC

reproduce this issue with kernel-3.10.0-1127.8.2.el7,output:
SUREFIRE-859: ouch!!!! it was null!!!!, system would crash!!!! 
SUREFIRE-859: ouch!!!! it was null!!!!, system would crash!!!!  

verified with kernel-3.10.0-1152.el7, this issue has fixed.
no iocb duplicated and the fprintf not called.

move the status to VERIFIED.

Comment 32 errata-xmlrpc 2020-09-29 21:16:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4060

Note You need to log in before you can comment on or make changes to this bug.