Bug 1952483 - QEMU's coroutines fail with CFLAGS=-flto on non-x86_64 architectures
Summary: QEMU's coroutines fail with CFLAGS=-flto on non-x86_64 architectures
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: qemu-kvm
Version: 9.0
Hardware: All
OS: Linux
medium
high
Target Milestone: beta
: 9.0
Assignee: Stefan Hajnoczi
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: 1939500 1971841
TreeView+ depends on / blocked
 
Reported: 2021-04-22 11:11 UTC by Miroslav Rezanina
Modified: 2021-10-25 14:15 UTC (History)
23 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1924014 1 low VERIFIED RHEL9: qemu-kvm core dumped(Co-routine is yielding to no one) 2021-11-21 14:14:01 UTC
Red Hat Bugzilla 1924974 1 low VERIFIED RHEL9: when ioeventfd=off, qemu-kvm core dumped(../block/aio_task.c:64: aio_task_pool_wait_one: Assertion `qemu_coroutin... 2021-11-21 14:12:49 UTC
Red Hat Bugzilla 1950192 1 high VERIFIED RHEL9: when ioeventfd=off and 8.4guest, (qemu) qemu-kvm: ../util/qemu-coroutine-lock.c:57: qemu_co_queue_wait_impl: Asse... 2021-11-21 14:42:59 UTC

Internal Links: 1924014 1924974 1950192

Description Miroslav Rezanina 2021-04-22 11:11:47 UTC
When running build for qemu-kvm for RHEL 9, test-block-iothread during "make check " fails on aarch64, ppc64le and s390x architecture for /attach/blockjob (pass on x86_64):

ERROR test-block-iothread - Bail out! ERROR:../tests/unit/test-block-iothread.c:379:test_job_run: assertion failed: (qemu_get_current_aio_context() == job->aio_context)

Same code passes test on RHEL 8.

Comment 1 smitterl 2021-05-05 11:32:29 UTC
I cannot reproduce this on RHEL 9 Beta compose ID RHEL-9.0.0-20210504.5 on s390x (z15) with qemu master@3e13d8e34b53d8f9a3421a816ccfbdc5fa874e98.
I ran
# ../configure --target-list=s390x-softmmu
# make
# make check-unit
I had to install TAP::Parser though CPAN.

Comment 4 Eric Auger 2021-06-17 11:12:54 UTC
Hi Kevin,

we seem to have a bunch of issues related to presumed coroutine races:
- https://bugzilla.redhat.com/show_bug.cgi?id=1924014
- https://bugzilla.redhat.com/show_bug.cgi?id=1950192
- https://bugzilla.redhat.com/show_bug.cgi?id=1924974
also I saw https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1921664

I don't understand yet if / how it relates to that BZ.

Above BZs mention  
- qemu-system-xxx: ../../util/qemu-coroutine-lock.c:57: qemu_co_queue_wait_impl: Assertion `qemu_in_coroutine()' failed.
- qemu-system-xxx: ../../block/aio_task.c:64: aio_task_pool_wait_one: Assertion `qemu_coroutine_self() == pool->main_co' failed.

Thanks

Eric

Comment 5 Kevin Wolf 2021-06-17 16:28:00 UTC
So far it seems we are only seeing this kind of problems on RHEL 9 and only on non-x86. My best guess is still that something is wrong with the TLS implementation there.

If you can reproduce the problems, you could try to figure out which of the components makes the difference: Does the problem still occur when compiling and running the RHEL 9 qemu source on RHEL 8, and when building and compiling RHEL 8 qemu on RHEL 9? If the problem is not in the QEMU source, is it the compiler/toolchain, the kernel or can we identify any other specific component that causes the difference when changed individually?

Comment 6 Thomas Huth 2021-07-15 09:19:16 UTC
Stefan, could you please have a look at this BZ here? It's easy to reproduce the problem with failing coroutines when compiling with -flto on a non-x86 box here:

 tar -xaf ~/qemu-6.0.0.tar.xz 
 cd qemu-6.0.0/
 ./configure --disable-docs --extra-cflags='-O2 -flto=auto -ffat-lto-objects' --target-list='s390x-softmmu'
 cd build/
 make -j8 tests/unit/test-block-iothread
 tests/unit/test-block-iothread

Fails with:

 ERROR:../tests/unit/test-block-iothread.c:379:test_job_run: assertion failed: (qemu_get_current_aio_context() == job->aio_context)
 Bail out! ERROR:../tests/unit/test-block-iothread.c:379:test_job_run: assertion failed: (qemu_get_current_aio_context() == job->aio_context)

Comment 7 Thomas Huth 2021-07-15 09:55:04 UTC
Backtrace looks like this:

(gdb) bt full
#0  0x000003fffca9f4ae in __pthread_kill_internal () from /lib64/libc.so.6
No symbol table info available.
#1  0x000003fffca4fa20 in raise () from /lib64/libc.so.6
No symbol table info available.
#2  0x000003fffca31398 in abort () from /lib64/libc.so.6
No symbol table info available.
#3  0x000003fffde00de6 in g_assertion_message () from /lib64/libglib-2.0.so.0
No symbol table info available.
#4  0x000003fffde00e46 in g_assertion_message_expr () from /lib64/libglib-2.0.so.0
No symbol table info available.
#5  0x000002aa00023c20 in test_job_run (job=0x2aa001f5f50, errp=<optimized out>) at ../tests/unit/test-block-iothread.c:379
        s = 0x2aa001f5f50
        __func__ = "test_job_run"
        __mptr = <optimized out>
#6  0x000002aa0005a546 in job_co_entry (opaque=0x2aa001f5f50) at ../job.c:914
        job = 0x2aa001f5f50
        __PRETTY_FUNCTION__ = "job_co_entry"
#7  0x000002aa000e2d7a in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at ../util/coroutine-ucontext.c:173
        arg = {p = <optimized out>, i = {<optimized out>, <optimized out>}}
        self = <optimized out>
        co = 0x2aa001f6110
        fake_stack_save = 0x0
#8  0x000003fffca65092 in __makecontext_ret () from /lib64/libc.so.6

Comment 8 Stefan Hajnoczi 2021-07-19 13:52:07 UTC
Thanks to Florian Weimer for input from the toolchain side.

It looks grim, unfortunately. The toolchain may cache TLS values regardless of stack/thread switches. This means a coroutine running in thread 1 that switches to thread 2 might see TLS values from thread 1.

To recap the issue:

  static int coroutine_fn test_job_run(Job *job, Error **errp)
  {
      TestBlockJob *s = container_of(job, TestBlockJob, common.job);

      job_transition_to_ready(&s->common.job);
      while (!s->should_complete) {
          s->n++;
          g_assert(qemu_get_current_aio_context() == job->aio_context);

          /* Avoid job_sleep_ns() because it marks the job as !busy. We want to
           * emulate some actual activity (probably some I/O) here so that the
           * drain involved in AioContext switches has to wait for this activity
           * to stop. */
          qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 1000000);
          ^^^^^^^^^^^^^^^^

          job_pause_point(&s->common.job);
      }

      g_assert(qemu_get_current_aio_context() == job->aio_context);
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Both qemu_co_sleep_ns() and qemu_get_current_aio_context() function load the same TLS variable. The compiler could cache the value since it knows other threads will not modify the TLS value.

We have discussed changing how QEMU uses TLS from coroutines, but TLS is used widely so it's not easy to fix this.

Kevin mentioned that he was able to work around the issue by disabling inlining in one case, but the concern is that there's no systematic way of preventing these bugs.

From a QEMU perspective the easiest option would be a "TLS barrier" primitive that tells the toolchain that TLS accesses cannot be cached across a certain point.

Another toolchain option is to disable TLS caching optimizations - i.e. stop assuming that TLS memory can only be modified from the current thread. I wonder how much of a performance impact this has.

Finally, a hack might be to find a way to convert a TLS variable's address into a regular pointer so the compiler can no longer assume other threads don't modify it. Then loads shouldn't be cached across sequence points according to the C standard.

Florian: Are any of these toolchain approaches possible?

Comment 9 Florian Weimer 2021-07-21 08:23:45 UTC
I have started a thread on the gcc list:

Disabling TLS address caching to help QEMU on GNU/Linux
https://gcc.gnu.org/pipermail/gcc/2021-July/236831.html

Comment 12 Daniel Berrangé 2021-08-20 12:54:32 UTC
(In reply to Florian Weimer from comment #9)
> I have started a thread on the gcc list:
> 
> Disabling TLS address caching to help QEMU on GNU/Linux
> https://gcc.gnu.org/pipermail/gcc/2021-July/236831.html

This discussion rather implies that we ought to have a RFE bug open against GCC to request a solution and track its progress towards RHEL.

This is complicated in 9 by the fact that we're now using CLang too, so presumably we might need an RFE against Clang instead of, or as well as, GCC.

Comment 13 Stefan Hajnoczi 2021-08-26 13:52:49 UTC
Hi Florian,
Thanks for starting the mailing list thread. Has there been activity in the gcc community?

As Daniel Berrange mentioned, clang has entered the picture. I wanted to check if you see anything happening. If not, then I'll open RFEs as suggested.

Comment 14 Florian Weimer 2021-08-26 13:57:55 UTC
(In reply to Stefan Hajnoczi from comment #13)
> Hi Florian,
> Thanks for starting the mailing list thread. Has there been activity in the
> gcc community?

There has been the discussion, but nothing else that I know of.

(If you have switched to clang, this is not really relevant to you anyway, so I don't know what the priority is for the gcc side.)

Comment 15 Thomas Huth 2021-08-26 14:58:21 UTC
We switched to Clang with qemu-kvm-6.0.0-12, but if I've got a comment in another BZ right (https://bugzilla.redhat.com/show_bug.cgi?id=1940132#c47) that build still fails on aarch64. It seems to work on s390x on a first glance, though.

Comment 16 Tom Stellard 2021-08-26 17:26:49 UTC
(In reply to Thomas Huth from comment #15)
> We switched to Clang with qemu-kvm-6.0.0-12, but if I've got a comment in
> another BZ right (https://bugzilla.redhat.com/show_bug.cgi?id=1940132#c47)
> that build still fails on aarch64. It seems to work on s390x on a first
> glance, though.

It's a little hard to follow through all the different bugzilla links, would someone be able to file a bug against the clang component with the summary of the failures specific to clang?  That would make it easier for our team to analyze and track.

Comment 17 Thomas Huth 2021-08-27 06:25:51 UTC
@eric.auger@redhat.com : Can you still reproduce the problem with a Clang build (that uses -flto) on aarch64? If so, could you please open a BZ against "clang" as Tom suggested, with the instructions how to reproduce the problem there?

Comment 20 Tom Stellard 2021-09-03 04:54:47 UTC
Is there a reduced test case that will demonstrate the problem mentioned in comment 8.

Comment 21 Eric Auger 2021-09-03 07:08:43 UTC
(In reply to Tom Stellard from comment #20)
> Is there a reduced test case that will demonstrate the problem mentioned in
> comment 8.

Tom, I opened https://bugzilla.redhat.com/show_bug.cgi?id=2000479 against CLANG as you suggested. Here you will find the qemu configuration and test case I used to trigger the issue. It is basically the same as the one described by Thomas in https://bugzilla.redhat.com/show_bug.cgi?id=1952483#c6, with CLANG.

Comment 22 Tom Stellard 2021-09-03 21:35:18 UTC
(In reply to Eric Auger from comment #21)
> (In reply to Tom Stellard from comment #20)
> > Is there a reduced test case that will demonstrate the problem mentioned in
> > comment 8.
> 
> Tom, I opened https://bugzilla.redhat.com/show_bug.cgi?id=2000479 against
> CLANG as you suggested. Here you will find the qemu configuration and test
> case I used to trigger the issue. It is basically the same as the one
> described by Thomas in
> https://bugzilla.redhat.com/show_bug.cgi?id=1952483#c6, with CLANG.

OK, thanks.  Am I correct that the core problem is sequences like this:

thread_local int *ptr;

read(ptr);
context_switch();
read(ptr);

Where the 2 read functions read the same address even though they may run in different threads.

Comment 23 Stefan Hajnoczi 2021-09-06 08:44:04 UTC
(In reply to Tom Stellard from comment #22)
> (In reply to Eric Auger from comment #21)
> > (In reply to Tom Stellard from comment #20)
> > > Is there a reduced test case that will demonstrate the problem mentioned in
> > > comment 8.
> > 
> > Tom, I opened https://bugzilla.redhat.com/show_bug.cgi?id=2000479 against
> > CLANG as you suggested. Here you will find the qemu configuration and test
> > case I used to trigger the issue. It is basically the same as the one
> > described by Thomas in
> > https://bugzilla.redhat.com/show_bug.cgi?id=1952483#c6, with CLANG.
> 
> OK, thanks.  Am I correct that the core problem is sequences like this:
> 
> thread_local int *ptr;
> 
> read(ptr);
> context_switch();
> read(ptr);
> 
> Where the 2 read functions read the same address even though they may run in
> different threads.

Yes.

Comment 24 serge_sans_paille 2021-09-08 14:57:14 UTC
This may sound a bit naive, but... if the couroutine are implemented in terms of setjmp/longjmp, then the c11 standard says that

```
7.13.2.1 [The longjmp function]

1            #include <setjmp.h>
             _Noreturn void longjmp(jmp_buf env, int val);
    Description

2   The longjmp function restores the environment saved by the most recent invocation of
    the setjmp macro in the same invocation of the program with the corresponding
    jmp_buf argument. If there has been no such invocation, or **if the invocation was from
    another thread of execution**, or if the function containing the invocation of the setjmp
    macro has terminated execution248) in the interim, or if the invocation of the setjmp
    macro was within the scope of an identifier with variably modified type and execution has
    left that scope in the interim, the behavior is undefined.

```

Then isn't that prone to failure? Wouldn't setting the thread local variables as volatile change something?

This small godbolt experiment is promising:

https://gcc.godbolt.org/z/6nznnMvTs

Comment 25 Stefan Hajnoczi 2021-09-08 15:26:04 UTC
(In reply to serge_sans_paille from comment #24)
> This may sound a bit naive, but... if the couroutine are implemented in
> terms of setjmp/longjmp, then the c11 standard says that
> 
> ```
> 7.13.2.1 [The longjmp function]
> 
> 1            #include <setjmp.h>
>              _Noreturn void longjmp(jmp_buf env, int val);
>     Description
> 
> 2   The longjmp function restores the environment saved by the most recent
> invocation of
>     the setjmp macro in the same invocation of the program with the
> corresponding
>     jmp_buf argument. If there has been no such invocation, or **if the
> invocation was from
>     another thread of execution**, or if the function containing the
> invocation of the setjmp
>     macro has terminated execution248) in the interim, or if the invocation
> of the setjmp
>     macro was within the scope of an identifier with variably modified type
> and execution has
>     left that scope in the interim, the behavior is undefined.
> 
> ```
> 
> Then isn't that prone to failure?

Yes, according to the spec the behavior is undefined. QEMU has other coroutine implementations too, e.g. assembly or using other OS/runtime APIs.

I don't think setjmp is the culprit here since QEMU could switch to the assembly implementation and it would still have the TLS problem.

> Wouldn't setting the thread local
> variables as volatile change something?
> 
> This small godbolt experiment is promising:
> 
> https://gcc.godbolt.org/z/6nznnMvTs

I don't think that approach is workable because:
1. It's very tricky to get it right. The relatively innocuous addition I made here is broken: https://gcc.godbolt.org/z/8GP4dTP56. The likelihood of errors like this slipping past code review is high.
2. All __thread variables in QEMU need to be converted to volatile pointers, including auditing and rewriting code that uses the variables.

Comment 26 Kevin Wolf 2021-09-08 16:11:53 UTC
(In reply to serge_sans_paille from comment #24)
> Wouldn't setting the thread local variables as volatile change something?

When I tried that with the original reproducer, it was not enough. In hindsight that made sense to me: It's not the value of the variable that is volatile and must not be cached, but it's its address that changes between threads. I don't think this can be expressed with volatile.

Also note that the bug didn't reproduce on x86 without -mtls-dialect=gnu2 (which apparently is the default on other architectures).

Comment 27 serge_sans_paille 2021-09-16 13:10:30 UTC
Another attempt: if we force an indirect access to the TLS through `-mno-tls-direct-seg-refs`, that should prevent hoisting (?)

Comment 28 Stefan Hajnoczi 2021-09-16 13:23:59 UTC
Do you have time to try the suggestion in Comment 27? I am on PTO until Sept 29th. Thank you!

Comment 29 Florian Weimer 2021-09-16 13:48:54 UTC
(In reply to serge_sans_paille from comment #27)
> Another attempt: if we force an indirect access to the TLS through
> `-mno-tls-direct-seg-refs`, that should prevent hoisting (?)

-mno-tls-direct-seg-refs is an x86 option, introduced for i386 para-virtualization (with i386 host kernels), so it's thoroughly obslete by now. It also goes in the wrong direction: -mtls-direct-seg-refs (the default) completely avoids materializing the thread pointer for direct accesses to thread-local variables (of the initial-exec or local-exec kind). And if the thread pointer is not loaded in to a general-purpose register, it can't be out-of-date after a context switch.

Comment 30 serge_sans_paille 2021-09-20 21:59:36 UTC
The following diff

```
--- qemu-6.1.0.orig/util/async.c	2021-08-24 13:35:41.000000000 -0400
+++ qemu-6.1.0/util/async.c	2021-09-20 17:48:15.404681749 -0400
@@ -673,6 +673,10 @@
 
 AioContext *qemu_get_current_aio_context(void)
 {
+    if (qemu_in_coroutine()) {
+      Coroutine *self = qemu_coroutine_self();
+      return self->ctx;
+    }
     if (my_aiocontext) {
         return my_aiocontext;
     }
```

fixes the scenario proposed by Thomas in https://bugzilla.redhat.com/show_bug.cgi?id=1952483#c6 (but it does not fix all tests).

I understand this puts an extra burden on qemu developers, but it also seems sane to me to prevent coroutine from accessing thread local variable from another thread than the one they were created (interesting read on that topic: http://www.crystalclearsoftware.com/soc/coroutine/coroutine/coroutine_thread.html)

Would that be acceptable to enforce that property upstream?

Comment 31 Stefan Hajnoczi 2021-09-28 14:56:55 UTC
(In reply to serge_sans_paille from comment #30)
> The following diff
> 
> ```
> --- qemu-6.1.0.orig/util/async.c	2021-08-24 13:35:41.000000000 -0400
> +++ qemu-6.1.0/util/async.c	2021-09-20 17:48:15.404681749 -0400
> @@ -673,6 +673,10 @@
>  
>  AioContext *qemu_get_current_aio_context(void)
>  {
> +    if (qemu_in_coroutine()) {

This uses the `current` TLS variable. Are you sure this works? It seems like the same problem :).

Comment 33 serge_sans_paille 2021-10-06 18:50:17 UTC
The patch above fixes the LTO issue, and once applied, I've been successfully building qemu with LTO with GCC: https://koji.fedoraproject.org/koji/taskinfo?taskID=76803353 (all archs) and with Clang : https://koji.fedoraproject.org/koji/taskinfo?taskID=76802978 (s390x only).

It's a compiler-agnostic patch, it works for any compiler that honors __attribute__((noinline)), as long as the compiler doesn't tries to do inter procedural optimization across non inlinable functions.

Comment 34 Stefan Hajnoczi 2021-10-25 14:15:37 UTC
RFC patch posted upstream based on the patch Serge attached to this BZ:
https://patchew.org/QEMU/20211025140716.166971-1-stefanha@redhat.com/


Note You need to log in before you can comment on or make changes to this bug.