Bug 1952483

Summary:	RFE: QEMU's coroutines fail with CFLAGS=-flto on non-x86_64 architectures
Product:	Red Hat Enterprise Linux 9	Reporter:	Miroslav Rezanina <mrezanin>
Component:	qemu-kvm	Assignee:	Virtualization Maintenance <virt-maint>
qemu-kvm sub component:	Storage	QA Contact:	Yihuang Yu <yihyu>
Status:	CLOSED MIGRATED	Docs Contact:
Severity:	medium
Priority:	medium	CC:	alougovs, berrange, bfu, coli, eric.auger, fweimer, hreitz, jinzhao, juzhang, kwolf, lijin, mdean, mdeng, ngu, npopov, qzhang, scoady, smitterl, stefanha, thuth, tstaudt, tstellar, vgoyal, virt-maint, virt-qe-z, xuma, yihyu, zhencliu, zhenyzha
Version:	9.0	Keywords:	FutureFeature, MigratedToJIRA, Reopened, RFE, Triaged
Target Milestone:	beta	Flags:	pm-rhel: mirror+
Target Release:	9.0
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-09-22 16:35:57 UTC	Type:	Story
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1939500, 1971841

Description Miroslav Rezanina 2021-04-22 11:11:47 UTC

When running build for qemu-kvm for RHEL 9, test-block-iothread during "make check " fails on aarch64, ppc64le and s390x architecture for /attach/blockjob (pass on x86_64):

ERROR test-block-iothread - Bail out! ERROR:../tests/unit/test-block-iothread.c:379:test_job_run: assertion failed: (qemu_get_current_aio_context() == job->aio_context)

Same code passes test on RHEL 8.

Comment 1 smitterl 2021-05-05 11:32:29 UTC

I cannot reproduce this on RHEL 9 Beta compose ID RHEL-9.0.0-20210504.5 on s390x (z15) with qemu master@3e13d8e34b53d8f9a3421a816ccfbdc5fa874e98.
I ran
# ../configure --target-list=s390x-softmmu
# make
# make check-unit
I had to install TAP::Parser though CPAN.

Comment 4 Eric Auger 2021-06-17 11:12:54 UTC

Hi Kevin,

we seem to have a bunch of issues related to presumed coroutine races:
- https://bugzilla.redhat.com/show_bug.cgi?id=1924014
- https://bugzilla.redhat.com/show_bug.cgi?id=1950192
- https://bugzilla.redhat.com/show_bug.cgi?id=1924974
also I saw https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1921664

I don't understand yet if / how it relates to that BZ.

Above BZs mention  
- qemu-system-xxx: ../../util/qemu-coroutine-lock.c:57: qemu_co_queue_wait_impl: Assertion `qemu_in_coroutine()' failed.
- qemu-system-xxx: ../../block/aio_task.c:64: aio_task_pool_wait_one: Assertion `qemu_coroutine_self() == pool->main_co' failed.

Thanks

Eric

Comment 5 Kevin Wolf 2021-06-17 16:28:00 UTC

So far it seems we are only seeing this kind of problems on RHEL 9 and only on non-x86. My best guess is still that something is wrong with the TLS implementation there.

If you can reproduce the problems, you could try to figure out which of the components makes the difference: Does the problem still occur when compiling and running the RHEL 9 qemu source on RHEL 8, and when building and compiling RHEL 8 qemu on RHEL 9? If the problem is not in the QEMU source, is it the compiler/toolchain, the kernel or can we identify any other specific component that causes the difference when changed individually?

Comment 6 Thomas Huth 2021-07-15 09:19:16 UTC

Stefan, could you please have a look at this BZ here? It's easy to reproduce the problem with failing coroutines when compiling with -flto on a non-x86 box here:

 tar -xaf ~/qemu-6.0.0.tar.xz 
 cd qemu-6.0.0/
 ./configure --disable-docs --extra-cflags='-O2 -flto=auto -ffat-lto-objects' --target-list='s390x-softmmu'
 cd build/
 make -j8 tests/unit/test-block-iothread
 tests/unit/test-block-iothread

Fails with:

 ERROR:../tests/unit/test-block-iothread.c:379:test_job_run: assertion failed: (qemu_get_current_aio_context() == job->aio_context)
 Bail out! ERROR:../tests/unit/test-block-iothread.c:379:test_job_run: assertion failed: (qemu_get_current_aio_context() == job->aio_context)

Comment 7 Thomas Huth 2021-07-15 09:55:04 UTC

Backtrace looks like this:

(gdb) bt full
#0  0x000003fffca9f4ae in __pthread_kill_internal () from /lib64/libc.so.6
No symbol table info available.
#1  0x000003fffca4fa20 in raise () from /lib64/libc.so.6
No symbol table info available.
#2  0x000003fffca31398 in abort () from /lib64/libc.so.6
No symbol table info available.
#3  0x000003fffde00de6 in g_assertion_message () from /lib64/libglib-2.0.so.0
No symbol table info available.
#4  0x000003fffde00e46 in g_assertion_message_expr () from /lib64/libglib-2.0.so.0
No symbol table info available.
#5  0x000002aa00023c20 in test_job_run (job=0x2aa001f5f50, errp=<optimized out>) at ../tests/unit/test-block-iothread.c:379
        s = 0x2aa001f5f50
        __func__ = "test_job_run"
        __mptr = <optimized out>
#6  0x000002aa0005a546 in job_co_entry (opaque=0x2aa001f5f50) at ../job.c:914
        job = 0x2aa001f5f50
        __PRETTY_FUNCTION__ = "job_co_entry"
#7  0x000002aa000e2d7a in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at ../util/coroutine-ucontext.c:173
        arg = {p = <optimized out>, i = {<optimized out>, <optimized out>}}
        self = <optimized out>
        co = 0x2aa001f6110
        fake_stack_save = 0x0
#8  0x000003fffca65092 in __makecontext_ret () from /lib64/libc.so.6

Comment 8 Stefan Hajnoczi 2021-07-19 13:52:07 UTC

Thanks to Florian Weimer for input from the toolchain side.

It looks grim, unfortunately. The toolchain may cache TLS values regardless of stack/thread switches. This means a coroutine running in thread 1 that switches to thread 2 might see TLS values from thread 1.

To recap the issue:

  static int coroutine_fn test_job_run(Job *job, Error **errp)
  {
      TestBlockJob *s = container_of(job, TestBlockJob, common.job);

      job_transition_to_ready(&s->common.job);
      while (!s->should_complete) {
          s->n++;
          g_assert(qemu_get_current_aio_context() == job->aio_context);

          /* Avoid job_sleep_ns() because it marks the job as !busy. We want to
           * emulate some actual activity (probably some I/O) here so that the
           * drain involved in AioContext switches has to wait for this activity
           * to stop. */
          qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 1000000);
          ^^^^^^^^^^^^^^^^

          job_pause_point(&s->common.job);
      }

      g_assert(qemu_get_current_aio_context() == job->aio_context);
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Both qemu_co_sleep_ns() and qemu_get_current_aio_context() function load the same TLS variable. The compiler could cache the value since it knows other threads will not modify the TLS value.

We have discussed changing how QEMU uses TLS from coroutines, but TLS is used widely so it's not easy to fix this.

Kevin mentioned that he was able to work around the issue by disabling inlining in one case, but the concern is that there's no systematic way of preventing these bugs.

From a QEMU perspective the easiest option would be a "TLS barrier" primitive that tells the toolchain that TLS accesses cannot be cached across a certain point.

Another toolchain option is to disable TLS caching optimizations - i.e. stop assuming that TLS memory can only be modified from the current thread. I wonder how much of a performance impact this has.

Finally, a hack might be to find a way to convert a TLS variable's address into a regular pointer so the compiler can no longer assume other threads don't modify it. Then loads shouldn't be cached across sequence points according to the C standard.

Florian: Are any of these toolchain approaches possible?

Comment 9 Florian Weimer 2021-07-21 08:23:45 UTC

I have started a thread on the gcc list:

Disabling TLS address caching to help QEMU on GNU/Linux
https://gcc.gnu.org/pipermail/gcc/2021-July/236831.html

Comment 12 Daniel Berrangé 2021-08-20 12:54:32 UTC

(In reply to Florian Weimer from comment #9)
> I have started a thread on the gcc list:
> 
> Disabling TLS address caching to help QEMU on GNU/Linux
> https://gcc.gnu.org/pipermail/gcc/2021-July/236831.html

This discussion rather implies that we ought to have a RFE bug open against GCC to request a solution and track its progress towards RHEL.

This is complicated in 9 by the fact that we're now using CLang too, so presumably we might need an RFE against Clang instead of, or as well as, GCC.

Comment 13 Stefan Hajnoczi 2021-08-26 13:52:49 UTC

Hi Florian,
Thanks for starting the mailing list thread. Has there been activity in the gcc community?

As Daniel Berrange mentioned, clang has entered the picture. I wanted to check if you see anything happening. If not, then I'll open RFEs as suggested.

Comment 14 Florian Weimer 2021-08-26 13:57:55 UTC

(In reply to Stefan Hajnoczi from comment #13)
> Hi Florian,
> Thanks for starting the mailing list thread. Has there been activity in the
> gcc community?

There has been the discussion, but nothing else that I know of.

(If you have switched to clang, this is not really relevant to you anyway, so I don't know what the priority is for the gcc side.)

Comment 15 Thomas Huth 2021-08-26 14:58:21 UTC

We switched to Clang with qemu-kvm-6.0.0-12, but if I've got a comment in another BZ right (https://bugzilla.redhat.com/show_bug.cgi?id=1940132#c47) that build still fails on aarch64. It seems to work on s390x on a first glance, though.

Comment 16 Tom Stellard 2021-08-26 17:26:49 UTC

(In reply to Thomas Huth from comment #15)
> We switched to Clang with qemu-kvm-6.0.0-12, but if I've got a comment in
> another BZ right (https://bugzilla.redhat.com/show_bug.cgi?id=1940132#c47)
> that build still fails on aarch64. It seems to work on s390x on a first
> glance, though.

It's a little hard to follow through all the different bugzilla links, would someone be able to file a bug against the clang component with the summary of the failures specific to clang?  That would make it easier for our team to analyze and track.

Comment 17 Thomas Huth 2021-08-27 06:25:51 UTC

@eric.auger : Can you still reproduce the problem with a Clang build (that uses -flto) on aarch64? If so, could you please open a BZ against "clang" as Tom suggested, with the instructions how to reproduce the problem there?

Comment 20 Tom Stellard 2021-09-03 04:54:47 UTC

Is there a reduced test case that will demonstrate the problem mentioned in comment 8.

Comment 21 Eric Auger 2021-09-03 07:08:43 UTC

(In reply to Tom Stellard from comment #20)
> Is there a reduced test case that will demonstrate the problem mentioned in
> comment 8.

Tom, I opened https://bugzilla.redhat.com/show_bug.cgi?id=2000479 against CLANG as you suggested. Here you will find the qemu configuration and test case I used to trigger the issue. It is basically the same as the one described by Thomas in https://bugzilla.redhat.com/show_bug.cgi?id=1952483#c6, with CLANG.

Comment 22 Tom Stellard 2021-09-03 21:35:18 UTC

(In reply to Eric Auger from comment #21)
> (In reply to Tom Stellard from comment #20)
> > Is there a reduced test case that will demonstrate the problem mentioned in
> > comment 8.
> 
> Tom, I opened https://bugzilla.redhat.com/show_bug.cgi?id=2000479 against
> CLANG as you suggested. Here you will find the qemu configuration and test
> case I used to trigger the issue. It is basically the same as the one
> described by Thomas in
> https://bugzilla.redhat.com/show_bug.cgi?id=1952483#c6, with CLANG.

OK, thanks.  Am I correct that the core problem is sequences like this:

thread_local int *ptr;

read(ptr);
context_switch();
read(ptr);

Where the 2 read functions read the same address even though they may run in different threads.

Comment 23 Stefan Hajnoczi 2021-09-06 08:44:04 UTC

(In reply to Tom Stellard from comment #22)
> (In reply to Eric Auger from comment #21)
> > (In reply to Tom Stellard from comment #20)
> > > Is there a reduced test case that will demonstrate the problem mentioned in
> > > comment 8.
> > 
> > Tom, I opened https://bugzilla.redhat.com/show_bug.cgi?id=2000479 against
> > CLANG as you suggested. Here you will find the qemu configuration and test
> > case I used to trigger the issue. It is basically the same as the one
> > described by Thomas in
> > https://bugzilla.redhat.com/show_bug.cgi?id=1952483#c6, with CLANG.
> 
> OK, thanks.  Am I correct that the core problem is sequences like this:
> 
> thread_local int *ptr;
> 
> read(ptr);
> context_switch();
> read(ptr);
> 
> Where the 2 read functions read the same address even though they may run in
> different threads.

Yes.

Comment 24 serge_sans_paille 2021-09-08 14:57:14 UTC

This may sound a bit naive, but... if the couroutine are implemented in terms of setjmp/longjmp, then the c11 standard says that

```
7.13.2.1 [The longjmp function]

1            #include <setjmp.h>
             _Noreturn void longjmp(jmp_buf env, int val);
    Description

2   The longjmp function restores the environment saved by the most recent invocation of
    the setjmp macro in the same invocation of the program with the corresponding
    jmp_buf argument. If there has been no such invocation, or **if the invocation was from
    another thread of execution**, or if the function containing the invocation of the setjmp
    macro has terminated execution248) in the interim, or if the invocation of the setjmp
    macro was within the scope of an identifier with variably modified type and execution has
    left that scope in the interim, the behavior is undefined.

```

Then isn't that prone to failure? Wouldn't setting the thread local variables as volatile change something?

This small godbolt experiment is promising:

https://gcc.godbolt.org/z/6nznnMvTs

Comment 25 Stefan Hajnoczi 2021-09-08 15:26:04 UTC

(In reply to serge_sans_paille from comment #24)
> This may sound a bit naive, but... if the couroutine are implemented in
> terms of setjmp/longjmp, then the c11 standard says that
> 
> ```
> 7.13.2.1 [The longjmp function]
> 
> 1            #include <setjmp.h>
>              _Noreturn void longjmp(jmp_buf env, int val);
>     Description
> 
> 2   The longjmp function restores the environment saved by the most recent
> invocation of
>     the setjmp macro in the same invocation of the program with the
> corresponding
>     jmp_buf argument. If there has been no such invocation, or **if the
> invocation was from
>     another thread of execution**, or if the function containing the
> invocation of the setjmp
>     macro has terminated execution248) in the interim, or if the invocation
> of the setjmp
>     macro was within the scope of an identifier with variably modified type
> and execution has
>     left that scope in the interim, the behavior is undefined.
> 
> ```
> 
> Then isn't that prone to failure?

Yes, according to the spec the behavior is undefined. QEMU has other coroutine implementations too, e.g. assembly or using other OS/runtime APIs.

I don't think setjmp is the culprit here since QEMU could switch to the assembly implementation and it would still have the TLS problem.

> Wouldn't setting the thread local
> variables as volatile change something?
> 
> This small godbolt experiment is promising:
> 
> https://gcc.godbolt.org/z/6nznnMvTs

I don't think that approach is workable because:
1. It's very tricky to get it right. The relatively innocuous addition I made here is broken: https://gcc.godbolt.org/z/8GP4dTP56. The likelihood of errors like this slipping past code review is high.
2. All __thread variables in QEMU need to be converted to volatile pointers, including auditing and rewriting code that uses the variables.

Comment 26 Kevin Wolf 2021-09-08 16:11:53 UTC

(In reply to serge_sans_paille from comment #24)
> Wouldn't setting the thread local variables as volatile change something?

When I tried that with the original reproducer, it was not enough. In hindsight that made sense to me: It's not the value of the variable that is volatile and must not be cached, but it's its address that changes between threads. I don't think this can be expressed with volatile.

Also note that the bug didn't reproduce on x86 without -mtls-dialect=gnu2 (which apparently is the default on other architectures).

Comment 27 serge_sans_paille 2021-09-16 13:10:30 UTC

Another attempt: if we force an indirect access to the TLS through `-mno-tls-direct-seg-refs`, that should prevent hoisting (?)

Comment 28 Stefan Hajnoczi 2021-09-16 13:23:59 UTC

Do you have time to try the suggestion in Comment 27? I am on PTO until Sept 29th. Thank you!

Comment 29 Florian Weimer 2021-09-16 13:48:54 UTC

(In reply to serge_sans_paille from comment #27)
> Another attempt: if we force an indirect access to the TLS through
> `-mno-tls-direct-seg-refs`, that should prevent hoisting (?)

-mno-tls-direct-seg-refs is an x86 option, introduced for i386 para-virtualization (with i386 host kernels), so it's thoroughly obslete by now. It also goes in the wrong direction: -mtls-direct-seg-refs (the default) completely avoids materializing the thread pointer for direct accesses to thread-local variables (of the initial-exec or local-exec kind). And if the thread pointer is not loaded in to a general-purpose register, it can't be out-of-date after a context switch.

Comment 30 serge_sans_paille 2021-09-20 21:59:36 UTC

The following diff

```
--- qemu-6.1.0.orig/util/async.c	2021-08-24 13:35:41.000000000 -0400
+++ qemu-6.1.0/util/async.c	2021-09-20 17:48:15.404681749 -0400
@@ -673,6 +673,10 @@
 
 AioContext *qemu_get_current_aio_context(void)
 {
+    if (qemu_in_coroutine()) {
+      Coroutine *self = qemu_coroutine_self();
+      return self->ctx;
+    }
     if (my_aiocontext) {
         return my_aiocontext;
     }
```

fixes the scenario proposed by Thomas in https://bugzilla.redhat.com/show_bug.cgi?id=1952483#c6 (but it does not fix all tests).

I understand this puts an extra burden on qemu developers, but it also seems sane to me to prevent coroutine from accessing thread local variable from another thread than the one they were created (interesting read on that topic: http://www.crystalclearsoftware.com/soc/coroutine/coroutine/coroutine_thread.html)

Would that be acceptable to enforce that property upstream?

Comment 31 Stefan Hajnoczi 2021-09-28 14:56:55 UTC

(In reply to serge_sans_paille from comment #30)
> The following diff
> 
> ```
> --- qemu-6.1.0.orig/util/async.c	2021-08-24 13:35:41.000000000 -0400
> +++ qemu-6.1.0/util/async.c	2021-09-20 17:48:15.404681749 -0400
> @@ -673,6 +673,10 @@
>  
>  AioContext *qemu_get_current_aio_context(void)
>  {
> +    if (qemu_in_coroutine()) {

This uses the `current` TLS variable. Are you sure this works? It seems like the same problem :).

Comment 33 serge_sans_paille 2021-10-06 18:50:17 UTC

The patch above fixes the LTO issue, and once applied, I've been successfully building qemu with LTO with GCC: https://koji.fedoraproject.org/koji/taskinfo?taskID=76803353 (all archs) and with Clang : https://koji.fedoraproject.org/koji/taskinfo?taskID=76802978 (s390x only).

It's a compiler-agnostic patch, it works for any compiler that honors __attribute__((noinline)), as long as the compiler doesn't tries to do inter procedural optimization across non inlinable functions.

Comment 34 Stefan Hajnoczi 2021-10-25 14:15:37 UTC

RFC patch posted upstream based on the patch Serge attached to this BZ:
https://patchew.org/QEMU/20211025140716.166971-1-stefanha@redhat.com/

Comment 35 Klaus Heinrich Kiwi 2022-02-16 13:21:37 UTC

Based on recent discussions with Stefan/Thomas and others, I'm moving this to ITR 9.1.0 as a "FutureFeature" since we don't yet enable LTO downstream on non-x86 architectures. We do have an RFC patch upstream, so hopefully this can be added soon.

Comment 36 Stefan Hajnoczi 2022-03-08 10:33:02 UTC

The following have been merged:
d5d2b15ecf cpus: use coroutine TLS macros for iothread_locked
17c78154b0 rcu: use coroutine TLS macros
47b7446456 util/async: replace __thread with QEMU TLS macros
7d29c341c9 tls: add macros for coroutine-safe TLS variables

I sent another 3 patches as a follow-up series.

Comment 37 Min Deng 2022-03-18 03:28:08 UTC

(In reply to Miroslav Rezanina from comment #0)
> When running build for qemu-kvm for RHEL 9, test-block-iothread during "make
> check " fails on aarch64, ppc64le and s390x architecture for
> /attach/blockjob (pass on x86_64):
  FYI, qemu-kvm isn't supported on RHEL 9 on Power.
Thanks

Comment 39 Stefan Hajnoczi 2022-05-17 09:51:16 UTC

The following patches were merged upstream:
c1fe694357 coroutine-win32: use QEMU_DEFINE_STATIC_CO_TLS()
ac387a08a9 coroutine: use QEMU_DEFINE_STATIC_CO_TLS()
34145a307d coroutine-ucontext: use QEMU_DEFINE_STATIC_CO_TLS()

Comment 42 lijin 2022-06-27 12:55:50 UTC

Hi Yihuang and Boqiao,

Could you do the pre-verify on aarch64 and s390x with the fixed version?

Thanks.

Comment 43 Yihuang Yu 2022-06-28 02:51:04 UTC

Analyzed the build log, "-flto" is still not in the configure setting, is this expected? The full configure from here: http://download.eng.bos.redhat.com/brewroot/vol/rhel-9/packages/qemu-kvm/7.0.0/7.el9/data/logs/aarch64/build.log

I can see x86 enabled -flto: http://download.eng.bos.redhat.com/brewroot/vol/rhel-9/packages/qemu-kvm/7.0.0/7.el9/data/logs/x86_64/build.log, and I also compiled with -flto myself on aarch64, it passed. Steps refer to Eric's bug 2000479

# ./configure --cc=clang --cxx=/bin/false --prefix=/usr --libdir=/usr/lib64 --datadir=/usr/share --sysconfdir=/etc --interp-prefix=/usr/qemu-%M --localstatedir=/var --docdir=/usr/share/doc --libexecdir=/usr/libexec '--extra-ldflags=-Wl,-z,relro -Wl,--as-needed  -Wl,-z,now   -flto' '--extra-cflags=-O2 -flto -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS --config /usr/lib/rpm/redhat/redhat-hardened-clang.cfg -fstack-protector-strong   -fasynchronous-unwind-tables ' --target-list=aarch64-softmmu --enable-kvm --extra-cflags=-Wstrict-prototypes --extra-cflags=-Wredundant-decls --enable-trace-backends=log --enable-seccomp --enable-cap-ng --disable-werror --without-default-devices --disable-capstone --target-list='aarch64-softmmu'

# make check-unit -j16
......
......
22/92 qemu:unit / test-block-iothread                 OK              0.64s   16 subtests passed
......
......
Ok:                 92
Expected Fail:      0
Fail:               0
Unexpected Pass:    0
Skipped:            0
Timeout:            0

So in my opinion, maybe we can also enable -flto on other architectures?

Anyway, the test result on the official build is passed.

Result: PASS as no Critical Regression or TestBlocker found

Test Environment:
Host Distro: RHEL-9.1.0-20220627.0 BaseOS aarch64
Host Kernel: kernel-5.14.0-119.el9.aarch64
QEMU: qemu-kvm-7.0.0-7.el9.aarch64
edk2: edk2-aarch64-20220526git16779ede2d36-1.el9.noarch
Guest: RHEL.9.1.0

Results Analysis:
From 85 tests executed, 84 passed and 0 warned - success rate of 98.82% (excluding SKIP and CANCEL)
1 test case failed with an auto issue but retes passed

New bugs(0):
Existing bugs(0):

Job link:
http://10.0.136.47/6759356/results.html

Comment 44 Yanan Fu 2022-06-28 06:07:27 UTC

QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Comment 45 Eric Auger 2022-06-28 08:37:44 UTC

(In reply to Yihuang Yu from comment #43)
> Analyzed the build log, "-flto" is still not in the configure setting, is
> this expected? The full configure from here:
> http://download.eng.bos.redhat.com/brewroot/vol/rhel-9/packages/qemu-kvm/7.0.
> 0/7.el9/data/logs/aarch64/build.log
> 
> I can see x86 enabled -flto:
> http://download.eng.bos.redhat.com/brewroot/vol/rhel-9/packages/qemu-kvm/7.0.
> 0/7.el9/data/logs/x86_64/build.log, and I also compiled with -flto myself on
> aarch64, it passed. Steps refer to Eric's bug 2000479
> 
> # ./configure --cc=clang --cxx=/bin/false --prefix=/usr --libdir=/usr/lib64
> --datadir=/usr/share --sysconfdir=/etc --interp-prefix=/usr/qemu-%M
> --localstatedir=/var --docdir=/usr/share/doc --libexecdir=/usr/libexec
> '--extra-ldflags=-Wl,-z,relro -Wl,--as-needed  -Wl,-z,now   -flto'
> '--extra-cflags=-O2 -flto -fexceptions -g -grecord-gcc-switches -pipe -Wall
> -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS
> --config /usr/lib/rpm/redhat/redhat-hardened-clang.cfg
> -fstack-protector-strong   -fasynchronous-unwind-tables '
> --target-list=aarch64-softmmu --enable-kvm
> --extra-cflags=-Wstrict-prototypes --extra-cflags=-Wredundant-decls
> --enable-trace-backends=log --enable-seccomp --enable-cap-ng
> --disable-werror --without-default-devices --disable-capstone
> --target-list='aarch64-softmmu'
> 
> # make check-unit -j16
> ......
> ......
> 22/92 qemu:unit / test-block-iothread                 OK              0.64s 
> 16 subtests passed
> ......
> ......
> Ok:                 92
> Expected Fail:      0
> Fail:               0
> Unexpected Pass:    0
> Skipped:            0
> Timeout:            0
> 
> So in my opinion, maybe we can also enable -flto on other architectures?
> 
> Anyway, the test result on the official build is passed.
> 
> Result: PASS as no Critical Regression or TestBlocker found
> 
> Test Environment:
> Host Distro: RHEL-9.1.0-20220627.0 BaseOS aarch64
> Host Kernel: kernel-5.14.0-119.el9.aarch64
> QEMU: qemu-kvm-7.0.0-7.el9.aarch64
> edk2: edk2-aarch64-20220526git16779ede2d36-1.el9.noarch
> Guest: RHEL.9.1.0
> 
> Results Analysis:
> From 85 tests executed, 84 passed and 0 warned - success rate of 98.82%
> (excluding SKIP and CANCEL)
> 1 test case failed with an auto issue but retes passed
> 
> New bugs(0):
> Existing bugs(0):
> 
> Job link:
> http://10.0.136.47/6759356/results.html

While at it, would you have cycles to test with Safestack enabled (https://bugzilla.redhat.com/show_bug.cgi?id=1992968)? We had the same symptoms and maybe Stefan's series also fixes that other BZ. Thank you in advance!

Comment 46 Thomas Huth 2022-06-28 08:50:50 UTC

(In reply to Yihuang Yu from comment #43)
> Analyzed the build log, "-flto" is still not in the configure setting, is
> this expected? The full configure from here:

I think we also need a change to the qemu-kvm.spec file to enable LTO on non-x86 again. There's a hack there at the top of the file that looks like this:

%ifnarch x86_64
     %global _lto_cflags %%{nil}
%endif

Without removing that, we don't get LTO on s390x and aarch64, so I think this cannot properly verified. @stefanha, could you add such a patch on top, please?

Comment 47 Yihuang Yu 2022-06-28 09:07:25 UTC

(In reply to Eric Auger from comment #45)
> (In reply to Yihuang Yu from comment #43)
> > Analyzed the build log, "-flto" is still not in the configure setting, is
> > this expected? The full configure from here:
> > http://download.eng.bos.redhat.com/brewroot/vol/rhel-9/packages/qemu-kvm/7.0.
> > 0/7.el9/data/logs/aarch64/build.log
> > 
> > I can see x86 enabled -flto:
> > http://download.eng.bos.redhat.com/brewroot/vol/rhel-9/packages/qemu-kvm/7.0.
> > 0/7.el9/data/logs/x86_64/build.log, and I also compiled with -flto myself on
> > aarch64, it passed. Steps refer to Eric's bug 2000479
> > 
> > # ./configure --cc=clang --cxx=/bin/false --prefix=/usr --libdir=/usr/lib64
> > --datadir=/usr/share --sysconfdir=/etc --interp-prefix=/usr/qemu-%M
> > --localstatedir=/var --docdir=/usr/share/doc --libexecdir=/usr/libexec
> > '--extra-ldflags=-Wl,-z,relro -Wl,--as-needed  -Wl,-z,now   -flto'
> > '--extra-cflags=-O2 -flto -fexceptions -g -grecord-gcc-switches -pipe -Wall
> > -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS
> > --config /usr/lib/rpm/redhat/redhat-hardened-clang.cfg
> > -fstack-protector-strong   -fasynchronous-unwind-tables '
> > --target-list=aarch64-softmmu --enable-kvm
> > --extra-cflags=-Wstrict-prototypes --extra-cflags=-Wredundant-decls
> > --enable-trace-backends=log --enable-seccomp --enable-cap-ng
> > --disable-werror --without-default-devices --disable-capstone
> > --target-list='aarch64-softmmu'
> > 
> > # make check-unit -j16
> > ......
> > ......
> > 22/92 qemu:unit / test-block-iothread                 OK              0.64s 
> > 16 subtests passed
> > ......
> > ......
> > Ok:                 92
> > Expected Fail:      0
> > Fail:               0
> > Unexpected Pass:    0
> > Skipped:            0
> > Timeout:            0
> > 
> > So in my opinion, maybe we can also enable -flto on other architectures?
> > 
> > Anyway, the test result on the official build is passed.
> > 
> > Result: PASS as no Critical Regression or TestBlocker found
> > 
> > Test Environment:
> > Host Distro: RHEL-9.1.0-20220627.0 BaseOS aarch64
> > Host Kernel: kernel-5.14.0-119.el9.aarch64
> > QEMU: qemu-kvm-7.0.0-7.el9.aarch64
> > edk2: edk2-aarch64-20220526git16779ede2d36-1.el9.noarch
> > Guest: RHEL.9.1.0
> > 
> > Results Analysis:
> > From 85 tests executed, 84 passed and 0 warned - success rate of 98.82%
> > (excluding SKIP and CANCEL)
> > 1 test case failed with an auto issue but retes passed
> > 
> > New bugs(0):
> > Existing bugs(0):
> > 
> > Job link:
> > http://10.0.136.47/6759356/results.html
> 
> While at it, would you have cycles to test with Safestack enabled
> (https://bugzilla.redhat.com/show_bug.cgi?id=1992968)? We had the same
> symptoms and maybe Stefan's series also fixes that other BZ. Thank you in
> advance!

OK Eric, I will enable both flto and safe-stack, and then trigger a tier1 testing. Will update test result later.

Comment 48 Yihuang Yu 2022-06-28 11:22:09 UTC

Unfortunately, I cannot rebuild the qemu-kvm rpm package from src.rpm if I have both flto and safe-stack enabled. Eric, so I don't think now is the right time to enable the safe-stack. Maybe we need to tweak some CFLAGS?

# diff /root/rpmbuild/SPECS/qemu-kvm.spec /home/qemu-kvm.spec.backup
7a8,13
> # LTO does not work with the coroutines of QEMU on non-x86 architectures
> # (see BZ 1952483 and 1950192 for more information)
> %ifnarch x86_64
>     %global _lto_cflags %%{nil}
> %endif
>
18c24
< %global have_safe_stack 1
---
> %global have_safe_stack 0
22a29,31
> %ifarch x86_64
> %global have_safe_stack 1
> %endif

flto + safe-stack:

 27/128 qemu:unit / test-bdrv-drain                                                 ERROR           1.01s   killed by signal 11 SIGSEGV
―――――――――――――――――――――――――――――――――――――――― ✀  ――――――――――――――――――――――――――――――――――――――――
stderr:

TAP parsing error: Too few tests run (expected 42, got 20)
(test program exited with status code -11)

 33/128 qemu:unit / test-block-iothread                                             ERROR           1.66s   killed by signal 6 SIGABRT
―――――――――――――――――――――――――――――――――――――――― ✀  ――――――――――――――――――――――――――――――――――――――――
stderr:
qemu_aio_coroutine_enter: Co-routine was already scheduled in ''


TAP parsing error: Too few tests run (expected 16, got 10)
(test program exited with status code -6)


Summary of Failures:

 27/128 qemu:unit / test-bdrv-drain                                        ERROR           0.94s   killed by signal 11 SIGSEGV
 33/128 qemu:unit / test-block-iothread                                    ERROR           1.54s   killed by signal 6 SIGABRT


Ok:                 123
Expected Fail:      0
Fail:               2
Unexpected Pass:    0
Skipped:            3
Timeout:            0

Comment 53 bfu 2022-07-13 08:51:22 UTC

(In reply to lijin from comment #42)
> Hi Yihuang and Boqiao,
> 
> Could you do the pre-verify on aarch64 and s390x with the fixed version?
> 
> Thanks.

[root@l42 build]# tests/unit/test-block-iothread
# random seed: R02Sdf2c11a84ebf6fa4a3bf33e5f4ba9f5c
1..16
# Start of sync-op tests
ok 1 /sync-op/pread
ok 2 /sync-op/pwrite
ok 3 /sync-op/load_vmstate
ok 4 /sync-op/save_vmstate
ok 5 /sync-op/pdiscard
ok 6 /sync-op/truncate
ok 7 /sync-op/block_status
ok 8 /sync-op/flush
ok 9 /sync-op/check
ok 10 /sync-op/activate
# End of sync-op tests
# Start of attach tests
ok 11 /attach/blockjob
ok 12 /attach/second_node
ok 13 /attach/preserve_blk_ctx
# End of attach tests
# Start of propagate tests
ok 14 /propagate/basic
ok 15 /propagate/diamond
ok 16 /propagate/mirror
# End of propagate tests

I didn't see an error on s390x

Comment 54 Stefan Hajnoczi 2022-07-14 11:06:01 UTC

Based on comment 48 there are still issues, probably related to coroutines, that need to be debugged if we want to enable LTO + SafeStack on non-x86 architectures.

The coroutine TLS patches were already merged in 7.0.0-7 for this BZ.

I am on PTO until August. At that time I can investigate the root cause. Let's keep LTO disabled until the root cause is understood.

If someone else wants to take over this BZ while I'm away, feel free.

Comment 55 Yihuang Yu 2022-07-15 07:07:15 UTC

OK, Stefan.

Then let me move the ITM to a bit later until we decide to fix the compile issue in which release, thanks for understanding.

Comment 67 RHEL Program Management 2023-04-03 07:28:07 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 71 RHEL Program Management 2023-09-22 16:35:26 UTC

Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 72 RHEL Program Management 2023-09-22 16:35:57 UTC

This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.