Bug 1952483
Summary: | RFE: QEMU's coroutines fail with CFLAGS=-flto on non-x86_64 architectures | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Miroslav Rezanina <mrezanin> |
Component: | qemu-kvm | Assignee: | Virtualization Maintenance <virt-maint> |
qemu-kvm sub component: | Storage | QA Contact: | Yihuang Yu <yihyu> |
Status: | CLOSED MIGRATED | Docs Contact: | |
Severity: | medium | ||
Priority: | medium | CC: | alougovs, berrange, bfu, coli, eric.auger, fweimer, hreitz, jinzhao, juzhang, kwolf, lijin, mdean, mdeng, ngu, npopov, qzhang, scoady, smitterl, stefanha, thuth, tstaudt, tstellar, vgoyal, virt-maint, virt-qe-z, xuma, yihyu, zhencliu, zhenyzha |
Version: | 9.0 | Keywords: | FutureFeature, MigratedToJIRA, Reopened, RFE, Triaged |
Target Milestone: | beta | Flags: | pm-rhel:
mirror+
|
Target Release: | 9.0 | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-09-22 16:35:57 UTC | Type: | Story |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1939500, 1971841 |
Description
Miroslav Rezanina
2021-04-22 11:11:47 UTC
I cannot reproduce this on RHEL 9 Beta compose ID RHEL-9.0.0-20210504.5 on s390x (z15) with qemu master@3e13d8e34b53d8f9a3421a816ccfbdc5fa874e98. I ran # ../configure --target-list=s390x-softmmu # make # make check-unit I had to install TAP::Parser though CPAN. Hi Kevin, we seem to have a bunch of issues related to presumed coroutine races: - https://bugzilla.redhat.com/show_bug.cgi?id=1924014 - https://bugzilla.redhat.com/show_bug.cgi?id=1950192 - https://bugzilla.redhat.com/show_bug.cgi?id=1924974 also I saw https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1921664 I don't understand yet if / how it relates to that BZ. Above BZs mention - qemu-system-xxx: ../../util/qemu-coroutine-lock.c:57: qemu_co_queue_wait_impl: Assertion `qemu_in_coroutine()' failed. - qemu-system-xxx: ../../block/aio_task.c:64: aio_task_pool_wait_one: Assertion `qemu_coroutine_self() == pool->main_co' failed. Thanks Eric So far it seems we are only seeing this kind of problems on RHEL 9 and only on non-x86. My best guess is still that something is wrong with the TLS implementation there. If you can reproduce the problems, you could try to figure out which of the components makes the difference: Does the problem still occur when compiling and running the RHEL 9 qemu source on RHEL 8, and when building and compiling RHEL 8 qemu on RHEL 9? If the problem is not in the QEMU source, is it the compiler/toolchain, the kernel or can we identify any other specific component that causes the difference when changed individually? Stefan, could you please have a look at this BZ here? It's easy to reproduce the problem with failing coroutines when compiling with -flto on a non-x86 box here: tar -xaf ~/qemu-6.0.0.tar.xz cd qemu-6.0.0/ ./configure --disable-docs --extra-cflags='-O2 -flto=auto -ffat-lto-objects' --target-list='s390x-softmmu' cd build/ make -j8 tests/unit/test-block-iothread tests/unit/test-block-iothread Fails with: ERROR:../tests/unit/test-block-iothread.c:379:test_job_run: assertion failed: (qemu_get_current_aio_context() == job->aio_context) Bail out! ERROR:../tests/unit/test-block-iothread.c:379:test_job_run: assertion failed: (qemu_get_current_aio_context() == job->aio_context) Backtrace looks like this: (gdb) bt full #0 0x000003fffca9f4ae in __pthread_kill_internal () from /lib64/libc.so.6 No symbol table info available. #1 0x000003fffca4fa20 in raise () from /lib64/libc.so.6 No symbol table info available. #2 0x000003fffca31398 in abort () from /lib64/libc.so.6 No symbol table info available. #3 0x000003fffde00de6 in g_assertion_message () from /lib64/libglib-2.0.so.0 No symbol table info available. #4 0x000003fffde00e46 in g_assertion_message_expr () from /lib64/libglib-2.0.so.0 No symbol table info available. #5 0x000002aa00023c20 in test_job_run (job=0x2aa001f5f50, errp=<optimized out>) at ../tests/unit/test-block-iothread.c:379 s = 0x2aa001f5f50 __func__ = "test_job_run" __mptr = <optimized out> #6 0x000002aa0005a546 in job_co_entry (opaque=0x2aa001f5f50) at ../job.c:914 job = 0x2aa001f5f50 __PRETTY_FUNCTION__ = "job_co_entry" #7 0x000002aa000e2d7a in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at ../util/coroutine-ucontext.c:173 arg = {p = <optimized out>, i = {<optimized out>, <optimized out>}} self = <optimized out> co = 0x2aa001f6110 fake_stack_save = 0x0 #8 0x000003fffca65092 in __makecontext_ret () from /lib64/libc.so.6 Thanks to Florian Weimer for input from the toolchain side. It looks grim, unfortunately. The toolchain may cache TLS values regardless of stack/thread switches. This means a coroutine running in thread 1 that switches to thread 2 might see TLS values from thread 1. To recap the issue: static int coroutine_fn test_job_run(Job *job, Error **errp) { TestBlockJob *s = container_of(job, TestBlockJob, common.job); job_transition_to_ready(&s->common.job); while (!s->should_complete) { s->n++; g_assert(qemu_get_current_aio_context() == job->aio_context); /* Avoid job_sleep_ns() because it marks the job as !busy. We want to * emulate some actual activity (probably some I/O) here so that the * drain involved in AioContext switches has to wait for this activity * to stop. */ qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 1000000); ^^^^^^^^^^^^^^^^ job_pause_point(&s->common.job); } g_assert(qemu_get_current_aio_context() == job->aio_context); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Both qemu_co_sleep_ns() and qemu_get_current_aio_context() function load the same TLS variable. The compiler could cache the value since it knows other threads will not modify the TLS value. We have discussed changing how QEMU uses TLS from coroutines, but TLS is used widely so it's not easy to fix this. Kevin mentioned that he was able to work around the issue by disabling inlining in one case, but the concern is that there's no systematic way of preventing these bugs. From a QEMU perspective the easiest option would be a "TLS barrier" primitive that tells the toolchain that TLS accesses cannot be cached across a certain point. Another toolchain option is to disable TLS caching optimizations - i.e. stop assuming that TLS memory can only be modified from the current thread. I wonder how much of a performance impact this has. Finally, a hack might be to find a way to convert a TLS variable's address into a regular pointer so the compiler can no longer assume other threads don't modify it. Then loads shouldn't be cached across sequence points according to the C standard. Florian: Are any of these toolchain approaches possible? I have started a thread on the gcc list: Disabling TLS address caching to help QEMU on GNU/Linux https://gcc.gnu.org/pipermail/gcc/2021-July/236831.html (In reply to Florian Weimer from comment #9) > I have started a thread on the gcc list: > > Disabling TLS address caching to help QEMU on GNU/Linux > https://gcc.gnu.org/pipermail/gcc/2021-July/236831.html This discussion rather implies that we ought to have a RFE bug open against GCC to request a solution and track its progress towards RHEL. This is complicated in 9 by the fact that we're now using CLang too, so presumably we might need an RFE against Clang instead of, or as well as, GCC. Hi Florian, Thanks for starting the mailing list thread. Has there been activity in the gcc community? As Daniel Berrange mentioned, clang has entered the picture. I wanted to check if you see anything happening. If not, then I'll open RFEs as suggested. (In reply to Stefan Hajnoczi from comment #13) > Hi Florian, > Thanks for starting the mailing list thread. Has there been activity in the > gcc community? There has been the discussion, but nothing else that I know of. (If you have switched to clang, this is not really relevant to you anyway, so I don't know what the priority is for the gcc side.) We switched to Clang with qemu-kvm-6.0.0-12, but if I've got a comment in another BZ right (https://bugzilla.redhat.com/show_bug.cgi?id=1940132#c47) that build still fails on aarch64. It seems to work on s390x on a first glance, though. (In reply to Thomas Huth from comment #15) > We switched to Clang with qemu-kvm-6.0.0-12, but if I've got a comment in > another BZ right (https://bugzilla.redhat.com/show_bug.cgi?id=1940132#c47) > that build still fails on aarch64. It seems to work on s390x on a first > glance, though. It's a little hard to follow through all the different bugzilla links, would someone be able to file a bug against the clang component with the summary of the failures specific to clang? That would make it easier for our team to analyze and track. @eric.auger : Can you still reproduce the problem with a Clang build (that uses -flto) on aarch64? If so, could you please open a BZ against "clang" as Tom suggested, with the instructions how to reproduce the problem there? Is there a reduced test case that will demonstrate the problem mentioned in comment 8. (In reply to Tom Stellard from comment #20) > Is there a reduced test case that will demonstrate the problem mentioned in > comment 8. Tom, I opened https://bugzilla.redhat.com/show_bug.cgi?id=2000479 against CLANG as you suggested. Here you will find the qemu configuration and test case I used to trigger the issue. It is basically the same as the one described by Thomas in https://bugzilla.redhat.com/show_bug.cgi?id=1952483#c6, with CLANG. (In reply to Eric Auger from comment #21) > (In reply to Tom Stellard from comment #20) > > Is there a reduced test case that will demonstrate the problem mentioned in > > comment 8. > > Tom, I opened https://bugzilla.redhat.com/show_bug.cgi?id=2000479 against > CLANG as you suggested. Here you will find the qemu configuration and test > case I used to trigger the issue. It is basically the same as the one > described by Thomas in > https://bugzilla.redhat.com/show_bug.cgi?id=1952483#c6, with CLANG. OK, thanks. Am I correct that the core problem is sequences like this: thread_local int *ptr; read(ptr); context_switch(); read(ptr); Where the 2 read functions read the same address even though they may run in different threads. (In reply to Tom Stellard from comment #22) > (In reply to Eric Auger from comment #21) > > (In reply to Tom Stellard from comment #20) > > > Is there a reduced test case that will demonstrate the problem mentioned in > > > comment 8. > > > > Tom, I opened https://bugzilla.redhat.com/show_bug.cgi?id=2000479 against > > CLANG as you suggested. Here you will find the qemu configuration and test > > case I used to trigger the issue. It is basically the same as the one > > described by Thomas in > > https://bugzilla.redhat.com/show_bug.cgi?id=1952483#c6, with CLANG. > > OK, thanks. Am I correct that the core problem is sequences like this: > > thread_local int *ptr; > > read(ptr); > context_switch(); > read(ptr); > > Where the 2 read functions read the same address even though they may run in > different threads. Yes. This may sound a bit naive, but... if the couroutine are implemented in terms of setjmp/longjmp, then the c11 standard says that ``` 7.13.2.1 [The longjmp function] 1 #include <setjmp.h> _Noreturn void longjmp(jmp_buf env, int val); Description 2 The longjmp function restores the environment saved by the most recent invocation of the setjmp macro in the same invocation of the program with the corresponding jmp_buf argument. If there has been no such invocation, or **if the invocation was from another thread of execution**, or if the function containing the invocation of the setjmp macro has terminated execution248) in the interim, or if the invocation of the setjmp macro was within the scope of an identifier with variably modified type and execution has left that scope in the interim, the behavior is undefined. ``` Then isn't that prone to failure? Wouldn't setting the thread local variables as volatile change something? This small godbolt experiment is promising: https://gcc.godbolt.org/z/6nznnMvTs (In reply to serge_sans_paille from comment #24) > This may sound a bit naive, but... if the couroutine are implemented in > terms of setjmp/longjmp, then the c11 standard says that > > ``` > 7.13.2.1 [The longjmp function] > > 1 #include <setjmp.h> > _Noreturn void longjmp(jmp_buf env, int val); > Description > > 2 The longjmp function restores the environment saved by the most recent > invocation of > the setjmp macro in the same invocation of the program with the > corresponding > jmp_buf argument. If there has been no such invocation, or **if the > invocation was from > another thread of execution**, or if the function containing the > invocation of the setjmp > macro has terminated execution248) in the interim, or if the invocation > of the setjmp > macro was within the scope of an identifier with variably modified type > and execution has > left that scope in the interim, the behavior is undefined. > > ``` > > Then isn't that prone to failure? Yes, according to the spec the behavior is undefined. QEMU has other coroutine implementations too, e.g. assembly or using other OS/runtime APIs. I don't think setjmp is the culprit here since QEMU could switch to the assembly implementation and it would still have the TLS problem. > Wouldn't setting the thread local > variables as volatile change something? > > This small godbolt experiment is promising: > > https://gcc.godbolt.org/z/6nznnMvTs I don't think that approach is workable because: 1. It's very tricky to get it right. The relatively innocuous addition I made here is broken: https://gcc.godbolt.org/z/8GP4dTP56. The likelihood of errors like this slipping past code review is high. 2. All __thread variables in QEMU need to be converted to volatile pointers, including auditing and rewriting code that uses the variables. (In reply to serge_sans_paille from comment #24) > Wouldn't setting the thread local variables as volatile change something? When I tried that with the original reproducer, it was not enough. In hindsight that made sense to me: It's not the value of the variable that is volatile and must not be cached, but it's its address that changes between threads. I don't think this can be expressed with volatile. Also note that the bug didn't reproduce on x86 without -mtls-dialect=gnu2 (which apparently is the default on other architectures). Another attempt: if we force an indirect access to the TLS through `-mno-tls-direct-seg-refs`, that should prevent hoisting (?) Do you have time to try the suggestion in Comment 27? I am on PTO until Sept 29th. Thank you! (In reply to serge_sans_paille from comment #27) > Another attempt: if we force an indirect access to the TLS through > `-mno-tls-direct-seg-refs`, that should prevent hoisting (?) -mno-tls-direct-seg-refs is an x86 option, introduced for i386 para-virtualization (with i386 host kernels), so it's thoroughly obslete by now. It also goes in the wrong direction: -mtls-direct-seg-refs (the default) completely avoids materializing the thread pointer for direct accesses to thread-local variables (of the initial-exec or local-exec kind). And if the thread pointer is not loaded in to a general-purpose register, it can't be out-of-date after a context switch. The following diff ``` --- qemu-6.1.0.orig/util/async.c 2021-08-24 13:35:41.000000000 -0400 +++ qemu-6.1.0/util/async.c 2021-09-20 17:48:15.404681749 -0400 @@ -673,6 +673,10 @@ AioContext *qemu_get_current_aio_context(void) { + if (qemu_in_coroutine()) { + Coroutine *self = qemu_coroutine_self(); + return self->ctx; + } if (my_aiocontext) { return my_aiocontext; } ``` fixes the scenario proposed by Thomas in https://bugzilla.redhat.com/show_bug.cgi?id=1952483#c6 (but it does not fix all tests). I understand this puts an extra burden on qemu developers, but it also seems sane to me to prevent coroutine from accessing thread local variable from another thread than the one they were created (interesting read on that topic: http://www.crystalclearsoftware.com/soc/coroutine/coroutine/coroutine_thread.html) Would that be acceptable to enforce that property upstream? (In reply to serge_sans_paille from comment #30) > The following diff > > ``` > --- qemu-6.1.0.orig/util/async.c 2021-08-24 13:35:41.000000000 -0400 > +++ qemu-6.1.0/util/async.c 2021-09-20 17:48:15.404681749 -0400 > @@ -673,6 +673,10 @@ > > AioContext *qemu_get_current_aio_context(void) > { > + if (qemu_in_coroutine()) { This uses the `current` TLS variable. Are you sure this works? It seems like the same problem :). The patch above fixes the LTO issue, and once applied, I've been successfully building qemu with LTO with GCC: https://koji.fedoraproject.org/koji/taskinfo?taskID=76803353 (all archs) and with Clang : https://koji.fedoraproject.org/koji/taskinfo?taskID=76802978 (s390x only). It's a compiler-agnostic patch, it works for any compiler that honors __attribute__((noinline)), as long as the compiler doesn't tries to do inter procedural optimization across non inlinable functions. RFC patch posted upstream based on the patch Serge attached to this BZ: https://patchew.org/QEMU/20211025140716.166971-1-stefanha@redhat.com/ Based on recent discussions with Stefan/Thomas and others, I'm moving this to ITR 9.1.0 as a "FutureFeature" since we don't yet enable LTO downstream on non-x86 architectures. We do have an RFC patch upstream, so hopefully this can be added soon. The following have been merged: d5d2b15ecf cpus: use coroutine TLS macros for iothread_locked 17c78154b0 rcu: use coroutine TLS macros 47b7446456 util/async: replace __thread with QEMU TLS macros 7d29c341c9 tls: add macros for coroutine-safe TLS variables I sent another 3 patches as a follow-up series. (In reply to Miroslav Rezanina from comment #0) > When running build for qemu-kvm for RHEL 9, test-block-iothread during "make > check " fails on aarch64, ppc64le and s390x architecture for > /attach/blockjob (pass on x86_64): FYI, qemu-kvm isn't supported on RHEL 9 on Power. Thanks The following patches were merged upstream: c1fe694357 coroutine-win32: use QEMU_DEFINE_STATIC_CO_TLS() ac387a08a9 coroutine: use QEMU_DEFINE_STATIC_CO_TLS() 34145a307d coroutine-ucontext: use QEMU_DEFINE_STATIC_CO_TLS() Hi Yihuang and Boqiao, Could you do the pre-verify on aarch64 and s390x with the fixed version? Thanks. Analyzed the build log, "-flto" is still not in the configure setting, is this expected? The full configure from here: http://download.eng.bos.redhat.com/brewroot/vol/rhel-9/packages/qemu-kvm/7.0.0/7.el9/data/logs/aarch64/build.log I can see x86 enabled -flto: http://download.eng.bos.redhat.com/brewroot/vol/rhel-9/packages/qemu-kvm/7.0.0/7.el9/data/logs/x86_64/build.log, and I also compiled with -flto myself on aarch64, it passed. Steps refer to Eric's bug 2000479 # ./configure --cc=clang --cxx=/bin/false --prefix=/usr --libdir=/usr/lib64 --datadir=/usr/share --sysconfdir=/etc --interp-prefix=/usr/qemu-%M --localstatedir=/var --docdir=/usr/share/doc --libexecdir=/usr/libexec '--extra-ldflags=-Wl,-z,relro -Wl,--as-needed -Wl,-z,now -flto' '--extra-cflags=-O2 -flto -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS --config /usr/lib/rpm/redhat/redhat-hardened-clang.cfg -fstack-protector-strong -fasynchronous-unwind-tables ' --target-list=aarch64-softmmu --enable-kvm --extra-cflags=-Wstrict-prototypes --extra-cflags=-Wredundant-decls --enable-trace-backends=log --enable-seccomp --enable-cap-ng --disable-werror --without-default-devices --disable-capstone --target-list='aarch64-softmmu' # make check-unit -j16 ...... ...... 22/92 qemu:unit / test-block-iothread OK 0.64s 16 subtests passed ...... ...... Ok: 92 Expected Fail: 0 Fail: 0 Unexpected Pass: 0 Skipped: 0 Timeout: 0 So in my opinion, maybe we can also enable -flto on other architectures? Anyway, the test result on the official build is passed. Result: PASS as no Critical Regression or TestBlocker found Test Environment: Host Distro: RHEL-9.1.0-20220627.0 BaseOS aarch64 Host Kernel: kernel-5.14.0-119.el9.aarch64 QEMU: qemu-kvm-7.0.0-7.el9.aarch64 edk2: edk2-aarch64-20220526git16779ede2d36-1.el9.noarch Guest: RHEL.9.1.0 Results Analysis: From 85 tests executed, 84 passed and 0 warned - success rate of 98.82% (excluding SKIP and CANCEL) 1 test case failed with an auto issue but retes passed New bugs(0): Existing bugs(0): Job link: http://10.0.136.47/6759356/results.html QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass. (In reply to Yihuang Yu from comment #43) > Analyzed the build log, "-flto" is still not in the configure setting, is > this expected? The full configure from here: > http://download.eng.bos.redhat.com/brewroot/vol/rhel-9/packages/qemu-kvm/7.0. > 0/7.el9/data/logs/aarch64/build.log > > I can see x86 enabled -flto: > http://download.eng.bos.redhat.com/brewroot/vol/rhel-9/packages/qemu-kvm/7.0. > 0/7.el9/data/logs/x86_64/build.log, and I also compiled with -flto myself on > aarch64, it passed. Steps refer to Eric's bug 2000479 > > # ./configure --cc=clang --cxx=/bin/false --prefix=/usr --libdir=/usr/lib64 > --datadir=/usr/share --sysconfdir=/etc --interp-prefix=/usr/qemu-%M > --localstatedir=/var --docdir=/usr/share/doc --libexecdir=/usr/libexec > '--extra-ldflags=-Wl,-z,relro -Wl,--as-needed -Wl,-z,now -flto' > '--extra-cflags=-O2 -flto -fexceptions -g -grecord-gcc-switches -pipe -Wall > -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS > --config /usr/lib/rpm/redhat/redhat-hardened-clang.cfg > -fstack-protector-strong -fasynchronous-unwind-tables ' > --target-list=aarch64-softmmu --enable-kvm > --extra-cflags=-Wstrict-prototypes --extra-cflags=-Wredundant-decls > --enable-trace-backends=log --enable-seccomp --enable-cap-ng > --disable-werror --without-default-devices --disable-capstone > --target-list='aarch64-softmmu' > > # make check-unit -j16 > ...... > ...... > 22/92 qemu:unit / test-block-iothread OK 0.64s > 16 subtests passed > ...... > ...... > Ok: 92 > Expected Fail: 0 > Fail: 0 > Unexpected Pass: 0 > Skipped: 0 > Timeout: 0 > > So in my opinion, maybe we can also enable -flto on other architectures? > > Anyway, the test result on the official build is passed. > > Result: PASS as no Critical Regression or TestBlocker found > > Test Environment: > Host Distro: RHEL-9.1.0-20220627.0 BaseOS aarch64 > Host Kernel: kernel-5.14.0-119.el9.aarch64 > QEMU: qemu-kvm-7.0.0-7.el9.aarch64 > edk2: edk2-aarch64-20220526git16779ede2d36-1.el9.noarch > Guest: RHEL.9.1.0 > > Results Analysis: > From 85 tests executed, 84 passed and 0 warned - success rate of 98.82% > (excluding SKIP and CANCEL) > 1 test case failed with an auto issue but retes passed > > New bugs(0): > Existing bugs(0): > > Job link: > http://10.0.136.47/6759356/results.html While at it, would you have cycles to test with Safestack enabled (https://bugzilla.redhat.com/show_bug.cgi?id=1992968)? We had the same symptoms and maybe Stefan's series also fixes that other BZ. Thank you in advance! (In reply to Yihuang Yu from comment #43) > Analyzed the build log, "-flto" is still not in the configure setting, is > this expected? The full configure from here: I think we also need a change to the qemu-kvm.spec file to enable LTO on non-x86 again. There's a hack there at the top of the file that looks like this: %ifnarch x86_64 %global _lto_cflags %%{nil} %endif Without removing that, we don't get LTO on s390x and aarch64, so I think this cannot properly verified. @stefanha, could you add such a patch on top, please? (In reply to Eric Auger from comment #45) > (In reply to Yihuang Yu from comment #43) > > Analyzed the build log, "-flto" is still not in the configure setting, is > > this expected? The full configure from here: > > http://download.eng.bos.redhat.com/brewroot/vol/rhel-9/packages/qemu-kvm/7.0. > > 0/7.el9/data/logs/aarch64/build.log > > > > I can see x86 enabled -flto: > > http://download.eng.bos.redhat.com/brewroot/vol/rhel-9/packages/qemu-kvm/7.0. > > 0/7.el9/data/logs/x86_64/build.log, and I also compiled with -flto myself on > > aarch64, it passed. Steps refer to Eric's bug 2000479 > > > > # ./configure --cc=clang --cxx=/bin/false --prefix=/usr --libdir=/usr/lib64 > > --datadir=/usr/share --sysconfdir=/etc --interp-prefix=/usr/qemu-%M > > --localstatedir=/var --docdir=/usr/share/doc --libexecdir=/usr/libexec > > '--extra-ldflags=-Wl,-z,relro -Wl,--as-needed -Wl,-z,now -flto' > > '--extra-cflags=-O2 -flto -fexceptions -g -grecord-gcc-switches -pipe -Wall > > -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS > > --config /usr/lib/rpm/redhat/redhat-hardened-clang.cfg > > -fstack-protector-strong -fasynchronous-unwind-tables ' > > --target-list=aarch64-softmmu --enable-kvm > > --extra-cflags=-Wstrict-prototypes --extra-cflags=-Wredundant-decls > > --enable-trace-backends=log --enable-seccomp --enable-cap-ng > > --disable-werror --without-default-devices --disable-capstone > > --target-list='aarch64-softmmu' > > > > # make check-unit -j16 > > ...... > > ...... > > 22/92 qemu:unit / test-block-iothread OK 0.64s > > 16 subtests passed > > ...... > > ...... > > Ok: 92 > > Expected Fail: 0 > > Fail: 0 > > Unexpected Pass: 0 > > Skipped: 0 > > Timeout: 0 > > > > So in my opinion, maybe we can also enable -flto on other architectures? > > > > Anyway, the test result on the official build is passed. > > > > Result: PASS as no Critical Regression or TestBlocker found > > > > Test Environment: > > Host Distro: RHEL-9.1.0-20220627.0 BaseOS aarch64 > > Host Kernel: kernel-5.14.0-119.el9.aarch64 > > QEMU: qemu-kvm-7.0.0-7.el9.aarch64 > > edk2: edk2-aarch64-20220526git16779ede2d36-1.el9.noarch > > Guest: RHEL.9.1.0 > > > > Results Analysis: > > From 85 tests executed, 84 passed and 0 warned - success rate of 98.82% > > (excluding SKIP and CANCEL) > > 1 test case failed with an auto issue but retes passed > > > > New bugs(0): > > Existing bugs(0): > > > > Job link: > > http://10.0.136.47/6759356/results.html > > While at it, would you have cycles to test with Safestack enabled > (https://bugzilla.redhat.com/show_bug.cgi?id=1992968)? We had the same > symptoms and maybe Stefan's series also fixes that other BZ. Thank you in > advance! OK Eric, I will enable both flto and safe-stack, and then trigger a tier1 testing. Will update test result later. Unfortunately, I cannot rebuild the qemu-kvm rpm package from src.rpm if I have both flto and safe-stack enabled. Eric, so I don't think now is the right time to enable the safe-stack. Maybe we need to tweak some CFLAGS? # diff /root/rpmbuild/SPECS/qemu-kvm.spec /home/qemu-kvm.spec.backup 7a8,13 > # LTO does not work with the coroutines of QEMU on non-x86 architectures > # (see BZ 1952483 and 1950192 for more information) > %ifnarch x86_64 > %global _lto_cflags %%{nil} > %endif > 18c24 < %global have_safe_stack 1 --- > %global have_safe_stack 0 22a29,31 > %ifarch x86_64 > %global have_safe_stack 1 > %endif flto + safe-stack: 27/128 qemu:unit / test-bdrv-drain ERROR 1.01s killed by signal 11 SIGSEGV ―――――――――――――――――――――――――――――――――――――――― ✀ ―――――――――――――――――――――――――――――――――――――――― stderr: TAP parsing error: Too few tests run (expected 42, got 20) (test program exited with status code -11) 33/128 qemu:unit / test-block-iothread ERROR 1.66s killed by signal 6 SIGABRT ―――――――――――――――――――――――――――――――――――――――― ✀ ―――――――――――――――――――――――――――――――――――――――― stderr: qemu_aio_coroutine_enter: Co-routine was already scheduled in '' TAP parsing error: Too few tests run (expected 16, got 10) (test program exited with status code -6) Summary of Failures: 27/128 qemu:unit / test-bdrv-drain ERROR 0.94s killed by signal 11 SIGSEGV 33/128 qemu:unit / test-block-iothread ERROR 1.54s killed by signal 6 SIGABRT Ok: 123 Expected Fail: 0 Fail: 2 Unexpected Pass: 0 Skipped: 3 Timeout: 0 (In reply to lijin from comment #42) > Hi Yihuang and Boqiao, > > Could you do the pre-verify on aarch64 and s390x with the fixed version? > > Thanks. [root@l42 build]# tests/unit/test-block-iothread # random seed: R02Sdf2c11a84ebf6fa4a3bf33e5f4ba9f5c 1..16 # Start of sync-op tests ok 1 /sync-op/pread ok 2 /sync-op/pwrite ok 3 /sync-op/load_vmstate ok 4 /sync-op/save_vmstate ok 5 /sync-op/pdiscard ok 6 /sync-op/truncate ok 7 /sync-op/block_status ok 8 /sync-op/flush ok 9 /sync-op/check ok 10 /sync-op/activate # End of sync-op tests # Start of attach tests ok 11 /attach/blockjob ok 12 /attach/second_node ok 13 /attach/preserve_blk_ctx # End of attach tests # Start of propagate tests ok 14 /propagate/basic ok 15 /propagate/diamond ok 16 /propagate/mirror # End of propagate tests I didn't see an error on s390x Based on comment 48 there are still issues, probably related to coroutines, that need to be debugged if we want to enable LTO + SafeStack on non-x86 architectures. The coroutine TLS patches were already merged in 7.0.0-7 for this BZ. I am on PTO until August. At that time I can investigate the root cause. Let's keep LTO disabled until the root cause is understood. If someone else wants to take over this BZ while I'm away, feel free. OK, Stefan. Then let me move the ITM to a bit later until we decide to fix the compile issue in which release, thanks for understanding. After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug. This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there. Due to differences in account names between systems, some fields were not replicated. Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information. To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer. You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like: "Bugzilla Bug" = 1234567 In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information. |