Description of problem: Feb 20 11:18:57 ovirt-ovn001 abrt-hook-ccpp[306988]: Process 2814 (libvirtd) of user 0 killed by SIGSEGV - dumping core Feb 20 11:18:58 ovirt-ovn001 abrt-hook-ccpp[306994]: Can't generate core backtrace: dwfl_getthread_frames failed: No DWARF information found Feb 20 11:18:58 ovirt-ovn001 abrt-hook-ccpp[306988]: Core backtrace generator exited with error 1 Version-Release number of selected component (if applicable): libvirt 6.6.0-7.3.el8.x86_64 How reproducible: Happens from time to time (take every week per node) Additional info: (gdb) bt full #0 0x00007f66500000f0 in ?? () No symbol table info available. #1 0x00007f6677c9e7bd in g_source_unref_internal () from /lib64/libglib-2.0.so.0 No symbol table info available. #2 0x00007f6677c9e90e in g_source_iter_next () from /lib64/libglib-2.0.so.0 No symbol table info available. #3 0x00007f6677ca0e23 in g_main_context_prepare () from /lib64/libglib-2.0.so.0 No symbol table info available. #4 0x00007f6677ca18eb in g_main_context_iterate.isra () from /lib64/libglib-2.0.so.0 No symbol table info available. #5 0x00007f6677ca1d72 in g_main_loop_run () from /lib64/libglib-2.0.so.0 No symbol table info available. #6 0x00007f667b03937e in virEventThreadWorker (opaque=0x7f6650269980) at ../../src/util/vireventthread.c:120 data = 0x7f6650269980 running = 0x7f656c000b60 #7 0x00007f6677cc9d4a in g_thread_proxy () from /lib64/libglib-2.0.so.0 No symbol table info available. #8 0x00007f667783814a in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #9 0x00007f667714df23 in clone () from /lib64/libc.so.6 No symbol table info available.
Created attachment 1758571 [details] backtrace for all threads
This was supposed to be fixed by: https://gitlab.com/libvirt/libvirt/-/commit/0db4743645b7a0611a3c0687f834205c9956f7fc from 6.7.0 and back-ported to various 6.6.0 versions, like 6.6.0-8 (BZ 1894045) and 6.6.0-7.2 (BZ 1915601). However, since your version should already include that fix it looks like it either did not fix the issue or that there is different issue there. Daniel: Any idea whether this is related since you did the original fix? The backtrace suggests it very possibly is.
(In reply to Martin Kletzander from comment #2) > This was supposed to be fixed by: > https://gitlab.com/libvirt/libvirt/-/commit/ > 0db4743645b7a0611a3c0687f834205c9956f7fc from 6.7.0 and back-ported to > various 6.6.0 versions, like 6.6.0-8 (BZ 1894045) and 6.6.0-7.2 (BZ > 1915601). However, since your version should already include that fix it > looks like it either did not fix the issue or that there is different issue > there. > > Daniel: Any idea whether this is related since you did the original fix? > The backtrace suggests it very possibly is. Ooooh, it is the same bug, but occurring in a different piece of code. The patches we took fixed this problem in the virEventLoop impl, but I forgot that virEventThread is separate. This is a dedicated thread per-VM that handles I/O watches for the QEMU monitor and guest agent. I suspect that we're unrefing GSource in a different thread and triggering the same race bug.
I took a stab at it here: https://listman.redhat.com/archives/libvir-list/2021-March/msg00226.html
This should be fixed upstream by v7.1.0-96-g2a490ce5a03e: commit 2a490ce5a03ef6607fe55515ba55d6cfd2016bef Author: Martin Kletzander <mkletzan> Date: Thu Mar 4 10:00:06 2021 +0100 glib: Use safe glib event workaround in other event loops
Martin, do you have any idea to reproduce this issue?
(In reply to Han Han from comment #10) Unfortunately this is very similar to the BZs 1894045 and 1915601. Even the reporter notices this only occasionally.
*** Bug 1931929 has been marked as a duplicate of this bug. ***
Well we have multiple nodes, and must say I hit this issue all together +- 1 time per day. So if we have some build with a patched libvirt, I think we can confirm within a week if its fully fixed :)
Created attachment 1762892 [details] Coredump backtrace Hello, I just find a method to reproduce the virEventThreadWorker crash: #!/bin/bash VM=8.3 function loop_list(){ while true;do virsh list --all;done } function loop_guestinfo(){ while true;do virsh guestinfo $VM;done } function loop_domstats(){ while true;do virsh domstats $VM;done } loop_list & loop_guestinfo & loop_domstats & The backtrace: (gdb) bt #0 0x00007fd9a75477d0 in g_source_unref_internal (source=0x7fd938450da0, context=0x7fd92816f1d0, have_lock=1) at gmain.c:2140 #1 0x00007fd9a7547a0e in g_source_iter_next (iter=iter@entry=0x7fd95292fa10, source=source@entry=0x7fd95292fa08) at gmain.c:980 #2 0x00007fd9a7549f23 in g_main_context_prepare (context=context@entry=0x7fd92816f1d0, priority=priority@entry=0x7fd95292fa90) at gmain.c:3452 #3 0x00007fd9a754a9eb in g_main_context_iterate (context=0x7fd92816f1d0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3882 #4 0x00007fd9a754ae72 in g_main_loop_run (loop=0x7fd928095860) at gmain.c:4098 #5 0x00007fd9a7cf1ede in virEventThreadWorker (opaque=0x7fd92816fd70) at ../src/util/vireventthread.c:124 #6 0x00007fd9a7572e1a in g_thread_proxy (data=0x7fd93800aca0) at gthread.c:784 #7 0x00007fd9a403e14a in start_thread (arg=<optimized out>) at pthread_create.c:479 #8 0x00007fd9a67ecdb3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 However the backtrace below virEventThreadWorker is different from the backtrace of reporter.
Martin, could you help to check if comment19 is the same bug or a new bug? BTW, versions: libvirt-7.0.0-8.module+el8.4.0+10233+8b7fd9eb.x86_64 glib2-2.56.4-9.el8.x86_64
Exception approved in review meeting on 12 Mar 2021.
(In reply to Han Han from comment #20) Certainly looks like the one we needed, good job! Also good luck catching the output =D
(In reply to Jean-Louis Dupond from comment #13) Sure, however I cannot make a new build now, three consecutive ones failed for various reasons, there is something fishy going on. If I get to it before someone else I will let you know.
It looks like the performance impact is pretty significant and our longest tests that use the event loop can timeout on slower build machines. Until this is fixed in glibc we have to have this workaround in place though, so I'll see what can be done to get this in soon.
(In reply to Martin Kletzander from comment #24) > It looks like the performance impact is pretty significant and our longest > tests that use the event loop can timeout on slower build machines. Until > this is fixed in glibc we have to have this workaround in place though, so > I'll see what can be done to get this in soon. Could we request glib2 to backport https://gitlab.gnome.org/GNOME/glib/-/commit/b06c48de7554607ff3fb58d6c0510cfa5088e909 ?
(In reply to Han Han from comment #25) We could, although it would not help us. The workaround needs to be in place anyway and we already have a complete fix "tested" in a scratch build. Just waiting for the backport. Of course having the glib fix backported as well would be nice overall.
We were waiting for one last patch that should hopefully finish the fix: commit 695bdb3841ca20e905680a7eec8ca040ec28e459 Author: Daniel P. Berrangé <berrange> Date: Tue Mar 16 16:26:06 2021 +0000 src: ensure GSource background unref happens in correct event loop
*** Bug 1939874 has been marked as a duplicate of this bug. ***
Run the script of comment19 for half an hour with version libvirt-7.0.0-10.module+el8.4.0+10417+37f6984d.x86_64 qemu-kvm-5.2.0-14.module+el8.4.0+10425+ad586fa5.x86_64. Never reproduce the crash.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2098