Bug 1931331 - libvirtd crashes in virEventThreadWorker
Summary: libvirtd crashes in virEventThreadWorker
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: libvirt
Version: 8.3
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: 8.4
Assignee: Martin Kletzander
QA Contact: Han Han
URL:
Whiteboard:
: 1931929 1939874 (view as bug list)
Depends On:
Blocks: 1940484 1942010 1985906
TreeView+ depends on / blocked
 
Reported: 2021-02-22 08:01 UTC by Jean-Louis Dupond
Modified: 2024-10-01 17:32 UTC (History)
19 users (show)

Fixed In Version: libvirt-7.0.0-10.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1942010 (view as bug list)
Environment:
Last Closed: 2021-05-25 06:47:27 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
backtrace for all threads (49.16 KB, text/plain)
2021-02-22 08:08 UTC, Jean-Louis Dupond
no flags Details
Coredump backtrace (67.37 KB, text/plain)
2021-03-12 04:48 UTC, Han Han
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5694061 0 None None None 2021-03-11 13:58:52 UTC

Description Jean-Louis Dupond 2021-02-22 08:01:21 UTC
Description of problem:
Feb 20 11:18:57 ovirt-ovn001 abrt-hook-ccpp[306988]: Process 2814 (libvirtd) of user 0 killed by SIGSEGV - dumping core
Feb 20 11:18:58 ovirt-ovn001 abrt-hook-ccpp[306994]: Can't generate core backtrace: dwfl_getthread_frames failed: No DWARF information found
Feb 20 11:18:58 ovirt-ovn001 abrt-hook-ccpp[306988]: Core backtrace generator exited with error 1


Version-Release number of selected component (if applicable):
libvirt 6.6.0-7.3.el8.x86_64

How reproducible:
Happens from time to time (take every week per node)

Additional info:
(gdb) bt full
#0  0x00007f66500000f0 in ?? ()
No symbol table info available.
#1  0x00007f6677c9e7bd in g_source_unref_internal () from /lib64/libglib-2.0.so.0
No symbol table info available.
#2  0x00007f6677c9e90e in g_source_iter_next () from /lib64/libglib-2.0.so.0
No symbol table info available.
#3  0x00007f6677ca0e23 in g_main_context_prepare () from /lib64/libglib-2.0.so.0
No symbol table info available.
#4  0x00007f6677ca18eb in g_main_context_iterate.isra () from /lib64/libglib-2.0.so.0
No symbol table info available.
#5  0x00007f6677ca1d72 in g_main_loop_run () from /lib64/libglib-2.0.so.0
No symbol table info available.
#6  0x00007f667b03937e in virEventThreadWorker (opaque=0x7f6650269980) at ../../src/util/vireventthread.c:120
        data = 0x7f6650269980
        running = 0x7f656c000b60
#7  0x00007f6677cc9d4a in g_thread_proxy () from /lib64/libglib-2.0.so.0
No symbol table info available.
#8  0x00007f667783814a in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#9  0x00007f667714df23 in clone () from /lib64/libc.so.6
No symbol table info available.

Comment 1 Jean-Louis Dupond 2021-02-22 08:08:44 UTC
Created attachment 1758571 [details]
backtrace for all threads

Comment 2 Martin Kletzander 2021-02-22 10:40:18 UTC
This was supposed to be fixed by: https://gitlab.com/libvirt/libvirt/-/commit/0db4743645b7a0611a3c0687f834205c9956f7fc from 6.7.0 and back-ported to various 6.6.0 versions, like 6.6.0-8 (BZ 1894045) and 6.6.0-7.2 (BZ 1915601).  However, since your version should already include that fix it looks like it either did not fix the issue or that there is different issue there.

Daniel: Any idea whether this is related since you did the original fix?  The backtrace suggests it very possibly is.

Comment 3 Daniel Berrangé 2021-02-22 10:54:28 UTC
(In reply to Martin Kletzander from comment #2)
> This was supposed to be fixed by:
> https://gitlab.com/libvirt/libvirt/-/commit/
> 0db4743645b7a0611a3c0687f834205c9956f7fc from 6.7.0 and back-ported to
> various 6.6.0 versions, like 6.6.0-8 (BZ 1894045) and 6.6.0-7.2 (BZ
> 1915601).  However, since your version should already include that fix it
> looks like it either did not fix the issue or that there is different issue
> there.
> 
> Daniel: Any idea whether this is related since you did the original fix? 
> The backtrace suggests it very possibly is.

Ooooh, it is the same bug, but occurring in a different piece of code. 

The patches we took fixed this problem in the virEventLoop impl, but I forgot that virEventThread is separate. This is a dedicated thread per-VM that handles I/O watches for the QEMU monitor and guest agent.

I suspect that we're unrefing GSource in a different thread and triggering the same race bug.

Comment 4 Martin Kletzander 2021-03-04 09:51:11 UTC
I took a stab at it here:

  https://listman.redhat.com/archives/libvir-list/2021-March/msg00226.html

Comment 5 Martin Kletzander 2021-03-05 13:59:35 UTC
This should be fixed upstream by v7.1.0-96-g2a490ce5a03e:

commit 2a490ce5a03ef6607fe55515ba55d6cfd2016bef
Author: Martin Kletzander <mkletzan>
Date:   Thu Mar 4 10:00:06 2021 +0100

    glib: Use safe glib event workaround in other event loops

Comment 10 Han Han 2021-03-10 09:54:49 UTC
Martin, do you have any idea to reproduce this issue?

Comment 11 Martin Kletzander 2021-03-10 10:45:36 UTC
(In reply to Han Han from comment #10)
Unfortunately this is very similar to the BZs 1894045 and 1915601.  Even the reporter notices this only occasionally.

Comment 12 Peter Krempa 2021-03-10 12:29:39 UTC
*** Bug 1931929 has been marked as a duplicate of this bug. ***

Comment 13 Jean-Louis Dupond 2021-03-11 08:29:38 UTC
Well we have multiple nodes, and must say I hit this issue all together +- 1 time per day.
So if we have some build with a patched libvirt, I think we can confirm within a week if its fully fixed :)

Comment 19 Han Han 2021-03-12 04:48:34 UTC
Created attachment 1762892 [details]
Coredump backtrace

Hello, I just find a method to reproduce the virEventThreadWorker crash:

#!/bin/bash
VM=8.3
function loop_list(){
        while true;do virsh list --all;done
}

function loop_guestinfo(){
        while true;do virsh guestinfo $VM;done
}

function loop_domstats(){
        while true;do virsh domstats $VM;done
}

loop_list &
loop_guestinfo &
loop_domstats &


The backtrace:
(gdb) bt
#0  0x00007fd9a75477d0 in g_source_unref_internal (source=0x7fd938450da0, context=0x7fd92816f1d0, have_lock=1) at gmain.c:2140                                                                            
#1  0x00007fd9a7547a0e in g_source_iter_next (iter=iter@entry=0x7fd95292fa10, source=source@entry=0x7fd95292fa08) at gmain.c:980                                                                          
#2  0x00007fd9a7549f23 in g_main_context_prepare (context=context@entry=0x7fd92816f1d0, priority=priority@entry=0x7fd95292fa90) at gmain.c:3452                                                           
#3  0x00007fd9a754a9eb in g_main_context_iterate (context=0x7fd92816f1d0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3882                                           
#4  0x00007fd9a754ae72 in g_main_loop_run (loop=0x7fd928095860) at gmain.c:4098
#5  0x00007fd9a7cf1ede in virEventThreadWorker (opaque=0x7fd92816fd70) at ../src/util/vireventthread.c:124                                                                                                
#6  0x00007fd9a7572e1a in g_thread_proxy (data=0x7fd93800aca0) at gthread.c:784
#7  0x00007fd9a403e14a in start_thread (arg=<optimized out>) at pthread_create.c:479
#8  0x00007fd9a67ecdb3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

However the backtrace below virEventThreadWorker is different from the backtrace of reporter.

Comment 20 Han Han 2021-03-12 04:51:12 UTC
Martin, could you help to check if comment19 is the same bug or a new bug?
BTW, versions: libvirt-7.0.0-8.module+el8.4.0+10233+8b7fd9eb.x86_64 glib2-2.56.4-9.el8.x86_64

Comment 21 Jeff Nelson 2021-03-12 16:30:17 UTC
Exception approved in review meeting on 12 Mar 2021.

Comment 22 Martin Kletzander 2021-03-13 22:51:14 UTC
(In reply to Han Han from comment #20)
Certainly looks like the one we needed, good job!



Also good luck catching the output =D

Comment 23 Martin Kletzander 2021-03-13 23:35:32 UTC
(In reply to Jean-Louis Dupond from comment #13)
Sure, however I cannot make a new build now, three consecutive ones failed for various reasons, there is something fishy going on.  If I get to it before someone else I will let you know.

Comment 24 Martin Kletzander 2021-03-16 15:03:34 UTC
It looks like the performance impact is pretty significant and our longest tests that use the event loop can timeout on slower build machines.  Until this is fixed in glibc we have to have this workaround in place though, so I'll see what can be done to get this in soon.

Comment 25 Han Han 2021-03-17 01:24:29 UTC
(In reply to Martin Kletzander from comment #24)
> It looks like the performance impact is pretty significant and our longest
> tests that use the event loop can timeout on slower build machines.  Until
> this is fixed in glibc we have to have this workaround in place though, so
> I'll see what can be done to get this in soon.

Could we request glib2 to backport https://gitlab.gnome.org/GNOME/glib/-/commit/b06c48de7554607ff3fb58d6c0510cfa5088e909 ?

Comment 26 Martin Kletzander 2021-03-17 02:08:09 UTC
(In reply to Han Han from comment #25)
We could, although it would not help us.  The workaround needs to be in place anyway and we already have a complete fix "tested" in a scratch build.  Just waiting for the backport.  Of course having the glib fix backported as well would be nice overall.

Comment 28 Martin Kletzander 2021-03-17 13:10:24 UTC
We were waiting for one last patch that should hopefully finish the fix:

commit 695bdb3841ca20e905680a7eec8ca040ec28e459
Author: Daniel P. Berrangé <berrange>
Date:   Tue Mar 16 16:26:06 2021 +0000

    src: ensure GSource background unref happens in correct event loop

Comment 29 Peter Krempa 2021-03-17 14:26:41 UTC
*** Bug 1939874 has been marked as a duplicate of this bug. ***

Comment 37 Han Han 2021-03-23 01:43:32 UTC
Run the script of comment19 for half an hour with version libvirt-7.0.0-10.module+el8.4.0+10417+37f6984d.x86_64
qemu-kvm-5.2.0-14.module+el8.4.0+10425+ad586fa5.x86_64. Never reproduce the crash.

Comment 43 errata-xmlrpc 2021-05-25 06:47:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2098


Note You need to log in before you can comment on or make changes to this bug.