Bug 1895359 - libvirtd fails to release lock on resources and crashes
Summary: libvirtd fails to release lock on resources and crashes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: libvirt
Version: 8.3
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: 8.4
Assignee: Michal Privoznik
QA Contact: yafu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-06 13:22 UTC by Katerina Koukiou
Modified: 2021-05-25 06:45 UTC (History)
7 users (show)

Fixed In Version: libvirt-7.0.0-3.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-25 06:45:10 UTC
Type: Bug
Target Upstream Version: 6.10.0
Embargoed:


Attachments (Terms of Use)
coredump file used for the stacktrace (6.70 MB, application/x-lz4)
2020-11-06 13:22 UTC, Katerina Koukiou
no flags Details
Output for 't a a bt' in gdb (14.83 KB, text/plain)
2020-11-06 20:43 UTC, Katerina Koukiou
no flags Details

Description Katerina Koukiou 2020-11-06 13:22:29 UTC
Created attachment 1727126 [details]
coredump file used for the stacktrace

Description of problem:
In cockpit tests we see often libvirt crashes, with the following stacktrace. It's not 100 reproduceable and I only have seen this in CI, thus I don't have a reproducer unfortunately. However, we have core dump (find also in attachment) files and with debug symbols installed I paste the backtrace from gdb below. 
 

Version-Release number of selected component (if applicable):

libvirt-daemon-6.0.0-28.module+el8.3.0+7827+5e65edd7.x86_64


How reproducible:

Sometimes

(gdb) bt full
#0  __GI___pthread_mutex_lock (mutex=mutex@entry=0x0) at ../nptl/pthread_mutex_lock.c:67
        type = <optimized out>
        __PRETTY_FUNCTION__ = "__pthread_mutex_lock"
        id = <optimized out>
#1  0x00007fa783819d69 in virMutexLock (m=m@entry=0x0) at ../../src/util/virthread.c:79
No locals.
#2  0x00007fa741e963d2 in qemuDriverLock (driver=0x0) at ../../src/qemu/qemu_conf.c:1177
No locals.
#3  virQEMUDriverGetConfig (driver=0x0) at ../../src/qemu/qemu_conf.c:1177
        conf = <optimized out>
#4  0x00007fa741edb620 in qemuStateStop () at ../../src/qemu/qemu_driver.c:1070
        ret = -1
        conn = 0x0
        numDomains = 0
        i = <optimized out>
        state = 32679
        domains = 0x0
        flags = 0x0
        cfg = <optimized out>
#5  0x00007fa7839ae9bf in virStateStop () at ../../src/libvirt.c:713
        i = 7
        ret = 0
#6  0x000055a7dd300db1 in daemonStopWorker (opaque=0x55a7ddf819f0) at ../../src/remote/remote_daemon.c:759
        __x = <optimized out>
        dmn = 0x55a7ddf819f0
        __func__ = "daemonStopWorker"
#7  0x00007fa783819c0a in virThreadHelper (data=<optimized out>) at ../../src/util/virthread.c:196
        args = 0x0
        local = {func = 0x55a7dd300d70 <daemonStopWorker>, funcName = 0x55a7dd32bf41 "daemonStopWorker", worker = false, opaque = 0x55a7ddf819f0}
#8  0x00007fa77fba914a in start_thread (arg=<optimized out>) at pthread_create.c:479
        ret = <optimized out>
        pd = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140356264179456, -2127972736968815236, 140736926283710, 140736926283711, 140736926283888, 140356264176384, 2105879883266788732, 2105735624623394172}, mask_was_saved = 0}}, priv = {
            pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#9  0x00007fa77f8daf23 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
No locals.


I also reported the exact same crash on debian buster [1] some days ago, seems like this exists at least since version 5.0.0

[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=973758

Comment 1 Jaroslav Suchanek 2020-11-06 14:07:57 UTC
Hi Katerina,

do you have workers with newer libvirt too? Some recent fedoras, for example f32, f33? Do you observe crashes in daemon stopping too?

Comment 2 Katerina Koukiou 2020-11-06 14:22:22 UTC
@Jaroslav actually it does on fedora-33 && fedora-32. I just found an upstream report for fedora which reports the exact same crash, see https://bugzilla.redhat.com/show_bug.cgi?id=1828207

I guess we shall keep it open for RHEL separately though.

Comment 3 Katerina Koukiou 2020-11-06 20:43:46 UTC
Created attachment 1727277 [details]
Output for 't a a bt' in gdb

Output from gdb as requested by mprivoznik on IRC.

Comment 4 Jaroslav Suchanek 2020-11-12 16:30:22 UTC
Michal, can you please triage this bz? How serious/difficult it would be for fixing? Is there any interim workaround, so cockpit team is not blocked by it. I can see one from Daniel https://bugzilla.redhat.com/show_bug.cgi?id=1828207#c8 which might be sufficient for now.

Thanks.

Comment 5 Michal Privoznik 2020-11-12 17:03:59 UTC
Yes, I think this is the same bug. What's essentially happening here is that there is a thread spawned when we see PrepareForShutdown dbus signal. And just before the thread gets to run, the libvirtd finishes and calls qemuStateCleanup() which frees the QEMU driver. This can be seen in the stack trace where the main thread is executing _dl_fini() which is calling destructors for all libraries. Anyway, only after then the thread gets to run and tries to access (now freed) QEMU driver.

The naive fix is to make the thread check if QEMU driver is NULL and if it is then do nothing. But that's just papering over the real issue.

Comment 6 Michal Privoznik 2020-11-12 17:13:17 UTC
I've resurrected the old discussion about shutdown:

https://www.redhat.com/archives/libvir-list/2020-November/msg00632.html

Comment 7 Michal Privoznik 2020-11-12 18:46:07 UTC
Patch proposed upstream:

https://www.redhat.com/archives/libvir-list/2020-November/msg00639.html

Comment 9 Michal Privoznik 2020-11-24 16:56:09 UTC
Fix pushed upstream:

commit a42b46dd7db2cafe77010bfae55f2e4631a26844
Author:     Michal Prívozník <mprivozn>
AuthorDate: Fri Nov 13 10:56:59 2020 +0100
Commit:     Michal Prívozník <mprivozn>
CommitDate: Tue Nov 24 17:52:54 2020 +0100

    virnetdaemon: Wait for "daemon-stop" thread to finish before quitting
    
    When the host is shutting down then we get PrepareForShutdown
    signal on DBus to which we react by creating a thread which
    runs virStateStop() and thus qemuStateStop(). But if scheduling
    the thread is delayed just a but it may happen that we receive
    SIGTERM (sent by systemd) to which we respond by quitting our
    event loop and cleaning up everything (including drivers). And
    only after that the thread gets to run only to find qemu_driver
    being NULL.
    
    What we can do is to delay exiting event loop and join the thread
    that's executing virStateStop(). If the join doesn't happen in
    given timeout (currently 30 seconds) then libvirtd shuts down
    forcefully anyways (see virNetDaemonRun()).
    
    Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1895359
    Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1739564
    
    Signed-off-by: Michal Privoznik <mprivozn>
    Reviewed-by: Ján Tomko <jtomko>

v6.9.0-373-ga42b46dd7d

Comment 14 yafu 2021-01-22 06:52:11 UTC
Hi Michal,

I met another crash when trying to verify the bug with libvirt-daemon-7.0.0-2.module+el8.4.0+9520+ef609c5f.x86_64.
Would you help to check whether it is caused by the patches for this bug please? Thanks a lot.

Reproduce steps:
1.# stress --cpu 16 --io 8 --vm 2 --vm-bytes 128M --timeout 100s &
(This stess works well in my host, please adjust options according to your host env)

2.# kill -SIGUSR1 $(pgrep libvirtd); sleep 1; kill -SIGINT $(pgrep libvirtd)

3.# abrt-cli list
id cb95788b1719b72a6f5a983d1f20a628be22efa4
reason:         qemuStateShutdownPrepare(): libvirtd killed by SIGSEGV
time:           Thu 21 Jan 2021 04:29:51 EST
cmdline:        /usr/sbin/libvirtd --timeout 120
package:        libvirt-daemon-7.0.0-2.module+el8.4.0+9520+ef609c5f
uid:            0 (root)
count:          3
Directory:      /var/spool/abrt/ccpp-2021-01-21-04:29:51-23902
Run 'abrt-cli report /var/spool/abrt/ccpp-2021-01-21-04:29:51-23902' for creating a case in Red Hat Customer Portal

4.Backtrace:
(gdb) t a a bt

Thread 15 (Thread 0x7fc7ad7c7700 (LWP 23909)):
#0  0x00007fc7b5a2462c in __lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:63
#1  0x00007fc7b5a1a4bf in __deallocate_stack (pd=0x7fc7b5c32388 <stack_cache_lock>, pd@entry=0x7fc7ad7c7700) at allocatestack.c:791
#2  0x00007fc7b5a1b03d in __free_tcb (pd=pd@entry=0x7fc7ad7c7700) at pthread_create.c:368
#3  0x00007fc7b5a1b3dc in start_thread (arg=<optimized out>) at pthread_create.c:575
#4  0x00007fc7b81c9db3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 14 (Thread 0x7fc767fff700 (LWP 23915)):
#0  0x00007fc7b81c489b in madvise () at ../sysdeps/unix/syscall-template.S:78
#1  0x00007fc7b5a1b404 in advise_stack_range (guardsize=<optimized out>, pd=140494420047616, size=<optimized out>, mem=0x7fc7677ff000)
    at allocatestack.c:392
#2  0x00007fc7b5a1b404 in start_thread (arg=<optimized out>) at pthread_create.c:569
#3  0x00007fc7b81c9db3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 13 (Thread 0x7fc78ffff700 (LWP 23911)):
#0  0x00007fc7b81c489b in madvise () at ../sysdeps/unix/syscall-template.S:78
#1  0x00007fc7b5a1b404 in advise_stack_range (guardsize=<optimized out>, pd=140495091136256, size=<optimized out>, mem=0x7fc78f7ff000)
    at allocatestack.c:392
#2  0x00007fc7b5a1b404 in start_thread (arg=<optimized out>) at pthread_create.c:569
#3  0x00007fc7b81c9db3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 12 (Thread 0x7fc7acfc6700 (LWP 23910)):
#0  0x00007fc7b81c489b in madvise () at ../sysdeps/unix/syscall-template.S:78
#1  0x00007fc7b5a1b404 in advise_stack_range (guardsize=<optimized out>, pd=140495577442048, size=<optimized out>, mem=0x7fc7ac7c6000)
    at allocatestack.c:392
#2  0x00007fc7b5a1b404 in start_thread (arg=<optimized out>) at pthread_create.c:569
#3  0x00007fc7b81c9db3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 11 (Thread 0x7fc7aefca700 (LWP 23906)):
#0  0x00007fc7b5a2462c in __lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:63
#1  0x00007fc7b5a1a4bf in __deallocate_stack (pd=0x7fc7b5c32388 <stack_cache_lock>, pd@entry=0x7fc7aefca700) at allocatestack.c:791
#2  0x00007fc7b5a1b03d in __free_tcb (pd=pd@entry=0x7fc7aefca700) at pthread_create.c:368
#3  0x00007fc7b5a1b3dc in start_thread (arg=<optimized out>) at pthread_create.c:575
#4  0x00007fc7b81c9db3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
--Type <RET> for more, q to quit, c to continue without paging--

Thread 10 (Thread 0x7fc76e851700 (LWP 23917)):
#0  0x00007fc7b81c489b in madvise () at ../sysdeps/unix/syscall-template.S:78
#1  0x00007fc7b5a1b404 in advise_stack_range (guardsize=<optimized out>, pd=140494529435392, size=<optimized out>, mem=0x7fc76e051000)
    at allocatestack.c:392
#2  0x00007fc7b5a1b404 in start_thread (arg=<optimized out>) at pthread_create.c:569
#3  0x00007fc7b81c9db3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 9 (LWP 23908):
#0  0x00007fc7b5a1b206 in start_thread (arg=<optimized out>) at ../sysdeps/unix/sysv/linux/exit-thread.h:36
#1  0x00007fc7b81c9db3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 8 (Thread 0x7fc76e050700 (LWP 23937)):
#0  0x00007fc7b9c7f747 in __mmap64 (offset=0, fd=3, flags=2, prot=1, len=45059, addr=0x0) at ../sysdeps/unix/sysv/linux/mmap64.c:52
#1  0x00007fc7b9c7f747 in __mmap64 (addr=0x0, len=45059, prot=1, flags=2, fd=3, offset=0) at ../sysdeps/unix/sysv/linux/mmap64.c:40
#2  0x00007fc7b9c7176d in _dl_sysdep_read_whole_file
    (file=file@entry=0x7fc7b9c858e3 "/etc/ld.so.cache", sizep=sizep@entry=0x7fc7b9e8f108 <cachesize>, prot=prot@entry=1) at dl-misc.c:57
#3  0x00007fc7b9c78b6c in _dl_load_cache_lookup (name=name@entry=0x7fc76e04f810 "libnss_files.so.2") at dl-cache.c:414
#4  0x00007fc7b9c6a785 in _dl_map_object (loader=loader@entry=0x7fc7b9e7d9e0, name=<optimized out>, 
    name@entry=0x7fc76e04f810 "libnss_files.so.2", type=type@entry=2, trace_mode=trace_mode@entry=0, mode=mode@entry=-1879048190, nsid=<optimized out>) at dl-load.c:2138
#5  0x00007fc7b9c74b49 in dl_open_worker (a=a@entry=0x7fc76e04f580) at dl-open.c:526
#6  0x00007fc7b8205214 in __GI__dl_catch_exception
    (exception=exception@entry=0x7fc76e04f560, operate=operate@entry=0x7fc7b9c74aa0 <dl_open_worker>, args=args@entry=0x7fc76e04f580)
    at dl-error-skeleton.c:208
#7  0x00007fc7b9c746b1 in _dl_open
    (file=<optimized out>, mode=-2147483646, caller_dlopen=0x7fc7b81ebc6e <nss_load_library+350>, nsid=-2, argc=3, argv=<optimized out>, env=0x7ffe1a01d268) at dl-open.c:869
#8  0x00007fc7b82046a1 in do_dlopen (ptr=ptr@entry=0x7fc76e04f7d0) at dl-libc.c:96
#9  0x00007fc7b8205214 in __GI__dl_catch_exception
    (exception=exception@entry=0x7fc76e04f750, operate=operate@entry=0x7fc7b8204660 <do_dlopen>, args=args@entry=0x7fc76e04f7d0)
    at dl-error-skeleton.c:208
#10 0x00007fc7b82052d3 in __GI__dl_catch_error
    (objname=objname@entry=0x7fc76e04f7a8, errstring=errstring@entry=0x7fc76e04f7b0, mallocedp=mallocedp@entry=0x7fc76e04f7a7, operate=operate@entry=0x7fc7b8204660 <do_dlopen>, args=args@entry=0x7fc76e04f7d0) at dl-error-skeleton.c:227
--Type <RET> for more, q to quit, c to continue without paging--
#11 0x00007fc7b82047a7 in dlerror_run (operate=operate@entry=0x7fc7b8204660 <do_dlopen>, args=args@entry=0x7fc76e04f7d0) at dl-libc.c:46
#12 0x00007fc7b820483a in __GI___libc_dlopen_mode (name=name@entry=0x7fc76e04f810 "libnss_files.so.2", mode=mode@entry=-2147483646)
    at dl-libc.c:195
#13 0x00007fc7b81ebc6e in nss_load_library (ni=ni@entry=0x7fc754001dc0) at nsswitch.c:359
#14 0x00007fc7b81ec568 in __GI___nss_lookup_function (ni=0x7fc754001dc0, fct_name=<optimized out>, fct_name@entry=0x7fc7b8255131 "getpwuid_r")
    at nsswitch.c:456
#15 0x00007fc7b81ec792 in __GI___nss_next2
    (status=<optimized out>, all_values=<optimized out>, fctp=<optimized out>, fct2_name=<optimized out>, fct_name=<optimized out>, ni=<optimized out>) at nsswitch.c:251
#16 0x00007fc7b81ec792 in __GI___nss_next2
    (ni=ni@entry=0x7fc76e04f948, fct_name=fct_name@entry=0x7fc7b8255131 "getpwuid_r", fct2_name=fct2_name@entry=0x0, fctp=fctp@entry=0x7fc76e04f950, status=status@entry=0, all_values=all_values@entry=0) at nsswitch.c:222
#17 0x00007fc7b81953b6 in __getpwuid_r (uid=<optimized out>, resbuf=0x7fc76e04f9e0, buffer=<optimized out>, buflen=1024, result=<optimized out>)
    at ../nss/getXXbyYY_r.c:385
#18 0x00007fc7b972f3bb in virGetUserEnt () at /lib64/libvirt.so.0
#19 0x00007fc7b973063a in virGetUserName () at /lib64/libvirt.so.0
#20 0x00007fc7b96e1db7 in virIdentityGetSystem () at /lib64/libvirt.so.0
#21 0x0000562958136b23 in daemonRunStateInit ()
#22 0x00007fc7b972995b in virThreadHelper () at /lib64/libvirt.so.0
#23 0x00007fc7b5a1b14a in start_thread (arg=<optimized out>) at pthread_create.c:479
#24 0x00007fc7b81c9db3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 7 (Thread 0x7fc76f052700 (LWP 23916)):
#0  0x00007fc7b5a2462c in __lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:63
#1  0x00007fc7b5a1a4bf in __deallocate_stack (pd=0x7fc7b5c32388 <stack_cache_lock>, pd@entry=0x7fc76f052700) at allocatestack.c:791
#2  0x00007fc7b5a1b03d in __free_tcb (pd=pd@entry=0x7fc76f052700) at pthread_create.c:368
#3  0x00007fc7b5a1b3dc in start_thread (arg=<optimized out>) at pthread_create.c:575
#4  0x00007fc7b81c9db3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 6 (LWP 23905):
#0  0x00007fc7b5a1b206 in start_thread (arg=<optimized out>) at ../sysdeps/unix/sysv/linux/exit-thread.h:36
#1  0x00007fc7b81c9db3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7fc770054700 (LWP 23913)):
#0  0x00007fc7b81c489b in madvise () at ../sysdeps/unix/syscall-template.S:78
--Type <RET> for more, q to quit, c to continue without paging--
#1  0x00007fc7b5a1b404 in advise_stack_range (guardsize=<optimized out>, pd=140494554613504, size=<optimized out>, mem=0x7fc76f854000)
    at allocatestack.c:392
#2  0x00007fc7b5a1b404 in start_thread (arg=<optimized out>) at pthread_create.c:569
#3  0x00007fc7b81c9db3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7fc7ae7c9700 (LWP 23907)):
#0  0x00007fc7b5a2462c in __lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:63
#1  0x00007fc7b5a1a4bf in __deallocate_stack (pd=0x7fc7b5c32388 <stack_cache_lock>, pd@entry=0x7fc7ae7c9700) at allocatestack.c:791
#2  0x00007fc7b5a1b03d in __free_tcb (pd=pd@entry=0x7fc7ae7c9700) at pthread_create.c:368
#3  0x00007fc7b5a1b3dc in start_thread (arg=<optimized out>) at pthread_create.c:575
#4  0x00007fc7b81c9db3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7fc76f853700 (LWP 23914)):
#0  0x00007fc7b81c489b in madvise () at ../sysdeps/unix/syscall-template.S:78
#1  0x00007fc7b5a1b404 in advise_stack_range (guardsize=<optimized out>, pd=140494546220800, size=<optimized out>, mem=0x7fc76f053000)
    at allocatestack.c:392
#2  0x00007fc7b5a1b404 in start_thread (arg=<optimized out>) at pthread_create.c:569
#3  0x00007fc7b81c9db3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (LWP 23912):
#0  0x00007fc7b81c479b in munmap () at ../sysdeps/unix/syscall-template.S:78
#1  0x00007fc7b5a19f35 in free_stacks (limit=41943040) at allocatestack.c:281
#2  0x00007fc7b5a1a5aa in queue_stack (stack=0x7fc78f7fe700) at allocatestack.c:311
#3  0x00007fc7b5a1a5aa in __deallocate_stack (pd=pd@entry=0x7fc78f7fe700) at allocatestack.c:802
#4  0x00007fc7b5a1b03d in __free_tcb (pd=pd@entry=0x7fc78f7fe700) at pthread_create.c:368
#5  0x00007fc7b5a1b3dc in start_thread (arg=<optimized out>) at pthread_create.c:575
#6  0x00007fc7b81c9db3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7fc7b9e62b00 (LWP 23902)):
#0  0x00007fc77010eaaf in qemuStateShutdownPrepare () at /usr/lib64/libvirt/connection-driver/libvirt_driver_qemu.so
#1  0x00007fc7b98d3f20 in virStateShutdownPrepare () at /lib64/libvirt.so.0
#2  0x00007fc7b97ed6be in virNetDaemonRun () at /lib64/libvirt.so.0
#3  0x0000562958135790 in main ()

Comment 15 Michal Privoznik 2021-01-22 09:17:59 UTC
This is a bug, yes. It is caused by slowing down your computer so much, that the initialization of state driver (e.g. QEMU driver in this case) hasn't finished (well, it hasn't even started) when SIGUSR1 was delivered (BTW: libvirtd does not handle SIGUSR1 so the default action is taken - termination). Even worse - spawning worker threads isn't finished - pthread is still spawning/setting up threads. I believe that it's very unlikely to be hit by users - they would have to start the daemon and kill it right away and it has to happen at the exact time when the driver isn't initialized.

Anyway, fixing it should be fairly easy - just check if qemu driver is allocated in qemuStateShutdownPrepare() and qemuStateShutdownWait(). Let me post a patch. Meanwhile, move this back to ASSIGNED.

Comment 16 Michal Privoznik 2021-01-22 09:46:55 UTC
Patch proposed upstream:

https://www.redhat.com/archives/libvir-list/2021-January/msg00955.html

Comment 18 yafu 2021-02-03 08:39:54 UTC
Reproduced with libvirt-daemon-6.10.0-1.el8.x86_64. 

Reproduced steps:
1.Download libvirt-6.10.0-1.scrmod+el8.4.0+9260+5df81558.src.rpm and install it:
#rpm -ivh libvirt-6.10.0-1.scrmod+el8.4.0+9260+5df81558.src.rpm;

2.Patch the https://www.redhat.com/archives/libvir-list/2020-November/msg00709.html and build the pkgs;

3.Install pkgs built in step 2;

4.Start libvirtd service and using gdb attach the pid:
#gdb attach `pidof libvirtd`

5.Open another terminal and execute the following cmds:
#kill -SIGUSR1 $(pgrep libvirtd); sleep 1; kill -SIGINT $(pgrep libvirtd)

6.Libvirtd hang sometimes and the backtrace is as follows:
(gdb) t a a bt

Thread 4 (Thread 0x7fb8a688b700 (LWP 567442)):
#0  0x00007fb90427aa31 in __GI___poll (fds=0x7fb89400b560, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007fb904fe3ab6 in g_main_context_poll
    (priority=<optimized out>, n_fds=1, fds=0x7fb89400b560, timeout=<optimized out>, context=0x7fb89400a170) at gmain.c:4203
#2  0x00007fb904fe3ab6 in g_main_context_iterate
    (context=context@entry=0x7fb89400a170, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3897
#3  0x00007fb904fe3be0 in g_main_context_iteration (context=0x7fb89400a170, may_block=may_block@entry=1) at gmain.c:3963
#4  0x00007fb904fe3c31 in glib_worker_main (data=<optimized out>) at gmain.c:5772
#5  0x00007fb90500be1a in g_thread_proxy (data=0x7fb89400a400) at gthread.c:784
#6  0x00007fb901ad714a in start_thread (arg=<optimized out>) at pthread_create.c:479
#7  0x00007fb904285db3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7fb905f19b00 (LWP 567425)):
#0  0x00007fb904251d98 in __GI___nanosleep (requested_time=0x7ffeaf519c30, remaining=0x7ffeaf519c30) at ../sysdeps/unix/sysv/linux/nanosleep.c:28
#1  0x00007fb904251c9e in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#2  0x0000565055d30bec in main ()

Thread 2 (Thread 0x7fb8a608a700 (LWP 567446)):
#0  0x00007fb90427aa31 in __GI___poll (fds=0x7fb89401ff30, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
--Type <RET> for more, q to quit, c to continue without paging--
#1  0x00007fb904fe3ab6 in g_main_context_poll
    (priority=<optimized out>, n_fds=2, fds=0x7fb89401ff30, timeout=<optimized out>, context=0x7fb89401dd60) at gmain.c:4203
#2  0x00007fb904fe3ab6 in g_main_context_iterate (context=0x7fb89401dd60, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>)
    at gmain.c:3897
#3  0x00007fb904fe3e72 in g_main_loop_run (loop=0x7fb89401e6d0) at gmain.c:4098
#4  0x00007fb904a7059a in gdbus_shared_thread_func (user_data=0x7fb89401dd30) at gdbusprivate.c:275
#5  0x00007fb90500be1a in g_thread_proxy (data=0x7fb89400a4a0) at gthread.c:784
#6  0x00007fb901ad714a in start_thread (arg=<optimized out>) at pthread_create.c:479
#7  0x00007fb904285db3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7fb8a4887700 (LWP 567471)):
#0  0x00007fb901ad9924 in __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:67
#1  0x00007fb8aefbb362 in qemuDriverLock (driver=0x0) at ../src/qemu/qemu_conf.c:1271
#2  0x00007fb8aefbb362 in virQEMUDriverGetConfig (driver=0x0) at ../src/qemu/qemu_conf.c:1271
#3  0x00007fb8aefd86b7 in qemuStateStop () at ../src/qemu/qemu_driver.c:1049
#4  0x00007fb90598e88f in virStateStop () at /lib64/libvirt.so.0
#5  0x0000565055d329fb in daemonStopWorker ()
#6  0x00007fb9057e508b in virThreadHelper () at /lib64/libvirt.so.0
#7  0x00007fb901ad714a in start_thread (arg=<optimized out>) at pthread_create.c:479
#8  0x00007fb904285db3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Comment 19 yafu 2021-02-04 06:51:45 UTC
Hi, when I tried to verify the bug with libvirt-7.0.0-3.el8 and modified the source code according to https://www.redhat.com/archives/libvir-list/2020-November/msg00709.html(sleep 10 secs in daemonStopWorker and main() func and handle SIGUSR1 in daemonSetupSignals()), another crash happened sometimes.
Would you help to check it please? Thanks.

Reproduce steps:
1.Download libvirt-7.0.0-3.scrmod+el8.4.0+9783+d29cc7db.src.rpm and install it:
#rpm -ivh libvirt-7.0.0-3.scrmod+el8.4.0+9783+d29cc7db.src.rpm

2.Modify src/remote/remote_daemon.c as https://www.redhat.com/archives/libvir-list/2020-November/msg00709.html , then rebuild the pkgs;

3.Install pkgs built in step2;

4..Start libvirtd service and using gdb attach the pid:
#gdb attach `pidof libvirtd`

5.Open another terminal and execute the following cmds serveral times:
#kill -SIGUSR1 $(pgrep libvirtd); sleep 1; kill -SIGINT $(pgrep libvirtd)

6.libvirtd service crash sometimes and the backtrace is as follow:
Thread 25 "daemon-stop" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fdb6aefa700 (LWP 646439)]
__GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:67
67	  unsigned int type = PTHREAD_MUTEX_TYPE_ELISION (mutex);
(gdb) t a a bt

Thread 25 (Thread 0x7fdb6aefa700 (LWP 646439)):
#0  0x00007fdbbf63c924 in __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:67
#1  0x00007fdb6cf9b4d2 in qemuDriverLock (driver=0x0) at ../src/qemu/qemu_conf.c:1271
#2  0x00007fdb6cf9b4d2 in virQEMUDriverGetConfig (driver=0x0) at ../src/qemu/qemu_conf.c:1271
#3  0x00007fdb6cfb8b07 in qemuStateStop () at ../src/qemu/qemu_driver.c:1031
#4  0x00007fdbc34f316f in virStateStop () at ../src/libvirt.c:777
#5  0x0000560a8442fa9b in daemonStopWorker (opaque=0x560a85765030) at ../src/remote/remote_daemon.c:493
#6  0x00007fdbc33489eb in virThreadHelper (data=<optimized out>) at ../src/util/virthread.c:233
#7  0x00007fdbbf63a14a in start_thread (arg=<optimized out>) at pthread_create.c:479
#8  0x00007fdbc1de8db3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 18 (Thread 0x7fdb68b68700 (LWP 646379)):
#0  0x00007fdbc1ddda31 in __GI___poll (fds=0x7fdb5001ff30, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007fdbc2b46ab6 in g_main_context_poll
    (priority=<optimized out>, n_fds=2, fds=0x7fdb5001ff30, timeout=<optimized out>, context=0x7fdb5001dd60) at gmain.c:4203
#2  0x00007fdbc2b46ab6 in g_main_context_iterate (context=0x7fdb5001dd60, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>)
    at gmain.c:3897
#3  0x00007fdbc2b46e72 in g_main_loop_run (loop=0x7fdb5001e6d0) at gmain.c:4098
#4  0x00007fdbc25d359a in gdbus_shared_thread_func (user_data=0x7fdb5001dd30) at gdbusprivate.c:275
--Type <RET> for more, q to quit, c to continue without paging--
#5  0x00007fdbc2b6ee1a in g_thread_proxy (data=0x7fdb5000a4a0) at gthread.c:784
#6  0x00007fdbbf63a14a in start_thread (arg=<optimized out>) at pthread_create.c:479
#7  0x00007fdbc1de8db3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 17 (Thread 0x7fdb69369700 (LWP 646378)):
#0  0x00007fdbc1ddda31 in __GI___poll (fds=0x7fdb5000b560, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007fdbc2b46ab6 in g_main_context_poll
    (priority=<optimized out>, n_fds=1, fds=0x7fdb5000b560, timeout=<optimized out>, context=0x7fdb5000a170) at gmain.c:4203
#2  0x00007fdbc2b46ab6 in g_main_context_iterate
    (context=context@entry=0x7fdb5000a170, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3897
#3  0x00007fdbc2b46be0 in g_main_context_iteration (context=0x7fdb5000a170, may_block=may_block@entry=1) at gmain.c:3963
#4  0x00007fdbc2b46c31 in glib_worker_main (data=<optimized out>) at gmain.c:5772
#5  0x00007fdbc2b6ee1a in g_thread_proxy (data=0x7fdb5000a400) at gthread.c:784
#6  0x00007fdbbf63a14a in start_thread (arg=<optimized out>) at pthread_create.c:479
#7  0x00007fdbc1de8db3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7fdbc3a7eb00 (LWP 646361)):
#0  0x00007fdbc1db4d98 in __GI___nanosleep (requested_time=0x7ffd0e3fc5a0, remaining=0x7ffd0e3fc5a0) at ../sysdeps/unix/sysv/linux/nanosleep.c:28
#1  0x00007fdbc1db4c9e in __sleep (seconds=0, seconds@entry=10) at ../sysdeps/posix/sleep.c:55
#2  0x0000560a8442dc6c in main (argc=<optimized out>, argv=<optimized out>) at ../src/remote/remote_daemon.c:1264

Comment 20 Michal Privoznik 2021-02-04 07:38:43 UTC
I think you are missing this commit:

commit 69977ff10560a80bcf5bf93f1a3f819a2d1623ca
Author:     Michal Prívozník <mprivozn>
AuthorDate: Fri Jan 22 10:25:45 2021 +0100
Commit:     Michal Prívozník <mprivozn>
CommitDate: Wed Jan 27 09:39:40 2021 +0100

    qemu: Avoid crash in qemuStateShutdownPrepare() and qemuStateShutdownWait()
    
    If QEMU driver fails to initialize for whatever reason (it can be
    as trivial as a typo on qemu.conf), the control jumps to error
    label in qemuStateInitialize() where qemuStateCleanup() is called
    which frees the driver. But the daemon then asks drivers to
    prepare for shutdown, which in case of QEMU driver is implemented
    in qemuStateShutdownPrepare(). In here, the driver is
    dereferenced but since it was freed earlier, the pointer is NULL
    which leads to instant crash.
    
    Solution is simple - just check if qemu_driver is not NULL. But
    doing so only in qemuStateShutdownPrepare() would push the
    problem down to virStateShutdownWait(), well
    qemuStateShutdownWait(). Therefore, duplicate the trick there
    too.
    
    Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1895359#c14
    Signed-off-by: Michal Privoznik <mprivozn>
    Reviewed-by: Jiri Denemark <jdenemar>


It resolves the issue you ran into: calling qemuStateStop() when driver is still NULL. Let me backport it.

Comment 21 yafu 2021-02-04 07:50:56 UTC
(In reply to Michal Privoznik from comment #20)
> I think you are missing this commit:
> 
> commit 69977ff10560a80bcf5bf93f1a3f819a2d1623ca
> Author:     Michal Prívozník <mprivozn>
> AuthorDate: Fri Jan 22 10:25:45 2021 +0100
> Commit:     Michal Prívozník <mprivozn>
> CommitDate: Wed Jan 27 09:39:40 2021 +0100
> 
>     qemu: Avoid crash in qemuStateShutdownPrepare() and
> qemuStateShutdownWait()
>     
>     If QEMU driver fails to initialize for whatever reason (it can be
>     as trivial as a typo on qemu.conf), the control jumps to error
>     label in qemuStateInitialize() where qemuStateCleanup() is called
>     which frees the driver. But the daemon then asks drivers to
>     prepare for shutdown, which in case of QEMU driver is implemented
>     in qemuStateShutdownPrepare(). In here, the driver is
>     dereferenced but since it was freed earlier, the pointer is NULL
>     which leads to instant crash.
>     
>     Solution is simple - just check if qemu_driver is not NULL. But
>     doing so only in qemuStateShutdownPrepare() would push the
>     problem down to virStateShutdownWait(), well
>     qemuStateShutdownWait(). Therefore, duplicate the trick there
>     too.
>     
>     Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1895359#c14
>     Signed-off-by: Michal Privoznik <mprivozn>
>     Reviewed-by: Jiri Denemark <jdenemar>
> 
> 
> It resolves the issue you ran into: calling qemuStateStop() when driver is
> still NULL. Let me backport it.

I checked the source code and it includes this patch.

Comment 22 Michal Privoznik 2021-02-04 08:08:03 UTC
Oh yeah, that patch is backported. So I think what are you facing is the problem with that DO NOT MERGE patch itself. It's intent was to help reproduce original problem. But as I look into it, it is not obvious that it calls virStateStop() twice! Hence the problem you're seeing. I believe that if you don't apply the patch then there is no bug, no crash.

Comment 23 yafu 2021-02-04 09:42:43 UTC
(In reply to Michal Privoznik from comment #22)
> Oh yeah, that patch is backported. So I think what are you facing is the
> problem with that DO NOT MERGE patch itself. It's intent was to help
> reproduce original problem. But as I look into it, it is not obvious that it
> calls virStateStop() twice! Hence the problem you're seeing. I believe that
> if you don't apply the patch then there is no bug, no crash.

Thanks for your clarification.

Yes, no crash happens if not merge the patch.

Comment 24 yafu 2021-02-04 09:48:48 UTC
Verified the bug with libvirt-7.0.0-3.scrmod+el8.4.0+9783+d29cc7db.src.rpm.

Test steps are the same as comment 14 and no crash happens.

Comment 26 errata-xmlrpc 2021-05-25 06:45:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2098


Note You need to log in before you can comment on or make changes to this bug.