Bug 1836865 - libvirt crashes when stopping daemon after virsh command
Summary: libvirt crashes when stopping daemon after virsh command
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: libvirt
Version: 8.2
Hardware: All
OS: Unspecified
unspecified
medium
Target Milestone: rc
: 8.5
Assignee: Jonathon Jongsma
QA Contact: yafu
URL:
Whiteboard:
: 1933590 1942127 (view as bug list)
Depends On:
Blocks: 1845468 1916117 1949342
TreeView+ depends on / blocked
 
Reported: 2020-05-18 11:30 UTC by smitterl
Modified: 2021-11-16 07:52 UTC (History)
15 users (show)

Fixed In Version: libvirt-7.3.0-1.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1845468 1949342 (view as bug list)
Environment:
Last Closed: 2021-11-16 07:49:57 UTC
Type: Bug
Target Upstream Version: 7.2.0
Embargoed:
pm-rhel: mirror+


Attachments (Terms of Use)
backtraces_s390x (10.80 KB, text/plain)
2020-05-18 14:50 UTC, smitterl
no flags Details
backtraces_x86_64 (10.80 KB, text/plain)
2020-05-18 14:54 UTC, smitterl
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1828207 0 unspecified CLOSED Libvirtd dumps core when stopped too quickly 2021-05-25 16:00:35 UTC
Red Hat Bugzilla 1942127 1 None None None 2021-09-07 21:09:02 UTC

Internal Links: 1942127

Description smitterl 2020-05-18 11:30:50 UTC
Description of problem:
segfault in host dmesg when stopping libvirtd after issuing virsh command

Version-Release number of selected component (if applicable):
libvirt-daemon-driver-nodedev-6.0.0-17.module+el8.3.0+6423+e4cb6418.x86_64

How reproducible:
Almost always

Steps to Reproduce:
1. dmesg -C; systemctl start libvirtd; virsh list; systemctl stop libvirtd; dmesg

Actual results:
[   69.860597] nodedev-init[1745]: segfault at 20 ip 00007f56a76dd9f9 sp 00007f565f7fd858 error 4 in libvirt.so.0.6003.0[7f56a75d7000+43b000]
[   69.863838] Code: 48 89 d8 5b c3 0f 1f 40 00 31 db 48 89 d8 5b c3 90 90 eb ec 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 85 ff 74 0c <8b> 07 66 31 c0 3d 00 00 fe ca 74 0b 31 c0 c3 0f 1f 84 00 00 00 00

Expected results:
No dmesg messages

Additional info:
1. On s390x more reliably reproducible with different error message but same component:
libvirt-daemon-driver-nodedev-6.0.0-19.module+el8.2.1+6538+c148631f.s390x
[436639.909467] User process fault: interruption code 003b ilc:3 in libvirt_driver_nodedev.so[3ff95700000+11000]
[436639.909483] Failing address: 0000000000000000 TEID: 0000000000000800
[436639.909485] Fault in primary space mode while using user ASCE.
[436639.909487] AS:00000001406781c7 R3:0000000000000024 
[436639.909492] CPU: 7 PID: 259260 Comm: libvirtd Kdump: loaded Not tainted 4.18.0-193.el8.s390x #1
[436639.909494] Hardware name: IBM 2964 N96 400 (LPAR)
[436639.909496] User PSW : 0705000180000000 000003ff9570938c
[436639.909499]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:1 AS:0 CC:0 PM:0 RI:0 EA:3
[436639.909501] User GPRS: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[436639.909503]            000003ff9570da34 0000000000000496 000003ff9570e6b0 000003ff400f4490
[436639.909505]            000003ff38186f30 000003ff00000000 0000000000000000 000003ff38049f50
[436639.909507]            000003ffa42a5f40 000003ff9570e2ce 000003ff9570937c 000003ff5f37e888
[436639.909517] User Code: 000003ff9570937c: c41800004e1e	lgrl	%r1,3ff95712fb8
                           000003ff95709382: e31010000004	lg	%r1,0(%r1)
                          #000003ff95709388: a7390000		lghi	%r3,0
                          >000003ff9570938c: e32010500004	lg	%r2,80(%r1)
                           000003ff95709392: c0e5ffffe067	brasl	%r14,3ff95705460
                           000003ff95709398: a7f4f953		brc	15,3ff9570863e
                           000003ff9570939c: c020000025e5	larl	%r2,3ff9570df66
                           000003ff957093a2: a7f4f95d		brc	15,3ff9570865c
[436639.909574] Last Breaking-Event-Address:
[436639.909579]  [<000003ffa3d693d6>] 0x3ffa3d693d6
2. Doesn't reproduce when sleeping after virsh command, "dmesg -C; systemctl start libvirtd; virsh list; sleep 1; systemctl stop libvirtd; dmesg"

Comment 2 Peter Krempa 2020-05-18 12:24:29 UTC
Looks like another random crash on the shutdown code path similar to other reports we have:

https://bugzilla.redhat.com/show_bug.cgi?id=1739564
https://bugzilla.redhat.com/show_bug.cgi?id=1832498

Please attach the backtrace of all threads of the crashed process.

Comment 3 smitterl 2020-05-18 14:50:41 UTC
Created attachment 1689603 [details]
backtraces_s390x

Comment 4 smitterl 2020-05-18 14:54:56 UTC
Created attachment 1689604 [details]
backtraces_x86_64

Comment 5 Jaroslav Suchanek 2020-06-08 10:08:42 UTC
Moving to advanced virtualization as it should be target there first.

Btw. Daniel Berrange started discussion about this and similar problems on the list:
https://www.redhat.com/archives/libvir-list/2020-April/msg01328.html

Comment 7 smitterl 2020-09-14 12:02:13 UTC
Couldn't reproduce on current master. Details, s. https://bugzilla.redhat.com/show_bug.cgi?id=1845468#c4

Comment 9 smitterl 2021-03-15 15:33:25 UTC
Yan, I'm qa_ack'ing this. I hope that's okay. Please, let me know:

This is something I tested with upstream 6.8.0, see https://bugzilla.redhat.com/show_bug.cgi?id=1845468#c4

Today, I've verified this furthermore with the following RHEL 8.4 AV version which is in AV Beta:
libvirt-daemon-7.0.0-8.module+el8.4.0+10233+8b7fd9eb.s390x

The command succeeds/no crash for 100% in 50 consecutive rounds:

# for i in $(seq 50); do dmesg -C; systemctl start libvirtd; virsh list; systemctl stop libvirtd; dmesg; sleep 2; done
 Id   Name   State
--------------------

Warning: Stopping libvirtd.service, but it can still be activated by:
  libvirtd-admin.socket
  libvirtd-ro.socket
  libvirtd.socket
...

Therefore, I'm setting TestOnly.

Finally, I think the scope is to stop ignoring the reported error specifically for automated tests where this occurred[1] and maybe re-run the above command for verification. Is this okay, Yan? Do you want me to do something else? Please, feel free to assign this BZ to me if you are okay with the scoping.

[1]
          - daemon.functional
          - conf_file.sysconfig_libvirtd.libvirtd_config
          - virsh.screenshot.normal_test.acl_test.screen_0.paused_option
          - virsh.snapshot.live.no_halt
          - virsh.managedsave.status_error_no.id_option.no_opt.no_progress

Comment 10 yafu 2021-03-16 03:28:01 UTC
(In reply to smitterl from comment #9)
> Yan, I'm qa_ack'ing this. I hope that's okay. Please, let me know:
> 
> This is something I tested with upstream 6.8.0, see
> https://bugzilla.redhat.com/show_bug.cgi?id=1845468#c4
> 
> Today, I've verified this furthermore with the following RHEL 8.4 AV version
> which is in AV Beta:
> libvirt-daemon-7.0.0-8.module+el8.4.0+10233+8b7fd9eb.s390x
> 
> The command succeeds/no crash for 100% in 50 consecutive rounds:
> 
> # for i in $(seq 50); do dmesg -C; systemctl start libvirtd; virsh list;
> systemctl stop libvirtd; dmesg; sleep 2; done
>  Id   Name   State
> --------------------
> 
> Warning: Stopping libvirtd.service, but it can still be activated by:
>   libvirtd-admin.socket
>   libvirtd-ro.socket
>   libvirtd.socket
> ...
> 
> Therefore, I'm setting TestOnly.
> 
> Finally, I think the scope is to stop ignoring the reported error
> specifically for automated tests where this occurred[1] and maybe re-run the
> above command for verification. Is this okay, Yan? Do you want me to do
> something else? Please, feel free to assign this BZ to me if you are okay
> with the scoping.
> 
> [1]
>           - daemon.functional
>           - conf_file.sysconfig_libvirtd.libvirtd_config
>           - virsh.screenshot.normal_test.acl_test.screen_0.paused_option
>           - virsh.snapshot.live.no_halt
>           - virsh.managedsave.status_error_no.id_option.no_opt.no_progress


Hi Sebastian,

I still can reproduce this bug with step:
# systemctl start libvirtd; virsh list; systemctl stop libvirtd
and can reproduce more easily with step:
# systemctl restart libvirtd; virsh list ; systemctl restart libvirtd

But i can not reproduce the bug on one of my host. And I think it's the same bug with https://bugzilla.redhat.com/show_bug.cgi?id=1933590.

@Peter, 
Would you help to confirm whether the 2 bugs are the same issue please? Thanks.

Comment 11 Peter Krempa 2021-03-16 11:56:51 UTC
Yes, it seems to be the same problem. The crash happens in udevAddOneDevice.

Comment 12 yalzhang@redhat.com 2021-03-17 01:37:02 UTC
*** Bug 1933590 has been marked as a duplicate of this bug. ***

Comment 13 yafu 2021-03-17 03:33:10 UTC
According to comment 10 - 11, move the bug to FailedQA.

Comment 14 Jonathon Jongsma 2021-03-18 19:46:13 UTC
Patch merged upstream 

commit caf23cdc9b628ae3acd89478fbace093b23649b8
Author: Jonathon Jongsma <jjongsma>
Date:   Tue Mar 16 17:27:25 2021 -0500

    nodedev: Don't crash when exiting before init is done
    
    If libvirtd is terminated before the node driver finishes
    initialization, it can crash with a backtrace similar to the following:
    
        Stack trace of thread 1922933:
        #0  0x00007f8515178774 g_hash_table_find (libglib-2.0.so.0)
        #1  0x00007f851593ea98 virHashSearch (libvirt.so.0)
        #2  0x00007f8515a1dd83 virNodeDeviceObjListSearch (libvirt.so.0)
        #3  0x00007f84cceb40a1 udevAddOneDevice (libvirt_driver_nodedev.so)
        #4  0x00007f84cceb5fae nodeStateInitializeEnumerate (libvirt_driver_nodedev.so)
        #5  0x00007f85159840cb virThreadHelper (libvirt.so.0)
        #6  0x00007f8511c7d14a start_thread (libpthread.so.0)
        #7  0x00007f851442bdb3 __clone (libc.so.6)
    
        Stack trace of thread 1922863:
        #0  0x00007f851442651d syscall (libc.so.6)
        #1  0x00007f85159842d4 virThreadSelfID (libvirt.so.0)
        #2  0x00007f851594e240 virLogFormatString (libvirt.so.0)
        #3  0x00007f851596635d vir_object_finalize (libvirt.so.0)
        #4  0x00007f8514efe8e9 g_object_unref (libgobject-2.0.so.0)
        #5  0x00007f85159667f8 virObjectUnref (libvirt.so.0)
        #6  0x00007f851517755f g_hash_table_remove_all_nodes.part.0 (libglib-2.0.so.0)
        #7  0x00007f8515177e62 g_hash_table_unref (libglib-2.0.so.0)
        #8  0x00007f851596637e vir_object_finalize (libvirt.so.0)
        #9  0x00007f8514efe8e9 g_object_unref (libgobject-2.0.so.0)
        #10 0x00007f85159667f8 virObjectUnref (libvirt.so.0)
        #11 0x00007f84cceb2b42 nodeStateCleanup (libvirt_driver_nodedev.so)
        #12 0x00007f8515b37950 virStateCleanup (libvirt.so.0)
        #13 0x00005648085348e8 main (libvirtd)
        #14 0x00007f8514352493 __libc_start_main (libc.so.6)
        #15 0x00005648085350fe _start (libvirtd)
    
    This is because the initial population of the device list is done in a
    separate initialization thread. If we attempt to exit libvirtd before
    this init thread has completed, we'll try to free the device list while
    accessing it from the other thread. In order to guarantee that this
    init thread is not accessing the device list when we're cleaning up the
    nodedev driver, make it joinable and wait for it to finish before
    proceding with the cleanup. This is similar to how we handle the udev
    event handler thread.
    
    The separate initialization thread was added in commit
    9f0ae0b1.
    
    Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1836865
    
    Signed-off-by: Jonathon Jongsma <jjongsma>
    Reviewed-by: Michal Privoznik <mprivozn>

Comment 15 Thomas Huth 2021-03-24 06:29:56 UTC
Maybe a duplicate: BZ 1942127

Comment 16 John Ferlan 2021-03-24 13:43:37 UTC
I see this was pushed upstream as commit caf23cdc9b628ae3acd89478fbace093b23649b8

Will this be POSTed downstream ?  If so, then please be sure to set the exception?, provide a downstream build, and work w/ QE to set ITM properly.  If not, then ITR would change to 8.5.0, but it would seem a crash should be fixed.

BTW: I'm clearing the ITM value since it wasn't cleared automagically on FailedQA

Comment 17 Jonathon Jongsma 2021-03-24 19:11:55 UTC
(In reply to John Ferlan from comment #16)
> I see this was pushed upstream as commit
> caf23cdc9b628ae3acd89478fbace093b23649b8
> 
> Will this be POSTed downstream ?  If so, then please be sure to set the
> exception?, provide a downstream build, and work w/ QE to set ITM properly. 
> If not, then ITR would change to 8.5.0, but it would seem a crash should be
> fixed.
> 
> BTW: I'm clearing the ITM value since it wasn't cleared automagically on
> FailedQA

As far as I can tell (I haven't actually gone back and tested older versions, just going by the commit that introduced this separate nodedev init thread), this issue has been around since libvirt 4.0. It only occurs if the daemon is stopped before the initialization of the library completes (in other words, right after the daemon has started). So it is both rarely encountered and has a fairly low impact since the crash happens in the process of exiting the daemon. So I'm inclined to simply push it to 8.5.0 rather than try to get an exception to push it into 8.4

Comment 18 John Ferlan 2021-03-25 12:07:38 UTC
OK given that let's move this to RHEL-AV 8.5.0 and let the libvirt rebase pick this up (moved to POST)

Comment 19 Jonathon Jongsma 2021-04-09 17:29:46 UTC
*** Bug 1942127 has been marked as a duplicate of this bug. ***

Comment 20 IBM Bug Proxy 2021-04-09 17:35:49 UTC
------- Comment From Max.Bender1 2021-03-24 09:05 EDT-------
(In reply to comment #10)
> Looks similar to BZ 1836865 ... but to be sure, could you please specify the
> exact versions of libvirt, qemu and kernel that you were using on the system
> where the crash occurred?

Kernel : Linux zt93k9 4.18.0-283.el8.s390x #1 SMP Thu Feb 4 05:52:52 EST 2021 s390x s390x s390x GNU/Linux
libvirt: 7.0.0
qemu-kvm: 5.2.0

Comment 21 IBM Bug Proxy 2021-04-09 17:50:55 UTC
------- Comment From Max.Bender1 2021-04-09 13:43 EDT-------
(In reply to comment #14)
> I'm certain that this is a duplicate of bug 1836865, so I'm going to close
> it as a duplicate. Feel free to re-open if you can reproduce it on an
> upstream release.
> *** This bug has been marked as a duplicate of bug 1836865 ***

Ok, we will retest this on our side with an upgrade release. Thanks!

Comment 24 yafu 2021-06-08 06:59:10 UTC
Verified with libvirt-daemon-7.4.0-1.module+el8.5.0+11218+83343022.x86_64.

Test steps:
1.# for i in {1..100}; do systemctl restart libvirtd; virsh list ; systemctl restart libvirtd ; sleep 30; done

2.Check the coredump log after step 1:
# cat /var/log/messages | grep -i coredump
no output

Comment 25 IBM Bug Proxy 2021-08-18 15:50:53 UTC
------- Comment From Max.Bender1 2021-08-18 11:42 EDT-------
I am still able to reproduce this with the same procedure noted in the original bug description on RHEL 8.5 AV. This leads me to believe that this issue is not related to https://bugzilla.redhat.com/show_bug.cgi?id=1836865 - the theorized duplicate. Issue is still present on

OS: RHEL 8.5
virsh : 7.4.0
qemu-img : 6.0.0
kernel : 4.18.0-325.el8.s390x

We are not restarting libvirtd explicitly anywhere in our test, but maybe it's crashing/unavailable intermittently when 50 operations are run against it?

---

Again the steps to reproduce can be found below

Provision 50 guests and save their XML definitions to some directory. Then shutdown those guests. This was done with some python automation...

import os

xml_dir_path = '/root/scalebug1/xmls'

all_guests = os.popen('virsh list --all --name').read()

for guest in all_guests.split('\n'):
print(guest)
os.system(f"virsh dumpxml {guest} > {xml_dir_path}/{guest}.xml")
```

Then start to provision 100+ guests (just to give you time to reproduce problem), at the same time start running undefine and define operations against the 50 shutdown guests. We utilized virt-install (define & provision) to provision these guests. Let us know if you need some automation to do this!

```
virt-install
--connect qemu:///system
--name {{ item }}
--metadata description="qcow:{{ image_dir }}/{{ item }}/{{ item }}.qcow2"
--vcpus {{ vcpu }}
--memory {{ memory_mb }}
--disk {{ item }}.qcow2{{ cache_mode }}{{ aio_mode }}
--disk path=seed.iso,device=cdrom,bus=scsi
--network {{ virt_network_parm }}
--boot hd
--noautoconsole
```

Again we utilized some hacky python scripts to do these operations at scale.

"UNDEFINE"
----------
```
import os

xml_dir_path = '/root/scalebug1/xmls'

all_guests = os.popen('virsh list --all --name').read()

for guest in all_guests.split('\n'):
if 'scalebug1' in guest:
print(guest)
os.system(f"virsh undefine {guest}")
```

"DEFINE"
---------
```
import os

xml_dir_path = '/root/scalebug1/xmls'

from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir(xml_dir_path) if isfile(join(xml_dir_path, f))]

print(onlyfiles)

for domain in onlyfiles:
os.system(f"virsh define {xml_dir_path}/{domain}")
```
Make sure to keep an eye on your provisioning, you will start to see libvirt lose track of the UUIDs showing the error described below

ERROR    Domain not found: no domain with matching uuid '5e6072ae-e5c3-453f-b7dd-6c8bf3144a0f' (zt93k9-scalebug1-10-20-112-22)

------- Comment From Max.Bender1 2021-08-18 11:44 EDT-------
As soon as I stop running the undefines & redefines provisioning returns to normal scoping the issue to one or both of those operations

Comment 26 smitterl 2021-08-19 15:23:16 UTC
Please note Max' last comment was meant for the duplicate RH bug https://bugzilla.redhat.com/show_bug.cgi?id=1942127

Comment 27 Jonathon Jongsma 2021-09-07 21:08:06 UTC
*** Bug 1942127 has been marked as a duplicate of this bug. ***

Comment 28 IBM Bug Proxy 2021-09-07 21:16:30 UTC
------- Comment From tstaudt.com 2021-08-23 10:14 EDT-------
IBM 192126 - RH1942127- RHEL8.4 Nightly[0208] - No domain with matching uuid using "virt-install" during many undefine operations (kvm/libvirt) (-> Red Hat Enterprise Linux Advanced Virtualization)
now points to
Red Hat Bug 1942127 - RHEL8.4 Nightly[0208] - No domain with matching uuid using "virt-install" during many undefine operations (kvm/libvirt) (-> Red Hat Enterprise Linux Advanced Virtualization)
again.

------- Comment From Max.Bender1 2021-08-25 11:27 EDT-------
(In reply to comment #34)
> So, I am trying to reproduce the issue here and haven't managed to yet. But
> I'm also a bit unclear about the actual procedure. So, let me ask a few
> questions to try to clarify things.
> A) What are you actually trying to do here? Is it a real-world scenario, or
> just a somewhat contrived procedure that allows you to reproduce the bug?

This issue was originally found during our scaling tests where we had some guests running through lifecycle operations to validate the behavior while the system was under load. After we completed the test we had a retrospective to determine why some guests didn't come up in time and eventually scoped it as a consequence of the lifecycle guests running undefine & redefine operations.

The procedure noted in the original bug report was just a way we managed to reproduce the issue outside of the scaling test. We obviously wouldn't expect an end-user to run these operations at scale like we are but were able to utilize it to identify an issue.

The use-case that we are targeting is dynamic cloud environment where VMs are getting provisioned & deprovisioned often to run things like CICD tasks, helper jobs or other non-static linux configurations. Technologies like Terraform is a potential end-user target for where many of these operations might happen at once.

> B) Do you know which command is returning the error message "Domain not
> found: no domain with matching uuid '5e6072ae-e5c3-453f-b7dd-6c8bf3144a0f'"?
> I assume it's one of the the 'undefine' commands?

`virt-install` is the command that is returning this error. Undefines/Redefines all execute successfully. It is a NEW GUEST that experiences this issue, not one of the 50 being redefined/undefined

> C) Above you say "at the same time start running undefine and define
> operations against the 50 shutdown guests". It sounds almost as if you're
> running the undefine and define operations at the same time. But I assume
> you're doing all of the undefines first and then doing all of the defines?

Correct, undefine and wait till completed. Then redefine. FYI I didn't combine those operations themselves in a programmatic loop, just manually executed them back to back.

> So this is how I interpret your instructions above. Please tell me if I'm
> incorrect:
> - provision 50 guests
> - dump xml for all 50 guests
> - shutdown all 50 guests
> At this point we kick off two parallel paths. In the first path, we
> provision 100 new guests just to keep libvirt busy. In the second path, we
> first undefine all of the original 50 guest domains. After all of those
> domains are undefined, we try to re-define all of those 50 domain from the
> xml files. Is that correct?

Correct, I observed the UUID error within the first 20 NEW guests provisioning FYI.

> Oh, I just noticed the following line in the backtrace you attached above.
> Are you certain that you're testing the version of libvirt that you think
> you are testing?
> Reading symbols from /usr/sbin/libvirtd...Reading symbols from
> /usr/lib/debug/usr/sbin/libvirtd-6.3.0-1.module+el8.3.0+6478+69f490bb.s390x.
> debug...done.

Backtrace was from original bug report against RHEL 8.4 which likely had Libvirt==6.3.0. Can re-capture if needed.

Confirmed that recent reproduction occured on :

OS: RHEL 8.5
virsh : 7.4.0
qemu-img : 6.0.0
kernel : 4.18.0-325.el8.s390x

------- Comment From Max.Bender1 2021-08-31 11:22 EDT-------
(In reply to comment #42)
> Have you been able to reproduce on libvirt 7.6 yet? Is it possible to get a
> stack trace for this version?

Still WIP on this item, have some other bugs that are taking priority. Can you expand on what you would like in regards to a "stack trace"? Do you mean just enable debug logging for libvirt?

------- Comment From Max.Bender1 2021-09-07 15:07 EDT-------
I am unable to reproduce this on RHEL 8.5 AV 1.1 (libvirt 7.6.0) after a few attempts. Looks like something got fixed?

Do we care about backports to 8.4?

Comment 30 errata-xmlrpc 2021-11-16 07:49:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4684


Note You need to log in before you can comment on or make changes to this bug.