Description of problem: segfault in host dmesg when stopping libvirtd after issuing virsh command Version-Release number of selected component (if applicable): libvirt-daemon-driver-nodedev-6.0.0-17.module+el8.3.0+6423+e4cb6418.x86_64 How reproducible: Almost always Steps to Reproduce: 1. dmesg -C; systemctl start libvirtd; virsh list; systemctl stop libvirtd; dmesg Actual results: [ 69.860597] nodedev-init[1745]: segfault at 20 ip 00007f56a76dd9f9 sp 00007f565f7fd858 error 4 in libvirt.so.0.6003.0[7f56a75d7000+43b000] [ 69.863838] Code: 48 89 d8 5b c3 0f 1f 40 00 31 db 48 89 d8 5b c3 90 90 eb ec 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 85 ff 74 0c <8b> 07 66 31 c0 3d 00 00 fe ca 74 0b 31 c0 c3 0f 1f 84 00 00 00 00 Expected results: No dmesg messages Additional info: 1. On s390x more reliably reproducible with different error message but same component: libvirt-daemon-driver-nodedev-6.0.0-19.module+el8.2.1+6538+c148631f.s390x [436639.909467] User process fault: interruption code 003b ilc:3 in libvirt_driver_nodedev.so[3ff95700000+11000] [436639.909483] Failing address: 0000000000000000 TEID: 0000000000000800 [436639.909485] Fault in primary space mode while using user ASCE. [436639.909487] AS:00000001406781c7 R3:0000000000000024 [436639.909492] CPU: 7 PID: 259260 Comm: libvirtd Kdump: loaded Not tainted 4.18.0-193.el8.s390x #1 [436639.909494] Hardware name: IBM 2964 N96 400 (LPAR) [436639.909496] User PSW : 0705000180000000 000003ff9570938c [436639.909499] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:1 AS:0 CC:0 PM:0 RI:0 EA:3 [436639.909501] User GPRS: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [436639.909503] 000003ff9570da34 0000000000000496 000003ff9570e6b0 000003ff400f4490 [436639.909505] 000003ff38186f30 000003ff00000000 0000000000000000 000003ff38049f50 [436639.909507] 000003ffa42a5f40 000003ff9570e2ce 000003ff9570937c 000003ff5f37e888 [436639.909517] User Code: 000003ff9570937c: c41800004e1e lgrl %r1,3ff95712fb8 000003ff95709382: e31010000004 lg %r1,0(%r1) #000003ff95709388: a7390000 lghi %r3,0 >000003ff9570938c: e32010500004 lg %r2,80(%r1) 000003ff95709392: c0e5ffffe067 brasl %r14,3ff95705460 000003ff95709398: a7f4f953 brc 15,3ff9570863e 000003ff9570939c: c020000025e5 larl %r2,3ff9570df66 000003ff957093a2: a7f4f95d brc 15,3ff9570865c [436639.909574] Last Breaking-Event-Address: [436639.909579] [<000003ffa3d693d6>] 0x3ffa3d693d6 2. Doesn't reproduce when sleeping after virsh command, "dmesg -C; systemctl start libvirtd; virsh list; sleep 1; systemctl stop libvirtd; dmesg"
Looks like another random crash on the shutdown code path similar to other reports we have: https://bugzilla.redhat.com/show_bug.cgi?id=1739564 https://bugzilla.redhat.com/show_bug.cgi?id=1832498 Please attach the backtrace of all threads of the crashed process.
Created attachment 1689603 [details] backtraces_s390x
Created attachment 1689604 [details] backtraces_x86_64
Moving to advanced virtualization as it should be target there first. Btw. Daniel Berrange started discussion about this and similar problems on the list: https://www.redhat.com/archives/libvir-list/2020-April/msg01328.html
Couldn't reproduce on current master. Details, s. https://bugzilla.redhat.com/show_bug.cgi?id=1845468#c4
Yan, I'm qa_ack'ing this. I hope that's okay. Please, let me know: This is something I tested with upstream 6.8.0, see https://bugzilla.redhat.com/show_bug.cgi?id=1845468#c4 Today, I've verified this furthermore with the following RHEL 8.4 AV version which is in AV Beta: libvirt-daemon-7.0.0-8.module+el8.4.0+10233+8b7fd9eb.s390x The command succeeds/no crash for 100% in 50 consecutive rounds: # for i in $(seq 50); do dmesg -C; systemctl start libvirtd; virsh list; systemctl stop libvirtd; dmesg; sleep 2; done Id Name State -------------------- Warning: Stopping libvirtd.service, but it can still be activated by: libvirtd-admin.socket libvirtd-ro.socket libvirtd.socket ... Therefore, I'm setting TestOnly. Finally, I think the scope is to stop ignoring the reported error specifically for automated tests where this occurred[1] and maybe re-run the above command for verification. Is this okay, Yan? Do you want me to do something else? Please, feel free to assign this BZ to me if you are okay with the scoping. [1] - daemon.functional - conf_file.sysconfig_libvirtd.libvirtd_config - virsh.screenshot.normal_test.acl_test.screen_0.paused_option - virsh.snapshot.live.no_halt - virsh.managedsave.status_error_no.id_option.no_opt.no_progress
(In reply to smitterl from comment #9) > Yan, I'm qa_ack'ing this. I hope that's okay. Please, let me know: > > This is something I tested with upstream 6.8.0, see > https://bugzilla.redhat.com/show_bug.cgi?id=1845468#c4 > > Today, I've verified this furthermore with the following RHEL 8.4 AV version > which is in AV Beta: > libvirt-daemon-7.0.0-8.module+el8.4.0+10233+8b7fd9eb.s390x > > The command succeeds/no crash for 100% in 50 consecutive rounds: > > # for i in $(seq 50); do dmesg -C; systemctl start libvirtd; virsh list; > systemctl stop libvirtd; dmesg; sleep 2; done > Id Name State > -------------------- > > Warning: Stopping libvirtd.service, but it can still be activated by: > libvirtd-admin.socket > libvirtd-ro.socket > libvirtd.socket > ... > > Therefore, I'm setting TestOnly. > > Finally, I think the scope is to stop ignoring the reported error > specifically for automated tests where this occurred[1] and maybe re-run the > above command for verification. Is this okay, Yan? Do you want me to do > something else? Please, feel free to assign this BZ to me if you are okay > with the scoping. > > [1] > - daemon.functional > - conf_file.sysconfig_libvirtd.libvirtd_config > - virsh.screenshot.normal_test.acl_test.screen_0.paused_option > - virsh.snapshot.live.no_halt > - virsh.managedsave.status_error_no.id_option.no_opt.no_progress Hi Sebastian, I still can reproduce this bug with step: # systemctl start libvirtd; virsh list; systemctl stop libvirtd and can reproduce more easily with step: # systemctl restart libvirtd; virsh list ; systemctl restart libvirtd But i can not reproduce the bug on one of my host. And I think it's the same bug with https://bugzilla.redhat.com/show_bug.cgi?id=1933590. @Peter, Would you help to confirm whether the 2 bugs are the same issue please? Thanks.
Yes, it seems to be the same problem. The crash happens in udevAddOneDevice.
*** Bug 1933590 has been marked as a duplicate of this bug. ***
According to comment 10 - 11, move the bug to FailedQA.
Patch merged upstream commit caf23cdc9b628ae3acd89478fbace093b23649b8 Author: Jonathon Jongsma <jjongsma> Date: Tue Mar 16 17:27:25 2021 -0500 nodedev: Don't crash when exiting before init is done If libvirtd is terminated before the node driver finishes initialization, it can crash with a backtrace similar to the following: Stack trace of thread 1922933: #0 0x00007f8515178774 g_hash_table_find (libglib-2.0.so.0) #1 0x00007f851593ea98 virHashSearch (libvirt.so.0) #2 0x00007f8515a1dd83 virNodeDeviceObjListSearch (libvirt.so.0) #3 0x00007f84cceb40a1 udevAddOneDevice (libvirt_driver_nodedev.so) #4 0x00007f84cceb5fae nodeStateInitializeEnumerate (libvirt_driver_nodedev.so) #5 0x00007f85159840cb virThreadHelper (libvirt.so.0) #6 0x00007f8511c7d14a start_thread (libpthread.so.0) #7 0x00007f851442bdb3 __clone (libc.so.6) Stack trace of thread 1922863: #0 0x00007f851442651d syscall (libc.so.6) #1 0x00007f85159842d4 virThreadSelfID (libvirt.so.0) #2 0x00007f851594e240 virLogFormatString (libvirt.so.0) #3 0x00007f851596635d vir_object_finalize (libvirt.so.0) #4 0x00007f8514efe8e9 g_object_unref (libgobject-2.0.so.0) #5 0x00007f85159667f8 virObjectUnref (libvirt.so.0) #6 0x00007f851517755f g_hash_table_remove_all_nodes.part.0 (libglib-2.0.so.0) #7 0x00007f8515177e62 g_hash_table_unref (libglib-2.0.so.0) #8 0x00007f851596637e vir_object_finalize (libvirt.so.0) #9 0x00007f8514efe8e9 g_object_unref (libgobject-2.0.so.0) #10 0x00007f85159667f8 virObjectUnref (libvirt.so.0) #11 0x00007f84cceb2b42 nodeStateCleanup (libvirt_driver_nodedev.so) #12 0x00007f8515b37950 virStateCleanup (libvirt.so.0) #13 0x00005648085348e8 main (libvirtd) #14 0x00007f8514352493 __libc_start_main (libc.so.6) #15 0x00005648085350fe _start (libvirtd) This is because the initial population of the device list is done in a separate initialization thread. If we attempt to exit libvirtd before this init thread has completed, we'll try to free the device list while accessing it from the other thread. In order to guarantee that this init thread is not accessing the device list when we're cleaning up the nodedev driver, make it joinable and wait for it to finish before proceding with the cleanup. This is similar to how we handle the udev event handler thread. The separate initialization thread was added in commit 9f0ae0b1. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1836865 Signed-off-by: Jonathon Jongsma <jjongsma> Reviewed-by: Michal Privoznik <mprivozn>
Maybe a duplicate: BZ 1942127
I see this was pushed upstream as commit caf23cdc9b628ae3acd89478fbace093b23649b8 Will this be POSTed downstream ? If so, then please be sure to set the exception?, provide a downstream build, and work w/ QE to set ITM properly. If not, then ITR would change to 8.5.0, but it would seem a crash should be fixed. BTW: I'm clearing the ITM value since it wasn't cleared automagically on FailedQA
(In reply to John Ferlan from comment #16) > I see this was pushed upstream as commit > caf23cdc9b628ae3acd89478fbace093b23649b8 > > Will this be POSTed downstream ? If so, then please be sure to set the > exception?, provide a downstream build, and work w/ QE to set ITM properly. > If not, then ITR would change to 8.5.0, but it would seem a crash should be > fixed. > > BTW: I'm clearing the ITM value since it wasn't cleared automagically on > FailedQA As far as I can tell (I haven't actually gone back and tested older versions, just going by the commit that introduced this separate nodedev init thread), this issue has been around since libvirt 4.0. It only occurs if the daemon is stopped before the initialization of the library completes (in other words, right after the daemon has started). So it is both rarely encountered and has a fairly low impact since the crash happens in the process of exiting the daemon. So I'm inclined to simply push it to 8.5.0 rather than try to get an exception to push it into 8.4
OK given that let's move this to RHEL-AV 8.5.0 and let the libvirt rebase pick this up (moved to POST)
*** Bug 1942127 has been marked as a duplicate of this bug. ***
------- Comment From Max.Bender1 2021-03-24 09:05 EDT------- (In reply to comment #10) > Looks similar to BZ 1836865 ... but to be sure, could you please specify the > exact versions of libvirt, qemu and kernel that you were using on the system > where the crash occurred? Kernel : Linux zt93k9 4.18.0-283.el8.s390x #1 SMP Thu Feb 4 05:52:52 EST 2021 s390x s390x s390x GNU/Linux libvirt: 7.0.0 qemu-kvm: 5.2.0
------- Comment From Max.Bender1 2021-04-09 13:43 EDT------- (In reply to comment #14) > I'm certain that this is a duplicate of bug 1836865, so I'm going to close > it as a duplicate. Feel free to re-open if you can reproduce it on an > upstream release. > *** This bug has been marked as a duplicate of bug 1836865 *** Ok, we will retest this on our side with an upgrade release. Thanks!
Verified with libvirt-daemon-7.4.0-1.module+el8.5.0+11218+83343022.x86_64. Test steps: 1.# for i in {1..100}; do systemctl restart libvirtd; virsh list ; systemctl restart libvirtd ; sleep 30; done 2.Check the coredump log after step 1: # cat /var/log/messages | grep -i coredump no output
------- Comment From Max.Bender1 2021-08-18 11:42 EDT------- I am still able to reproduce this with the same procedure noted in the original bug description on RHEL 8.5 AV. This leads me to believe that this issue is not related to https://bugzilla.redhat.com/show_bug.cgi?id=1836865 - the theorized duplicate. Issue is still present on OS: RHEL 8.5 virsh : 7.4.0 qemu-img : 6.0.0 kernel : 4.18.0-325.el8.s390x We are not restarting libvirtd explicitly anywhere in our test, but maybe it's crashing/unavailable intermittently when 50 operations are run against it? --- Again the steps to reproduce can be found below Provision 50 guests and save their XML definitions to some directory. Then shutdown those guests. This was done with some python automation... import os xml_dir_path = '/root/scalebug1/xmls' all_guests = os.popen('virsh list --all --name').read() for guest in all_guests.split('\n'): print(guest) os.system(f"virsh dumpxml {guest} > {xml_dir_path}/{guest}.xml") ``` Then start to provision 100+ guests (just to give you time to reproduce problem), at the same time start running undefine and define operations against the 50 shutdown guests. We utilized virt-install (define & provision) to provision these guests. Let us know if you need some automation to do this! ``` virt-install --connect qemu:///system --name {{ item }} --metadata description="qcow:{{ image_dir }}/{{ item }}/{{ item }}.qcow2" --vcpus {{ vcpu }} --memory {{ memory_mb }} --disk {{ item }}.qcow2{{ cache_mode }}{{ aio_mode }} --disk path=seed.iso,device=cdrom,bus=scsi --network {{ virt_network_parm }} --boot hd --noautoconsole ``` Again we utilized some hacky python scripts to do these operations at scale. "UNDEFINE" ---------- ``` import os xml_dir_path = '/root/scalebug1/xmls' all_guests = os.popen('virsh list --all --name').read() for guest in all_guests.split('\n'): if 'scalebug1' in guest: print(guest) os.system(f"virsh undefine {guest}") ``` "DEFINE" --------- ``` import os xml_dir_path = '/root/scalebug1/xmls' from os import listdir from os.path import isfile, join onlyfiles = [f for f in listdir(xml_dir_path) if isfile(join(xml_dir_path, f))] print(onlyfiles) for domain in onlyfiles: os.system(f"virsh define {xml_dir_path}/{domain}") ``` Make sure to keep an eye on your provisioning, you will start to see libvirt lose track of the UUIDs showing the error described below ERROR Domain not found: no domain with matching uuid '5e6072ae-e5c3-453f-b7dd-6c8bf3144a0f' (zt93k9-scalebug1-10-20-112-22) ------- Comment From Max.Bender1 2021-08-18 11:44 EDT------- As soon as I stop running the undefines & redefines provisioning returns to normal scoping the issue to one or both of those operations
Please note Max' last comment was meant for the duplicate RH bug https://bugzilla.redhat.com/show_bug.cgi?id=1942127
------- Comment From tstaudt.com 2021-08-23 10:14 EDT------- IBM 192126 - RH1942127- RHEL8.4 Nightly[0208] - No domain with matching uuid using "virt-install" during many undefine operations (kvm/libvirt) (-> Red Hat Enterprise Linux Advanced Virtualization) now points to Red Hat Bug 1942127 - RHEL8.4 Nightly[0208] - No domain with matching uuid using "virt-install" during many undefine operations (kvm/libvirt) (-> Red Hat Enterprise Linux Advanced Virtualization) again. ------- Comment From Max.Bender1 2021-08-25 11:27 EDT------- (In reply to comment #34) > So, I am trying to reproduce the issue here and haven't managed to yet. But > I'm also a bit unclear about the actual procedure. So, let me ask a few > questions to try to clarify things. > A) What are you actually trying to do here? Is it a real-world scenario, or > just a somewhat contrived procedure that allows you to reproduce the bug? This issue was originally found during our scaling tests where we had some guests running through lifecycle operations to validate the behavior while the system was under load. After we completed the test we had a retrospective to determine why some guests didn't come up in time and eventually scoped it as a consequence of the lifecycle guests running undefine & redefine operations. The procedure noted in the original bug report was just a way we managed to reproduce the issue outside of the scaling test. We obviously wouldn't expect an end-user to run these operations at scale like we are but were able to utilize it to identify an issue. The use-case that we are targeting is dynamic cloud environment where VMs are getting provisioned & deprovisioned often to run things like CICD tasks, helper jobs or other non-static linux configurations. Technologies like Terraform is a potential end-user target for where many of these operations might happen at once. > B) Do you know which command is returning the error message "Domain not > found: no domain with matching uuid '5e6072ae-e5c3-453f-b7dd-6c8bf3144a0f'"? > I assume it's one of the the 'undefine' commands? `virt-install` is the command that is returning this error. Undefines/Redefines all execute successfully. It is a NEW GUEST that experiences this issue, not one of the 50 being redefined/undefined > C) Above you say "at the same time start running undefine and define > operations against the 50 shutdown guests". It sounds almost as if you're > running the undefine and define operations at the same time. But I assume > you're doing all of the undefines first and then doing all of the defines? Correct, undefine and wait till completed. Then redefine. FYI I didn't combine those operations themselves in a programmatic loop, just manually executed them back to back. > So this is how I interpret your instructions above. Please tell me if I'm > incorrect: > - provision 50 guests > - dump xml for all 50 guests > - shutdown all 50 guests > At this point we kick off two parallel paths. In the first path, we > provision 100 new guests just to keep libvirt busy. In the second path, we > first undefine all of the original 50 guest domains. After all of those > domains are undefined, we try to re-define all of those 50 domain from the > xml files. Is that correct? Correct, I observed the UUID error within the first 20 NEW guests provisioning FYI. > Oh, I just noticed the following line in the backtrace you attached above. > Are you certain that you're testing the version of libvirt that you think > you are testing? > Reading symbols from /usr/sbin/libvirtd...Reading symbols from > /usr/lib/debug/usr/sbin/libvirtd-6.3.0-1.module+el8.3.0+6478+69f490bb.s390x. > debug...done. Backtrace was from original bug report against RHEL 8.4 which likely had Libvirt==6.3.0. Can re-capture if needed. Confirmed that recent reproduction occured on : OS: RHEL 8.5 virsh : 7.4.0 qemu-img : 6.0.0 kernel : 4.18.0-325.el8.s390x ------- Comment From Max.Bender1 2021-08-31 11:22 EDT------- (In reply to comment #42) > Have you been able to reproduce on libvirt 7.6 yet? Is it possible to get a > stack trace for this version? Still WIP on this item, have some other bugs that are taking priority. Can you expand on what you would like in regards to a "stack trace"? Do you mean just enable debug logging for libvirt? ------- Comment From Max.Bender1 2021-09-07 15:07 EDT------- I am unable to reproduce this on RHEL 8.5 AV 1.1 (libvirt 7.6.0) after a few attempts. Looks like something got fixed? Do we care about backports to 8.4?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4684