Red Hat Bugzilla – Bug 1162208
libvirtd occasionally crashes at the end of migration
Last modified: 2015-03-05 02:47:20 EST
Description of problem: Migration of VMs in RHEV3.5 on RHEL7.0 hosts occasionally fails because of a libvirtd crash (segfault) happening near the end. Version-Release number of selected component (if applicable): libvirt-1.1.1-29.el7_0.3.x86_64 How reproducible: about 1/10th of the time Steps to Reproduce: 1. Connect to a VM running in RHEV3.5 environment on RHEL7.0 hosts using spice (tested just with 64 bit windows 7 as a guest, default configuration) 2. Migrate the VM 3. Repeat until error appears. Actual results: The VM shuts down, libvirtd segfaults and is restarted on source.
Created attachment 955828 [details] libvirtd log coredump too large to attach, you can find it here: http://download.eng.brq.redhat.com/scratch/tjamrisk/libvirt-coredump
Could you install the required debuginfo packages, and get the output of "thread apply all backtrace" gdb command from the coredump? That wouldn't be so big...
Created attachment 955852 [details] backtrace
So this is what happens in the two threads involved in this crash: Thread 10 Thread 1 qemuMigrationPerform qemuMigrationPerformJob doPeer2PeerMigrate doPeer2PeerMigrate3 qemuMigrationConfirmPhase qemuProcessStop qemuProcessKill /* Clear out dynamically assigned labels */ qemuDomainObjEnterRemote virObjectRef(vm) virObjectUnlock(vm) qemuDomainGetBlockInfo virConnectClose qemuDomObjFromDomain virDomainObjListFindByUUID virObjectLock(vm) qemuOpenFile /* use cleared labels */ SIGSEGV qemuDomainObjExitRemote qemuDomainRemoveInactive The problem is in qemuProcessStop which clears out dynamically assigned labels but the seclabel structures full of NULL pointers remain in vm->def->seclabels. The second thread then tries to use one of the seclabels. The problem does not affect persistent domains because vm->def is completely removed and replaced with a persistent version of def pointed to by vm->newdef.
This is already fixed upstream by v1.2.5-112-g7eb0ee1 (the scenario described in the commit message is obviously not the only possible way to hit the crash): commit 7eb0ee175b278a4439cee65a7a554767f0be9cd1 Author: Ján Tomko <jtomko@redhat.com> Date: Thu Jun 12 10:50:43 2014 +0200 Fix crash when saving a domain with type none dac label qemuDomainGetImageIds did not check if there was a label in the seclabel, thus crashing on <seclabel type='none' model='dac'/> https://bugzilla.redhat.com/show_bug.cgi?id=1108590
The patch is included in 7.1, feel free to request a backport for 7.0.z...
Steps to reliably reproduce this crash: 0. patch libvirt to make the race condition window bigger (see attached patches), build it, and start the modified libvirtd 1. create a transient domain 2. one "doPeer2PeerMigrate:??? : SLEEPING" debug log appears, run "virsh domblkinfo $DOMAIN $DISK" libvirtd just segfaults without the fix mentioned in comment 6.
Created attachment 959275 [details] patch for reproducing the bug on libvirt from 7.0.z
Created attachment 959276 [details] patch for reproducing the bug on libvirt from 7.1
I can reproduce this bug with libvirt-1.1.1-29.el7.x86_64 : Steps : 1.rebuild the libvirt and add a patch(jiri offered in comment 9). 2.install the libvirt we build in step 1 and restart libvirtd in source host # service libvirtd start Redirecting to /bin/systemctl start libvirtd.service 3.prepare a transient vm from source host(doesn't set <seclabel type='none' model='dac'/> in guest xml, because this will cause another crash): # virsh create test6.xml Domain test6 created from test6.xml 4. migrate the vm from source host(libvirtd have been rebuilt) to target host(p2p migrate): # virsh migrate test6 qemu+ssh://lhuang/system --p2p 5. before migrate success(during sleep time) do domblkinfo on source host(open another terminal): # virsh domblkinfo test6 hda error: End of file while reading data: Input/output error error: One or more references were leaked after disconnect from the hypervisor error: Failed to reconnect to the hypervisor 6.check the coredump file via gdb in source host have the same cause with comment 5. And cannot verify this bug with libvirt-1.2.8-10.el7.x86_64 : 1.rebuild the libvirt and add a patch(jiri offered in comment 9). 2.install the libvirt we build in step 1 and restart libvirtd in source host # service libvirtd start Redirecting to /bin/systemctl start libvirtd.service 3.prepare a transient vm from source host(doesn't set <seclabel type='none' model='dac'/> in guest xml, because this will cause another crash): # virsh create test6.xml Domain test6 created from test6.xml 4. migrate the vm from source host(libvirtd have been rebuilt) to target host(p2p migrate): # virsh migrate test6 qemu+ssh://lhuang/system --p2p 5. before migrate success(during sleep time) do domblkinfo on source host(open another terminal): # virsh domblkinfo test6 hda Capacity: 4294967296 Allocation: 4294967296 Physical: 4294967296
Sorry i make a mistake in comment 14. >And cannot verify this bug with libvirt-1.2.8-10.el7.x86_64 : s/cannot verify this bug/cannot reproduce
*** Bug 1174869 has been marked as a duplicate of this bug. ***
Thanks Tomas. Move the bug to VERIFIED. And we also re-verify it PASS with the latest libvirt version.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0323.html