Bug 1162208
Summary: | libvirtd occasionally crashes at the end of migration | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Tomas Jamrisko <tjamrisk> | ||||||||||
Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | high | ||||||||||||
Version: | 7.0 | CC: | dyuan, fromani, jdenemar, lhuang, mzhan, rbalakri, tdosek, tjamrisk, wzhang, xuzhang, zpeng | ||||||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||||||
Target Release: | --- | ||||||||||||
Hardware: | Unspecified | ||||||||||||
OS: | Unspecified | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | libvirt-1.2.8-1.el7 | Doc Type: | Bug Fix | ||||||||||
Doc Text: |
Cause: Libvirt did not properly check whether a DAC security label is non-NULL before trying to parse user/group ownership from it.
Consequence: When virDomainGetBlockInfo API is called on a transient domain that has just finished migration to another host, its DAC security label may already be NULL, which crashes libvirtd. Since RHEV uses transient domains and periodically calls virDomainGetBlockInfo, it's just a matter of timing if the API is called at the right time to crash libvirtd.
Fix: Properly check DAC label before trying to parse it.
Result: Libvirtd no longer crashes in the described scenario.
|
Story Points: | --- | ||||||||||
Clone Of: | |||||||||||||
: | 1171124 (view as bug list) | Environment: | |||||||||||
Last Closed: | 2015-03-05 07:47:20 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 1151953, 1171124 | ||||||||||||
Attachments: |
|
Description
Tomas Jamrisko
2014-11-10 14:16:27 UTC
Created attachment 955828 [details] libvirtd log coredump too large to attach, you can find it here: http://download.eng.brq.redhat.com/scratch/tjamrisk/libvirt-coredump Could you install the required debuginfo packages, and get the output of "thread apply all backtrace" gdb command from the coredump? That wouldn't be so big... Created attachment 955852 [details]
backtrace
So this is what happens in the two threads involved in this crash: Thread 10 Thread 1 qemuMigrationPerform qemuMigrationPerformJob doPeer2PeerMigrate doPeer2PeerMigrate3 qemuMigrationConfirmPhase qemuProcessStop qemuProcessKill /* Clear out dynamically assigned labels */ qemuDomainObjEnterRemote virObjectRef(vm) virObjectUnlock(vm) qemuDomainGetBlockInfo virConnectClose qemuDomObjFromDomain virDomainObjListFindByUUID virObjectLock(vm) qemuOpenFile /* use cleared labels */ SIGSEGV qemuDomainObjExitRemote qemuDomainRemoveInactive The problem is in qemuProcessStop which clears out dynamically assigned labels but the seclabel structures full of NULL pointers remain in vm->def->seclabels. The second thread then tries to use one of the seclabels. The problem does not affect persistent domains because vm->def is completely removed and replaced with a persistent version of def pointed to by vm->newdef. This is already fixed upstream by v1.2.5-112-g7eb0ee1 (the scenario described in the commit message is obviously not the only possible way to hit the crash): commit 7eb0ee175b278a4439cee65a7a554767f0be9cd1 Author: Ján Tomko <jtomko> Date: Thu Jun 12 10:50:43 2014 +0200 Fix crash when saving a domain with type none dac label qemuDomainGetImageIds did not check if there was a label in the seclabel, thus crashing on <seclabel type='none' model='dac'/> https://bugzilla.redhat.com/show_bug.cgi?id=1108590 The patch is included in 7.1, feel free to request a backport for 7.0.z... Steps to reliably reproduce this crash: 0. patch libvirt to make the race condition window bigger (see attached patches), build it, and start the modified libvirtd 1. create a transient domain 2. one "doPeer2PeerMigrate:??? : SLEEPING" debug log appears, run "virsh domblkinfo $DOMAIN $DISK" libvirtd just segfaults without the fix mentioned in comment 6. Created attachment 959275 [details]
patch for reproducing the bug on libvirt from 7.0.z
Created attachment 959276 [details]
patch for reproducing the bug on libvirt from 7.1
I can reproduce this bug with libvirt-1.1.1-29.el7.x86_64 : Steps : 1.rebuild the libvirt and add a patch(jiri offered in comment 9). 2.install the libvirt we build in step 1 and restart libvirtd in source host # service libvirtd start Redirecting to /bin/systemctl start libvirtd.service 3.prepare a transient vm from source host(doesn't set <seclabel type='none' model='dac'/> in guest xml, because this will cause another crash): # virsh create test6.xml Domain test6 created from test6.xml 4. migrate the vm from source host(libvirtd have been rebuilt) to target host(p2p migrate): # virsh migrate test6 qemu+ssh://lhuang/system --p2p 5. before migrate success(during sleep time) do domblkinfo on source host(open another terminal): # virsh domblkinfo test6 hda error: End of file while reading data: Input/output error error: One or more references were leaked after disconnect from the hypervisor error: Failed to reconnect to the hypervisor 6.check the coredump file via gdb in source host have the same cause with comment 5. And cannot verify this bug with libvirt-1.2.8-10.el7.x86_64 : 1.rebuild the libvirt and add a patch(jiri offered in comment 9). 2.install the libvirt we build in step 1 and restart libvirtd in source host # service libvirtd start Redirecting to /bin/systemctl start libvirtd.service 3.prepare a transient vm from source host(doesn't set <seclabel type='none' model='dac'/> in guest xml, because this will cause another crash): # virsh create test6.xml Domain test6 created from test6.xml 4. migrate the vm from source host(libvirtd have been rebuilt) to target host(p2p migrate): # virsh migrate test6 qemu+ssh://lhuang/system --p2p 5. before migrate success(during sleep time) do domblkinfo on source host(open another terminal): # virsh domblkinfo test6 hda Capacity: 4294967296 Allocation: 4294967296 Physical: 4294967296 Sorry i make a mistake in comment 14. >And cannot verify this bug with libvirt-1.2.8-10.el7.x86_64 : s/cannot verify this bug/cannot reproduce *** Bug 1174869 has been marked as a duplicate of this bug. *** Thanks Tomas. Move the bug to VERIFIED. And we also re-verify it PASS with the latest libvirt version. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0323.html |