Bug 1047659
Summary: | bidirectional VMs migration between 2 hosts fail on VM doesn't exist / fatal error | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Ilanit Stein <istein> | ||||||||||||||||
Component: | libvirt | Assignee: | Peter Krempa <pkrempa> | ||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Ilanit Stein <istein> | ||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||
Priority: | high | ||||||||||||||||||
Version: | 6.5 | CC: | ahadas, cpelland, dyuan, honzhang, iheim, istein, jdenemar, jiahu, juzhou, lpeer, michal.skrivanek, mzhan, pkrempa, rbalakri, Rhev-m-bugs, yeylon, zpeng | ||||||||||||||||
Target Milestone: | rc | Keywords: | Upstream, ZStream | ||||||||||||||||
Target Release: | --- | ||||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||||
OS: | Unspecified | ||||||||||||||||||
Whiteboard: | virt | ||||||||||||||||||
Fixed In Version: | libvirt-0.10.2-30.el6 | Doc Type: | Bug Fix | ||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||
Clone Of: | |||||||||||||||||||
: | 1054785 (view as bug list) | Environment: | |||||||||||||||||
Last Closed: | 2014-10-14 04:19:35 UTC | Type: | Bug | ||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
Embargoed: | |||||||||||||||||||
Bug Depends On: | |||||||||||||||||||
Bug Blocks: | 1054785, 1055578 | ||||||||||||||||||
Attachments: |
|
Description
Ilanit Stein
2014-01-01 16:11:32 UTC
Created attachment 844154 [details]
engine log (migration @17:31)
Created attachment 844155 [details]
host 1 vdsm & libvirt logs
host time is 2 hours behind rhevm time. (migration @15:31)
Created attachment 844156 [details]
host 2 vdsm & libvirt logs
host time is 2 hours behind rhevm time. (migration @15:31)
Created attachment 845762 [details]
host 2 libvirt log
missing libvirt log
Problem didn't reproduce on is30 (7 trials). There is a libvirt crash exactly at that time when vdsm has libvirt connection issues during migration. That's a bummer…. Do you have more logs, e.g. qemu log or libvirt core dump? If you cannot reproduce anymore then we have to close this I suppose:( 2014-01-01 15:31:03.305+000027559: debug : virObjectUnref:135 : OBJECT_UNREF: obj=0x7fc72801a610 2014-01-01 15:31:03.305+000027572: debug : virObjectUnref:135 : OBJECT_UNREF: obj=0x7fc738000b00 2014-01-01 15:31:03.305+00006023: warning : qemuDomainObjEnterMonitorInternal:1005 : This thread seems to be the async job owner; entering monitor without asking for a nested job is dangerous 2014-01-01 15:31:03.305+000027559: debug : virEventPollDispatchHandles:469 : i=31 w=121 2014-01-01 15:31:03.305+0000: 27559: debug : qemuMonitorDispose:247 : mon=0x7fc72801a610 ====== end of log ===== 2014-01-01 15:31:03.502+0000: 9997: info : libvirt version: 0.10.2, package: 29.el6_5.2 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2013-12-11-08:49:17, x86-022.build.eng.bos.redhat.com) Created attachment 846088 [details]
host20 qemu logs
Created attachment 846089 [details]
host19 qemu logs
then it looks like a purely libvirt crash please take a look into libvirt crash Can you attach gdb backtrace of all libvirt threads when the deamon crashed? Created attachment 847679 [details]
core dump
Reproducer: commit ba95d2d1220061ce86fd0b66f0a2f19a90c6b51c Author: Peter Krempa <pkrempa> Date: Tue Jan 14 19:09:36 2014 +0100 DO NOT APPLY UPSTREAM: Reproducer Apply this patch and run a patched libvirt daemon. Then start a VM and kill it's process after few seconds: # virsh start VM & sleep 3; killall -9 qemu-kvm Upstream version doesn't crash but the virsh command returns cryptic error message: error: invalid argument: monitor must not be NULL 0.10.2 and similar downstream versions segfault: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffed467700 (LWP 29256)] __pthread_mutex_lock (mutex=0x10) at pthread_mutex_lock.c:50 50 unsigned int type = PTHREAD_MUTEX_TYPE (mutex); Missing separate debuginfos, use: debuginfo-install VirtualBox-4.2-4.2.22_91556_el6-1.x86_64 (gdb) bt #0 __pthread_mutex_lock (mutex=0x10) at pthread_mutex_lock.c:50 #1 0x00007fffe891585e in qemuDomainObjEnterMonitorInternal (driver=0x7fffe00858f0, driver_locked=true, obj=0x7fffe01689c0, asyncJob=<value optimized out>) at qemu/qemu_domain.c:1008 #2 0x00007fffe892662a in qemuProcessDetectVcpuPIDs (driver=0x7fffe00858f0, vm=0x7fffe01689c0) at qemu/qemu_process.c:1831 #3 0x00007fffe892c417 in qemuProcessStart (conn=0x7fffdc000ae0, driver=0x7fffe00858f0, vm=0x7fffe01689c0, migrateFrom=0x0, stdin_fd=-1, stdin_path=0x0, snapshot=0x0, vmop=VIR_NETDEV_VPORT_PROFILE_OP_CREATE, flags=1) at qemu/qemu_process.c:4022 #4 0x00007fffe8973f3e in qemuDomainObjStart (conn=0x7fffdc000ae0, driver=0x7fffe00858f0, vm=0x7fffe01689c0, flags=<value optimized out>) at qemu/qemu_driver.c:6025 diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 9331744..a27eded 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -226,40 +226,42 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) return 0; if (virSecurityManagerSetDaemonSocketLabel(driver->securityManager, vm->def) < 0) { VIR_ERROR(_("Failed to set security context for agent for %s"), vm->def->name); goto cleanup; } /* Hold an extra reference because we can't allow 'vm' to be * deleted while the agent is active */ virObjectRef(vm); ignore_value(virTimeMillisNow(&priv->agentStart)); virObjectUnlock(vm); agent = qemuAgentOpen(vm, config, &agentCallbacks); + sleep(10); + virObjectLock(vm); priv->agentStart = 0; if (virSecurityManagerClearSocketLabel(driver->securityManager, vm->def) < 0) { VIR_ERROR(_("Failed to clear security context for agent for %s"), vm->def->name); goto cleanup; } if (agent == NULL) virObjectUnref(vm); if (!virDomainObjIsActive(vm)) { qemuAgentClose(agent); goto cleanup; } priv->agent = agent; if (priv->agent == NULL) { Fixed upstream: commit b952cbbccafd5ead8b5a70b2608a1d5a7f03b31e Author: Peter Krempa <pkrempa> Date: Tue Jan 14 19:13:30 2014 +0100 qemu: Avoid operations on NULL monitor if VM fails early https://bugzilla.redhat.com/show_bug.cgi?id=1047659 If a VM dies very early during an attempted connect to the guest agent while the locks are down the domain monitor object will be freed. The object is then accessed later as any failure during guest agent startup isn't considered fatal. In the current upstream version this doesn't lead to a crash as virObjectLock called when entering the monitor in qemuProcessDetectVcpuPIDs checks the pointer before attempting to dereference (lock) it. The NULL pointer is then caught in the monitor helper code. Before the introduction of virObjectLockable - observed on 0.10.2 - the pointer is locked directly via virMutexLock leading to a crash. To avoid this problem we need to differentiate between the guest agent not being present and the VM quitting when the locks were down. The fix reorganizes the code in qemuConnectAgent to add the check and then adds special handling to the callers. Thanks Peter's help, i can reproduce this with build libvirt-0.10.2-29.el6_5.2. verify with build: libvirt-0.10.2-31.el6.x86_64 step: 1:apply Peter's patch in comment 15 2:Build libvirt pkg & install 3:prepare a guest with guest agent configured. 4:# virsh start VM & sleep 3; killall -9 qemu-kvm # error: Failed to start domain rhel6.5 error: internal error guest crashed while connecting to the guest agent [1]+ Exit 1 virsh start rhel6.5 # service libvirtd status libvirtd (pid 1546) is running... libvirtd not crash, move to verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1374.html |