This service will be undergoing maintenance at 00:00 UTC, 2016-08-01. It is expected to last about 1 hours
Bug 1047659 - bidirectional VMs migration between 2 hosts fail on VM doesn't exist / fatal error
bidirectional VMs migration between 2 hosts fail on VM doesn't exist / fatal ...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: libvirt (Show other bugs)
6.5
Unspecified Unspecified
high Severity high
: rc
: ---
Assigned To: Peter Krempa
Ilanit Stein
virt
: Upstream, ZStream
Depends On:
Blocks: 1054785 1055578
  Show dependency treegraph
 
Reported: 2014-01-01 11:11 EST by Ilanit Stein
Modified: 2016-04-26 10:18 EDT (History)
17 users (show)

See Also:
Fixed In Version: libvirt-0.10.2-30.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1054785 (view as bug list)
Environment:
Last Closed: 2014-10-14 00:19:35 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
engine log (migration @17:31) (512.09 KB, application/x-gzip)
2014-01-01 11:46 EST, Ilanit Stein
no flags Details
host 1 vdsm & libvirt logs (1.13 MB, application/zip)
2014-01-01 11:47 EST, Ilanit Stein
no flags Details
host 2 vdsm & libvirt logs (1.07 MB, application/zip)
2014-01-01 11:48 EST, Ilanit Stein
no flags Details
host 2 libvirt log (687.74 KB, application/x-xz)
2014-01-05 09:19 EST, Ilanit Stein
no flags Details
host20 qemu logs (32.90 KB, application/zip)
2014-01-06 08:38 EST, Ilanit Stein
no flags Details
host19 qemu logs (34.26 KB, application/zip)
2014-01-06 08:39 EST, Ilanit Stein
no flags Details
core dump (1.31 MB, application/x-xz)
2014-01-09 09:43 EST, Ilanit Stein
no flags Details

  None (edit)
Description Ilanit Stein 2014-01-01 11:11:32 EST
Description of problem:
2 hosts (every host memory 64GB), each run 8 VMs with rhel OS, and 4GB memory.
Concurrent migrate all VMs running on host 1 to host 2, few seconds later,
and while this migration process is running, 
run concurrent migrate for all VMs running on host 2 to host 1,
results with VM migration failure for all VMs.
Some VMs process was killed.

Version-Release number of selected component (if applicable):
is29

Actual results:
All VMs failed to migrate -
13 VMs failed on event:	"Migration failed due to Error: Fatal error during migration" 

3 VMs turned into "down" state in UI, and had event error: "Migration failed due to Error: Desktop does not exist"

Expected results:
Migration should succeed.
There is enough memory in each host, to run 16 such VMs.
Comment 1 Ilanit Stein 2014-01-01 11:46:08 EST
Created attachment 844154 [details]
engine log (migration @17:31)
Comment 2 Ilanit Stein 2014-01-01 11:47:55 EST
Created attachment 844155 [details]
host 1 vdsm & libvirt logs

host time is 2 hours behind rhevm time. (migration @15:31)
Comment 3 Ilanit Stein 2014-01-01 11:48:48 EST
Created attachment 844156 [details]
host 2 vdsm & libvirt logs

host time is 2 hours behind rhevm time. (migration @15:31)
Comment 5 Ilanit Stein 2014-01-05 09:19:16 EST
Created attachment 845762 [details]
host 2 libvirt log

missing libvirt log
Comment 6 Ilanit Stein 2014-01-05 09:53:52 EST
Problem didn't reproduce on is30 (7 trials).
Comment 7 Michal Skrivanek 2014-01-06 07:56:15 EST
There is a libvirt crash exactly at that time when vdsm has libvirt connection issues during migration. That's a bummer….

Do you have more logs, e.g. qemu log or libvirt core dump? If you cannot reproduce anymore then we have to close this I suppose:(


2014-01-01 15:31:03.305+000027559: debug : virObjectUnref:135 : OBJECT_UNREF: obj=0x7fc72801a610
2014-01-01 15:31:03.305+000027572: debug : virObjectUnref:135 : OBJECT_UNREF: obj=0x7fc738000b00
2014-01-01 15:31:03.305+00006023: warning : qemuDomainObjEnterMonitorInternal:1005 : This thread seems to be the async job owner; entering monitor without asking for a nested job is dangerous
2014-01-01 15:31:03.305+000027559: debug : virEventPollDispatchHandles:469 : i=31 w=121
2014-01-01 15:31:03.305+0000: 27559: debug : qemuMonitorDispose:247 : mon=0x7fc72801a610


     ====== end of log =====

2014-01-01 15:31:03.502+0000: 9997: info : libvirt version: 0.10.2, package: 29.el6_5.2 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2013-12-11-08:49:17, x86-022.build.eng.bos.redhat.com)
Comment 8 Ilanit Stein 2014-01-06 08:38:26 EST
Created attachment 846088 [details]
host20 qemu logs
Comment 9 Ilanit Stein 2014-01-06 08:39:08 EST
Created attachment 846089 [details]
host19 qemu logs
Comment 10 Michal Skrivanek 2014-01-07 03:53:20 EST
then it looks like a purely libvirt crash
Comment 11 Michal Skrivanek 2014-01-07 03:53:47 EST
please take a look into libvirt crash
Comment 13 Jiri Denemark 2014-01-07 04:50:31 EST
Can you attach gdb backtrace of all libvirt threads when the deamon crashed?
Comment 14 Ilanit Stein 2014-01-09 09:43:25 EST
Created attachment 847679 [details]
core dump
Comment 15 Peter Krempa 2014-01-14 13:29:36 EST
Reproducer:

commit ba95d2d1220061ce86fd0b66f0a2f19a90c6b51c
Author: Peter Krempa <pkrempa@redhat.com>
Date:   Tue Jan 14 19:09:36 2014 +0100

    DO NOT APPLY UPSTREAM: Reproducer
    
    Apply this patch and run a patched libvirt daemon. Then start a VM and
    kill it's process after few seconds:
    
     # virsh start VM & sleep 3; killall -9 qemu-kvm
    
    Upstream version doesn't crash but the virsh command returns cryptic
    error message:
     error: invalid argument: monitor must not be NULL
    
    0.10.2 and similar downstream versions segfault:
    
    Program received signal SIGSEGV, Segmentation fault.
    [Switching to Thread 0x7fffed467700 (LWP 29256)]
    __pthread_mutex_lock (mutex=0x10) at pthread_mutex_lock.c:50
    50        unsigned int type = PTHREAD_MUTEX_TYPE (mutex);
    Missing separate debuginfos, use: debuginfo-install VirtualBox-4.2-4.2.22_91556_el6-1.x86_64
    (gdb) bt
     #0  __pthread_mutex_lock (mutex=0x10) at pthread_mutex_lock.c:50
     #1  0x00007fffe891585e in qemuDomainObjEnterMonitorInternal (driver=0x7fffe00858f0, driver_locked=true, obj=0x7fffe01689c0, asyncJob=<value optimized out>)
         at qemu/qemu_domain.c:1008
     #2  0x00007fffe892662a in qemuProcessDetectVcpuPIDs (driver=0x7fffe00858f0, vm=0x7fffe01689c0) at qemu/qemu_process.c:1831
     #3  0x00007fffe892c417 in qemuProcessStart (conn=0x7fffdc000ae0, driver=0x7fffe00858f0, vm=0x7fffe01689c0, migrateFrom=0x0, stdin_fd=-1, stdin_path=0x0, snapshot=0x0,
         vmop=VIR_NETDEV_VPORT_PROFILE_OP_CREATE, flags=1) at qemu/qemu_process.c:4022
     #4  0x00007fffe8973f3e in qemuDomainObjStart (conn=0x7fffdc000ae0, driver=0x7fffe00858f0, vm=0x7fffe01689c0, flags=<value optimized out>) at qemu/qemu_driver.c:6025

diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
index 9331744..a27eded 100644
--- a/src/qemu/qemu_process.c
+++ b/src/qemu/qemu_process.c
@@ -226,40 +226,42 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm)
         return 0;

     if (virSecurityManagerSetDaemonSocketLabel(driver->securityManager,
                                                vm->def) < 0) {
         VIR_ERROR(_("Failed to set security context for agent for %s"),
                   vm->def->name);
         goto cleanup;
     }

     /* Hold an extra reference because we can't allow 'vm' to be
      * deleted while the agent is active */
     virObjectRef(vm);

     ignore_value(virTimeMillisNow(&priv->agentStart));
     virObjectUnlock(vm);

     agent = qemuAgentOpen(vm,
                           config,
                           &agentCallbacks);

+    sleep(10);
+
     virObjectLock(vm);
     priv->agentStart = 0;

     if (virSecurityManagerClearSocketLabel(driver->securityManager,
                                            vm->def) < 0) {
         VIR_ERROR(_("Failed to clear security context for agent for %s"),
                   vm->def->name);
         goto cleanup;
     }

     if (agent == NULL)
         virObjectUnref(vm);

     if (!virDomainObjIsActive(vm)) {
         qemuAgentClose(agent);
         goto cleanup;
     }
     priv->agent = agent;

     if (priv->agent == NULL) {
Comment 16 Peter Krempa 2014-01-15 12:11:14 EST
Fixed upstream:

commit b952cbbccafd5ead8b5a70b2608a1d5a7f03b31e
Author: Peter Krempa <pkrempa@redhat.com>
Date:   Tue Jan 14 19:13:30 2014 +0100

    qemu: Avoid operations on NULL monitor if VM fails early
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1047659
    
    If a VM dies very early during an attempted connect to the guest agent
    while the locks are down the domain monitor object will be freed. The
    object is then accessed later as any failure during guest agent startup
    isn't considered fatal.
    
    In the current upstream version this doesn't lead to a crash as
    virObjectLock called when entering the monitor in
    qemuProcessDetectVcpuPIDs checks the pointer before attempting to
    dereference (lock) it. The NULL pointer is then caught in the monitor
    helper code.
    
    Before the introduction of virObjectLockable - observed on 0.10.2 - the
    pointer is locked directly via virMutexLock leading to a crash.
    
    To avoid this problem we need to differentiate between the guest agent
    not being present and the VM quitting when the locks were down. The fix
    reorganizes the code in qemuConnectAgent to add the check and then adds
    special handling to the callers.
Comment 19 zhe peng 2014-01-16 05:45:09 EST
Thanks Peter's help, i can reproduce this with build libvirt-0.10.2-29.el6_5.2.
Comment 23 zhe peng 2014-04-15 02:10:46 EDT
verify with build:
libvirt-0.10.2-31.el6.x86_64

step:
1:apply Peter's patch in comment 15
2:Build libvirt pkg & install
3:prepare a guest with guest agent configured.
4:# virsh start VM & sleep 3; killall -9 qemu-kvm
# error: Failed to start domain rhel6.5
error: internal error guest crashed while connecting to the guest agent


[1]+  Exit 1                  virsh start rhel6.5
# service libvirtd status
libvirtd (pid  1546) is running...

libvirtd not crash, move to verified.
Comment 25 errata-xmlrpc 2014-10-14 00:19:35 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1374.html

Note You need to log in before you can comment on or make changes to this bug.