1047659 – bidirectional VMs migration between 2 hosts fail on VM doesn't exist / fatal error

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1047659 - bidirectional VMs migration between 2 hosts fail on VM doesn't exist / fatal error

Summary: bidirectional VMs migration between 2 hosts fail on VM doesn't exist / fatal ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	6.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Peter Krempa
QA Contact:	Ilanit Stein
Docs Contact:
URL:
Whiteboard:	virt
Depends On:
Blocks:	1054785 1055578
TreeView+	depends on / blocked

Reported:	2014-01-01 16:11 UTC by Ilanit Stein
Modified:	2016-04-26 14:18 UTC (History)
CC List:	17 users (show)
Fixed In Version:	libvirt-0.10.2-30.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1054785 (view as bug list)
Environment:
Last Closed:	2014-10-14 04:19:35 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
engine log (migration @17:31) (512.09 KB, application/x-gzip) 2014-01-01 16:46 UTC, Ilanit Stein	no flags	Details
host 1 vdsm & libvirt logs (1.13 MB, application/zip) 2014-01-01 16:47 UTC, Ilanit Stein	no flags	Details
host 2 vdsm & libvirt logs (1.07 MB, application/zip) 2014-01-01 16:48 UTC, Ilanit Stein	no flags	Details
host 2 libvirt log (687.74 KB, application/x-xz) 2014-01-05 14:19 UTC, Ilanit Stein	no flags	Details
host20 qemu logs (32.90 KB, application/zip) 2014-01-06 13:38 UTC, Ilanit Stein	no flags	Details
host19 qemu logs (34.26 KB, application/zip) 2014-01-06 13:39 UTC, Ilanit Stein	no flags	Details
core dump (1.31 MB, application/x-xz) 2014-01-09 14:43 UTC, Ilanit Stein	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2014:1374	0	normal	SHIPPED_LIVE	libvirt bug fix and enhancement update	2014-10-14 08:11:54 UTC

Description Ilanit Stein 2014-01-01 16:11:32 UTC

Description of problem:
2 hosts (every host memory 64GB), each run 8 VMs with rhel OS, and 4GB memory.
Concurrent migrate all VMs running on host 1 to host 2, few seconds later,
and while this migration process is running, 
run concurrent migrate for all VMs running on host 2 to host 1,
results with VM migration failure for all VMs.
Some VMs process was killed.

Version-Release number of selected component (if applicable):
is29

Actual results:
All VMs failed to migrate -
13 VMs failed on event:	"Migration failed due to Error: Fatal error during migration" 

3 VMs turned into "down" state in UI, and had event error: "Migration failed due to Error: Desktop does not exist"

Expected results:
Migration should succeed.
There is enough memory in each host, to run 16 such VMs.

Comment 1 Ilanit Stein 2014-01-01 16:46:08 UTC

Created attachment 844154 [details]
engine log (migration @17:31)

Comment 2 Ilanit Stein 2014-01-01 16:47:55 UTC

Created attachment 844155 [details]
host 1 vdsm & libvirt logs

host time is 2 hours behind rhevm time. (migration @15:31)

Comment 3 Ilanit Stein 2014-01-01 16:48:48 UTC

Created attachment 844156 [details]
host 2 vdsm & libvirt logs

host time is 2 hours behind rhevm time. (migration @15:31)

Comment 5 Ilanit Stein 2014-01-05 14:19:16 UTC

Created attachment 845762 [details]
host 2 libvirt log

missing libvirt log

Comment 6 Ilanit Stein 2014-01-05 14:53:52 UTC

Problem didn't reproduce on is30 (7 trials).

Comment 7 Michal Skrivanek 2014-01-06 12:56:15 UTC

There is a libvirt crash exactly at that time when vdsm has libvirt connection issues during migration. That's a bummer….

Do you have more logs, e.g. qemu log or libvirt core dump? If you cannot reproduce anymore then we have to close this I suppose:(


2014-01-01 15:31:03.305+000027559: debug : virObjectUnref:135 : OBJECT_UNREF: obj=0x7fc72801a610
2014-01-01 15:31:03.305+000027572: debug : virObjectUnref:135 : OBJECT_UNREF: obj=0x7fc738000b00
2014-01-01 15:31:03.305+00006023: warning : qemuDomainObjEnterMonitorInternal:1005 : This thread seems to be the async job owner; entering monitor without asking for a nested job is dangerous
2014-01-01 15:31:03.305+000027559: debug : virEventPollDispatchHandles:469 : i=31 w=121
2014-01-01 15:31:03.305+0000: 27559: debug : qemuMonitorDispose:247 : mon=0x7fc72801a610


     ====== end of log =====

2014-01-01 15:31:03.502+0000: 9997: info : libvirt version: 0.10.2, package: 29.el6_5.2 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2013-12-11-08:49:17, x86-022.build.eng.bos.redhat.com)

Comment 8 Ilanit Stein 2014-01-06 13:38:26 UTC

Created attachment 846088 [details]
host20 qemu logs

Comment 9 Ilanit Stein 2014-01-06 13:39:08 UTC

Created attachment 846089 [details]
host19 qemu logs

Comment 10 Michal Skrivanek 2014-01-07 08:53:20 UTC

then it looks like a purely libvirt crash

Comment 11 Michal Skrivanek 2014-01-07 08:53:47 UTC

please take a look into libvirt crash

Comment 13 Jiri Denemark 2014-01-07 09:50:31 UTC

Can you attach gdb backtrace of all libvirt threads when the deamon crashed?

Comment 14 Ilanit Stein 2014-01-09 14:43:25 UTC

Created attachment 847679 [details]
core dump

Comment 15 Peter Krempa 2014-01-14 18:29:36 UTC

Reproducer:

commit ba95d2d1220061ce86fd0b66f0a2f19a90c6b51c
Author: Peter Krempa <pkrempa>
Date:   Tue Jan 14 19:09:36 2014 +0100

    DO NOT APPLY UPSTREAM: Reproducer
    
    Apply this patch and run a patched libvirt daemon. Then start a VM and
    kill it's process after few seconds:
    
     # virsh start VM & sleep 3; killall -9 qemu-kvm
    
    Upstream version doesn't crash but the virsh command returns cryptic
    error message:
     error: invalid argument: monitor must not be NULL
    
    0.10.2 and similar downstream versions segfault:
    
    Program received signal SIGSEGV, Segmentation fault.
    [Switching to Thread 0x7fffed467700 (LWP 29256)]
    __pthread_mutex_lock (mutex=0x10) at pthread_mutex_lock.c:50
    50        unsigned int type = PTHREAD_MUTEX_TYPE (mutex);
    Missing separate debuginfos, use: debuginfo-install VirtualBox-4.2-4.2.22_91556_el6-1.x86_64
    (gdb) bt
     #0  __pthread_mutex_lock (mutex=0x10) at pthread_mutex_lock.c:50
     #1  0x00007fffe891585e in qemuDomainObjEnterMonitorInternal (driver=0x7fffe00858f0, driver_locked=true, obj=0x7fffe01689c0, asyncJob=<value optimized out>)
         at qemu/qemu_domain.c:1008
     #2  0x00007fffe892662a in qemuProcessDetectVcpuPIDs (driver=0x7fffe00858f0, vm=0x7fffe01689c0) at qemu/qemu_process.c:1831
     #3  0x00007fffe892c417 in qemuProcessStart (conn=0x7fffdc000ae0, driver=0x7fffe00858f0, vm=0x7fffe01689c0, migrateFrom=0x0, stdin_fd=-1, stdin_path=0x0, snapshot=0x0,
         vmop=VIR_NETDEV_VPORT_PROFILE_OP_CREATE, flags=1) at qemu/qemu_process.c:4022
     #4  0x00007fffe8973f3e in qemuDomainObjStart (conn=0x7fffdc000ae0, driver=0x7fffe00858f0, vm=0x7fffe01689c0, flags=<value optimized out>) at qemu/qemu_driver.c:6025

diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
index 9331744..a27eded 100644
--- a/src/qemu/qemu_process.c
+++ b/src/qemu/qemu_process.c
@@ -226,40 +226,42 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm)
         return 0;

     if (virSecurityManagerSetDaemonSocketLabel(driver->securityManager,
                                                vm->def) < 0) {
         VIR_ERROR(_("Failed to set security context for agent for %s"),
                   vm->def->name);
         goto cleanup;
     }

     /* Hold an extra reference because we can't allow 'vm' to be
      * deleted while the agent is active */
     virObjectRef(vm);

     ignore_value(virTimeMillisNow(&priv->agentStart));
     virObjectUnlock(vm);

     agent = qemuAgentOpen(vm,
                           config,
                           &agentCallbacks);

+    sleep(10);
+
     virObjectLock(vm);
     priv->agentStart = 0;

     if (virSecurityManagerClearSocketLabel(driver->securityManager,
                                            vm->def) < 0) {
         VIR_ERROR(_("Failed to clear security context for agent for %s"),
                   vm->def->name);
         goto cleanup;
     }

     if (agent == NULL)
         virObjectUnref(vm);

     if (!virDomainObjIsActive(vm)) {
         qemuAgentClose(agent);
         goto cleanup;
     }
     priv->agent = agent;

     if (priv->agent == NULL) {

Comment 16 Peter Krempa 2014-01-15 17:11:14 UTC

Fixed upstream:

commit b952cbbccafd5ead8b5a70b2608a1d5a7f03b31e
Author: Peter Krempa <pkrempa>
Date:   Tue Jan 14 19:13:30 2014 +0100

    qemu: Avoid operations on NULL monitor if VM fails early
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1047659
    
    If a VM dies very early during an attempted connect to the guest agent
    while the locks are down the domain monitor object will be freed. The
    object is then accessed later as any failure during guest agent startup
    isn't considered fatal.
    
    In the current upstream version this doesn't lead to a crash as
    virObjectLock called when entering the monitor in
    qemuProcessDetectVcpuPIDs checks the pointer before attempting to
    dereference (lock) it. The NULL pointer is then caught in the monitor
    helper code.
    
    Before the introduction of virObjectLockable - observed on 0.10.2 - the
    pointer is locked directly via virMutexLock leading to a crash.
    
    To avoid this problem we need to differentiate between the guest agent
    not being present and the VM quitting when the locks were down. The fix
    reorganizes the code in qemuConnectAgent to add the check and then adds
    special handling to the callers.

Comment 19 zhe peng 2014-01-16 10:45:09 UTC

Thanks Peter's help, i can reproduce this with build libvirt-0.10.2-29.el6_5.2.

Comment 23 zhe peng 2014-04-15 06:10:46 UTC

verify with build:
libvirt-0.10.2-31.el6.x86_64

step:
1:apply Peter's patch in comment 15
2:Build libvirt pkg & install
3:prepare a guest with guest agent configured.
4:# virsh start VM & sleep 3; killall -9 qemu-kvm
# error: Failed to start domain rhel6.5
error: internal error guest crashed while connecting to the guest agent


[1]+  Exit 1                  virsh start rhel6.5
# service libvirtd status
libvirtd (pid  1546) is running...

libvirtd not crash, move to verified.

Comment 25 errata-xmlrpc 2014-10-14 04:19:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1374.html

Note You need to log in before you can comment on or make changes to this bug.