Bug 1083238

Summary:	Migration of a VM fails if another API allowed during migration takes a long time to finish
Product:	Red Hat Enterprise Linux 6	Reporter:	Anitha Udgiri <audgiri>
Component:	libvirt	Assignee:	Jiri Denemark <jdenemar>
Status:	CLOSED ERRATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	6.0	CC:	audgiri, bazulay, dyuan, ederevea, iheim, jdenemar, jentrena, lpeer, mzhan, ofrenkel, pdwyer, rbalakri, stefanha, tdosek, ukar, vfeenstr, ydu, yeylon, zhwang, zpeng
Target Milestone:	rc	Keywords:	Upstream, ZStream
Target Release:	6.6
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	virt
Fixed In Version:	libvirt-0.10.2-37.el6	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1122908 (view as bug list)		Environment:
Last Closed:	2014-10-14 04:20:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1100671, 1122908

Description Anitha Udgiri 2014-04-01 18:18:48 UTC

Description of problem:
Encountering failure when attempting to live-migrate a VM.
Errors with :-
VM xxxx.xxxx.xxx.xxxx is down. Exit message: Domain not found: no domain with matching uuid '794d339d-f7b5-40f3-8b4f-e1fd29fd2832'.

The VM itself is up, online and responding as expected.

Interestingly, this is the only VM that has an issue migrating.


Version-Release number of selected component (if applicable):
RHEVM 3.2.3-0.42
RHEVH 20130815.0.el6_4
vdsm-4.10.2-24.1.el6ev.x86_64

Comment 10 Jiri Denemark 2014-04-11 12:15:01 UTC

Anitha, could you also attach libvirtd log from the target host covering the same time frame as the log from source? And also /var/log/libvirt/qemu/DOMAIN_NAME.log from both the source and the target hosts?

Comment 11 Jiri Denemark 2014-04-11 14:26:50 UTC

The "Thread-6067683::ERROR::2014-03-14 15:04:02,084::vm::285::vm.Vm::(run)
vmId=`794d339d-f7b5-40f3-8b4f-e1fd29fd2832`::Failed to migrate" error seems to
be a result of an overloaded host. While the domain was in the process of
migrating to another host, vdsm called virDomainBlockStats:

2014-03-14 15:02:55.123+0000: 15425: debug : virDomainBlockStats:7200 :
    dom=0x7f49900a1e40, (VM: name=XXXX,
    uuid=794d339d-f7b5-40f3-8b4f-e1fd29fd2832), disk=hdc,
    stats=0x7f499cd8ead0, size=40
2014-03-14 15:03:14.634+0000: 15425: debug :
    qemuDomainObjBeginJobInternal:808 : Starting job: query (async=migration
    out)
2014-03-14 15:03:14.634+0000: 15425: debug :
    qemuMonitorGetBlockStatsInfo:1468 : mon=0x7f498415f420 dev=ide0-1-0
2014-03-14 15:03:14.634+0000: 15425: debug : qemuMonitorSend:903 :
    QEMU_MONITOR_SEND_MSG: mon=0x7f498415f420
    msg={"execute":"query-blockstats","id":"libvirt-1093563"}
 fd=-1

While one thread was waiting for the result of query-blockstats QMP command,
the migration thread wanted to check the state of migration. For that it needs
to get an exclusive access to the monitor which was still in use by thread
15425 and thus waiting for the job timed out:

2014-03-14 15:03:44.000+0000: 15429: warning :
    qemuDomainObjBeginJobInternal:839 : Cannot start job (async nested, none)
    for domain XXXX; current job is (query, migration out)
        owned by (15425, 15429)

2014-03-14 15:03:44.000+0000: 15429: error :
    qemuDomainObjBeginJobInternal:843 : Timed out during operation: cannot
    acquire state change lock

Libvirt considered this a fatal error (such timeouts are signs of
non-reponsive or hung qemu-kvm process) and cancelled the migration:

2014-03-14 15:03:44.000+0000: 15429: debug : doPeer2PeerMigrate3:2544 :
    Finish3 0x7f4984160710 ret=-1
2014-03-14 15:03:55.220+0000: 15429: error :
    virNetClientProgramDispatchError:174 : An error occurred, but the cause is
    unknown
2014-03-14 15:03:55.220+0000: 15429: debug : doPeer2PeerMigrate3:2577 :
    Confirm3 0x7f498c00dd00 cancelled=1 vm=0x7f498416a300
2014-03-14 15:03:55.220+0000: 15429: debug : qemuMigrationConfirm:3227 :
    driver=0x7f49900f5fa0, conn=0x7f498c00dd00, vm=0x7f498416a300,
    cookiein=(null), cookieinlen=0, flags=3, retcode=1
2014-03-14 15:03:55.220+0000: 15429: debug : qemuDomainObjSetJobPhase:689
    : Setting 'migration out' phase to 'confirm3_cancelled'
2014-03-14 15:03:55.220+0000: 15429: debug : qemuMigrationEatCookie:755 :
    cookielen=0 cookie='(null)'
2014-03-14 15:03:55.220+0000: 15429: debug : qemuProcessStartCPUs:2825 :
    Using lock state '(null)'
2014-03-14 15:04:01.948+0000: 15429: debug :
    qemuDomainObjBeginJobInternal:808 : Starting job: async nested
    (async=migration out)
2014-03-14 15:04:01.948+0000: 15429: debug : qemuMonitorStartCPUs:1187 :
    mon=0x7f498415f420
2014-03-14 15:04:01.948+0000: 15429: debug : qemuMonitorSend:903 :
    QEMU_MONITOR_SEND_MSG: mon=0x7f498415f420
    msg={"execute":"cont","id":"libvirt-1093564"}

Note the "An error occurred, but the cause is unknown" returned by the Finish
method called on the target host. We need the logs from there to see what
happened but I guess the domain actually disappeared by itself on the target
host before we attempted to kill it explicitly.


The original query-blockstats request finished just before the end of
cancelled migration:

2014-03-14 15:04:01.946+0000: 15413: debug : qemuMonitorIOProcess:353 :
    QEMU_MONITOR_IO_PROCESS: mon=0x7f498415f420 buf={"return": [{"device":
    "drive-ide0-1-0", "stats": {"flush_total_time_ns": 0, "wr_highest_offset":
    0, "wr_total_time_ns": 0, "wr_bytes": 0, "rd_total_time_ns": 0,
    "flush_operations": 0, "wr_operations": 0, "rd_bytes": 0, "rd_operations":
    0}}, {"device": "drive-virtio-disk0", "parent": {"stats":
    {"flush_total_time_ns": 0, "wr_highest_offset": 46814906368,
    "wr_total_time_ns": 0, "wr_bytes": 0, "rd_total_time_ns": 0,
    "flush_operations": 0, "wr_operations": 0, "rd_bytes": 0, "rd_operations":
    0}}, "stats": {"flush_total_time_ns": 0, "wr_highest_offset": 46814906368,
    "wr_total_time_ns": 88839345819632, "wr_bytes": 361862714368,
    "rd_total_time_ns": 13164079237085, "flush_operations": 0,
    "wr_operations": 33862105, "rd_bytes": 249739890688, "rd_operations":
    9968284}}], "id": "libvirt-1093563"}
 len=796
2014-03-14 15:04:01.947+0000: 15425: debug : qemuDomainObjEndJob:937 :
    Stopping job: query (async=migration out)
2014-03-14 15:04:01.948+0000: 15425: debug : virDomainFree:2281 :
    dom=0x7f49900a1e40, (VM: name=XXXX,
    uuid=794d339d-f7b5-40f3-8b4f-e1fd29fd2832)

Comment 12 Anitha Udgiri 2014-04-11 19:27:35 UTC

(In reply to Jiri Denemark from comment #10)
> Anitha, could you also attach libvirtd log from the target host covering the
> same time frame as the log from source? And also
> /var/log/libvirt/qemu/DOMAIN_NAME.log from both the source and the target
> hosts?

Jiri,
   Attached the same.

Comment 16 Jiri Denemark 2014-04-14 09:30:11 UTC

Thanks, so the "An error occurred, but the cause is unknown" error message is a harmless bug in qemuMigrationFinish. If the sources aborts the migration for whatever reason and calls virDomainMigrateFinish3 with non-zero cancelled argument, the API returns NULL (normally used to indicate an error) without setting any error (logically, because it correctly killed QEMU and cleaned up after cancelled incoming migration).

Comment 17 Jiri Denemark 2014-04-14 09:52:36 UTC

So there's apparently a bug in our migration code that causes migration to be aborted if qemu is slow to response.

The "Domain not found: no domain with matching uuid" error only show up on the destination and is a result of something (likely vdsm) calling virDomainLookupByUUID on the domain just after it was killed because the source cancelled the migration.

Comment 18 Anitha Udgiri 2014-04-14 17:53:02 UTC

(In reply to Jiri Denemark from comment #17)
> So there's apparently a bug in our migration code that causes migration to
> be aborted if qemu is slow to response.
> 
> The "Domain not found: no domain with matching uuid" error only show up on
> the destination and is a result of something (likely vdsm) calling
> virDomainLookupByUUID on the domain just after it was killed because the
> source cancelled the migration.

Thanks Jiri.
What are the possible circumatances where Qemu response is slow.
I ask because in one of the cases, one of the VMs fails to migrate possibly due to heavy load. But the same VM when migrated separately under a perceivably lighter load continues to fail with the same error in one case but succeeds in another case.

Comment 19 Jiri Denemark 2014-04-18 08:52:31 UTC

Hard to say what causes QEMU to response so slowly. We would need a QEMU expert to help answer this question. Stefan, do you have any idea why would query-blockstats be so slow (more than 30 seconds) to respond?

In any case, I'm afraid we won't be able to fix this bug without replacing our query-migrate polling with a migration-completed event, which was apparently forgotten (see bug 580006).

Comment 20 Jiri Denemark 2014-04-23 13:29:38 UTC

Actually, until the event is supported by qemu the migration thread could possibly ignore lock timeouts when it asks for migration status. I'll think about it more and see if it's doable or not.

Comment 21 Dave Allan 2014-04-24 14:01:48 UTC

(In reply to Jiri Denemark from comment #16)
> Thanks, so the "An error occurred, but the cause is unknown" error message
> is a harmless bug in qemuMigrationFinish. If the sources aborts the
> migration for whatever reason and calls virDomainMigrateFinish3 with
> non-zero cancelled argument, the API returns NULL (normally used to indicate
> an error) without setting any error (logically, because it correctly killed
> QEMU and cleaned up after cancelled incoming migration).

IMO this should be changed to reflect migration cancelled.  It's not correct that the cause is unknown.

Comment 22 Stefan Hajnoczi 2014-04-28 13:00:35 UTC

(In reply to Jiri Denemark from comment #19)
> Hard to say what causes QEMU to response so slowly. We would need a QEMU
> expert to help answer this question. Stefan, do you have any idea why would
> query-blockstats be so slow (more than 30 seconds) to respond?

Jiri, sorry for the late response.  I looked at qemu-kvm-0.12.1.2-2.415.el6_5.7 query-blockstats:

The query-blockstats command itself is light-weight and does not perform blocking operations.  I suspect that something else (migration) is monopolizing the QEMU main loop.

If you'd like to find out the root cause we'd need to look at the stack trace of the QEMU process (e.g. gstack $(pgrep qemu)).

Comment 24 Jiri Denemark 2014-05-06 12:24:29 UTC

(In reply to Jiri Denemark from comment #20)
> Actually, until the event is supported by qemu the migration thread could
> possibly ignore lock timeouts when it asks for migration status. I'll think
> about it more and see if it's doable or not.

This solution should work and seems reasonable until we have the event implemented in qemu. I'll propose a patch upstream.

Comment 26 Jiri Denemark 2014-05-13 13:50:25 UTC

Pathes implementing the change suggested in comment #20 were sent upstream for review: https://www.redhat.com/archives/libvir-list/2014-May/msg00403.html

Comment 27 Jiri Denemark 2014-05-14 11:55:28 UTC

This is fixed upstream by v1.2.4-60-g1643b15..v1.2.4-63-gcc4882f:

commit 1643b158f6de87a133a5d8a80f69786055f709b1
Author: Jiri Denemark <jdenemar>
Date:   Tue May 13 14:34:55 2014 +0200

    qemu: Make qemuDomainObjBeginNestedJob static
    
    It's only used within qemu_domain.c.
    
    Signed-off-by: Jiri Denemark <jdenemar>

commit dca377dbd63a514963a06f4f15aa61fb8cb7f19b
Author: Jiri Denemark <jdenemar>
Date:   Tue May 13 14:39:35 2014 +0200

    qemuDomainObjBeginNestedJob: Return -2 for temporary failures
    
    If job queue is full or waiting for a job times out, the function
    returns -2 so that it can be handled in a different way by callers.
    
    The change is safe since all existing callers of
    qemuDomainObjBeginNestedJob check the return value to be less than zero.
    
    Signed-off-by: Jiri Denemark <jdenemar>

commit b5e0532e0446c6255ecedc2d459f4d47172b9225
Author: Jiri Denemark <jdenemar>
Date:   Wed May 14 13:22:34 2014 +0200

    qemu: Make qemuProcess{Start,Stop}CPUs easier to follow
    
    As a side effect, the return value of qemuDomainObjEnterMonitorAsync is
    not directly used as the return value of qemuProcess{Start,Stop}CPUs.
    
    Signed-off-by: Jiri Denemark <jdenemar>

commit cc4882f4ab470334b37e6dd8e65cba60c16e868a
Author: Jiri Denemark <jdenemar>
Date:   Tue May 13 14:45:31 2014 +0200

    qemu: Ignore temporary job errors when checking migration status
    
    When qemu driver is polling for migration to finish (in
    qemuMigrationWaitForCompletion), it may happen that another job allowed
    during migration is running and if it does not finish within 30 seconds,
    migration would be cancelled because of that. However, we can just
    ignore the timeout and let the waiting loop try again later.
    
    If an event fired at the end of migration is ever implemented in QEMU,
    we can just wait for the event instead of polling for migration status
    and libvirt will behave consistently, i.e., migration won't be cancelled
    in case another job started during migration takes long time to finish.
    
    For bug https://bugzilla.redhat.com/show_bug.cgi?id=1083238
    
    Signed-off-by: Jiri Denemark <jdenemar>

Comment 31 zhenfeng wang 2014-05-28 09:00:18 UTC

1.In the first ternial, set the breakpoint for the libvirtd

1.1.# ps aux|grep libvirtd
root     15352  0.5  0.1 831628 15368 ?        Sl   16:16   0:00 libvirtd --daemon
root     15486  0.0  0.0 103252   836 pts/2    S+   16:17   0:00 grep libvirtd

1.2.# gdb libvirtd 15352

(gdb) br qemuMonitorGetBlockStatsInfo
Breakpoint 1 at 0x477b90: file qemu/qemu_monitor.c, line 1483.
(gdb) i b
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x0000000000477b90 in qemuMonitorGetBlockStatsInfo at qemu/qemu_monitor.c:1483
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y

2.Open the second terminal, d the migration operation

2.1 Prepare the migration env, start the guest  
#virsh start rhel6
Domain rhel6 started

2.2.Set the migrate-speed to low value
# virsh migrate-setspeed rhel7k 1

# virsh migrate-getspeed rhel7k
1

2.3.Start the migration 
#virsh migrate --live rhel7k qemu+ssh://$target_ip/system --verbose

3.Excute the following command in third terminal before the migration finished, we can see the gdb stops libvirtd on the breakpoint on the first 
terminal
# virsh domblkstat rhel7k hda

4.Open the fourth terminal,Stop the qemu process with the following command
#ps axu|grep $guest_name
#kill -STOP $qemu_proccess_id

5.Let libvirtd continue run in the first terminal
(gdb)c
Continuing.

6.Wait for about 30 seconds, we can find the migration fails in the
second terminal
# virsh migrate --live rhel7k qemu+ssh://$target_ip/system --verbose
root@$target_ip's password: 
Migration: [ 67 %]error: Timed out during operation: cannot acquire state change lock

Verify this bug with libvirt-0.10.2-37.el6. the following was the verify steps

steps
1.Excute step 1~5 in the upper reproduce steps 
2.Wait for more than 30 seconds, the migration didn't fail in the second terminal, it just waited the another job finished

3.Run the following command in the fourth terminal ,the guest will continue to migrate and can successfully migrate to the target
#kill -CONT $qemu_process_id

According to the upper description, mark this bug verified

Comment 32 zhenfeng wang 2014-05-28 09:03:10 UTC

(In reply to zhenfeng wang from comment #31)
> 1.In the first ternial, set the breakpoint for the libvirtd
> 
> 1.1.# ps aux|grep libvirtd
> root     15352  0.5  0.1 831628 15368 ?        Sl   16:16   0:00 libvirtd
> --daemon
> root     15486  0.0  0.0 103252   836 pts/2    S+   16:17   0:00 grep
> libvirtd
> 
> 1.2.# gdb libvirtd 15352
> 
> (gdb) br qemuMonitorGetBlockStatsInfo
> Breakpoint 1 at 0x477b90: file qemu/qemu_monitor.c, line 1483.
> (gdb) i b
> Num     Type           Disp Enb Address            What
> 1       breakpoint     keep y   0x0000000000477b90 in
> qemuMonitorGetBlockStatsInfo at qemu/qemu_monitor.c:1483
> (gdb) r
> The program being debugged has been started already.
> Start it from the beginning? (y or n) y
> 

Sorry to forget to give the reproduce packet in the comment31, I reproduce this bug with the  libvirt-0.10.2-35.el6.x86_64.rpm

Comment 35 errata-xmlrpc 2014-10-14 04:20:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1374.html