Bug 1098833

Summary:	After migration failed in last phase, didn't resume VM on source host
Product:	Red Hat Enterprise Linux 6	Reporter:	David Gibson <dgibson>
Component:	libvirt	Assignee:	Jiri Denemark <jdenemar>
Status:	CLOSED ERRATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	high	Docs Contact:
Priority:	high
Version:	6.5	CC:	bazulay, dyuan, fromani, iheim, jdenemar, jentrena, juzhou, lpeer, michal.skrivanek, mkalinin, mzhan, ofrenkel, rbalakri, shyu, tdosek, yeylon, zhwang, zpeng
Target Milestone:	rc	Keywords:	ZStream
Target Release:	6.6
Hardware:	All
OS:	Linux
Whiteboard:	virt
Fixed In Version:	libvirt-0.10.2-37.el6	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-10-14 04:21:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1100642

Description David Gibson 2014-05-19 01:11:00 UTC

Description of problem:

It looks like a migration has failed with connection lost to the destination libvirtd, during the final phase, when the VM is paused.

The source VM has remained paused, rather than being resumed automatically after the migration failure.

Version-Release number of selected component (if applicable):

RHEV 3.3.0
vdsm-4.13.2-0.11.el6ev.x86_64 (source)
libvirt-0.10.2-29.el6_5.7.x86_64 (source)

How reproducible:

Tricky.

Steps to Reproduce:
1. Migrate VM from one host to another
2. Have a network error occur during final migration phase, causing it to fail.

Actual results:

Migration fails, and VM remains present but pausedon source host.

Expected results:

Migration fails, but VM is automatically resumed on host.

Additional info:

Relevant log extract from source vdsm log.

Thread-731854::DEBUG::2014-05-14 23:04:01,097::vm::746::vm.Vm::(run) vmId=`b3398089-1e10-46a7-b31c-bfbb21500401`::setting migration downtime to 50
libvirtEventLoop::DEBUG::2014-05-14 23:04:03,889::vm::4982::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`b3398089-1e10-46a7-b31c-bfbb21500401`::event Resumed deta
il 1 opaque None
Thread-731490::DEBUG::2014-05-14 23:04:03,891::libvirtconnection::108::libvirtconnection::(wrapper) Unknown libvirterror: ecode: 9 edom: 10 level: 2 message: operation failed: Lost connection to destination host
Thread-731490::DEBUG::2014-05-14 23:04:03,892::vm::752::vm.Vm::(cancel) vmId=`b3398089-1e10-46a7-b31c-bfbb21500401`::canceling migration downtime thread
Thread-731490::DEBUG::2014-05-14 23:04:03,892::vm::822::vm.Vm::(stop) vmId=`b3398089-1e10-46a7-b31c-bfbb21500401`::stopping migration monitor thread
Thread-731854::DEBUG::2014-05-14 23:04:03,892::vm::749::vm.Vm::(run) vmId=`b3398089-1e10-46a7-b31c-bfbb21500401`::migration downtime thread exiting
Thread-731494::DEBUG::2014-05-14 23:04:03,895::vm::315::vm.Vm::(run) vmId=`800bc216-a2a1-42fe-8de4-baa26bb52eea`::migration semaphore acquired
Thread-731490::ERROR::2014-05-14 23:04:03,895::vm::239::vm.Vm::(_recover) vmId=`b3398089-1e10-46a7-b31c-bfbb21500401`::operation failed: Lost connection to destination host

Comment 5 David Gibson 2014-05-20 05:49:25 UTC

Francesco, just attached the source and destination vdsm logs (private, since they have some customer details).  Let me know if you need logs earlier or later in time than those.

I have the source and destination sosreports, so let me know if you need the libvirtd or qemu logs too.  I didn't see anything useful in the source libvirtd or qemu logs, but I haven't had a chance to look over the dstination logs yet.

Comment 6 Michal Skrivanek 2014-05-20 08:06:56 UTC

David, since this happened at libvirt level we'd need the libvirt logs (and maybe qemu logs) as well.
So far it doesn't look like anything caused by vdsm(other than assuming src UP state at the end of migration), but let's see libvirt logs to tell for sure

Comment 8 David Gibson 2014-05-21 04:30:24 UTC

So, as Julio says, unfortunately we don't have the destination libvirt logs.  Julio has the extract from the source libvirt logs, but I'm pretty sure there's nothing useful in there.

But in any case, for the purposes of this BZ, I'm not actually interested in what caused the initial failure (I'm pursuing that separately with the customer; it may have been some network reconfiguration which was ongoing at the time).

But..

> So far it doesn't look like anything caused by vdsm(other than assuming src UP state at the end of migration),

The assumption that the source is UP after the failure is exactly the bug I'm reporting here.

Comment 9 Michal Skrivanek 2014-05-21 06:37:07 UTC

(In reply to David Gibson from comment #8)
Right, but the expectation from libvirt is that the VM will remain UP if migration fails for whatever reason. Francesco, please update with your findings, to me this looks like a libvirt problem

Comment 11 Francesco Romani 2014-05-21 09:43:06 UTC

What happens here is indeed libvirt tries to recover the VM which failed to migrate:

2014-05-14 22:04:03.831+0000: 6588: debug : qemuMonitorIOWrite:462 : QEMU_MONITOR_IO_WRITE: mon=0x7f346c05e6c0 buf={"execute":"query-migrate","id":"libvirt-9407"}^M
 len=49 ret=49 errno=11
2014-05-14 22:04:03.832+0000: 6588: debug : qemuMonitorIOProcess:354 : QEMU_MONITOR_IO_PROCESS: mon=0x7f346c05e6c0 buf={"return": {"status": "active", "total-time": 42101, "ram": {"total": 6543581184, "remaining": 1029115904, "transferred": 3257235874}}, "id": "libvirt-9407"}^M
 len=159
2014-05-14 22:04:03.833+0000: 6607: debug : virConnectIsAlive:19720 : conn=0x7f349833fbd0
2014-05-14 22:04:03.883+0000: 6607: debug : qemuDomainObjBeginJobInternal:808 : Starting job: async nested (async=migration out)
2014-05-14 22:04:03.883+0000: 6607: debug : qemuMonitorGetMigrationStatus:1871 : mon=0x7f346c05e6c0
2014-05-14 22:04:03.883+0000: 6607: debug : qemuMonitorSend:904 : QEMU_MONITOR_SEND_MSG: mon=0x7f346c05e6c0 msg={"execute":"query-migrate","id":"libvirt-9408"}^M
 fd=-1
2014-05-14 22:04:03.883+0000: 6588: warning : virKeepAliveTimerInternal:156 : No response from client 0x7f349847f780 after 5 keepalive messages in 30 seconds
2014-05-14 22:04:03.883+0000: 6588: debug : qemuMonitorIOWrite:462 : QEMU_MONITOR_IO_WRITE: mon=0x7f346c05e6c0 buf={"execute":"query-migrate","id":"libvirt-9408"}^M
 len=49 ret=49 errno=11
2014-05-14 22:04:03.884+0000: 6588: debug : qemuMonitorIOProcess:354 : QEMU_MONITOR_IO_PROCESS: mon=0x7f346c05e6c0 buf={"return": {"status": "active", "total-time": 42154, "ram": {"total": 6543581184, "remaining": 1018503168, "transferred": 3267832235}}, "id": "libvirt-9408"}^M
 len=159
2014-05-14 22:04:03.885+0000: 6607: debug : virConnectIsAlive:19720 : conn=0x7f349833fbd0
2014-05-14 22:04:03.885+0000: 6607: error : qemuMigrationWaitForCompletion:1096 : operation failed: Lost connection to destination host
2014-05-14 22:04:03.885+0000: 6607: debug : doPeer2PeerMigrate3:2626 : Finish3 0x7f349833fbd0 ret=-1
2014-05-14 22:04:03.885+0000: 6607: error : virNetClientSendInternal:1962 : internal error client socket is closed
2014-05-14 22:04:03.885+0000: 6607: debug : doPeer2PeerMigrate3:2659 : Confirm3 0x7f349813c7a0 cancelled=1 vm=0x7f346c060f00
2014-05-14 22:04:03.885+0000: 6607: debug : qemuMigrationConfirm:3310 : driver=0x7f349800b2e0, conn=0x7f349813c7a0, vm=0x7f346c060f00, cookiein=(null), cookieinlen=0, flags=1003, retcode=1
2014-05-14 22:04:03.885+0000: 6607: debug : qemuDomainObjSetJobPhase:689 : Setting 'migration out' phase to 'confirm3_cancelled'
2014-05-14 22:04:03.886+0000: 6607: debug : qemuMigrationEatCookie:758 : cookielen=0 cookie='(null)'
2014-05-14 22:04:03.886+0000: 6607: debug : qemuProcessStartCPUs:2870 : Using lock state '(null)'
2014-05-14 22:04:03.886+0000: 6607: debug : qemuDomainObjBeginJobInternal:808 : Starting job: async nested (async=migration out)
2014-05-14 22:04:03.886+0000: 6607: debug : qemuMonitorStartCPUs:1197 : mon=0x7f346c05e6c0
2014-05-14 22:04:03.886+0000: 6607: debug : qemuMonitorSend:904 : QEMU_MONITOR_SEND_MSG: mon=0x7f346c05e6c0 msg={"execute":"cont","id":"libvirt-9409"}^M
 fd=-1
2014-05-14 22:04:03.886+0000: 6588: debug : qemuMonitorIOWrite:462 : QEMU_MONITOR_IO_WRITE: mon=0x7f346c05e6c0 buf={"execute":"cont","id":"libvirt-9409"}^M
 len=40 ret=40 errno=11
2014-05-14 22:04:03.887+0000: 6588: debug : qemuMonitorIOProcess:354 : QEMU_MONITOR_IO_PROCESS: mon=0x7f346c05e6c0 buf={"return": {}, "id": "libvirt-9409"}^M

A few seconds after, however, libvirt receives a STOP event through the QEMU monitor, and it pauses the VM accordingly:

2014-05-14 22:04:11.474+0000: 6610: debug : virConnectIsAlive:19720 : conn=0x7f34782e3b30
2014-05-14 22:04:11.474+0000: 6588: debug : qemuMonitorIOProcess:354 : QEMU_MONITOR_IO_PROCESS: mon=0x7f346c05e6c0 buf={"timestamp": {"seconds": 1400105051, "microseconds": 474529}, "event": "STOP"}^M
 len=81
2014-05-14 22:04:11.474+0000: 6588: debug : qemuMonitorEmitStop:1018 : mon=0x7f346c05e6c0
2014-05-14 22:04:11.474+0000: 6588: debug : qemuProcessHandleStop:728 : Transitioned guest drx1bi2bfuiz1 to paused state
2014-05-14 22:04:11.475+0000: 6588: debug : qemuProcessHandleStop:738 : Preserving lock state '(null)'

The STOP event doesn't seem related to migration, as it should be concluded before. I don't think QEMU logs will help us to shed further light on this, but I'll be happy to dig in them inf needed.

On overall, I do think the libvirt behaviour is correct here.

Regarding VDSM, granted libvirt automatically tries to resume the VM, I don't think there is room for improvement either. The VM should be un-paused with the engine's consent.

Comment 12 Jiri Denemark 2014-05-21 12:16:36 UTC

Unfortunately, libvirt behavior is not correct here. After libvirtd detects broken connection to the destination daemon, it does not send migrate_cancel QMP command to the source qemu-kvm process. Thus if only libvirtd connection is broken but the connection between the source and the destination qemu-kvm process is still alive, migration will keep running and once it finishes, source QEMU stops virtual CPUs.

Comment 13 Jiri Denemark 2014-05-21 13:22:36 UTC

To reproduce this bug, one just needs to follow a few simple steps:

1) start peer-to-peer migration
2) before the migration finishes, kill libvirtd on destination host
3) source libvirtd will report the migration failed because of broken connection to the destination but the domain was live-migrated, the domain will become paused as soon as QEMU finishes transferring migration data to the destination QEMU

Comment 14 Jiri Denemark 2014-05-22 11:57:03 UTC

Patch sent upstream for review: https://www.redhat.com/archives/libvir-list/2014-May/msg00746.html

Comment 15 zhe peng 2014-05-23 03:17:53 UTC

I can reproduce this issue:
step:
1:soft mount nfs storage both on source and destination.
10.66.106.33:/var/lib/libvirt/images/ on /var/lib/libvirt/migrate type nfs (rw,soft,timeo=15,retrans=3,nosharecache,vers=4,addr=10.66.106.33,clientaddr=10.66.100.103)
2:set migrate speed to 1 (to take a long time to finish migrate)
#virsh migrate-setspeed rhel6.5 1
3:do migration
# virsh migrate --live --p2p rhel6.5 qemu+ssh://10.66.100.102/system --verbose
4:beforce migration finished,kill libvirtd on destination
# kill -9 `pidof libvirtd`
5:on source change speed back
# virsh migrate-setspeed rhel6.5 8796093022207

on source host get error:
Migration: [  0 %]error: operation failed: Lost connection to destination host
check guest status:
# virsh list --all
 Id    Name                           State
----------------------------------------------------
 16    rhel6.5                        paused

Comment 19 zhoujunqin 2014-05-28 11:06:33 UTC

I can reproduce with package libvirt-0.10.2-36.el6.x86_64 as Comment 15.
verified with new build libvirt-0.10.2-37.el6.x86_64.

verified steps:
1. Prepare a nfs server, and soft mount nfs both on source and target host.
# mount -t nfs  $nfs_ip:/var/lib/libvirt/images/ /var/lib/libvirt/migrate -o soft,timeo=15,retrans=3,nosharecache,vers=4,addr=$nfs_ip,clientaddr=source_ip

2. Prepare a healthy guest using image on nfs, set migrate speed to low speed(to take a long time to finish migrate)
# virsh migrate-setspeed test2 5

3. Start migration
# virsh migrate --live --p2p test2 qemu+ssh://$target_ip/system --verbose

4. Beforce migration finished,kill libvirtd on destination
# kill -9 `pidof libvirtd`

5. Check the source side:
on source host get error:
Migration: [ 80 %]error: operation failed: Lost connection to destination host
check guest status:
# virsh domstate test2
running

so move to verified.

Comment 23 errata-xmlrpc 2014-10-14 04:21:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1374.html