1098833 – After migration failed in last phase, didn't resume VM on source host

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1098833 - After migration failed in last phase, didn't resume VM on source host

Summary: After migration failed in last phase, didn't resume VM on source host

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	6.5
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	6.6
Assignee:	Jiri Denemark
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:	virt
Depends On:
Blocks:	1100642
TreeView+	depends on / blocked

Reported:	2014-05-19 01:11 UTC by David Gibson
Modified:	2019-04-29 01:43 UTC (History)
CC List:	18 users (show)
Fixed In Version:	libvirt-0.10.2-37.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-10-14 04:21:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	885563	0	None	None	None	Never
Red Hat Product Errata	RHBA-2014:1374	0	normal	SHIPPED_LIVE	libvirt bug fix and enhancement update	2014-10-14 08:11:54 UTC

Description David Gibson 2014-05-19 01:11:00 UTC

Description of problem:

It looks like a migration has failed with connection lost to the destination libvirtd, during the final phase, when the VM is paused.

The source VM has remained paused, rather than being resumed automatically after the migration failure.

Version-Release number of selected component (if applicable):

RHEV 3.3.0
vdsm-4.13.2-0.11.el6ev.x86_64 (source)
libvirt-0.10.2-29.el6_5.7.x86_64 (source)

How reproducible:

Tricky.

Steps to Reproduce:
1. Migrate VM from one host to another
2. Have a network error occur during final migration phase, causing it to fail.

Actual results:

Migration fails, and VM remains present but pausedon source host.

Expected results:

Migration fails, but VM is automatically resumed on host.

Additional info:

Relevant log extract from source vdsm log.

Thread-731854::DEBUG::2014-05-14 23:04:01,097::vm::746::vm.Vm::(run) vmId=`b3398089-1e10-46a7-b31c-bfbb21500401`::setting migration downtime to 50
libvirtEventLoop::DEBUG::2014-05-14 23:04:03,889::vm::4982::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`b3398089-1e10-46a7-b31c-bfbb21500401`::event Resumed deta
il 1 opaque None
Thread-731490::DEBUG::2014-05-14 23:04:03,891::libvirtconnection::108::libvirtconnection::(wrapper) Unknown libvirterror: ecode: 9 edom: 10 level: 2 message: operation failed: Lost connection to destination host
Thread-731490::DEBUG::2014-05-14 23:04:03,892::vm::752::vm.Vm::(cancel) vmId=`b3398089-1e10-46a7-b31c-bfbb21500401`::canceling migration downtime thread
Thread-731490::DEBUG::2014-05-14 23:04:03,892::vm::822::vm.Vm::(stop) vmId=`b3398089-1e10-46a7-b31c-bfbb21500401`::stopping migration monitor thread
Thread-731854::DEBUG::2014-05-14 23:04:03,892::vm::749::vm.Vm::(run) vmId=`b3398089-1e10-46a7-b31c-bfbb21500401`::migration downtime thread exiting
Thread-731494::DEBUG::2014-05-14 23:04:03,895::vm::315::vm.Vm::(run) vmId=`800bc216-a2a1-42fe-8de4-baa26bb52eea`::migration semaphore acquired
Thread-731490::ERROR::2014-05-14 23:04:03,895::vm::239::vm.Vm::(_recover) vmId=`b3398089-1e10-46a7-b31c-bfbb21500401`::operation failed: Lost connection to destination host

Comment 5 David Gibson 2014-05-20 05:49:25 UTC

Francesco, just attached the source and destination vdsm logs (private, since they have some customer details).  Let me know if you need logs earlier or later in time than those.

I have the source and destination sosreports, so let me know if you need the libvirtd or qemu logs too.  I didn't see anything useful in the source libvirtd or qemu logs, but I haven't had a chance to look over the dstination logs yet.

Comment 6 Michal Skrivanek 2014-05-20 08:06:56 UTC

David, since this happened at libvirt level we'd need the libvirt logs (and maybe qemu logs) as well.
So far it doesn't look like anything caused by vdsm(other than assuming src UP state at the end of migration), but let's see libvirt logs to tell for sure

Comment 8 David Gibson 2014-05-21 04:30:24 UTC

So, as Julio says, unfortunately we don't have the destination libvirt logs.  Julio has the extract from the source libvirt logs, but I'm pretty sure there's nothing useful in there.

But in any case, for the purposes of this BZ, I'm not actually interested in what caused the initial failure (I'm pursuing that separately with the customer; it may have been some network reconfiguration which was ongoing at the time).

But..

> So far it doesn't look like anything caused by vdsm(other than assuming src UP state at the end of migration),

The assumption that the source is UP after the failure is exactly the bug I'm reporting here.

Comment 9 Michal Skrivanek 2014-05-21 06:37:07 UTC

(In reply to David Gibson from comment #8)
Right, but the expectation from libvirt is that the VM will remain UP if migration fails for whatever reason. Francesco, please update with your findings, to me this looks like a libvirt problem

Comment 11 Francesco Romani 2014-05-21 09:43:06 UTC

What happens here is indeed libvirt tries to recover the VM which failed to migrate:

2014-05-14 22:04:03.831+0000: 6588: debug : qemuMonitorIOWrite:462 : QEMU_MONITOR_IO_WRITE: mon=0x7f346c05e6c0 buf={"execute":"query-migrate","id":"libvirt-9407"}^M
 len=49 ret=49 errno=11
2014-05-14 22:04:03.832+0000: 6588: debug : qemuMonitorIOProcess:354 : QEMU_MONITOR_IO_PROCESS: mon=0x7f346c05e6c0 buf={"return": {"status": "active", "total-time": 42101, "ram": {"total": 6543581184, "remaining": 1029115904, "transferred": 3257235874}}, "id": "libvirt-9407"}^M
 len=159
2014-05-14 22:04:03.833+0000: 6607: debug : virConnectIsAlive:19720 : conn=0x7f349833fbd0
2014-05-14 22:04:03.883+0000: 6607: debug : qemuDomainObjBeginJobInternal:808 : Starting job: async nested (async=migration out)
2014-05-14 22:04:03.883+0000: 6607: debug : qemuMonitorGetMigrationStatus:1871 : mon=0x7f346c05e6c0
2014-05-14 22:04:03.883+0000: 6607: debug : qemuMonitorSend:904 : QEMU_MONITOR_SEND_MSG: mon=0x7f346c05e6c0 msg={"execute":"query-migrate","id":"libvirt-9408"}^M
 fd=-1
2014-05-14 22:04:03.883+0000: 6588: warning : virKeepAliveTimerInternal:156 : No response from client 0x7f349847f780 after 5 keepalive messages in 30 seconds
2014-05-14 22:04:03.883+0000: 6588: debug : qemuMonitorIOWrite:462 : QEMU_MONITOR_IO_WRITE: mon=0x7f346c05e6c0 buf={"execute":"query-migrate","id":"libvirt-9408"}^M
 len=49 ret=49 errno=11
2014-05-14 22:04:03.884+0000: 6588: debug : qemuMonitorIOProcess:354 : QEMU_MONITOR_IO_PROCESS: mon=0x7f346c05e6c0 buf={"return": {"status": "active", "total-time": 42154, "ram": {"total": 6543581184, "remaining": 1018503168, "transferred": 3267832235}}, "id": "libvirt-9408"}^M
 len=159
2014-05-14 22:04:03.885+0000: 6607: debug : virConnectIsAlive:19720 : conn=0x7f349833fbd0
2014-05-14 22:04:03.885+0000: 6607: error : qemuMigrationWaitForCompletion:1096 : operation failed: Lost connection to destination host
2014-05-14 22:04:03.885+0000: 6607: debug : doPeer2PeerMigrate3:2626 : Finish3 0x7f349833fbd0 ret=-1
2014-05-14 22:04:03.885+0000: 6607: error : virNetClientSendInternal:1962 : internal error client socket is closed
2014-05-14 22:04:03.885+0000: 6607: debug : doPeer2PeerMigrate3:2659 : Confirm3 0x7f349813c7a0 cancelled=1 vm=0x7f346c060f00
2014-05-14 22:04:03.885+0000: 6607: debug : qemuMigrationConfirm:3310 : driver=0x7f349800b2e0, conn=0x7f349813c7a0, vm=0x7f346c060f00, cookiein=(null), cookieinlen=0, flags=1003, retcode=1
2014-05-14 22:04:03.885+0000: 6607: debug : qemuDomainObjSetJobPhase:689 : Setting 'migration out' phase to 'confirm3_cancelled'
2014-05-14 22:04:03.886+0000: 6607: debug : qemuMigrationEatCookie:758 : cookielen=0 cookie='(null)'
2014-05-14 22:04:03.886+0000: 6607: debug : qemuProcessStartCPUs:2870 : Using lock state '(null)'
2014-05-14 22:04:03.886+0000: 6607: debug : qemuDomainObjBeginJobInternal:808 : Starting job: async nested (async=migration out)
2014-05-14 22:04:03.886+0000: 6607: debug : qemuMonitorStartCPUs:1197 : mon=0x7f346c05e6c0
2014-05-14 22:04:03.886+0000: 6607: debug : qemuMonitorSend:904 : QEMU_MONITOR_SEND_MSG: mon=0x7f346c05e6c0 msg={"execute":"cont","id":"libvirt-9409"}^M
 fd=-1
2014-05-14 22:04:03.886+0000: 6588: debug : qemuMonitorIOWrite:462 : QEMU_MONITOR_IO_WRITE: mon=0x7f346c05e6c0 buf={"execute":"cont","id":"libvirt-9409"}^M
 len=40 ret=40 errno=11
2014-05-14 22:04:03.887+0000: 6588: debug : qemuMonitorIOProcess:354 : QEMU_MONITOR_IO_PROCESS: mon=0x7f346c05e6c0 buf={"return": {}, "id": "libvirt-9409"}^M

A few seconds after, however, libvirt receives a STOP event through the QEMU monitor, and it pauses the VM accordingly:

2014-05-14 22:04:11.474+0000: 6610: debug : virConnectIsAlive:19720 : conn=0x7f34782e3b30
2014-05-14 22:04:11.474+0000: 6588: debug : qemuMonitorIOProcess:354 : QEMU_MONITOR_IO_PROCESS: mon=0x7f346c05e6c0 buf={"timestamp": {"seconds": 1400105051, "microseconds": 474529}, "event": "STOP"}^M
 len=81
2014-05-14 22:04:11.474+0000: 6588: debug : qemuMonitorEmitStop:1018 : mon=0x7f346c05e6c0
2014-05-14 22:04:11.474+0000: 6588: debug : qemuProcessHandleStop:728 : Transitioned guest drx1bi2bfuiz1 to paused state
2014-05-14 22:04:11.475+0000: 6588: debug : qemuProcessHandleStop:738 : Preserving lock state '(null)'

The STOP event doesn't seem related to migration, as it should be concluded before. I don't think QEMU logs will help us to shed further light on this, but I'll be happy to dig in them inf needed.

On overall, I do think the libvirt behaviour is correct here.

Regarding VDSM, granted libvirt automatically tries to resume the VM, I don't think there is room for improvement either. The VM should be un-paused with the engine's consent.

Comment 12 Jiri Denemark 2014-05-21 12:16:36 UTC

Unfortunately, libvirt behavior is not correct here. After libvirtd detects broken connection to the destination daemon, it does not send migrate_cancel QMP command to the source qemu-kvm process. Thus if only libvirtd connection is broken but the connection between the source and the destination qemu-kvm process is still alive, migration will keep running and once it finishes, source QEMU stops virtual CPUs.

Comment 13 Jiri Denemark 2014-05-21 13:22:36 UTC

To reproduce this bug, one just needs to follow a few simple steps:

1) start peer-to-peer migration
2) before the migration finishes, kill libvirtd on destination host
3) source libvirtd will report the migration failed because of broken connection to the destination but the domain was live-migrated, the domain will become paused as soon as QEMU finishes transferring migration data to the destination QEMU

Comment 14 Jiri Denemark 2014-05-22 11:57:03 UTC

Patch sent upstream for review: https://www.redhat.com/archives/libvir-list/2014-May/msg00746.html

Comment 15 zhe peng 2014-05-23 03:17:53 UTC

I can reproduce this issue:
step:
1:soft mount nfs storage both on source and destination.
10.66.106.33:/var/lib/libvirt/images/ on /var/lib/libvirt/migrate type nfs (rw,soft,timeo=15,retrans=3,nosharecache,vers=4,addr=10.66.106.33,clientaddr=10.66.100.103)
2:set migrate speed to 1 (to take a long time to finish migrate)
#virsh migrate-setspeed rhel6.5 1
3:do migration
# virsh migrate --live --p2p rhel6.5 qemu+ssh://10.66.100.102/system --verbose
4:beforce migration finished,kill libvirtd on destination
# kill -9 `pidof libvirtd`
5:on source change speed back
# virsh migrate-setspeed rhel6.5 8796093022207

on source host get error:
Migration: [  0 %]error: operation failed: Lost connection to destination host
check guest status:
# virsh list --all
 Id    Name                           State
----------------------------------------------------
 16    rhel6.5                        paused

Comment 19 zhoujunqin 2014-05-28 11:06:33 UTC

I can reproduce with package libvirt-0.10.2-36.el6.x86_64 as Comment 15.
verified with new build libvirt-0.10.2-37.el6.x86_64.

verified steps:
1. Prepare a nfs server, and soft mount nfs both on source and target host.
# mount -t nfs  $nfs_ip:/var/lib/libvirt/images/ /var/lib/libvirt/migrate -o soft,timeo=15,retrans=3,nosharecache,vers=4,addr=$nfs_ip,clientaddr=source_ip

2. Prepare a healthy guest using image on nfs, set migrate speed to low speed(to take a long time to finish migrate)
# virsh migrate-setspeed test2 5

3. Start migration
# virsh migrate --live --p2p test2 qemu+ssh://$target_ip/system --verbose

4. Beforce migration finished,kill libvirtd on destination
# kill -9 `pidof libvirtd`

5. Check the source side:
on source host get error:
Migration: [ 80 %]error: operation failed: Lost connection to destination host
check guest status:
# virsh domstate test2
running

so move to verified.

Comment 23 errata-xmlrpc 2014-10-14 04:21:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1374.html

Note You need to log in before you can comment on or make changes to this bug.