Bug 1098833
Summary: | After migration failed in last phase, didn't resume VM on source host | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | David Gibson <dgibson> |
Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> |
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 6.5 | CC: | bazulay, dyuan, fromani, iheim, jdenemar, jentrena, juzhou, lpeer, michal.skrivanek, mkalinin, mzhan, ofrenkel, rbalakri, shyu, tdosek, yeylon, zhwang, zpeng |
Target Milestone: | rc | Keywords: | ZStream |
Target Release: | 6.6 | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | virt | ||
Fixed In Version: | libvirt-0.10.2-37.el6 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2014-10-14 04:21:53 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1100642 |
Description
David Gibson
2014-05-19 01:11:00 UTC
Francesco, just attached the source and destination vdsm logs (private, since they have some customer details). Let me know if you need logs earlier or later in time than those. I have the source and destination sosreports, so let me know if you need the libvirtd or qemu logs too. I didn't see anything useful in the source libvirtd or qemu logs, but I haven't had a chance to look over the dstination logs yet. David, since this happened at libvirt level we'd need the libvirt logs (and maybe qemu logs) as well. So far it doesn't look like anything caused by vdsm(other than assuming src UP state at the end of migration), but let's see libvirt logs to tell for sure So, as Julio says, unfortunately we don't have the destination libvirt logs. Julio has the extract from the source libvirt logs, but I'm pretty sure there's nothing useful in there.
But in any case, for the purposes of this BZ, I'm not actually interested in what caused the initial failure (I'm pursuing that separately with the customer; it may have been some network reconfiguration which was ongoing at the time).
But..
> So far it doesn't look like anything caused by vdsm(other than assuming src UP state at the end of migration),
The assumption that the source is UP after the failure is exactly the bug I'm reporting here.
(In reply to David Gibson from comment #8) Right, but the expectation from libvirt is that the VM will remain UP if migration fails for whatever reason. Francesco, please update with your findings, to me this looks like a libvirt problem What happens here is indeed libvirt tries to recover the VM which failed to migrate: 2014-05-14 22:04:03.831+0000: 6588: debug : qemuMonitorIOWrite:462 : QEMU_MONITOR_IO_WRITE: mon=0x7f346c05e6c0 buf={"execute":"query-migrate","id":"libvirt-9407"}^M len=49 ret=49 errno=11 2014-05-14 22:04:03.832+0000: 6588: debug : qemuMonitorIOProcess:354 : QEMU_MONITOR_IO_PROCESS: mon=0x7f346c05e6c0 buf={"return": {"status": "active", "total-time": 42101, "ram": {"total": 6543581184, "remaining": 1029115904, "transferred": 3257235874}}, "id": "libvirt-9407"}^M len=159 2014-05-14 22:04:03.833+0000: 6607: debug : virConnectIsAlive:19720 : conn=0x7f349833fbd0 2014-05-14 22:04:03.883+0000: 6607: debug : qemuDomainObjBeginJobInternal:808 : Starting job: async nested (async=migration out) 2014-05-14 22:04:03.883+0000: 6607: debug : qemuMonitorGetMigrationStatus:1871 : mon=0x7f346c05e6c0 2014-05-14 22:04:03.883+0000: 6607: debug : qemuMonitorSend:904 : QEMU_MONITOR_SEND_MSG: mon=0x7f346c05e6c0 msg={"execute":"query-migrate","id":"libvirt-9408"}^M fd=-1 2014-05-14 22:04:03.883+0000: 6588: warning : virKeepAliveTimerInternal:156 : No response from client 0x7f349847f780 after 5 keepalive messages in 30 seconds 2014-05-14 22:04:03.883+0000: 6588: debug : qemuMonitorIOWrite:462 : QEMU_MONITOR_IO_WRITE: mon=0x7f346c05e6c0 buf={"execute":"query-migrate","id":"libvirt-9408"}^M len=49 ret=49 errno=11 2014-05-14 22:04:03.884+0000: 6588: debug : qemuMonitorIOProcess:354 : QEMU_MONITOR_IO_PROCESS: mon=0x7f346c05e6c0 buf={"return": {"status": "active", "total-time": 42154, "ram": {"total": 6543581184, "remaining": 1018503168, "transferred": 3267832235}}, "id": "libvirt-9408"}^M len=159 2014-05-14 22:04:03.885+0000: 6607: debug : virConnectIsAlive:19720 : conn=0x7f349833fbd0 2014-05-14 22:04:03.885+0000: 6607: error : qemuMigrationWaitForCompletion:1096 : operation failed: Lost connection to destination host 2014-05-14 22:04:03.885+0000: 6607: debug : doPeer2PeerMigrate3:2626 : Finish3 0x7f349833fbd0 ret=-1 2014-05-14 22:04:03.885+0000: 6607: error : virNetClientSendInternal:1962 : internal error client socket is closed 2014-05-14 22:04:03.885+0000: 6607: debug : doPeer2PeerMigrate3:2659 : Confirm3 0x7f349813c7a0 cancelled=1 vm=0x7f346c060f00 2014-05-14 22:04:03.885+0000: 6607: debug : qemuMigrationConfirm:3310 : driver=0x7f349800b2e0, conn=0x7f349813c7a0, vm=0x7f346c060f00, cookiein=(null), cookieinlen=0, flags=1003, retcode=1 2014-05-14 22:04:03.885+0000: 6607: debug : qemuDomainObjSetJobPhase:689 : Setting 'migration out' phase to 'confirm3_cancelled' 2014-05-14 22:04:03.886+0000: 6607: debug : qemuMigrationEatCookie:758 : cookielen=0 cookie='(null)' 2014-05-14 22:04:03.886+0000: 6607: debug : qemuProcessStartCPUs:2870 : Using lock state '(null)' 2014-05-14 22:04:03.886+0000: 6607: debug : qemuDomainObjBeginJobInternal:808 : Starting job: async nested (async=migration out) 2014-05-14 22:04:03.886+0000: 6607: debug : qemuMonitorStartCPUs:1197 : mon=0x7f346c05e6c0 2014-05-14 22:04:03.886+0000: 6607: debug : qemuMonitorSend:904 : QEMU_MONITOR_SEND_MSG: mon=0x7f346c05e6c0 msg={"execute":"cont","id":"libvirt-9409"}^M fd=-1 2014-05-14 22:04:03.886+0000: 6588: debug : qemuMonitorIOWrite:462 : QEMU_MONITOR_IO_WRITE: mon=0x7f346c05e6c0 buf={"execute":"cont","id":"libvirt-9409"}^M len=40 ret=40 errno=11 2014-05-14 22:04:03.887+0000: 6588: debug : qemuMonitorIOProcess:354 : QEMU_MONITOR_IO_PROCESS: mon=0x7f346c05e6c0 buf={"return": {}, "id": "libvirt-9409"}^M A few seconds after, however, libvirt receives a STOP event through the QEMU monitor, and it pauses the VM accordingly: 2014-05-14 22:04:11.474+0000: 6610: debug : virConnectIsAlive:19720 : conn=0x7f34782e3b30 2014-05-14 22:04:11.474+0000: 6588: debug : qemuMonitorIOProcess:354 : QEMU_MONITOR_IO_PROCESS: mon=0x7f346c05e6c0 buf={"timestamp": {"seconds": 1400105051, "microseconds": 474529}, "event": "STOP"}^M len=81 2014-05-14 22:04:11.474+0000: 6588: debug : qemuMonitorEmitStop:1018 : mon=0x7f346c05e6c0 2014-05-14 22:04:11.474+0000: 6588: debug : qemuProcessHandleStop:728 : Transitioned guest drx1bi2bfuiz1 to paused state 2014-05-14 22:04:11.475+0000: 6588: debug : qemuProcessHandleStop:738 : Preserving lock state '(null)' The STOP event doesn't seem related to migration, as it should be concluded before. I don't think QEMU logs will help us to shed further light on this, but I'll be happy to dig in them inf needed. On overall, I do think the libvirt behaviour is correct here. Regarding VDSM, granted libvirt automatically tries to resume the VM, I don't think there is room for improvement either. The VM should be un-paused with the engine's consent. Unfortunately, libvirt behavior is not correct here. After libvirtd detects broken connection to the destination daemon, it does not send migrate_cancel QMP command to the source qemu-kvm process. Thus if only libvirtd connection is broken but the connection between the source and the destination qemu-kvm process is still alive, migration will keep running and once it finishes, source QEMU stops virtual CPUs. To reproduce this bug, one just needs to follow a few simple steps: 1) start peer-to-peer migration 2) before the migration finishes, kill libvirtd on destination host 3) source libvirtd will report the migration failed because of broken connection to the destination but the domain was live-migrated, the domain will become paused as soon as QEMU finishes transferring migration data to the destination QEMU Patch sent upstream for review: https://www.redhat.com/archives/libvir-list/2014-May/msg00746.html I can reproduce this issue: step: 1:soft mount nfs storage both on source and destination. 10.66.106.33:/var/lib/libvirt/images/ on /var/lib/libvirt/migrate type nfs (rw,soft,timeo=15,retrans=3,nosharecache,vers=4,addr=10.66.106.33,clientaddr=10.66.100.103) 2:set migrate speed to 1 (to take a long time to finish migrate) #virsh migrate-setspeed rhel6.5 1 3:do migration # virsh migrate --live --p2p rhel6.5 qemu+ssh://10.66.100.102/system --verbose 4:beforce migration finished,kill libvirtd on destination # kill -9 `pidof libvirtd` 5:on source change speed back # virsh migrate-setspeed rhel6.5 8796093022207 on source host get error: Migration: [ 0 %]error: operation failed: Lost connection to destination host check guest status: # virsh list --all Id Name State ---------------------------------------------------- 16 rhel6.5 paused I can reproduce with package libvirt-0.10.2-36.el6.x86_64 as Comment 15. verified with new build libvirt-0.10.2-37.el6.x86_64. verified steps: 1. Prepare a nfs server, and soft mount nfs both on source and target host. # mount -t nfs $nfs_ip:/var/lib/libvirt/images/ /var/lib/libvirt/migrate -o soft,timeo=15,retrans=3,nosharecache,vers=4,addr=$nfs_ip,clientaddr=source_ip 2. Prepare a healthy guest using image on nfs, set migrate speed to low speed(to take a long time to finish migrate) # virsh migrate-setspeed test2 5 3. Start migration # virsh migrate --live --p2p test2 qemu+ssh://$target_ip/system --verbose 4. Beforce migration finished,kill libvirtd on destination # kill -9 `pidof libvirtd` 5. Check the source side: on source host get error: Migration: [ 80 %]error: operation failed: Lost connection to destination host check guest status: # virsh domstate test2 running so move to verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1374.html |