Description of problem ---------------------- NBD-based libvirt storage migration fails with: "error: invalid argument: monitor must not be NULL" How reproducible: Consistently. Version / Environment --------------------- The environment is a nested Fedora-25 environment. So, I'm trying to migrate a level-2 Fedora-25 guest. libvirt version: $ rpm -q libvirt-daemon-kvm libvirt-daemon-kvm-3.2.0-1.fc25.x86_64 The QEMU binary used to boot the nested guest (to be migrated), on source guest hypervisor (source host from which migration originates) _and_ on the destination guest hypervisor is built from Git, with this specific patch from QEMU https://lists.nongnu.org/archive/html/qemu-block/2017-04/msg00085.html -- [RFC PATCH for-2.9] block: Ignore guest dev permissions during incoming migration So the built binary version (showing `git describe` output): $ /home/stack/build/build-qemu/x86_64-softmmu/qemu-system-x86_64 --version QEMU emulator version 2.8.93 (v2.9.0-rc3-3-g3a8624b) NB: However, the QEMU version is irrelevant, Michal Privoznik confirms on IRC that this is a clear libvirt bug. Steps to reproduce ~~~~~~~~~~~~~~~~~~ (1) Setup two hosts: https://kashyapc.fedorapeople.org/virt/libvirt-migration-tests-with-qemu+tcp.txt (2) Then, migrate the guest to the destination host: $ virsh migrate --verbose --copy-storage-all \ --p2p --live l2-f25 qemu+ssh://root@devstack-a/system Actual results -------------- $ virsh migrate --verbose --copy-storage-all \ --p2p --live l2-f25 qemu+ssh://root@devstack-a/system error: invalid argument: monitor must not be NULL Expected results ---------------- NBD-based live storage migration succeeds. Additional info --------------- From the libvirt log filters, it seems to be coming from: [...] 2017-04-06 14:39:09.573+0000: 1065: error : virNetClientProgramDispatchError:177 : invalid argument: monitor must not be NULL [...]
Created attachment 1269478 [details] libvirtd log with log filters enabled
This is most likely fixed by the following series: https://www.redhat.com/archives/libvir-list/2017-April/msg00219.html
(In reply to Jiri Denemark from comment #2) > This is most likely fixed by the following series: > https://www.redhat.com/archives/libvir-list/2017-April/msg00219.html I just built (RPMs) from libvirt Git, which has the above series ("qemu: Properly reset all migration capabilities"). I was here when I tested it: $ git describe v3.2.0-80-gbe193c4 (Test-1) Migrate a guest from source to destination: Result: Succeeds (the migrated guest successfully runs on the destination) (Test-2) Once 'Test-1' finished successfully, and the guest is running successfully on the destination, migrate it back to source: Result: Fails. $ virsh migrate --verbose --copy-storage-all \ --p2p --live l2-f25 qemu+ssh://root@l1-f25/system error: operation failed: migration job: is not active Looking at the source debug log (attached): I see the dreaded "cannot acquire state change lock" error. [...] 2017-04-10 06:29:23.322+0000: 22676: warning : qemuDomainObjBeginJobInternal:3607 : Cannot start job (modify, none) for domain l2-f25; current job is (none, migration out) owned by (0 <null> +, 16698 remoteDispatchDomainMigratePerform3Params) for (0s, 96s) 2017-04-10 06:29:23.322+0000: 22676: error : qemuDomainObjBeginJobInternal:3619 : Timed out during operation: cannot acquire state change lock (held by +remoteDispatchDomainMigratePerform3Params) +[...] 2017-04-10 06:31:57.525+0000: 16698: error : qemuMigrationCheckJobStatus:1420 : operation failed: migration job: is not active 2017-04-10 06:31:57.525+0000: 16698: debug : qemuMigrationCancelDriveMirror:785 : Cancelling drive mirrors for domain l2-f25 [...] 2017-04-10 06:31:57.538+0000: 16698: debug : qemuMigrationDriveMirrorCancelled:700 : All disk mirrors are gone 2017-04-10 06:31:57.538+0000: 16698: debug : doPeer2PeerMigrate3:4428 : Finish3 0x7f39d801e3d0 ret=-1 2017-04-10 06:31:57.539+0000: 16698: debug : qemuDomainObjEnterRemote:3918 : Entering remote (vm=0x563b26a60e60 name=l2-f25) 2017-04-10 06:31:57.783+0000: 16698: error : virNetClientProgramDispatchError:177 : migration successfully aborted [...]
Created attachment 1270406 [details] libvirtd debug log of destination host (after a failed migration from destination to source)
Created attachment 1270407 [details] libvirtd debug log of source host (after a failed migration from destination to source)
The "cannot acquire state change lock" message is related to processing NIC_RX_FILTER_CHANGED event during migration, which is currently impossible. This might be worth a separate bug report...
Anyway, the main problem ("migration job: is not active") should be fixed by https://www.redhat.com/archives/libvir-list/2017-April/msg01479.html which I just sent upstream for review.
The issue should be fixed now by commit fc48fc7930f560c4341f4afe1285848dfdb60278 Refs: v3.3.0-rc1-2-gfc48fc793 Author: Jiri Denemark <jdenemar> AuthorDate: Fri Apr 28 15:56:44 2017 +0200 Commit: Jiri Denemark <jdenemar> CommitDate: Tue May 2 12:26:35 2017 +0200 qemu: Don't reset "events" migration capability When creating v3.2.0-77-g8be3ccd04 commit, I completely forgot that one migration capability is very special. It's the "events" capability which tells QEMU to report "MIGRATION" events. Since libvirt always wants the events, it is enabled in qemuConnectMonitor and the rest of the code should not touch it. https://bugzilla.redhat.com/show_bug.cgi?id=1439841 https://bugzilla.redhat.com/show_bug.cgi?id=1441165 Messed-up-by: Jiri Denemark <jdenemar> Signed-off-by: Jiri Denemark <jdenemar>