Bug 1439841

Summary: libvirt: NBD-based storage migration fails with "error: invalid argument: monitor must not be NULL"
Product: [Community] Virtualization Tools Reporter: Kashyap Chamarthy <kchamart>
Component: libvirtAssignee: Jiri Denemark <jdenemar>
Status: CLOSED NEXTRELEASE QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: jdenemar, libvirt-maint, rbalakri
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libvirt-3.3.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-02 10:30:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
libvirtd log with log filters enabled
none
libvirtd debug log of destination host (after a failed migration from destination to source)
none
libvirtd debug log of source host (after a failed migration from destination to source) none

Description Kashyap Chamarthy 2017-04-06 16:07:58 UTC
Description of problem
----------------------

NBD-based libvirt storage migration fails with:

    "error: invalid argument: monitor must not be NULL"


How reproducible: Consistently.


Version / Environment
---------------------

The environment is a nested Fedora-25 environment.  So, I'm trying to
migrate a level-2 Fedora-25 guest.

libvirt version:

    $ rpm -q libvirt-daemon-kvm
    libvirt-daemon-kvm-3.2.0-1.fc25.x86_64

The QEMU binary used to boot the nested guest (to be migrated), on
source guest hypervisor (source host from which migration originates)
_and_ on the destination guest hypervisor is built from Git, with this
specific patch from QEMU

    https://lists.nongnu.org/archive/html/qemu-block/2017-04/msg00085.html
    -- [RFC PATCH for-2.9] block: Ignore guest dev permissions during
    incoming migration

So the built binary version (showing `git describe` output):

    $ /home/stack/build/build-qemu/x86_64-softmmu/qemu-system-x86_64 --version
    QEMU emulator version 2.8.93 (v2.9.0-rc3-3-g3a8624b)


NB: However, the QEMU version is irrelevant, Michal Privoznik confirms
on IRC that this is a clear libvirt bug.


Steps to reproduce
~~~~~~~~~~~~~~~~~~

(1) Setup two hosts:

https://kashyapc.fedorapeople.org/virt/libvirt-migration-tests-with-qemu+tcp.txt

(2) Then, migrate the guest to the destination host:

    $ virsh migrate --verbose --copy-storage-all \
        --p2p --live l2-f25 qemu+ssh://root@devstack-a/system


Actual results
--------------

    $ virsh migrate --verbose --copy-storage-all \
        --p2p --live l2-f25 qemu+ssh://root@devstack-a/system
    error: invalid argument: monitor must not be NULL


Expected results
----------------

NBD-based live storage migration succeeds.

Additional info
---------------

From the libvirt log filters, it seems to be coming from:

[...]
2017-04-06 14:39:09.573+0000: 1065: error : virNetClientProgramDispatchError:177 : invalid argument: monitor must not be NULL
[...]

Comment 1 Kashyap Chamarthy 2017-04-06 19:56:47 UTC
Created attachment 1269478 [details]
libvirtd log with log filters enabled

Comment 2 Jiri Denemark 2017-04-07 06:19:09 UTC
This is most likely fixed by the following series: https://www.redhat.com/archives/libvir-list/2017-April/msg00219.html

Comment 3 Kashyap Chamarthy 2017-04-10 07:54:59 UTC
(In reply to Jiri Denemark from comment #2)
> This is most likely fixed by the following series:
> https://www.redhat.com/archives/libvir-list/2017-April/msg00219.html

I just built (RPMs) from libvirt Git, which has the above series ("qemu:
Properly reset all migration capabilities").  I was here when I tested it:

    $ git describe
    v3.2.0-80-gbe193c4


(Test-1) Migrate a guest from source to destination:

         Result: Succeeds (the migrated guest successfully runs on the
                 destination)

(Test-2) Once 'Test-1' finished successfully, and the guest is running
         successfully on the destination, migrate it back to source:

         Result: Fails.

          $ virsh migrate --verbose --copy-storage-all \
                --p2p --live l2-f25 qemu+ssh://root@l1-f25/system

          error: operation failed: migration job: is not active


Looking at the source debug log (attached):  I see the dreaded "cannot
acquire state change lock" error.

[...]
2017-04-10 06:29:23.322+0000: 22676: warning : qemuDomainObjBeginJobInternal:3607 : Cannot start job (modify, none) for domain l2-f25; current job is (none, migration out) owned by (0 <null>
+, 16698 remoteDispatchDomainMigratePerform3Params) for (0s, 96s)
2017-04-10 06:29:23.322+0000: 22676: error : qemuDomainObjBeginJobInternal:3619 : Timed out during operation: cannot acquire state change lock (held by
+remoteDispatchDomainMigratePerform3Params)
+[...]
2017-04-10 06:31:57.525+0000: 16698: error : qemuMigrationCheckJobStatus:1420 : operation failed: migration job: is not active
2017-04-10 06:31:57.525+0000: 16698: debug : qemuMigrationCancelDriveMirror:785 : Cancelling drive mirrors for domain l2-f25
[...]
2017-04-10 06:31:57.538+0000: 16698: debug : qemuMigrationDriveMirrorCancelled:700 : All disk mirrors are gone
2017-04-10 06:31:57.538+0000: 16698: debug : doPeer2PeerMigrate3:4428 : Finish3 0x7f39d801e3d0 ret=-1
2017-04-10 06:31:57.539+0000: 16698: debug : qemuDomainObjEnterRemote:3918 : Entering remote (vm=0x563b26a60e60 name=l2-f25)
2017-04-10 06:31:57.783+0000: 16698: error : virNetClientProgramDispatchError:177 : migration successfully aborted
[...]

Comment 4 Kashyap Chamarthy 2017-04-10 08:00:35 UTC
Created attachment 1270406 [details]
libvirtd debug log of destination host (after a failed migration from destination to source)

Comment 5 Kashyap Chamarthy 2017-04-10 08:02:20 UTC
Created attachment 1270407 [details]
libvirtd debug log of source host (after a failed migration from destination to source)

Comment 6 Jiri Denemark 2017-04-28 13:32:25 UTC
The "cannot acquire state change lock" message is related to processing NIC_RX_FILTER_CHANGED event during migration, which is currently impossible. This might be worth a separate bug report...

Comment 7 Jiri Denemark 2017-04-28 14:31:06 UTC
Anyway, the main problem ("migration job: is not active") should be fixed by https://www.redhat.com/archives/libvir-list/2017-April/msg01479.html which I just sent upstream for review.

Comment 8 Jiri Denemark 2017-05-02 10:30:28 UTC
The issue should be fixed now by

commit fc48fc7930f560c4341f4afe1285848dfdb60278
Refs: v3.3.0-rc1-2-gfc48fc793
Author:     Jiri Denemark <jdenemar>
AuthorDate: Fri Apr 28 15:56:44 2017 +0200
Commit:     Jiri Denemark <jdenemar>
CommitDate: Tue May 2 12:26:35 2017 +0200

    qemu: Don't reset "events" migration capability

    When creating v3.2.0-77-g8be3ccd04 commit, I completely forgot that one
    migration capability is very special. It's the "events" capability which
    tells QEMU to report "MIGRATION" events. Since libvirt always wants the
    events, it is enabled in qemuConnectMonitor and the rest of the code
    should not touch it.

    https://bugzilla.redhat.com/show_bug.cgi?id=1439841
    https://bugzilla.redhat.com/show_bug.cgi?id=1441165

    Messed-up-by: Jiri Denemark <jdenemar>
    Signed-off-by: Jiri Denemark <jdenemar>