Bug 1151723

Summary: migration will hang after use migrate with --graphicsuri and guest status will be locked
Product: Red Hat Enterprise Linux 7 Reporter: Luyao Huang <lhuang>
Component: libvirtAssignee: Jiri Denemark <jdenemar>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.1CC: dgilbert, dyuan, fjin, hhuang, huding, jdenemar, juzhang, knoel, mzhan, quintela, rbalakri, virt-maint, xfu, zpeng
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: libvirt-2.0.0-2.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-03 18:10:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1288337    
Attachments:
Description Flags
libvirtd log on source host
none
the second migration gets stuck if the first migration is cancelled immediately none

Description Luyao Huang 2014-10-11 07:48:52 UTC
Description of problem:
migration will hang after use migrate --graphicsuri with a invalid uri and guest status will be locked.
 Only found this issue with guest use spice.

Version-Release number of selected component (if applicable):
libvirt-1.2.8-5.el7.x86_64
qemu-img-rhev-2.1.2-3.el7.x86_64


How reproducible:
100%

Steps to Reproduce:

1.prepare a guest can be migrate success and prepare a migration env
# virsh list --all
 Id    Name                           State
----------------------------------------------------
 4     test3                          running

2.#virsh dumpxml test3
    <graphics type='spice' autoport='yes' listen='127.0.0.1'>
      <listen type='address' address='127.0.0.1'/>
    </graphics>


3.use a invalid vnc(i use vnc123 )
# time virsh migrate test3 --graphicsuri vnc123 qemu+ssh://10.66.70.127/system --live --verbose
root.70.127's password: 
Migration: [100 %]
Migration: [100 %]
Migration: [100 %]

Migration: [100 %]^C^C                         <------hang

real	3m53.970s
user	0m0.146s
sys	0m0.097s

4.on source:

# virsh list --all
 Id    Name                           State
----------------------------------------------------
 4     test3                          paused

5.# virsh resume test3
error: Failed to resume domain test3
error: Timed out during operation: cannot acquire state change lock

6.# virsh destroy test3
Domain test3 destroyed



Actual results:

migrate cmd hang and after use ctrl+c 
source guest status will be locked
dest guest will be running status but cannot be used(cannot read or write)

Expected results:
report  error before migrate

Additional info:

log from libvirtd.log:
2014-10-11 07:43:15.601+0000: 15140: error : qemuDomainMigrateGraphicsRelocate:2143 : invalid argument: unknown graphics type (null)
2014-10-11 07:43:15.601+0000: 15140: warning : qemuMigrationRun:3555 : unable to provide data for graphics client relocation
2014-10-11 07:43:18.067+0000: 15140: warning : qemuMigrationCancelDriveMirror:1632 : Unable to stop block job on drive-ide0-0-0

Comment 1 Jiri Denemark 2014-11-27 16:12:36 UTC
It's apparently something in qemu-kvm-rhev. It works with qemu-kvm-1.5.3-60.el7, doesn't work with qemu-kvm-rhev-2.1.2-3.el7 and doesn't work in qemu-kvm-rhev-2.1.2-14.el7 either.

Comment 2 Jiri Denemark 2014-12-02 12:29:15 UTC
So the difference between 1.5.3 and 2.1.2 is in the response to query-spice in
case of invalid graphics URI. Normally, we send client_migrate_info and at the
end of migration, we wait for query-spice to return migrated = True. However,
if invalid graphics URI is passed to our migration APIs (i.e., something that
does not start with spice://), we don't call client_migrate_info. But we still
wait for query-spice (as long as spice is enabled for the domain, of course)
to return migrated = true at the end of migration.

With qemu-kvm-1.5.3 SPICE_DISCONNECTED event is emitted and followed by
SPICE_MIGRATE_COMPLETED. Once migration completes, query-spice returns:

    {
      "return": {
        "migrated": true,
        "enabled": true,
        "auth": "none",
        "port": 5900,
        "compiled-version": "0.12.4",
        "host": "0.0.0.0",
        "channels": [

        ],
        "mouse-mode": "server"
      },
      "id": "libvirt-25"
    }


While with qemu-kvm-rhev-2.1.2 no SPICE related events are emitted and at the
end of migration query-spice always returns (172.17.172.1 is the client):

    {
      "return": {
        "migrated": false,
        "enabled": true,
        "auth": "none",
        "port": 5900,
        "compiled-version": "0.12.4",
        "host": "0.0.0.0",
        "channels": [
          {
            "port": "51853",
            "family": "ipv4",
            "channel-type": 1,
            "connection-id": 2035481344,
            "host": "172.17.172.1",
            "channel-id": 0,
            "tls": false
          },
          {
            "port": "51854",
            "family": "ipv4",
            "channel-type": 2,
            "connection-id": 2035481344,
            "host": "172.17.172.1",
            "channel-id": 0,
            "tls": false
          },
          {
            "port": "51855",
            "family": "ipv4",
            "channel-type": 3,
            "connection-id": 2035481344,
            "host": "172.17.172.1",
            "channel-id": 0,
            "tls": false
          },
          {
            "port": "51856",
            "family": "ipv4",
            "channel-type": 4,
            "connection-id": 2035481344,
            "host": "172.17.172.1",
            "channel-id": 0,
            "tls": false
          }
        ],
        "mouse-mode": "server"
      },
      "id": "libvirt-238"
    }

and libvirt ends up in an endless loop waiting for migrated = true.

Perhaps we should not wait for spice to finish migration when we didn't call
client_migrate_info, I don't know. But it still seems QEMU behaves strangely.

Comment 5 Gerd Hoffmann 2015-09-09 12:38:48 UTC
> Perhaps we should not wait for spice to finish migration when we didn't call
> client_migrate_info, I don't know.

Yes, you should not.

BTW: no need to poll 'migrate', you can just wait for SPICE_MIGRATE_COMPLETED.

> But it still seems QEMU behaves strangely.

Why?  Sending spice migration notification when no spice client migration happened in the first place is strange.  Was fixed here:

============================= cut here =================================

commit a76a2f729aae21c45c7e9eef8d1d80e94d1cc930
Author: Gerd Hoffmann <kraxel>
Date:   Tue Apr 29 09:27:31 2014 +0200

    spice: fix libvirt snapshots
    
    Only notify spice-server about migration events in case we got
    target host information beforehand.  So we kick the seamless spice
    client migration only in case a actual live migration happens, not
    when libvirt uses live-migration-to-file for snapshotting.
    
    Signed-off-by: Gerd Hoffmann <kraxel>

Comment 6 Jiri Denemark 2016-03-01 15:45:11 UTC
Fixed upstream by v1.3.2-48-gbd7c8a6:

commit bd7c8a693d4d5f036ac55990bf5785dd19774685
Author:     Jiri Denemark <jdenemar>
AuthorDate: Mon Feb 29 13:18:13 2016 +0100
Commit:     Jiri Denemark <jdenemar>
CommitDate: Tue Mar 1 15:59:00 2016 +0100

    qemu: Don't always wait for SPICE to finish migration
    
    When SPICE graphics is configured for a domain but we did not ask the
    client to switch to the destination, we should not wait for
    SPICE_MIGRATE_COMPLETED event (which will never come).
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1151723
    
    Signed-off-by: Jiri Denemark <jdenemar>

Comment 7 Mike McCune 2016-03-28 22:45:31 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 9 Fangge Jin 2016-04-11 10:12:43 UTC
I can reproduce with build libvirt-1.2.8-5.el7.x86_64 and qemu-kvm-rhev-2.1.2-23.el7.x86_64

Verify pass with build libvirt-1.3.3-1.el7.x86_64 and qemu-kvm-rhev-2.5.0-4.el7.x86_64

Steps:
1.# virsh list --all
 Id    Name                           State
----------------------------------------------------
 8     rhel7.2-1030                   running
2.Connect to guest graphic:
# remote-viewer spice://10.66.5.57:5900
3.# virsh migrate rhel7.2-1030 qemu+ssh://10.66.4.113/system --live --verbose --graphicsuri aakdkdie
Migration: [100 %]
[root@fjin-5-57 2.1.2-23]#
4.On source host:
# virsh list
 Id    Name                           State
----------------------------------------------------
5.On target host:
# virsh list
 Id    Name                           State
----------------------------------------------------
 6     rhel7.2-1030                   running

6.Connect to guest graphic:
# remote-viewer spice://10.66.4.113:5900
7.In guest, do some operation, it can read and write.
8.Migrate back:
# virsh migrate rhel7.2-1030 qemu+ssh://10.66.5.57/system --live --verbose --graphicsuri spice://10.66.5.57:5900
Migration: [100 %]
[root@fjin-4-113 ~]#

Comment 10 Fangge Jin 2016-04-12 03:48:30 UTC
After do more testing, I found that when guest is persistent and do migration with --graphicsuri {invalid_uri} after a successfully migration, migrate will hang(waiting for spice migration to finish). 
Maybe wait_for_spice is not reset to false after the first successful migration, I guess.


Steps:
0. Guest rhel7.2-1030 is persistent on source host
1.On source
# virsh migrate rhel7.2-1030 qemu+ssh://10.66.4.113/system --live --verbose 
Migration: [100 %]
[root@fjin-5-57 libvirt]

2.On target, migration back:
# virsh migrate rhel7.2-1030 qemu+ssh://10.66.5.57/system --live --verbose 
Migration: [100 %]

3.On source, migrate with invalid graphicsuri
# virsh migrate rhel7.2-1030 qemu+ssh://10.66.4.113/system --live --verbose --graphicsuri 10.66.4.113
Migration: [100 %]
Migration: [100 %]
(after several minutes, virsh still hangs)

Comment 11 Fangge Jin 2016-04-12 03:50:13 UTC
Created attachment 1146190 [details]
libvirtd log on source host

Comment 12 Jiri Denemark 2016-07-05 08:52:28 UTC
Indeed, libvirt doesn't properly reset job->spiceMigration and thus a migration with an incorrect graphics URI will get stuck in case the domain was migrated with a correct graphics URI before. This bug affects mainly persistent domains; transient domains are affected only if the first migration is cancelled. Patches for this issue were sent for review upstream:

  https://www.redhat.com/archives/libvir-list/2016-July/msg00108.html

Unfortunately, there is a related bug 1352836, which needs to be taken into account when testing this bug with qemu-kvm-rhev-2.6.

Comment 13 Jiri Denemark 2016-07-08 11:46:36 UTC
This should be now fixed upstream by v2.0.0-59-ga16ea1a..v2.0.0-60-gf34b981:

commit a16ea1a0f3e6b9eb8be4be7a664af76e47bbceba
Author:     Jiri Denemark <jdenemar>
AuthorDate: Tue Jul 5 10:07:24 2016 +0200
Commit:     Jiri Denemark <jdenemar>
CommitDate: Fri Jul 8 13:35:17 2016 +0200

    qemu: Properly reset spiceMigration flag

    Otherwise migration during which we didn't send client_migrate_info QMP
    command will get stuck waiting for SPICE migration to finish if libvirtd
    sent the QMP command in a previous migration attempt.

    Broken by bd7c8a69.

    https://bugzilla.redhat.com/show_bug.cgi?id=1151723

    Signed-off-by: Jiri Denemark <jdenemar>

commit f34b981e403ce7abf41c0047e1b5610e1f5269db
Author:     Jiri Denemark <jdenemar>
AuthorDate: Wed Jun 29 15:01:17 2016 +0200
Commit:     Jiri Denemark <jdenemar>
CommitDate: Fri Jul 8 13:36:00 2016 +0200

    qemu: Drop useless SPICE migration code

    The spiceMigration flag will never be true if there is no SPICE graphics
    configured for the domain.

    https://bugzilla.redhat.com/show_bug.cgi?id=1151723

    Signed-off-by: Jiri Denemark <jdenemar>

Comment 15 Fangge Jin 2016-07-12 06:49:42 UTC
Verify on build libvirt-2.0.0-2.el7.x86_64 and qemu-kvm-rhev-2.6.0-12.el7.x86_64

Scenario 1: migrate with invalid graphicsuri -> migrate back with default graphicsuri -> migrate with correct graphicsuri

1.Define&start a guest with spice graphic on host A:

2.Connect a spice client to the guest:
# remote-viewer spice://hp-dl385g7-05.lab.eng.pek2.redhat.com:5900

3.Migrate the guest to host B with invalid graphicsuri
# virsh migrate rhel7.2 qemu+ssh://hp-dl385g7-06.lab.eng.pek2.redhat.com/system --live --verbose --graphicsuri abcdefg
Migration: [100 %]

Virsh didn't hang after migration is 100%, and the spice client disconnects.

4.After migration, connect a spice client to the guest again:
# remote-viewer spice://hp-dl385g7-06.lab.eng.pek2.redhat.com:5900

5.Migrate the guest back with default graphicsuri
# virsh migrate rhel7.2 qemu+tcp://hp-dl385g7-05.lab.eng.pek2.redhat.com/system --live --verbose 
Migration: [100 %]

The spice migration finishes successfully.

6.Migrate the guest to host B again with correct graphicsuri
# virsh migrate rhel7.2 qemu+ssh://hp-dl385g7-06.lab.eng.pek2.redhat.com/system --live --verbose --graphicsuri spice://hp-dl385g7-06.lab.eng.pek2.redhat.com:5900
Migration: [100 %]

The spice migration finishes successfully.



Scenario 2: prepare a persistent guest, migrate with default graphicsuri -> migrate back with default graphicsuri -> migrate with invalid graphicsuri again

1.Define&start a guest with spice graphic on host A:

2.Connect a spice client to the guest:
# remote-viewer spice://hp-dl385g7-05.lab.eng.pek2.redhat.com:5900

3.Migrate the guest to host B with default graphicsuri
# virsh migrate rhel7.2 qemu+ssh://hp-dl385g7-06.lab.eng.pek2.redhat.com/system --live --verbose 
Migration: [100 %]

4.Migrate back:
# virsh migrate rhel7.2 qemu+ssh://hp-dl385g7-05.lab.eng.pek2.redhat.com/system --live --verbose 
Migration: [100 %]

5.Migrate the guest to host B with invalid graphicsuri:
# virsh migrate rhel7.2 qemu+ssh://hp-dl385g7-06.lab.eng.pek2.redhat.com/system --live --verbose --graphicsuri abcdefg
Migration: [100 %]

Virsh didn't hang after migration is 100%, and the spice client disconnects.

Comment 16 Fangge Jin 2016-07-12 06:53:07 UTC
With qemu-kvm-rhev-2.6.0-12.el7.x86_64, I can't reproduce the issue described in comment 12 (and reported in Bug 1352836 ).

Comment 17 Fangge Jin 2016-07-12 11:04:52 UTC
And I met another problem, cancel the first migration immediately after issuing the command, then do migrate again, the second migration will get stuck after memory is 100% transferred. 


# virsh migrate rhel7.2 qemu+tcp://hp-dl385g7-06.lab.eng.pek2.redhat.com/system --live --verbose  --p2p
^Cerror: operation aborted: migration out: canceled by client

Connect a spice client to the guest:
# remote-viewer spice://hp-dl385g7-05.lab.eng.pek2.redhat.com:5900

# virsh migrate rhel7.2 qemu+tcp://hp-dl385g7-06.lab.eng.pek2.redhat.com/system --live --verbose  --p2p
Migration: [100 %]
Migration: [100 %]
Migration: [100 %]
Migration: [100 %]^C
[root@hp-dl385g7-05 2.6.0-13]#

Comment 18 Fangge Jin 2016-07-12 11:08:55 UTC
Created attachment 1178860 [details]
the second migration gets stuck if the first migration is cancelled immediately

Comment 19 Jiri Denemark 2016-07-19 07:57:28 UTC
I think it's the same issue as reported in bug 1352836, but in your case different steps were needed to reproduce it.

Comment 20 Fangge Jin 2016-08-08 08:32:24 UTC
According to comment 15, move this bug to verified. I will track the issue in comment 17 by adding a new test case

Comment 22 errata-xmlrpc 2016-11-03 18:10:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2577.html