RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1352836 - SPICE_MIGRATE_COMPLETED is not sent in some cases
Summary: SPICE_MIGRATE_COMPLETED is not sent in some cases
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: spice
Version: 7.3
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Default Assignee for SPICE Bugs
QA Contact: SPICE QE bug list
URL:
Whiteboard:
: 1339910 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-07-05 08:52 UTC by Jiri Denemark
Modified: 2016-11-04 03:45 UTC (History)
12 users (show)

Fixed In Version: spice-0.12.4-19.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-11-04 03:45:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
successful migration on RHEL 7.2 (51.59 KB, application/octet-stream)
2016-07-08 09:22 UTC, Jiri Denemark
no flags Details
migration on RHEL 7.3 (39.83 KB, application/octet-stream)
2016-07-08 09:52 UTC, Jiri Denemark
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:2324 0 normal SHIPPED_LIVE spice bug fix and enhancement update 2016-11-03 13:43:33 UTC

Description Jiri Denemark 2016-07-05 08:52:00 UTC
Description of problem:

qemu-kvm-rhev does not send SPICE_MIGRATE_COMPLETED event during a migration which uses client_migrate_info QMP command if the domain was migrated without this command before.

Version-Release number of selected component (if applicable):

qemu-kvm-rhev-2.6.0-11.el7.x86_64
spice-server-0.12.4-18.el7.x86_64
libvirt-2.0.0-1.el7.x86_64 + patches for bug 1151723

How reproducible:

100%

Steps to Reproduce:
1. start a new domain with spice graphics
2. connect a spice client to the domain (virt-viewer, remote-viewer, ...)
3. virsh migrate DOM qemu+tcp://host2/system --live --graphicsuri ble
   (note, migration will work fine without "--graphicsuri ble")
4. virsh migrate DOM qemu+tcp://host1/system --live

Actual results:

There are two possible results in step 4:
- migration works if a spice client IS NOT connected to the domain
- migration gets stuck in "Waiting for SPICE to finish migration" if a spice client IS connected to the domain

Any number of migration without a connected spice client between steps 3 and 4 will work, but as soon as a spice client connects to the domain, migration gets stuck. The direction (host1 -> host2 or host2 -> host1) is irrelevant here.

BTW, the migration gets stuck in libvirt after the domain is already happily running on the destination host, so just hitting ^C in the stuck virsh migrate command lets libvirt recover.

Expected results:

Migration should finish on both cases.

Additional info:

SPICE migration works just fine in all these cases on 7.2, with
qemu-kvm-rhev-2.3.0-31.el7_2.17.x86_64
spice-server-0.12.4-15.el7_2.1.x86_64
libvirt-2.0.0-1.el7.x86_64 + patches for bug 1151723

The incorrect graphics URI "ble" forces libvirt not to send client_migrate_info QMP command before migration.

Even though spice migration does not get stuck without an incorrect graphics URI, it still doesn't seem to be seamless. I can see virt-viewer shows it is disconnected before it switches to the destination host, but on 7.2, the graphics just stops for a fraction of a second and resumes without showing the "disconnected" message.

Comment 2 Jiri Denemark 2016-07-05 08:53:33 UTC
I will provide logs when I have a better internet connection (probably no sooner than on Thursday).

Comment 3 Jiri Denemark 2016-07-08 09:22:31 UTC
Created attachment 1177586 [details]
successful migration on RHEL 7.2

1) migration with wrong graphics URI, i.e., without client_migrate_info
    rhel72-1# virsh migrate nest qemu+tcp://rhel72-2.virt/system --p2p --live --graphicsuri ble

    The migration process is shown in rhel72-1.virt-libvirtd.log
    - line 2966: migration starts
    - line 3050: libvirt complains about wrong URI
    - line 3228: libvirt sends "migrate" QMP command
    - line 3311: libvirt processed MIGRATION completed event
    - line 3461: libvirt is not waiting for SPICE migration to finish
    - line 3484: migration completed

2) migration with a default migration URI, using client_migrate_info
    rhel72-2# virsh migrate nest qemu+tcp://rhel72-1.virt/system --p2p --live

    You can watch the process in rhel72-2.virt-libvirtd.log
    - line 3612: migration starts
    - line 3700: libvirt sends "client_migrate_info" QMP command
    - line 3878: libvirt sends "migrate" QMP command
    - line 3955: libvirt processed MIGRATION completed event
    - line 4091: libvirt processed SPICE_MIGRATE_COMPLETED event
    - line 4195: libvirt waits for SPICE migration to finish (which already
      happened so we don't really wait here)
    - line 4215: migration completed

Comment 4 Jiri Denemark 2016-07-08 09:52:39 UTC
Created attachment 1177592 [details]
migration on RHEL 7.3

1) migration with wrong graphics URI, i.e., without client_migrate_info
    rhel1# virsh migrate nest qemu+tcp://rhel2.virt/system --p2p --live --graphicsuri ble

    The migration process is shown in rhel1.virt-libvirtd.log
    - line 1388: migration starts
    - line 1472: libvirt complains about wrong URI
    - line 1664: libvirt sends "migrate" QMP command
    - line 1802: libvirt processed MIGRATION completed event
    - line 1952: libvirt is not waiting for SPICE migration to finish
    - line 1975: migration completed

2) migration with a default migration URI, using client_migrate_info
    rhel2# virsh migrate nest qemu+tcp://rhel1.virt/system --p2p --live

    You can watch the process in rhel2.virt-libvirtd.log
    - line 2110: migration starts
    - line 2198: libvirt sends "client_migrate_info" QMP command
    - line 2413: libvirt sends "migrate" QMP command
    - line 2531: libvirt processed MIGRATION completed event
    - line 2691: libvirt is waiting for SPICE migration to finish (the event
      never comes and virt-viewer just disconnects after 624ms)
    - line 2922: ^C to the stuck virsh migrate command
    - line 2962: migration completed

Comment 5 Jiri Denemark 2016-07-08 09:55:59 UTC
BTW, I tested this with upstream QEMU (2.6.0) and it is broken there as well.

Comment 6 Marc-Andre Lureau 2016-07-12 15:17:29 UTC
(In reply to Jiri Denemark from comment #5)
> BTW, I tested this with upstream QEMU (2.6.0) and it is broken there as well.

So this is looks like qemu 2.6 regression, if you just downgrade qemu to 2.3 (keeping spice libraries, libvirt etc), it works?

I am assuming that you didn't change the spice-gtk version either in your tests. The SPICE_DEBUG=1 log in the 7.3 2) case (the broken case, right) could be helpful. thanks

Comment 7 Marc-Andre Lureau 2016-07-13 14:43:27 UTC
I managed to reproduce, migration with 2.3.0-31 worked fine, with 2.6.0-13 the source remained paused, the dest is running, and spice-gtk prints extra errors (beside some "harmless" criticals that should also be fixed, but that's unrelated)

Comment 8 Marc-Andre Lureau 2016-07-13 16:17:41 UTC
The trouble seems to come from qemu completing the migration before client finishes it (in migrate_connect_complete_cb which was always empty for some reason), then it fallbacks to switch_host (considering seamless failed), but that somehow confuses qemu/spice (although the client seems to handle the transition quite ok):

2016-07-13 16:04:51.843+0000: initiating migration
main_channel_migrate_src_complete: 
main_channel_migrate_src_complete: client 0x7f7302009180 SWITCH_HOST
main_channel_marshall_migrate_switch: 
main_channel_client_handle_migrate_connected: client 0x7f7302009180 connected: 1 seamless 1
main_channel_client_handle_migrate_connected: client 0x7f7302009180 MIGRATE_CANCEL

Now investigating what changed in qemu with seamless migration or whether the fix is simply to wait for migrate_connect_complete_cb)

Comment 9 Marc-Andre Lureau 2016-07-13 17:10:05 UTC
Actually, this isn't always happening in my test. I am testing with a disk-less VM. I wonder if the issue is actually reproducible with 2.3, even with a real VM (since there is nothing in qemu really waiting for connect_cb) It's a timing issue, a big VM will likely finish connect_cb before doing actual live migration, but between 2.3 and 2.6 likely many things in migration path changed, and thus uncovored this bug. I think we need a qemu fix, but I wonder if this is a regression, or if it's easy to reproduce in 2.6 with a real/big VM.

Comment 10 Jiri Denemark 2016-07-18 14:55:16 UTC
*** Bug 1339910 has been marked as a duplicate of this bug. ***

Comment 11 Jiri Denemark 2016-07-19 08:35:46 UTC
It's possible there is some kind of race. I wasn't able to reproduce this bug with 2.3 at all, while with 2.6 it was almost 100% (it worked only once even with 2.6). BTW, my guest was running a "while true; do date '+%H:%M:%S.%N'; done" loop on its console.

Comment 12 Marc-Andre Lureau 2016-07-20 08:48:18 UTC
The issue is quite clearly on spice server side, not calling migrate_end_complete() when falling back to switch-host. 

We could also fix the "race" in qemu, but there would still be cases where qemu shouldn't wait forever for the spice client.

I am struggling to understand how the code work, so I don't have a definitive solution yet, something like this seems to help:
@@ -3051,6 +3051,7 @@ static void migrate_timeout(void *opaque)
         main_channel_migrate_cancel_wait(reds->main_channel);
         /* in case part of the client haven't yet completed the previous migration, disconnect them */
         reds_mig_target_client_disconnect_all(reds);
+        reds->mig_wait_connect = FALSE;
         reds_mig_cleanup(reds);


To reproduce, I tweaked spice-gtk with the following change:

@@ -2100,6 +2100,9 @@ static SpiceChannel* migrate_channel_connect(spice_migrate *mig, int type, int i
     SPICE_DEBUG("migrate_channel_connect %d:%d", type, id);
 
     SpiceChannel *newc = spice_channel_new(mig->session, type, id);
+    if (type != SPICE_CHANNEL_MAIN)
+        g_usleep(G_TIME_SPAN_SECOND * 4);


moving to spice server for further help.

Comment 13 Marc-Andre Lureau 2016-07-20 14:01:59 UTC
sent fix to ML:
https://lists.freedesktop.org/archives/spice-devel/2016-July/030835.html

Comment 16 David Jaša 2016-09-12 16:53:42 UTC
After some testing with migrations between 0.12.4-{15,19}, I can't say that migrations to -19 do work and to -15 do not. What happens is that
1) migrations sometimes fail to finish, disconnecting the client
2) migrations sometimes do finish and disconnect the client
3) migrations sometimes do finish with client still connected

but it's not consistent which of the options take place.


(In reply to Marc-Andre Lureau from comment #8)
> The trouble seems to come from qemu completing the migration before client
> finishes it (in migrate_connect_complete_cb which was always empty for some
> reason), then it fallbacks to switch_host (considering seamless failed), but
> that somehow confuses qemu/spice

Why it isn't possible to finish seamless migration even when qemu is done? IIRC that was semi-seamless mode of operation, seamless mode was designed so that qemu could be started on dst host while src_host -> client -> dst_host spice state transfer was still running (and that was also seamless-migration=on raison d'être - to instruct libvirt to only kill src qemu after this spice sync was done).

Comment 17 Marc-Andre Lureau 2016-09-14 10:31:55 UTC
(In reply to David Jaša from comment #16)
> After some testing with migrations between 0.12.4-{15,19}, I can't say that
> migrations to -19 do work and to -15 do not. What happens is that
> 1) migrations sometimes fail to finish, disconnecting the client
> 2) migrations sometimes do finish and disconnect the client
> 3) migrations sometimes do finish with client still connected
> 
> but it's not consistent which of the options take place.

What should work is migrating from -19, the fix is on the migration src side.

> (In reply to Marc-Andre Lureau from comment #8)
> > The trouble seems to come from qemu completing the migration before client
> > finishes it (in migrate_connect_complete_cb which was always empty for some
> > reason), then it fallbacks to switch_host (considering seamless failed), but
> > that somehow confuses qemu/spice
> 
> Why it isn't possible to finish seamless migration even when qemu is done?

I think spice server interface design was that qemu should wait for migrate_connect_complete_cb, unfortunately it never did. Even in this case, it would probably need to have some timeout, and fallback to a different method or disconnect the client, so the same server bug that was fixed here could happen.

Now, the logic to fallback to switch-mode is in spice server, main_channel_client_migrate_src_complete(). So I can imagine it could be improved to keep trying to finish the ongoing seamless migration instead (for how long? does this need new API to tell qemu to wait?). Tbh, it looks like a corner case to me, in general spice will be faster at migrating than the VM, but it may be worth trying to improve the spice server.

Comment 18 David Jaša 2016-09-23 15:27:55 UTC
I've put some effort into it and it works both ways for me - VERIFIED/SanityOnly.

Comment 20 errata-xmlrpc 2016-11-04 03:45:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2324.html


Note You need to log in before you can comment on or make changes to this bug.