Bug 1352836
| Summary: | SPICE_MIGRATE_COMPLETED is not sent in some cases | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Jiri Denemark <jdenemar> | ||||||
| Component: | spice | Assignee: | Default Assignee for SPICE Bugs <rh-spice-bugs> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | SPICE QE bug list <spice-qe-bugs> | ||||||
| Severity: | unspecified | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 7.3 | CC: | cfergeau, chayang, dgilbert, djasa, fjin, huding, juzhang, knoel, marcandre.lureau, tpelka, virt-maint, zhguo | ||||||
| Target Milestone: | rc | Keywords: | Regression | ||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | spice-0.12.4-19.el7 | Doc Type: | If docs needed, set a value | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2016-11-04 03:45:12 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Jiri Denemark
2016-07-05 08:52:00 UTC
I will provide logs when I have a better internet connection (probably no sooner than on Thursday). Created attachment 1177586 [details]
successful migration on RHEL 7.2
1) migration with wrong graphics URI, i.e., without client_migrate_info
rhel72-1# virsh migrate nest qemu+tcp://rhel72-2.virt/system --p2p --live --graphicsuri ble
The migration process is shown in rhel72-1.virt-libvirtd.log
- line 2966: migration starts
- line 3050: libvirt complains about wrong URI
- line 3228: libvirt sends "migrate" QMP command
- line 3311: libvirt processed MIGRATION completed event
- line 3461: libvirt is not waiting for SPICE migration to finish
- line 3484: migration completed
2) migration with a default migration URI, using client_migrate_info
rhel72-2# virsh migrate nest qemu+tcp://rhel72-1.virt/system --p2p --live
You can watch the process in rhel72-2.virt-libvirtd.log
- line 3612: migration starts
- line 3700: libvirt sends "client_migrate_info" QMP command
- line 3878: libvirt sends "migrate" QMP command
- line 3955: libvirt processed MIGRATION completed event
- line 4091: libvirt processed SPICE_MIGRATE_COMPLETED event
- line 4195: libvirt waits for SPICE migration to finish (which already
happened so we don't really wait here)
- line 4215: migration completed
Created attachment 1177592 [details]
migration on RHEL 7.3
1) migration with wrong graphics URI, i.e., without client_migrate_info
rhel1# virsh migrate nest qemu+tcp://rhel2.virt/system --p2p --live --graphicsuri ble
The migration process is shown in rhel1.virt-libvirtd.log
- line 1388: migration starts
- line 1472: libvirt complains about wrong URI
- line 1664: libvirt sends "migrate" QMP command
- line 1802: libvirt processed MIGRATION completed event
- line 1952: libvirt is not waiting for SPICE migration to finish
- line 1975: migration completed
2) migration with a default migration URI, using client_migrate_info
rhel2# virsh migrate nest qemu+tcp://rhel1.virt/system --p2p --live
You can watch the process in rhel2.virt-libvirtd.log
- line 2110: migration starts
- line 2198: libvirt sends "client_migrate_info" QMP command
- line 2413: libvirt sends "migrate" QMP command
- line 2531: libvirt processed MIGRATION completed event
- line 2691: libvirt is waiting for SPICE migration to finish (the event
never comes and virt-viewer just disconnects after 624ms)
- line 2922: ^C to the stuck virsh migrate command
- line 2962: migration completed
BTW, I tested this with upstream QEMU (2.6.0) and it is broken there as well. (In reply to Jiri Denemark from comment #5) > BTW, I tested this with upstream QEMU (2.6.0) and it is broken there as well. So this is looks like qemu 2.6 regression, if you just downgrade qemu to 2.3 (keeping spice libraries, libvirt etc), it works? I am assuming that you didn't change the spice-gtk version either in your tests. The SPICE_DEBUG=1 log in the 7.3 2) case (the broken case, right) could be helpful. thanks I managed to reproduce, migration with 2.3.0-31 worked fine, with 2.6.0-13 the source remained paused, the dest is running, and spice-gtk prints extra errors (beside some "harmless" criticals that should also be fixed, but that's unrelated) The trouble seems to come from qemu completing the migration before client finishes it (in migrate_connect_complete_cb which was always empty for some reason), then it fallbacks to switch_host (considering seamless failed), but that somehow confuses qemu/spice (although the client seems to handle the transition quite ok): 2016-07-13 16:04:51.843+0000: initiating migration main_channel_migrate_src_complete: main_channel_migrate_src_complete: client 0x7f7302009180 SWITCH_HOST main_channel_marshall_migrate_switch: main_channel_client_handle_migrate_connected: client 0x7f7302009180 connected: 1 seamless 1 main_channel_client_handle_migrate_connected: client 0x7f7302009180 MIGRATE_CANCEL Now investigating what changed in qemu with seamless migration or whether the fix is simply to wait for migrate_connect_complete_cb) Actually, this isn't always happening in my test. I am testing with a disk-less VM. I wonder if the issue is actually reproducible with 2.3, even with a real VM (since there is nothing in qemu really waiting for connect_cb) It's a timing issue, a big VM will likely finish connect_cb before doing actual live migration, but between 2.3 and 2.6 likely many things in migration path changed, and thus uncovored this bug. I think we need a qemu fix, but I wonder if this is a regression, or if it's easy to reproduce in 2.6 with a real/big VM. *** Bug 1339910 has been marked as a duplicate of this bug. *** It's possible there is some kind of race. I wasn't able to reproduce this bug with 2.3 at all, while with 2.6 it was almost 100% (it worked only once even with 2.6). BTW, my guest was running a "while true; do date '+%H:%M:%S.%N'; done" loop on its console. The issue is quite clearly on spice server side, not calling migrate_end_complete() when falling back to switch-host.
We could also fix the "race" in qemu, but there would still be cases where qemu shouldn't wait forever for the spice client.
I am struggling to understand how the code work, so I don't have a definitive solution yet, something like this seems to help:
@@ -3051,6 +3051,7 @@ static void migrate_timeout(void *opaque)
main_channel_migrate_cancel_wait(reds->main_channel);
/* in case part of the client haven't yet completed the previous migration, disconnect them */
reds_mig_target_client_disconnect_all(reds);
+ reds->mig_wait_connect = FALSE;
reds_mig_cleanup(reds);
To reproduce, I tweaked spice-gtk with the following change:
@@ -2100,6 +2100,9 @@ static SpiceChannel* migrate_channel_connect(spice_migrate *mig, int type, int i
SPICE_DEBUG("migrate_channel_connect %d:%d", type, id);
SpiceChannel *newc = spice_channel_new(mig->session, type, id);
+ if (type != SPICE_CHANNEL_MAIN)
+ g_usleep(G_TIME_SPAN_SECOND * 4);
moving to spice server for further help.
After some testing with migrations between 0.12.4-{15,19}, I can't say that migrations to -19 do work and to -15 do not. What happens is that
1) migrations sometimes fail to finish, disconnecting the client
2) migrations sometimes do finish and disconnect the client
3) migrations sometimes do finish with client still connected
but it's not consistent which of the options take place.
(In reply to Marc-Andre Lureau from comment #8)
> The trouble seems to come from qemu completing the migration before client
> finishes it (in migrate_connect_complete_cb which was always empty for some
> reason), then it fallbacks to switch_host (considering seamless failed), but
> that somehow confuses qemu/spice
Why it isn't possible to finish seamless migration even when qemu is done? IIRC that was semi-seamless mode of operation, seamless mode was designed so that qemu could be started on dst host while src_host -> client -> dst_host spice state transfer was still running (and that was also seamless-migration=on raison d'être - to instruct libvirt to only kill src qemu after this spice sync was done).
(In reply to David Jaša from comment #16) > After some testing with migrations between 0.12.4-{15,19}, I can't say that > migrations to -19 do work and to -15 do not. What happens is that > 1) migrations sometimes fail to finish, disconnecting the client > 2) migrations sometimes do finish and disconnect the client > 3) migrations sometimes do finish with client still connected > > but it's not consistent which of the options take place. What should work is migrating from -19, the fix is on the migration src side. > (In reply to Marc-Andre Lureau from comment #8) > > The trouble seems to come from qemu completing the migration before client > > finishes it (in migrate_connect_complete_cb which was always empty for some > > reason), then it fallbacks to switch_host (considering seamless failed), but > > that somehow confuses qemu/spice > > Why it isn't possible to finish seamless migration even when qemu is done? I think spice server interface design was that qemu should wait for migrate_connect_complete_cb, unfortunately it never did. Even in this case, it would probably need to have some timeout, and fallback to a different method or disconnect the client, so the same server bug that was fixed here could happen. Now, the logic to fallback to switch-mode is in spice server, main_channel_client_migrate_src_complete(). So I can imagine it could be improved to keep trying to finish the ongoing seamless migration instead (for how long? does this need new API to tell qemu to wait?). Tbh, it looks like a corner case to me, in general spice will be faster at migrating than the VM, but it may be worth trying to improve the spice server. I've put some effort into it and it works both ways for me - VERIFIED/SanityOnly. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2324.html |