Bug 1036201 - RFE: make seamless migration do not use timeout or timeout on init
Summary: RFE: make seamless migration do not use timeout or timeout on init
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: spice
Version: 7.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: 7.3
Assignee: Default Assignee for SPICE Bugs
QA Contact: SPICE QE bug list
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-11-29 17:43 UTC by David Jaša
Modified: 2016-05-02 12:21 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-05-02 12:21:18 UTC
Target Upstream Version:


Attachments (Terms of Use)
ping stats over some less stable connections (182.93 KB, text/plain)
2013-11-29 17:50 UTC, David Jaša
no flags Details

Description David Jaša 2013-11-29 17:43:18 UTC
Description of problem:
I recently had to work on network with very slow DNS queries (~2 s) and I found out that this condition made spice client fall back to undesirable SWITCH_HOST mode of migration. Giving spice client a bit more slack to finish client_migrate_info would fix this problem with possibly low consequences. 
By "a bit more slack" I mean proceeding with "migrate" qemu command no sooner than 5 second after client_migrate_info that didn't return a result.

(I'm not sure if timeout is indeed applied on libvirt level, please reassign to qemu which in the next in the chain if libvirt just waits for client_migrate_info result before calling migrate)

Version-Release number of selected component (if applicable):
libvirt-1.1.1-13.el7.x86_64
qemu-kvm-1.5.3-19.el7.x86_64
spice-server-0.12.4-3.el7.x86_64

How reproducible:
always

Steps to Reproduce:
1. have two hosts with a VM migratable between the two, and a client (may be the same as the host)
2. create very long response times for dns on client (such as 2 s; using a distant server or a WAN emulation)
3. connect with remote-viewer/virt-viewer to the VM
4. migrate the VM

Actual results:
client doesn't manage to connect to dst qemu before migrate is called on src qemu, already connected channels are disconnected, migration falls back to SWITCH_HOST / migrate_switch mode

Expected results:
client is given enough time co connect to dst qemu even in case of slow dns (or network hiccup at client_migrate_info time)

Additional info:
IMO the sensible timeout for client_migrate_info is in 5-10 seconds range - still way smaller than typical migration time and large enough to connect even when connected over slow link.

Comment 2 David Jaša 2013-11-29 17:50:05 UTC
Created attachment 830740 [details]
ping stats over some less stable connections

Just for the record - in the attachent, there is a real-world ping statistic of a WAN connection to work (one of spice use cases): note the several seconds-long network hiccups even with some packet loss - should these align with client_migrate_info time, we could get to hard-to-debug switch_host migrations.

Comment 3 Jiri Denemark 2013-12-02 10:22:35 UTC
There's indeed no timeout in libvirt anywhere around client_migrate_info. I guess it must be either in qemu or in spice.

Comment 4 Gerd Hoffmann 2013-12-04 07:38:15 UTC
No timeout in qemu-kvm either.  qemu just calls spice_server_migrate_connect(), then goes wait for the callback from spice-server.  Reassigning.

Comment 5 RHEL Program Management 2014-03-22 06:26:25 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 6 Marc-Andre Lureau 2014-07-03 13:46:12 UTC
There is a 10s timeout for migrate_info to complete seamless destination connection. Passed that time, the client will receive a switch host after migration complete (server_migrate_end). 

I think the reason was that we don't want to "block" the qemu migration from happening, since seamless migration waits for all channels to be connected before migrating. I guess "waiting" for destination connections is necessary to have a seamless experience, or else you would have to wait after migration completed and it wouldn't feel seamless anymore...

It seems there is a tradeoff here. I think we shouldn't "block" qemu migration longer than 10s. If the client is to slow or stuck, switch_host is a reasonable fallback.

So, I am tempted to close as not a bug.

Comment 7 Marc-Andre Lureau 2014-07-03 13:50:31 UTC
David, please comment, thanks

Comment 8 David Jaša 2014-07-03 14:07:48 UTC
(In reply to Marc-Andre Lureau from comment #6)
> There is a 10s timeout for migrate_info to complete seamless destination
> connection. Passed that time, the client will receive a switch host after
> migration complete (server_migrate_end). 
> 

I agree, it's good from resilience POV (in most cases, keeping VM up is more important than keeping client connected)

> I think the reason was that we don't want to "block" the qemu migration from
> happening, since seamless migration waits for all channels to be connected
> before migrating. I guess "waiting" for destination connections is necessary
> to have a seamless experience, or else you would have to wait after
> migration completed and it wouldn't feel seamless anymore...
> 

I'd still like this as a fallback to real seamless migration (say not-so-seamless migration: instead of switch_host, keep source running, have client connect to destination like after client_migrate_info, transfer spice-server state, switch client to dest, give go-ahead to stop source qemu). It wouldn't feel seamless but stuff like USB redirection or monitor config wouldnt' be reset and RHEV could re-allow connection to migrating VM.

> It seems there is a tradeoff here. I think we shouldn't "block" qemu
> migration longer than 10s. If the client is to slow or stuck, switch_host is
> a reasonable fallback.

IMO the sweet spot could be in 10 - 15 s range. What is the current timeout?

> 
> So, I am tempted to close as not a bug.

Comment 9 Marc-Andre Lureau 2014-07-03 15:39:49 UTC
(In reply to David Jaša from comment #8)

> I'd still like this as a fallback to real seamless migration (say
> not-so-seamless migration: instead of switch_host, keep source running, have
> client connect to destination like after client_migrate_info, transfer
> spice-server state, switch client to dest, give go-ahead to stop source
> qemu). It wouldn't feel seamless but stuff like USB redirection or monitor
> config wouldnt' be reset and RHEV could re-allow connection to migrating VM.

I wish Yonit could comment on that, it might not be doable... And to me this is overkill for little gain. We have already 3 migrations path, that would be another one.

If it's possible to change the behaviour, I would rather change seamless to not block migration start, even if the channel aren't all connected. So if it's possible, get rid of the timeout alltogether. A "slow client" won't perhaps feel so seamless, but we would still keep the channel state that you listed.  

> > It seems there is a tradeoff here. I think we shouldn't "block" qemu
> > migration longer than 10s. If the client is to slow or stuck, switch_host is
> > a reasonable fallback.
> 
> IMO the sweet spot could be in 10 - 15 s range. What is the current timeout?

As I said, it's 10s. It's short for the client to connect and handshake all channels and reply to origin server, but it's long for the other side initiating the migration.

Comment 10 Marc-Andre Lureau 2014-07-03 15:42:05 UTC
What about changing to "RFE: make seamless migration do not timeout on init" ?

Comment 11 David Jaša 2014-07-04 10:05:26 UTC
(In reply to Marc-Andre Lureau from comment #9)
> ...
> If it's possible to change the behaviour, I would rather change seamless to
> not block migration start, even if the channel aren't all connected. So if
> it's possible, get rid of the timeout alltogether. A "slow client" won't
> perhaps feel so seamless, but we would still keep the channel state that you
> listed.  
> 

So the timeline would be:
1. client_migrate_info src -> client
2. main channels connects
3. other channels try connecting
4. migration starts, dst qemu accepts no new connections now
5. migration finishes
6. (in order to avoid timeouts, dst qemu orders) client to connect the rest of the channels
7. the rest is just like regular seamless migration

That's pretty much what I requested above and I'm afraid that it wouldn't be easy either.

> > > It seems there is a tradeoff here. I think we shouldn't "block" qemu
> > > migration longer than 10s. If the client is to slow or stuck, switch_host is
> > > a reasonable fallback.
> > 
> > IMO the sweet spot could be in 10 - 15 s range. What is the current timeout?
> 
> As I said, it's 10s. It's short for the client to connect and handshake all
> channels and reply to origin server, but it's long for the other side
> initiating the migration.

OK, 10 s should stay.

Comment 12 Marc-Andre Lureau 2014-07-06 22:39:12 UTC
(In reply to Marc-Andre Lureau from comment #10)
> What about changing to "RFE: make seamless migration do not timeout on init"
> ?

Comment 14 David Jaša 2014-07-10 12:31:53 UTC
(In reply to Marc-Andre Lureau from comment #12)
> (In reply to Marc-Andre Lureau from comment #10)
> > What about changing to "RFE: make seamless migration do not timeout on init"
> > ?

That would be also effective change of migration mode (I mean channels missing client_migrate_info "window" connecting to dest after migration finishes), wouldn't it?

Comment 15 Marc-Andre Lureau 2014-07-10 13:02:07 UTC
(In reply to David Jaša from comment #14)
> (In reply to Marc-Andre Lureau from comment #12)
> > (In reply to Marc-Andre Lureau from comment #10)
> > > What about changing to "RFE: make seamless migration do not timeout on init"
> > > ?
> 
> That would be also effective change of migration mode (I mean channels
> missing client_migrate_info "window" connecting to dest after migration
> finishes), wouldn't it?

yes, an improvement of seamless migration. Eventually, get rid of the initial timeout: afaik, it doesn't help much to ensure client is ready before starting migration. If it's not yet ready in 10s, there is still a good chance it will be ready later, when migration finishes. If not, it could fallback to switch mode then (I imagine the reason it uses the fallback is to avoid freezing destination when seamless migration finishes)

Comment 16 David Jaša 2014-07-10 13:49:59 UTC
OK, but then I'd prefer a new RFE for this lazy :) seamless migration.

> Eventually, get rid of the initial timeout:

A timeout has to remain there if the main channel is to connect before migration starts, doesn't it?

Comment 17 Marc-Andre Lureau 2014-07-10 14:11:58 UTC
(In reply to David Jaša from comment #16)
> OK, but then I'd prefer a new RFE for this lazy :) seamless migration.
> 
> > Eventually, get rid of the initial timeout:
> 
> A timeout has to remain there if the main channel is to connect before
> migration starts, doesn't it?

Well, I am not convinced. In any case, we should not consider the current behaviour as a bug, but propose rfe. Imho, this kind of undefined/unclear RFE are better handled in upstream.

Comment 18 Marc-Andre Lureau 2014-08-25 16:35:04 UTC
So David, should we close this bug and open a new "RFE: make seamless migration do not timeout on init" or just repurpose this one?

Comment 19 Marc-Andre Lureau 2015-01-02 15:36:42 UTC
moving to 7.2

Comment 22 David Jaša 2016-05-02 12:21:18 UTC
Let's close it now. There was no other case of this bug happening since reporting and it will be even less likely to hit again with rise of local DNS caches.


Note You need to log in before you can comment on or make changes to this bug.